I love my job. One of the reasons it's quite so cool is because I get to try to solve mysteries in between my work with Total World Domination & Crushing the Helpless.
My team has been working for quite a few years on Netscan. We have a fairly impressive set of statistics and measurements over the last five or six years, which I'm quite pleased with. And which turn out to be a lot of fun. There's a lot of different ways to cut and splice the data, and we're just beginning to scratch its surface. (Incidentally, large portions of these data are available online, at the Netscan website; further, universities can request 500-GB-sized chunks of our data for their own research.)
Let me walk you through one of our latest mysteries. Perhaps you'll find it interesting too.
I should note that this project isn't quite typical of what we do: there's a lot of projects going on. Many of them are ethnographic, or design-oriented, or more traditionally sociological. But in general, we are trying to understand the social phenomena around online systems -- and if there's one thing going for Usenet, it's that it's a seething mass of social phenomena.
This mystery, like any good suspense, is presented as multi-part post. We'll do this one in four parts; I'll put up one part every day, through Sunday. None of them are painfully technical.
Part I: Counting Messages
At some point, I wanted to know how many messages we had. This chart shows the number of Netscan messages (click to zoom in). The X axis is the date: January 2000 (when Netscan's collection started) through October 2004. The Y axis, shows total posts per day; every line shows 200,000 posts, so the high point is just a little over a million posts.
The darker orange line is a 30 day running average: there's enough day-to-day variation that I need to run a trend-line average, or the data becomes hard to read: the noise begins to stifle the signal.
Again, this chart shows unique messages per day. You'll note a few features of this graph: it seems to be generally rising, linearly, over time. There's a few periods of particular interest: a particularly-noisy period in late 2000; a large dip in January, 2003; a smaller one in June, 2003; and a bit of a climb in April through June of 2004.
Now, as a person who hangs out with sociologists, I like to be able to explain behavior. What happened in 2000? In 2003? In mid-2004? Is it depression over a super-bowl result, or international terrorism?
Or is it just a technical glitch on our side?
Now, you should note that this chart isn't "all messages in Usenet"; this is "all the messages that our server saw." No one quite has the same view of all of the Usenet, due to the fairly anarchic design. There are various good descriptions of what happens, but here's an approximation:
When I post a message to, say, alt.candy-lovers.drgoodbar, it is sent to my local server. This server, periodically, communicates with a set of other servers that it knows. Those servers feed me their latest posts, and my server feeds them my latest posts. So this post is offered to a few other servers. Maybe one of them doesn't like posts sent anywhere in the alt hierarchy, and so it drops it. Perhaps a different server has a glitch, and doesn't get it. A third server is an incoming-only connection, and so doesn't want my updates. But my post's immortality is assured: a fourth server picks up the message, a few minutes or hours later, and propagates it. That server passes it on to some others, and so on.
So messages come in fits and starts, in great surges as they queue up somewhere and then flow more smoothly. Our own collector sometimes has problems and loses messages. So maybe this is just a problem on our side.
Tomorrow: Is the variation just a glitch?
January 20, 2005 12:15 PM | TrackBack | in Data and DocumentsDo bsittler and I get credit somewhere in this story for any suggestions we made on Sunday? *g*
Posted by: Auros at January 20, 2005 01:51 PM