Yesterday, we examined a chart of Usenet traffic, and saw some mysterious bumps. We ended our study with the question of which of those bumps were due to our own reliability, and which we could attribute to outside phenomena.
(Today's post is a little bit technical, and can probably be guiltlessly skipped without too much guilt. I'll catch you up tomorrow)
Now, how to measure your own reliability is a little tough. We could, I guess, compare our feeds to yet another feed coming in from somewhere, or double check our incoming messages against Google. But each of those is expensive, and requires even more processing -- and with our statistics engine, we've got a lot of stuff churning away already.
We've chosen a different take. We look instead at internal measures: things that just don't add up in our own database.
In particular, we can look for messages where we see someone answering a message … but we don't have the original. That suggests we failed to get the original, leaving behind an "orphan," so called because it is a message without its parent.
There will always be a background level of these "orphans". the first message may have been cancelled, for example, without us seeing either the original or the cancel. But we can make a good estimate.
Orphans turn out are a pretty good measure of our reliability. On days when we're scooping up most of Usenet, we don't have many orphans. When we are missing things, for one reason or another, our orphan count shoots upward. In fact, you can see that some of the downward spikes -- like the really big one in February, 2003 -- are matched by a surge in orphans. That's a pretty good sign that we lost stuff.
Here's a count of our orphans per day. These orphans on the chart are counted at one-tenth the scale of the posts: the high peak the orphan trend-line, in March of 2001, is about 60,000 a day.
Where orphans are flat, we're doing well. That's when we give our database guys a raise. (It does make sense that orphans have dropped in the last few years: we're talking to more news servers, now, and so we have a better flow of messages; our collection mechanism has, by and large, improved, so we're getting higher-quality data).
It looks like most of the fluctuations can be accounted for by the orphan count. Got a dip? We must have been missing stuff that day.
But take a good look at that curve. There's one thing that stands out in the recent past to me -- total volume went up during the summer. Way up. Up by a good 50% or so: at its peak, that's around a million messages a day, up from around 700,000 messages a few months before that.
This is peculiar. The orphan count is flat, so it's not some sort of strange server bump. This is simply more messages. Lots more messages.
Tomorrow, we'll take a look into several different Usenet sub-hierarchies and try to drill down into this spike.
January 20, 2005 01:12 PM | TrackBack | in Data and Documents