Thanks to the communities team at MSR for walking through this process with me. A shout out to Ben Sittler & Auros, who surely said useful things about this: I just don't remember what.
Chris asked whether our dataset is spam-filtered: No, it's not. We pull sources straight. (On the other hand, spam is usually pretty easy to detect: for one thing, people don't reply to it, and fairly often, the spammers show up dozens or hundreds of times in the same newsgroups). Chris, incidentally, has been trying to do some research arguing that one man's spam is another man's ham. I'm pretty sure that he's just plain wrong on this, unless you really want to look at my blog to find people who will sell you catfood. Or lesbian bondage movies. Or generic cialis. (If Chris is right, then pretty much anything anyone knows about marketing -- idaes like "target audience" -- are just plain wrong.)
Joshua asks about the spike in orphans in early 2001. That's actually a side-effect, I think, of the smoothing: there's a lot of jumping up and down in that time period. These were during the early days of Netscan, and our server was up and down, which means that our data was a mess. (Far more mysterious is the recent spike in orphans in microsoft.public. We really should be tracking those messages pretty well.)
January 23, 2005 07:20 PM | TrackBack | in