January 31, 2005

I'm not egotistical, I'm monitoring my online profile

Sure, you may ego-surf every few days. But me? I keep a search running through Bloglines that lets me know when I'm mentioned. (This is Kibozing for the 21st century.)

And now a new blog has popped up. Danyel Smith, hip-hop editor and writer, is now Danyel Smith, blogger

Which means that someone, somewhere, is going to type Danyel Blog and be horribly confused.

This is what happens when very-nearly-unique names (we need not mention this person) collide.

Posted by danyelf at 11:18 AM | Comments (0) | TrackBack

January 28, 2005

Dr. Updates, or How I Learned to Stop Worrying, and Love the Toast

Another day, another cryptic movie reference. New blog entry posted on Raindrop

Posted by danyelf at 10:34 AM | Comments (0) | TrackBack

January 27, 2005

Who knew?

This is just distressing: Microsoft Indexing Service was there all along, and it worked, and it worked well. Your Windows 2000 box was able to do high-speed full-text searches of your hard drive.

It's just that, um, it wasn't connected to the search box without a magic invocation.

Jon Udell has more

Posted by danyelf at 11:25 AM | Comments (0) | TrackBack

January 26, 2005

Data Synchronization in Practice (See also: Raindrop)

I just got a slice of the Microsoft Research blog Raindrop. I'll be posting there periodically: for the next little while, I'll post placeholders here pointing to there, but I'll sooner or later be posting all work-related material there.

The Danyel-only view of Raindrop lives here

And my own post came up just now, entitled Synchronization in Practice

Posted by danyelf at 04:37 PM | Comments (0) | TrackBack

Wikipedia has errors?

A while ago, I talked about Wikipedia and ways that it has of correcting errors. This was in the midst of great controversy: is wikipedia reliable? Can you trust a system that might errors as much as you can trust the stolid, ever-accurate Brittanica?

One counter-argument, of course, is that if you find an error in Wikipedia, you can fix it.

Today, Many2Many points to an article in the Times Online

A schoolboy with a fascination for Poland and wildlife has uncovered several significant errors in the latest — the fifteenth — edition of the Encyclopaedia Britannica.

And now, of course, Brittanica readers are stuck with the mistakes...

And this is a gratuitous reference to Borges because I can.

Posted by danyelf at 03:27 PM | Comments (0) | TrackBack

JUNG & Netscan on Many-to-Many

Ross Mayfield has been chatting with Marc Smith about Netscan

His blog entry (and the associated Flickr series) shows some cool pictures of the reply-to network in Usenet: that is, the network of who replies to messages by someone else.

Of course, the images are generated with JUNG, the open-source network calculation and visualization package.


This is particularly interesting in the context of AOL dropping newsgroup support which has been much-discussed ( here and here and here and here )

I'll put in my own two-cents worth below the fold...

There are a number of good reasons why AOL might want to drop newsgroup support. Some people have pointed out that Usenet is less of a critical resource than it once was, and more people are moving to blogs and other fora.

It's definitely the case that Usenet-sans-binaries has remained roughly flat for a while. Here's the number of daily posts, as recorded by Netscan, over the last four years, MINUS ALL POSTS IN ALT.BINARIES (x axis is the number of days since January 1, 2000. Sorry for the awkwardness, but I don't feel like fighting Excel right now.)

usenet levels

So here's how I read the AOL story:

AOL has historically viewed itself as an editor (presenting a happier bit of the internet), I can imagine that maintaining newsgroups have perpetually been a thorn in their sides: a quick skim through the titles sees a distressing amount of sex and pirated and cracked. Especially if their interface is one of the traditional “here’s 30K newsgroups, which do you want?” types.

Combine that with their lawsuit from Harlan Ellison and the fact that newsgroups have to sit on their own server (which means that AOL is “storing” and “holding” the data, which might make them liable), and I can see their desire for getting rid of them.

So I’d read it this way:

  • AOL figures that Google Groups is taking up the slack for them (for people who are not seeking binaries)
  • AOL figures that Easynews (and similar) will take up the slack for them (for people who are seeking binaries, especially dubiously-legal ones)
  • AOL figures that either way, any liability that follows Usenet around becomes not their problem
  • AOL gets to not ship and support a newsreader

And, from the AOL perspective, Usenet is flat (as seen above).

There’s growth on microsoft,public, more mixed on the rest of Usenet. Here’s daily post counts for all of Usenet, minus all posts to alt.binaries. It’s pretty much flat, possibly downward trending.

Doesn't mean that Usenet isn't interesting -- just that they are less likely to lose customers over not having it.

Update: Similar thoughts at Tim Jarret's

Maybe I'll pack him a nice shiny late-2004 vintage Treemap.

Posted by danyelf at 02:19 PM | Comments (0) | TrackBack

January 23, 2005

MSR footnotes

Thanks to the communities team at MSR for walking through this process with me. A shout out to Ben Sittler & Auros, who surely said useful things about this: I just don't remember what.

Chris asked whether our dataset is spam-filtered: No, it's not. We pull sources straight. (On the other hand, spam is usually pretty easy to detect: for one thing, people don't reply to it, and fairly often, the spammers show up dozens or hundreds of times in the same newsgroups). Chris, incidentally, has been trying to do some research arguing that one man's spam is another man's ham. I'm pretty sure that he's just plain wrong on this, unless you really want to look at my blog to find people who will sell you catfood. Or lesbian bondage movies. Or generic cialis. (If Chris is right, then pretty much anything anyone knows about marketing -- idaes like "target audience" -- are just plain wrong.)

Joshua asks about the spike in orphans in early 2001. That's actually a side-effect, I think, of the smoothing: there's a lot of jumping up and down in that time period. These were during the early days of Netscan, and our server was up and down, which means that our data was a mess. (Far more mysterious is the recent spike in orphans in microsoft.public. We really should be tracking those messages pretty well.)

Posted by danyelf at 07:20 PM | Comments (0) | TrackBack

Hunting Yenc: Tracing the Great Usenet Bump (Part IV)

Back in the first post, we found a large bump in Usenet traffic. On the second day, we showed that the bump was not due to an obvious flaw in our system--at least, it wasn't due to the same flaws we've run into before. On the third day, we traced the bump to alt.binaries, and from there found that it spread itself across a lot of groups. alt.binaries.dvd seemed to most dramatically have it.

Let's continue by tracing through and seeing individual authors.

I took a closer look at alt.binaries.dvd. The most frequent poster on alt.binaries.dvd is some guy calling himself yenc. yenc@power-post.org. Indeed, he has a great many names:

	yenc@power-post.org (Yenc-PP-A&A) 
	yenc@power-post.org (anonymous@anonymous.com)
        "Builder" <Yenc@power-post.org>
        "daathal" <Yenc@power-post.org>
        "Dognorah" <Yenc@power-post.org>

and oodles of others …

Here's a daily count of yenc's posts.

A quick web search points out that yenc@power-post.org isn't just one person. yenc is the default name generated by Power-Post software for yEncoding. Which means that what we're seeing here is a whole lot of people, posting at software defaults.

(yEncoding? That's the sequel to UUEncoding, and is another way of breaking up binaries and posting them to newsgroups.)

Now take a quick look at that peak there. That's 300,000 daily messages[1], pretty close to the size of the spike we're trying to account for.

So what did this Yenc have to say?

One person hypothesized that this was a spike generated by a movie release -- maybe this is a few thousand copies of Return of the King? Another suggested that this was a surge of some more-illicit material.

Here's a random selection of post titles from 2004 posted by yenc@power-post.org (Yenc-PP-A&A)

        ManxTT2002- "manxTT2002.part005.rar" yEnc (142/161)
	#alt.binaries.cd.image.xbox @ efnet + 28484 [09/32] - "ins-fn2k4x.r05" yEnc (094/201)
	(DERWI) [Twins] - "twins.part083.rar" yEnc FTD: 195277 (195/201)
	isleofmanTTextras1.- "isleofmanTTextras.part092.rar" yEnc (025/161)
	isleofmanTTextras1.- "isleofmanTTextras.part102.rar" yEnc (081/122)
	#alt.binaries.svcd@Efnet #5317 Garfield "vcd-garfieldts1.r16" yEnc (25/36)
	(www.abstartrek.org) 04 of 32 - "TOS - 101 - The Man Trap.part04.rar" yEnc (14/24)
	(199013) Mooimakertje.part05.rar (17/27)
	isleofmanTTextras1.- "isleofmanTTextras.part076.rar" yEnc (027/161)
	(DERWI) [Twins] - "twins.part082.rar" yEnc FTD: 195277 (051/201)
	Karperfilmpjes "CarpseX aflevering 05.wmv" By VanManiac (007/390)
	(DERWI) [verzoek repost ANNE] - "anne.part123.rar" yEnc FTD: 205123 (130/201)
	§ #alt.binaries.svcd@Efnet \/ Spiderman 2 TS CD1 \/ #5422 "vcd-spiderman2ts1.vol165+40.PAR2" § (01/54)
	(R.S.V.P. #208020) [06/94] - "rs0932.part05.rar" yEnc (051/201)
	Hajni from mikesapartment.com | The hottest women I have ever seen! [03/25] - "hajni-03.mpg" yEnc (08/19)
	§ #alt.binaries.svcd@Efnet \/ \/ #5394 "vcd-garfieldts1.r22" § (33/40)
	Karperfilmpjes "HollandseKarpersessies- 01-Pannekoek.wmv" By VanManiac (128/342)

I can send you a larger set if you want, but this is pretty much representative. It's … stuff. Games and movies and songs and pirated Dutch films and all the other things that look like the binary Usenet today.

Which tells me that it isn't content that's driving this spike. This is not a sudden surge of interest in anything in particular. Nor is it a particular person: I'm pretty sure that no individual is rolling out 300K messages per day. Yenc-PP-A&A is, I'm pretty sure, another aggregated alias.

So what is it? We need to explain two things. The up curve, or why binary posting took off like mad in early 2004, and the down curve, or why binary posting suddenly started to drop, out through October 2004.

A few notes:
· As far as I can tell, the YENC release notes suggest that no new versions of power post have come out since mid-2003. So it's probably not the excitement over a new software version.
· BitTorrent downloads have been rising steadily, but don't seem to have the sort of spike that would explain the drop…

Any other clever ideas?


1. Ok, full confession: looking back at the table, I'm no longer quite so sure what query our data guy used to get me this table. I need to double check my figures: I quietly suspect that this is weekly, not daily messages. I also quietly suspect that it's just Yenc-PP-A&A and not any of the other Yenc identities.

Posted by danyelf at 07:17 PM | Comments (3) | TrackBack

January 20, 2005

Tracing the Great Usenet Bump (Part III)

Over the last two days, we've looked at the number of messages across Usenet. We seem to have found an interesting phenomenon: a surge in messages that doesn't connect to an obvious technical failure on our part.

One obvious question is to figure out where those posts are hiding.

Let's see if we can trace this out a little further. Here's the curves for three different news hierarchies, microsoft.public, tw, and alt.binaries.

In each of them, the orphans are the smaller scale (and are usually something like 10% of the message scale).

These were pretty arbitrary choices, but there is a method to this madness. These are three of the biggest hierarchies in our records.

We look pretty carefully at microsoft.public, and we have better archives, so if the phenomenon is there, it'll be easy to pin down. alt.binaries often gets scapegoated as a home for odd behavior, because it's so very big and heavily trafficked with illegal material (CD rips, DVD copies, cracked computer games, and so on).

microsoft.public.* traffic
tw.* traffic

Now, I'm not sure what's going on with the bump in the orphans count in microsoft.public, but the post count isn't going nuts. Our feed from Taiwan isn't quite as reliable as I might like, but it's pretty stable too.

And, wow, alt.binaries has our bump. Peak to baseline, it looks to be somewhere around the 300,000 that we're interested in.

Now, a lesser research team might accept this and call it a day. "Yeah, that's those binary posters," they might say. "Not even Google tracks 'em -- they just want to swap pr0n and copies of Return of the King."

And while this is true, there is always more data to sift. Let's go on. Maybe this whole phenomenon is buried within a single newsgroup or two. (Then again, maybe it's not).

Let's try to be systematic, now. I got a table of the posts-per-day for every alt.binaries group with more than ten thousand posts. For each group, I compared a baseline period (May 2003 - Feb 2004) to the bump period (May 2004 - June 2004), and checked whether the average posts per day during the later period was more than twice the average posts per day of the earlier period.

Now, for each of them, I looked for a groups where the average number of daily messages from May 2003 through Feb 2004 was under half the number of messages for May 2004 thorough June 2004. I also looked for groups that had a maximum day over 50,000 messages. Less than that, and we just aren't able to build up to this large a spike.

It's not hard to find echoes of the bump in a number of groups ...


but the most dramatic one that I'm finding seems to be in alt.binaries.dvd. Alt.binaries.dvd had nothing before January, 2003, but got up to 50,000 posts per day during our time period. Tomorrow, we'll look a little more into what happened.


Posted by danyelf at 02:15 PM | Comments (0) | TrackBack

Isolating Technical Issues (Part II)

Yesterday, we examined a chart of Usenet traffic, and saw some mysterious bumps. We ended our study with the question of which of those bumps were due to our own reliability, and which we could attribute to outside phenomena.

(Today's post is a little bit technical, and can probably be guiltlessly skipped without too much guilt. I'll catch you up tomorrow)

Now, how to measure your own reliability is a little tough. We could, I guess, compare our feeds to yet another feed coming in from somewhere, or double check our incoming messages against Google. But each of those is expensive, and requires even more processing -- and with our statistics engine, we've got a lot of stuff churning away already.

We've chosen a different take. We look instead at internal measures: things that just don't add up in our own database.

In particular, we can look for messages where we see someone answering a message … but we don't have the original. That suggests we failed to get the original, leaving behind an "orphan," so called because it is a message without its parent.

There will always be a background level of these "orphans". the first message may have been cancelled, for example, without us seeing either the original or the cancel. But we can make a good estimate.

Orphans turn out are a pretty good measure of our reliability. On days when we're scooping up most of Usenet, we don't have many orphans. When we are missing things, for one reason or another, our orphan count shoots upward. In fact, you can see that some of the downward spikes -- like the really big one in February, 2003 -- are matched by a surge in orphans. That's a pretty good sign that we lost stuff.

Here's a count of our orphans per day. These orphans on the chart are counted at one-tenth the scale of the posts: the high peak the orphan trend-line, in March of 2001, is about 60,000 a day.

Where orphans are flat, we're doing well. That's when we give our database guys a raise. (It does make sense that orphans have dropped in the last few years: we're talking to more news servers, now, and so we have a better flow of messages; our collection mechanism has, by and large, improved, so we're getting higher-quality data).

It looks like most of the fluctuations can be accounted for by the orphan count. Got a dip? We must have been missing stuff that day.

But take a good look at that curve. There's one thing that stands out in the recent past to me -- total volume went up during the summer. Way up. Up by a good 50% or so: at its peak, that's around a million messages a day, up from around 700,000 messages a few months before that.

This is peculiar. The orphan count is flat, so it's not some sort of strange server bump. This is simply more messages. Lots more messages.

Tomorrow, we'll take a look into several different Usenet sub-hierarchies and try to drill down into this spike.

Posted by danyelf at 01:12 PM | Comments (0) | TrackBack

A Day at MSR: Chasing the Great Usenet Bump (Part I)

I love my job. One of the reasons it's quite so cool is because I get to try to solve mysteries in between my work with Total World Domination & Crushing the Helpless.

My team has been working for quite a few years on Netscan. We have a fairly impressive set of statistics and measurements over the last five or six years, which I'm quite pleased with. And which turn out to be a lot of fun. There's a lot of different ways to cut and splice the data, and we're just beginning to scratch its surface. (Incidentally, large portions of these data are available online, at the Netscan website; further, universities can request 500-GB-sized chunks of our data for their own research.)

Let me walk you through one of our latest mysteries. Perhaps you'll find it interesting too.

I should note that this project isn't quite typical of what we do: there's a lot of projects going on. Many of them are ethnographic, or design-oriented, or more traditionally sociological. But in general, we are trying to understand the social phenomena around online systems -- and if there's one thing going for Usenet, it's that it's a seething mass of social phenomena.

This mystery, like any good suspense, is presented as multi-part post. We'll do this one in four parts; I'll put up one part every day, through Sunday. None of them are painfully technical.

Part I: Counting Messages

At some point, I wanted to know how many messages we had. This chart shows the number of Netscan messages (click to zoom in). The X axis is the date: January 2000 (when Netscan's collection started) through October 2004. The Y axis, shows total posts per day; every line shows 200,000 posts, so the high point is just a little over a million posts.

The darker orange line is a 30 day running average: there's enough day-to-day variation that I need to run a trend-line average, or the data becomes hard to read: the noise begins to stifle the signal.

Again, this chart shows unique messages per day. You'll note a few features of this graph: it seems to be generally rising, linearly, over time. There's a few periods of particular interest: a particularly-noisy period in late 2000; a large dip in January, 2003; a smaller one in June, 2003; and a bit of a climb in April through June of 2004.

Now, as a person who hangs out with sociologists, I like to be able to explain behavior. What happened in 2000? In 2003? In mid-2004? Is it depression over a super-bowl result, or international terrorism?

Or is it just a technical glitch on our side?

Now, you should note that this chart isn't "all messages in Usenet"; this is "all the messages that our server saw." No one quite has the same view of all of the Usenet, due to the fairly anarchic design. There are various good descriptions of what happens, but here's an approximation:

When I post a message to, say, alt.candy-lovers.drgoodbar, it is sent to my local server. This server, periodically, communicates with a set of other servers that it knows. Those servers feed me their latest posts, and my server feeds them my latest posts. So this post is offered to a few other servers. Maybe one of them doesn't like posts sent anywhere in the alt hierarchy, and so it drops it. Perhaps a different server has a glitch, and doesn't get it. A third server is an incoming-only connection, and so doesn't want my updates. But my post's immortality is assured: a fourth server picks up the message, a few minutes or hours later, and propagates it. That server passes it on to some others, and so on.

So messages come in fits and starts, in great surges as they queue up somewhere and then flow more smoothly. Our own collector sometimes has problems and loses messages. So maybe this is just a problem on our side.

Tomorrow: Is the variation just a glitch?

Posted by danyelf at 12:15 PM | Comments (1) | TrackBack

January 12, 2005

Travel Plans

For those who wait, with bated breath, to know where I am and where I'm going, here's some future highlights.

· This weekend, January 15-16: the Triumphant Berkeley Return Trip. I have much of Saturday to wander about on my own; then I head off to Chesh & Sarah's for a gathering. Sunday morning is vegetarian buddhist thai brunch! Contact me for more information as danyelf -at - acm.org

· Sometime early-mid February: University of Maryland.

· February 16-20, plus a few additional days on one side or the other: Redondo Beach, CA for the SUNBELT conference, with planned side-trips to Irvine, San Diego, and LA

Posted by danyelf at 05:42 PM | Comments (0) | TrackBack

Netscan talks RSS

So I'm sorry to have been out of touch, but projects here at MSR are blasting ahead at top speed. It's pretty exciting, really. One of my favorite innovations is that we're developing new ways to look at our Usenet overview system, Netscan.

Netscan has been collecting message headers for four or five years now, and tracks lots of statsitics per-author, per-newsgroup, or in several other combinations.

You can now get an RSS feed for virtually every page of netscan -- get the newest statistics, datasets, and percentages for your favorite author, newsgroup, or thread delivered right to your door!

http://netscan.research.microsoft.com/Tech has lots of "RSS" buttons on it that will follow your favorite (technical) newsgroup.

Technical, incidentally, because Netscan scrapes those daily. Nontechnical newsgroups are still batched up onto a hard disk and only updated once a month.

It's kind of fun.

Posted by danyelf at 05:40 PM | Comments (0) | TrackBack

Calling them on "Excellent" Customer Service

I'm sure that you've also had lots of people on the phone announce that they wish to provide you with "excellent" customer service. Before telling you that their computer is dead, they can't call you back, and they refuse to fix your problem.

I've decided to take the word "excellent" as a promise.

Here's the transcript...

TITLE: Calling them on "excellent" customer service.

So I'm ordering the various stuff for my new place. Land line. DSL over land line. ISP for DSL over land line. And so on. Qwest helpfully offers me a one-stop shopping -- as I work my way through my menus, I order two of them. (The third, the ISP, is easy to call.)

Then I get an email the next day. It starts off by telling me that their computer glitched, and missed my order for DSL. Could I please let them know whetehr I want it, and what ISP I will use? ("We are committed to excellent customer service.")

- Sure. I give them the information.

- Ok. You will need to wait until your line is installed to order DSL. ("We are committed to excellent customer service.")

- Huh? But I already did order DSL! And you asked me to confirm it!

- Well, we're afraid--the next email says--we can't process that request, and we'll need you to cancel your current order, then call this number and restart. ("We are committed to excellent customer service.")

- Ok. In my world, I reply tartly, "excellent" customer service involves processing a customer's actual request, as opposed to asking them to cancel their order in order to restart their order. It's just one of those things.

Interestingly, I did get a response. And they decided to move forward, and were able to add DSL to my previous order.

Posted by danyelf at 05:22 PM | Comments (0) | TrackBack