March 2, 2010

Real-Time is Prime-Time for Scams

I had a brief panic earlier this week. For a few hours I thought the Internet was on the verge of collapse. The sudden concern was brought about by trying to find season 1 episode 1 of "Glee" online somewhere.

My first thought was to go to as this has become my new authority for legal online television. It turns out that only the last 5 episodes of Glee are available on Hulu. A fascinating conversation with a Fox executive taught me about how that is the result of a legacy method of licensing the content for television programs that has been shoehorned into the Internet TV age. It used to be that shows followed a path from brand-new to syndication in a well-ordered manner which doesn't match well with the expectations the public have of finding all shows archived on the Internet somewhere. That, however, is tangential to my panic.

Unsuccessful on hulu, I started doing just a general Google search and turned up many many many pages purporting to be Glee Episode 1 Season 1, but were really just gateway videos to ... you guessed it ... porn.

I immediately related this experience to the resulting aftermath of the Haiti and now Chile earthquakes in which immediately following both disasters, internet sites sprang up which fraudulently offered to take donations on behalf of victims or redirect you to their issue / product of primary concern which was rarely related to the disaster.

A final example of this effect comes from Twitter. In the Twitterverse whenever a meme is created, usually with a hashtag, it is not long after that the griefers and scammers show up. They post their VIAGRA ad with the hottest twitter meme hashtag and destroy the conversation for everyone else.

My panic peaked at this point? How has the internet survived so long in the face of this stuff? Has it just grown to the point where this is now economically feasible? Are we in a new era of the web which looks like the spam-era of email? As part of the work that I've been doing on Information Retrieval I was able to consider how powerful the signal from PageRank must be to overcome this: To have been overcoming this for so long.

PageRank, is a technique in which links from one page to another confirm authority on the destination page. The paths that people can take through the Internet by following links therefore reveal a great deal about where the good content is and where the bad content is. The links represent the efforts of human curation on the Internet. Every link that you put on your web page helps PageRank sifts the garbage from the gold.

However, this doesn't work with real-time information because PageRank is pretty slow. It takes time for people to add those links. It takes time to figure out the shape of the Internet and it takes time to report the results back to Internet searchers. Apparently it takes more than half a season of Glee, because I can only find garbage today.

So PageRank works well for archived data, what can work for real-time data? Maybe social networks can. If you can leverage social networks to immediately vote on the content being created by the real-time web, then perhaps the social network can replace PageRank for ephemeral data. All that remains is a way of figuring out what people think is good or bad in the same way that looking at a link tells you whether people think content is good or bad.

So what started out as a panic that the Internet was about to collapse, really gave me a new appreciation for PageRank. In some ways the link structure of the Internet is the social network that we have been leveraging all along. My panic also made me realize that we need a new signal for real-time ephemeral data - like news and tweets - to sift the good from the bad. My panic has subsided now that I know the shape of the problem a little better. I think the problem is large, but it would be cool to solve it.

