February 26, 2010

Twitter During Emergencies

people looking at the fargo flood
Photo courtesy of DahKohlmeyer

A few months ago on the University of California, Irvine campus we had an incident in which a student wearing camouflage was seen walking onto campus from secondary road coming from a somewhat remote area (grain of salt: we are in Irvine of course). Given the stories about violence on campus that are consistently told in the main stream media, this caused people who saw him to become alarmed. The fear was that his intention was to massacre a number of students in a Virgina Tech style rampage.

What followed was an explosion of real-time information sharing. It was an extremely heterogeneous mix of media that was involved however. One system that was involved is called ZotAlert. It is a terrific text-message based system that is used by the UCI Police to send text messages about emerging danger to the entire University community. It has been used to warn about violent incidents occurring in and around campus as well as burglaries and other crimes in a very short window of time after they are reported.

Of course, individual text messaging, facebook, twitter, email and a variety of less well known social media were involved in the explosion of information. But so were face to face conversations and standard phone calls.

I remember that the first information that I received was from my wife who was talking to a friend who had received a text message from another friend relaying information from mobile phone call with her daughter! The message that was relayed was that they were in lock-down at the pool and there was a guy running around with a gun shooting people.

Then I read a Facebook report that the guy had a Nerf-Gun. I synthesized and retweeted those two bits of information and was emailed by a reporter from the O.C. Register who wanted to confirm my update that he saw on Facebook. I, of course, couldn't confirm it as I was just passing along information that I had heard.

Later, a student wearing camouflage was arrested in the student union which was also in lock down at the time. It eventually turned out it was the wrong guy who made an unfortunate fashion choice that morning.

Hours later the information flood settled down and it was revealed that the guy with the gun was a student with a paint-ball gun who was shooting paint balls in the field and was going home. He was mostly oblivious to the craziness going on around him because he wasn't online and he wasn't apprehended in any reasonable amount of time. He eventually apologized profusely for being dumb enough to carry a paint ball gun onto a college campus and being seen.

I had a couple of observations about this event. The first was that the ZotAlert system was very authoritative, and not surprisingly was slower to send out the facts. This was, hopefully, because they were trying to make sure they actually had facts.

Another observation was that the social media was extremely effective at getting out the word that something was going on. The subject and accuracy of the something varied to a great degree. When it comes to a potentially dangerous situation like this could have been I think that this was a success. Even if you can't communicate the right information, you would like everyone to be on guard and in the right frame of mind to respond appropriately when they get first hand information.

Lastly it was interesting to see how much this media space is fractured. Every social media tool I was involved with was lighting up. No tool had a monopoly on the communication. They were each used to individual strengths and to communicate to particular people. It appears that our community has a pretty good sense of which tools different people pay attention to. So when I want to reach my wife I text message, but if I want to reach my department I send an email.

In the paper "Chatter on the Red: What Hazards Threat Reveals about the Social Life of Microblogged Information" by Starbird, Palen, Hughes and Vieweg and published in CSCW 2010, the authors look formally at some of these effects.

Their data source were tweets that were sent out around the time of a 2009 flood in the Red River Valley on the U.S/Canada border. This event lasted for several months, so the nature of the information was much less about being individually safe for the next few hours and much more about being safe as a community for weeks.

They commented on the fractured and heterogenous nature of social media:

"Collection and analysis of large data sets generated from CMC [Computer Mediated Communication] during newsworthy events first reveals an utterly unsurprising observation: that publicly available CMC is heterogeneous and unwieldy. ... Our tweet-by-tweet analysis of the Local- Individual Users Streams indicates that most are broadcasting autobiographical information in narrative form, though many contain elements of commentary and the sharing of higher-level information as well. Even as some Twitterers shift focus to the flood, most continue tweeting within their established Twitter persona."

and although they had contradictory comments about what Twitter was this quote reinforced my view that as a technology, Twitter is an infrastructure for low-bandwidth multicast.

" Twitter, a new incarnation of computer mediated chat, is a platform without formal curation mechanisms for the massive amount of information generated by its (burgeoning) user base. There is no rating or recommendation system support—key features of commerce sites like Amazon and information aggregators like Digg. Nor is there a complex system of validation that, for example, Wikipedia has implemented. Also unlike Wikipedia, content passed through Twitter is short-lived, and therefore cannot be discussed, verified and edited. While most social media have “places” for interaction, interaction in Twitter occurs in and on the data itself, through its distribution, manipulation, and redistribution. Without regular retransmission, communications quickly get lost in the noise and eventually die off."

Another difference between UCI and this flood, was that the time scale allowed people to do more self-organization and to create more digital tools to help manage the information flow than would be normal over the course of hours.

The authors reinforced a belief about geo-located data that I previously blogged about which is that there is nothing about twitter which should make you think that localized data is really more local. It is a means to broadcast and subscribe and what you do on top of that is communicate. Just like all communication it is human centered and not easily parsable by a machine. So the researchers spent a lot of effort curating a set of tweets that were related to the flooding. There was one caveat which I'll mention below

Some interesting facts that emerged from their study were that only about 10% of the tweets that were in the dataset about the floods were original. And of that 10%, it was split between autobiographical narrative format and knowledge introduction. This is the same pattern of use seen in "Is It Really About Me? Message Content in Social Awareness Stream" between Meformers and Informers.

A curious note about this dataset though was that the localized data was three times as likely to be original. This is a reasonable expectation given the dataset but speaks to a place in which the merging of local and localized data does occur.

Those that weren't original were sometimes synthesizing other tweets. This included editing, curating and synthesized the data from others. Then another group of people were posting educational tweets related to the events unfolding.

A final interesting behavior that was observed was the sensor stream to twitter account phenomenon in which some talented folks connected a sensor measuring flood levels to a twitter stream which periodically tweeted data. This is something I would like to explore in much greater depth.

Posted by djp3 at 9:48 AM | Comments (0) | TrackBack (0)

February 25, 2010

Using social networks to guide recommendation systems

punk bffs
Photo courtesy of Walt Jabsco

The problem of trying to incorporate social networks into collaborative filtering recommendations seems to be a hot research topic right now. The basic idea of this problem is that one may have a dataset consisting of many different ratings by a user of a thing, like a movie or product, which takes on a number from 0 to 1. What we would like to do is to predict how much a given user will like something which they have never rated before.

In collaborative filtering the approaches have two axes user/item similarity and memory based or model based approaches.

The first axis describes what kind of similarity you are leveraging in order to make your recommendations. User similarity asserts that people who have rated things similar to you in the past are likely to rate a new thing in the same way. Item similarity assets that you are likely to rate a new thing the same way that you have rated similar things in the past. The first requires a way of determining whether users are similar, the latter a way of determining whether items are similar. In either case you can just use your ratings themselves as the basis for similarity or you can use some external knowledge to judge similarity.

The second axis describes how you store the information from which you base your decisions. In a memory based approach, you just keep all of your rankings around and when it comes time to make a new rating on an unseen item, you go to your data and do your analysis. In a model-based approach, every time a new rating is observed, a new model is constructed for a user. When it is time to rate a new unseen item, then the model is consulted for a rating.

Two recent papers that explore these issues are "On Social Networks and Collaborative Recommendation" by Konstas, Stathopoulus, and Jose and "Learning to Recommend with Social Trust Ensemble" by Ma, King and Lyu, both from SIGIR 2009.

The first paper, "On Social..." undertook the task of trying to create a list of songs that a user would like based on their previous history of listening to songs in Last.fm, who their friends are (and their history), and then a collection of tags which applied to users and music. There approach was an attempt to merge both user and item similarity with a social network in a memory based approach. Because they ultimately created a play list sorted by most-likely-to-be-liked, they had a hard time comparing their results to traditional collaborative filtering systems which produce a hypothetical how-much-do-you-like-a-song rating for every song given a user.

Their approach was very interesting to me because they basically created a graph in which users, songs and tags were nodes and the relations between them were represented as weighted edges in a graph. Then they ran the PageRank algorithm over the personalized graph, pulled out the songs in the graph that were mostly highly rated and that was the recommendation. The weights on the edges of the graph required some black magic tuning.

It appears that any memory-based system that has significant computation, like the previous one, suffers from a real-time response challenge. If you are creating a system like Pandora, it's not really a problem because you have quite a bit of time in which to pick the next song about as long as the current song takes to play.

But if your system is more of a query-response system where you ask "Will I like song X?" then you really only have milliseconds to get an answer. This suggests to me that a model based approach is nearly required in fast-response systems.

The second paper, "Learning to Recommend..." was similar in spirit but very different in execution. The goal of the authors of this paper was to create a recommendation system of products based on the reviews in Epinions.com. The key feature that the authors wanted to include was a social-recommendation component and the basic assumption they were exploring: What you like is based on a combination of your own tastes and the tastes of your social network. When it comes to epinions this work shows that to be true in a 40/60 split respectively.

So they cast the problem of collaborative filtering as a graphical model and used results that showed how the matrix manipulations associated with collaborative filtering can be solved as just such a model. Then they showed how as social graph can also be cast as a graphical model. So the first graphical model says how likely you are to like something based on your previous history of ratings and the second component says how likely is your social network likely to like something. Then they combined the graphical models and derived an optimization technique for finding a local optimal solution to the matrix problem in the beginning. The result was a model that performed a query on an item quickly and did better than simply looking at a users tastes by themselves, or a social networks tastes by themselves.

Posted by djp3 at 10:14 AM | Comments (0) | TrackBack (0)

February 24, 2010

Twitter is a Low-Bandwidth Multicast Internet

lots of water pipes
Photo courtesy of hockadilly

In the paper "Is it Really About Me? Message Content in Social Awareness Streams" by Mor Naaman, Jeffrey Boase, and Chih-Hui Lai and published in CSCW 2010, the authors look at Twitter messages and attempt characterize what is going on in a representative sample of Twitter users' communication.

Personally I believe that Twitter is fundamentally an infrastructure play. I think that "Twitter" is going to become a series of extremely well-engineered pipes that enable people to post information and receive information from an infinite number of streams. It is basically recreating the Internet as something which is based on multi-cast rather than being based on one-to-one connections. Or seen another way they are becoming a new radio spectrum for the Internet. Tune in or tune out as you will. This is phenomenally valuable. It is doing what the Internet did for archival information to ephemeral information.

Most analysis, including the one cited above, however, treat Twitter as if it is a consumer grade tool. In my mind, that is like saying that you need to study the water pipes in your neighborhood to see how people are using them. Basically, they use water pipes to get water. There isn't much more to say about it. After that, you are studying what people do with water, not with the infrastructure that delivers it. At this point of course, we've never seen this type of water pipe, so it still makes sense to look at both the pipes and the content.

This study does this. It doesn't do it naively though. It talks about Twitter as being one of many "social awareness streams" through which people communicate to each other. Then the authors go on to study the content of these streams.

What they found is fascinating. For example, in support of my claim that Twitter is really an infrastructure for ephemeral data, they found that in the random sample of 3379 messages spread across 350 users who were active in Twitter and used it for non-commercial purposes, over 196 different applications were used to put information into Twitter. That blew my mind. I know about 10 applications that post to Twitter but 196 actively used applications, amazed me.

Another fascinating fact, that has also been reported by Pew researchers is that 25% of the posts were from mobile platforms. It think this speaks more to the growth of mobile networks and smart phones than it does about real-time data generation. I suspect that in the same way that phone calls are becoming more and more dominated by mobile connections, Twitter will eventually be almost entirely mobile. That's not the nature of Twitter as much as it's the nature of the future network.

So using open coding methods the authors took the 3379 messages and developed a categorization for them and then had independent coders assign categories to each message for analysis.

The results were that the dominant minority of messages were categorized as "Me Now" messages - messages that described what people were feeling or doing at that moment. Messages were categorized this way 41% of the time. The frequency went up to 51% when you considered just mobile messages.

This lead to an attempt to categorize people into personas which described two types of people, those who were "Informers" - passing links along, or people who were "Meformers" passing their personal context along. The authors left as open work the task of what structural properties of the social graph characterized these two groups.

So although I believe that Twitter is primarily an infrastructure for low-bandwidth multicast, it is still new enough that studies like this one, that characterize emergent ways that people are using it are valuable and informative. It confirmed some of my intuitions, and helped put some numbers to the trends.

Posted by djp3 at 9:26 AM | Comments (4) | TrackBack (0)

February 23, 2010

Local data vs Localized data: When are they the same?

compass rose
Photo courtesy of future15pic

I've been thinking about geolocated status messages a bit recently and have just started to look into studies that have been conducted about such data.

This started after I had seen some pretty cool visualizations of Facebook data that was geocoded. What the visualizations showed was a globe. Emanating from the globe were icons that corresponded to events that were happening around the world (on facebook) which were also associated with a latitude and longitude. It was amazing to see the activity occurring visually. As you watch you can see people waking up with the sun and making friends across the globe.

After a few minutes though I began to wonder if any of this mattered. It was certainly cool to see something like this, because basically it had never been seen before, but I began to wonder if the location from which the data was generated really mattered.

So I did a little follow-on experiment. I looked into the Twitter API and ran an experiment over a week. For one day I looked at all the geo-tagged tweets coming from the UCI campus (generously defined). Then over the course of 6 days I looked at all the tweets coming from the U.S. Capital and about 4 blocks in all directions.

I was very surprised to see that only 2 tweets emanated from UCI's campus and a similar number came from D.C. So about 2 tweets per day for a large geolocated area. The tweets were not very local either. In UCI's case, one was a tweet meme about Toyota's recent recall problems and the other was a Four Square check-in. In D.C.'s case the results were similar.

This was a surprise to me because I was led to believe (by the media, I suppose) that there is a flood of geotagged tweets arising around us. If only you could tap into this stream you would be amazed at the richness of information around you. That turned out to just not be true at all.

Now, I've begun to think of geotagged data more like email. Why would you expect that the location where you sent an email from was correlated with the content of the email? Certainly some of it is. And certainly some of the content is being made because of the location. But it isn't like the content is all about the location. As if every email that I sent from my office was some description of my office.

I believe the same pertains to Twitter, Facebook etc. Generally speaking, there isn't any reason to think that geotagged status updates are related to the content any more than you would expect email to be related to the location from which it is sent.

My attention to the issue has caused me to read to papers recently about the issue. The first is "Where We Twitter" by two students in my department, Sam Kaufman and Judy Chen. It isn't a huge contribution, but it makes the point that there are some use cases for good local data and they brought out the distinction that I just described above by naming two types of information, "localized data" and "local data". I was mostly complaining about localized data above. That is data which has a place where it was generated but which is not necessarily about the place. This is in contrast to "local data" which is about a location, but wasn't necessarily authored in that location. Geotagged tweets conflate the two types of information.

The second paper was more rigorous and was published in CSCW 2010, called "On the 'Localness' of User-Generated Content". In this paper, the authors, Brent Hecht and Darren Gergle study how much local data is generated from near the same location. The high order take away bit was that there was a lot more alignment in Flickr's data of local data being localized at the same location. Wikipedia in contrast does not tend to have nearly as much local data generated near the location. A simple, but interesting explanation is that the easiest way to localize Flickr photos is to take a photo with a camera that automatically tags it with a location. So entering local data in Flickr requires you to actually be at the location.

Posted by djp3 at 10:02 AM | Comments (1) | TrackBack (0)