Directions in Search over Social Media

In his keynote at the Search in Social Media workshop at CIKM, Andrew Tomkins suggested that there is plenty of room for academic IR research progress in social media.  I happen to agree.

Community generated content has been all the rage for a few years: blogs, Wikipedia, online forums, twitter, Yahoo! Answers, and the list goes on. Many of these generate a large volume of archived data — some in the form of more or less polished documents, like a blog post or Wikipedia article; others, like twitter, are snippets of an often one-sided conversation and broadcast messages.

From the IR researcher’s perspective, is it worth studying these artifacts of “social media”? Is there something that distinguishes these from other document collections? If so, how can we leverage that distinction in our retrieval models? This post aims to answer a couple of these questions and hopefully bring up a few more.

First and foremost, we need to identify whether there is value in providing access to artifacts of social media.  Some, like twitter, seem to be mostly ephemeral, only (generally) interesting in the moment and quickly fading from view.  Even the twitter search engine advertises: “See what’s happening — right now” and the results (as far as I can tell) are only ranked chronologically.  

Many other types of social media — some existing long before Web 2.0 was born — can be real treasure-troves of information. There exists an online forum, public mailing lists, newsgroup or message board for virtually every special interest group under the sun — from gardening, to home-brewing, to apple computers. These are often heavily trafficked, populated with real subject matter experts, and host a rich information exchange. I would argue that the content created through these social media outlets present an enormous value to searchers, and information retrieval research has a lot to contribute in this corner of social media.

What makes these document collections different than what has been previously studied? Can we just treat them the same as web pages? Or do they need special consideration?

In many of these collections, the unit of retrieval — what we consider a document — is not fixed, but rather dependent on the task. Consider online forums, often organized into topical sub-forums, which in turn are organized into conversation threads of individual posts. Some information needs many only require a single post as a result, some require the context of the full conversation thread, and others may need to retrieve a pertinent sub-forum.

These collections often offer another orthogonal axis of retrieval — the author. In highly trafficked message boards and mailing lists, tens or hundreds of thousands of users with varying levels of expertise contribute to the conversation. One may wish to find subject matter experts to address a question to, or favor message threads with contributions from those more likely to know the answer.

These factors, of course, are not entirely unique to social media search, and have to some degree been addressed in previous research. This question of identifying the granularity of the unit of retrieval has been addressed at the document level (for example in XML element retrieval at INEX), but not so much at the collection level. Resource ranking in federated search and cluster-based retrieval bear some resemblance to the selection of a topical sub-collection, such as a sub-forum ranking. Author-ranking has also been studied at TREC in the Blog and Enterprise Tracks. But, each of these have been studied in isolation, without much regard to the interaction between the different aspects of the collection. To my knowledge, no IR testbeds exist that contain the rich collection structure offered in these types of social media.

This, in my mind, is the real promise of research in search over social media. These collections provide multiple levels of organizational granularity, different axes of organization, multiple types of searchable objects, and relations among those objects.  I predict that this will be an interesting and fertile direction of information retrieval research — pushing the systems to support more sophisticated multi-dimensional indexing and extending existing retrieval models to handle rich relationships between documents.

Is the science of IR improving?

I’m just back from the annual meeting of ASIST (American Society for Info Science and Technology) in Columbus, OH.   I gave a talk during one of the five sessions on IR, and after all the speakers were through there was a session of audience questions.  Andrew Dillon lobbed a provocative question our way:  how do we know if IR as a field is making forward progress?  (I’m paraphrasing, of course).  An uncomfortable pause set in, followed by obligatory sidestepping, e.g. “first we need to define progress.”  It’s a fair question, though: we see incremental progress reported in the literature, but getting a high-level sense of the field’s forward motion strikes me as harder to come by.

I offered an off-the-cuff answer that I suspect readers might comment on.  Actually it was two answers.

First, surely there is meaning in the increasing competition to publish in the field’s best venues.  This isn’t news, but the following figure showcases the fact that getting a paper into SIGIR is indeed growing more difficult (many more people are trying).

 

Of course SIGIR is not synonymous with the field, but I think the figure speaks to the question Andrew asked.   Unless the SIGIR community is spinning its wheels, increasing competition among researchers suggests expectations and standards for “successful research” is climbing.

My second answer had to do with the diversity of tasks that fit under the umbrella term of IR.  Looking at TREC over the years we see new tasks appear (and disappear), new problems to tackle.  I argued that the field is indeed making progress, and we can see that progress in this creativity.  We are solving problems that we didn’t know existed (e.g. adversarial IR) or that actually didn’t exist (e.g. blog search) only several years ago.  Does this creativity imply improvement?  I argued that it does.

WSDM Accepted Papers Posted

WSDM 2009 accepted papers posted here.

What is different about highly effective retrievals?

Several recent (and several not so recent) papers have focused on methods of evaluating IR systems without relevance judgments.  The appeal of this approach is obvious; forming relevance judgments is arguably the hardest part of building a test collection.  Additionally, ranking systems without judgments has implications for fusion-based IR where we would like to combine various systems’ output while bearing in mind our confidence in each system’s results.  A reliable way to rank systems without relevance judgments would make fusion in rapidly changing, very large corpora much more tractable.
I’ll save the question of how valuable judgment-free test collections would actually be for another post.  Here I’m interested in a slightly different matter.  I must preface my discussion, however, with an admission that this is an area of research I’ve come to recently.  I am SURE that some of our readers are more familiar with the literature and results in this area than I am… please bring your comments.  
The issue that interests me is the problem of identifying the best-performing systems during judgment-free evaluation.  Most approaches to judgment-free system ranking can identify really poor performers.  They also do a serviceable job ranking systems that perform fairly well.  But judgment-free rankings tend to fall apart when it comes to identifying systems that perform much better than average.

This problem appeared in a relatively early paper by Ian Soboroff and others and it has continued to be problematic since then (Aslam and Savell discuss it, as well).

The mechanics at work here are easy enough to understand. Most judgment-free ranking is based on an analysis of the documents that are commonly retrieved by many systems. Systems that perform fairly well tend to return many of the same documents as other ‘pretty good’ systems. Poor performers tend to miss these documents.  But what about the best performers?  Aslam and Savell argue that most judgment-free evaluation leads to a “tyranny of the masses,” punishing systems that do anything really different from the norm.  Wu and Crestani suggest that the best performers “are somewhat peculiar”; they do something qualitatively different from average performers.  Simply by deviating from the norm, then, the best systems look bad under the common judgment-free lenses.

If the best systems are doing something qualitatively different from the great unwashed, what is that difference?  Can we model it in order to improve our ranking of systems in the absence of relevance judgments?

Most of the literature on this topic focuses on TREC data.  In this context it is often the case that the best retrievals result from complex manual runs, as opposed to automatic, title-only runs. When I started looking into this a bit, I assumed that high-performance runs, by virtue of resulting from detailed statements of information need, would retrieve relevant documents that were missed by most other systems (e.g. documents that were relevant but that lacked terms from the topic title).

Pursuing this hypothesis will take some real work, but I was surprised by this figure:

 

Average recall for TREC-8 systems

Average recall for TREC-8 systems

The plot shows #rel_returned/#rel averaged over the 50 topics used in TREC-8 (ranking is by MAP).  Now it certainly could be true that the best systems are finding relevant documents that other systems are not.  But the best performers don’t appear to be finding more relevant documents than others.  

To me the mystery here is why these high-performing runs appear so bad using most judgment-free evaluation measures.  Retrieving “hidden” relevant documents would indeed lead to apparent bad performance under a tyranny of the masses.  But these systems don’t have especially high recall (quite the contrary, in fact).  Are they retrieving hidden relevant documents and failing to return obvious ones?  That seems unlikely.

What are the best performers doing that sets them apart from the crowd?  Can we account for this difference in judgment-free evaluation?  Until we can I can only be skeptical: what are we really measuring when we estimate performance using Cranfield-type methods without relevance judgments?

Blogs, queries, corpora

In 2006, I was studying information retrieval at the University of Massachusetts and, during a Friday of extreme impatience, I installed WordPress, started apached and created a blog called “Information Retrieval”. After a handful of posts over the course of six months, the comments queue filled with spam and WordPress stopped working. It is with this dubious evidence that I have been asked my esteemed colleagues to write the first post of “Probably Irrelevant”. The talent represented by those nominating me will ensure that “Probably Irrelevant” will see a little more life than “Information Retrieval” (if it has not already based on the title alone).

Now, it seems appropriate that the inaugural post of an information retrieval blog should address the subject of “blog search”. Unfortunately, I am dreadfully less qualified than my co-authors to discuss the state of the art. So, I apologize in advance for errors, omissions, or general ridiculousness and lay blame on Kevyn and Jonathan.

Now, when I started “Information Retrieval”, one of the first messages I received was from a senior member of the IR community. He wrote,

Maybe you could blog about why anyone is interested in blogs :-)

I replied,

I’ll keep this in mind when you’re chairing a session on blog search at SIGIR 2010.

I will not identify the original commenter but encourage conference attendees to pay attention in Geneva.

Of course, this comment deserves some thought. One of the issues with blog search is the under-defined taxonomy of queries. The TREC Blog Track defines the following tasks

  • blog post retrieval (i.e. “Find me posts about X.”)
  • opinion retrieval (i.e. “What do people think about X?”)
  • polarity (i.e. “Find me positive posts about X.”)
  • feed distillation (i.e. “Find me a blog with a principle, recurring interest in X.”)

One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers? Or are they merely contrived by the track organizers while trading pints at the Gaithersburg Marriot? These are questions, not criticisms. I think these are fine tasks but we have to be careful to define queries which are representative of those being issued blog search engines or, more generally, fulfill some desire users have. The problem with a new corpus is that how users interact with it is still not completely developed. What users will actually use these systems? Casual blog readers? Marketers? Political scientists? Sociologists?

The majority of time in an “Introduction to Information Retrieval” course is devoted to modeling documents. And, yes, we have sophisticated models of documents. We decompose individual documents using passages, sentences, or other exploitable structure. We also model the corpus as a whole either explicitly (e.g. cluster-based retrieval, latent semantic indexing, regularization) or implicitly (e.g. pseudo-relevance feedback).

For an information retrieval researcher, a corpus without queries is a corpse. Queries make information retrieval different from unsupervised learning. Also, because they are so short, queries make information retrieval different from traditional text classification. While information retrieval research has focused on ranking documents given a query, prior to the late 1990s, there were very few (published) results on modeling queries in aggregate. However, with the advent of web search engines, there has been a growing body of work on such models. These include descriptive studies of web query frequencies and user clicking behavior as well as models for query similarity and clicking behavior. These results have mainly been presented for web users and queries; I would be very interested in seeing whether the results generalize to non-web search scenarios.

To come back to blog search, I believe we need a better understanding of both the corpus and the queries before defining tasks. Blog corpora exist and are actively being studied. I am less certain about blog queries. One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios. Of course, some of us are engineers who sometimes desire to build a tool because we believe it would be used. However, if there is a mismatch between what we believe will be useful and what users find useful, then we have wasted time.*

I’ve touched on a lot in this first post and hope it serves as a starting point of discussion. So, welcome to “Probably Irrelevant”.

*I just became aware of a paper to be presented at CIKM entitled “What Should Blog Search Look Like?” which I hope will answer some of these questions.

Editor’s Note: Many thanks to Fernando for authoring our first post.  He couldn’t have chosen a more timely topic, the TREC 2008 Blog Track judgements are underway, Iadh Ounis as recently posted a call for suggestions for the 2009 tasks, Jeff Dalton has an insightful response, and Marti Hearst’s paper is now online.

Welcome to Probably Irrelevant

IR is far, far more than a branch of computer science, concerned primarily with issues of algorithms, computers, and computing.

Tefko Saracevic, Acceptance address for the 1997 Gerard Salton Award.

Probably Irrelevant is a group blog on information retrieval and all things related. It serves as an open forum for IR research and development discussion. We aspire to have a wide range of IR researchers and practitioners contribute to the blog — from academia and industry, professors and students, evangelists and critics.

Of course, if you’d like to contribute, please leave a comment or contact us.