Finding relevance judgements in the wild

We recently heard our poster on online forum search was accepted to SIGIR 09, and I’ve been wanting to post something about the test setup we used in that study.

There’s no existing IR test collection for such a task, although some similar datasets do exist. For various reasons we weren’t able to create a traditional test collection, with user-issued queries and deep pools of relevance judgements. But, this particular dataset and possibly other online dialog archives can be mined to produce a ready-made IR test collection.

The users of the online forum we’ve been looking at frequently include links in their forum posts — often to previous messages and threads in the same forum. These links are sometimes in response to a new user’s question, and refer the user to a previous instance of the same (or similar) question and an answer contributed by another user. Here’s a few examples to illustrate my point. This interaction among forum users can be used as a form of query/relevance judgement pair. See the paper for a few more details on how we characterize the presence of a question-post/answer-link pair.

This type of test collection creation does have some distinct advantages over the typical retrieval test collections used at TREC. First, the queries represent real information needs of real users of the online forum. Many TREC queries are pulled from search engine logs, but frequently (as in the Blog Track’s Feed Distillation task) the queries are invented by participants or assessors. The information needs present in the online forum posts are much more verbose than typical keyword queries on a web search engine, providing a retrieval system more evidence with which to use in relevance scoring. The “relevance judgement”, provided by another forum user linking to a previous thread, also presents in-situ relevance information — sensitive not only to the original question, but also to the overall nature of the forum and the time when the question was asked.

There are several drawbacks inherent in this type of corpus creation, most importantly with regard to the exhaustiveness of the relevance assessment. Typically in TREC-style collection development, ranked results from several retrieval systems are pooled and those pooled documents are assessed for relevance. When the systems’ output is sufficiently diverse and relevance assessment is sufficiently deep, this produces a reasonably complete relevance assessment for each query — if a relevant document is in the collection, it would most likely be retrieved by one of the systems and be judged by being admitted into the pool. The method of collecting relevance judgements we use in our SIGIR poster, on the other hand, will not produce anything close to an exhaustive set of relevant threads. In the great majority of cases, only a single thread is linked to in a subsequent reply message. There is no guarantee that this thread is the best or only relevant thread in the collection. For this reason, we must take care not to assume non-judged threads are necessarily irrelevant.

There are plenty of datasets that seem to be ready-made for classification or regression tasks, without any need for annotation — for example the classic 20 newsgroups for text classification and Yahoo! Answers for a number of prediction tasks. For relevance ranking, however, I haven’t seen any ready-made datasets with real relevance judgements, as opposed to noisy interaction indicators such as click-through statistics. Conversation archives like the one we use offer one way to mine behavioral data for relevance judgements, offering ground-truth preferable in many ways to post-hoc relevance assessment.

Directions in Search over Social Media

In his keynote at the Search in Social Media workshop at CIKM, Andrew Tomkins suggested that there is plenty of room for academic IR research progress in social media.  I happen to agree.

Community generated content has been all the rage for a few years: blogs, Wikipedia, online forums, twitter, Yahoo! Answers, and the list goes on. Many of these generate a large volume of archived data — some in the form of more or less polished documents, like a blog post or Wikipedia article; others, like twitter, are snippets of an often one-sided conversation and broadcast messages.

From the IR researcher’s perspective, is it worth studying these artifacts of “social media”? Is there something that distinguishes these from other document collections? If so, how can we leverage that distinction in our retrieval models? This post aims to answer a couple of these questions and hopefully bring up a few more.

First and foremost, we need to identify whether there is value in providing access to artifacts of social media.  Some, like twitter, seem to be mostly ephemeral, only (generally) interesting in the moment and quickly fading from view.  Even the twitter search engine advertises: “See what’s happening — right now” and the results (as far as I can tell) are only ranked chronologically.  

Many other types of social media — some existing long before Web 2.0 was born — can be real treasure-troves of information. There exists an online forum, public mailing lists, newsgroup or message board for virtually every special interest group under the sun — from gardening, to home-brewing, to apple computers. These are often heavily trafficked, populated with real subject matter experts, and host a rich information exchange. I would argue that the content created through these social media outlets present an enormous value to searchers, and information retrieval research has a lot to contribute in this corner of social media.

What makes these document collections different than what has been previously studied? Can we just treat them the same as web pages? Or do they need special consideration?

In many of these collections, the unit of retrieval — what we consider a document — is not fixed, but rather dependent on the task. Consider online forums, often organized into topical sub-forums, which in turn are organized into conversation threads of individual posts. Some information needs many only require a single post as a result, some require the context of the full conversation thread, and others may need to retrieve a pertinent sub-forum.

These collections often offer another orthogonal axis of retrieval — the author. In highly trafficked message boards and mailing lists, tens or hundreds of thousands of users with varying levels of expertise contribute to the conversation. One may wish to find subject matter experts to address a question to, or favor message threads with contributions from those more likely to know the answer.

These factors, of course, are not entirely unique to social media search, and have to some degree been addressed in previous research. This question of identifying the granularity of the unit of retrieval has been addressed at the document level (for example in XML element retrieval at INEX), but not so much at the collection level. Resource ranking in federated search and cluster-based retrieval bear some resemblance to the selection of a topical sub-collection, such as a sub-forum ranking. Author-ranking has also been studied at TREC in the Blog and Enterprise Tracks. But, each of these have been studied in isolation, without much regard to the interaction between the different aspects of the collection. To my knowledge, no IR testbeds exist that contain the rich collection structure offered in these types of social media.

This, in my mind, is the real promise of research in search over social media. These collections provide multiple levels of organizational granularity, different axes of organization, multiple types of searchable objects, and relations among those objects.  I predict that this will be an interesting and fertile direction of information retrieval research — pushing the systems to support more sophisticated multi-dimensional indexing and extending existing retrieval models to handle rich relationships between documents.