Finding relevance judgements in the wild

We recently heard our poster on online forum search was accepted to SIGIR 09, and I’ve been wanting to post something about the test setup we used in that study.

There’s no existing IR test collection for such a task, although some similar datasets do exist. For various reasons we weren’t able to create a traditional test collection, with user-issued queries and deep pools of relevance judgements. But, this particular dataset and possibly other online dialog archives can be mined to produce a ready-made IR test collection.

The users of the online forum we’ve been looking at frequently include links in their forum posts — often to previous messages and threads in the same forum. These links are sometimes in response to a new user’s question, and refer the user to a previous instance of the same (or similar) question and an answer contributed by another user. Here’s a few examples to illustrate my point. This interaction among forum users can be used as a form of query/relevance judgement pair. See the paper for a few more details on how we characterize the presence of a question-post/answer-link pair.

This type of test collection creation does have some distinct advantages over the typical retrieval test collections used at TREC. First, the queries represent real information needs of real users of the online forum. Many TREC queries are pulled from search engine logs, but frequently (as in the Blog Track’s Feed Distillation task) the queries are invented by participants or assessors. The information needs present in the online forum posts are much more verbose than typical keyword queries on a web search engine, providing a retrieval system more evidence with which to use in relevance scoring. The “relevance judgement”, provided by another forum user linking to a previous thread, also presents in-situ relevance information — sensitive not only to the original question, but also to the overall nature of the forum and the time when the question was asked.

There are several drawbacks inherent in this type of corpus creation, most importantly with regard to the exhaustiveness of the relevance assessment. Typically in TREC-style collection development, ranked results from several retrieval systems are pooled and those pooled documents are assessed for relevance. When the systems’ output is sufficiently diverse and relevance assessment is sufficiently deep, this produces a reasonably complete relevance assessment for each query — if a relevant document is in the collection, it would most likely be retrieved by one of the systems and be judged by being admitted into the pool. The method of collecting relevance judgements we use in our SIGIR poster, on the other hand, will not produce anything close to an exhaustive set of relevant threads. In the great majority of cases, only a single thread is linked to in a subsequent reply message. There is no guarantee that this thread is the best or only relevant thread in the collection. For this reason, we must take care not to assume non-judged threads are necessarily irrelevant.

There are plenty of datasets that seem to be ready-made for classification or regression tasks, without any need for annotation — for example the classic 20 newsgroups for text classification and Yahoo! Answers for a number of prediction tasks. For relevance ranking, however, I haven’t seen any ready-made datasets with real relevance judgements, as opposed to noisy interaction indicators such as click-through statistics. Conversation archives like the one we use offer one way to mine behavioral data for relevance judgements, offering ground-truth preferable in many ways to post-hoc relevance assessment.

What is different about highly effective retrievals?

Several recent (and several not so recent) papers have focused on methods of evaluating IR systems without relevance judgments.  The appeal of this approach is obvious; forming relevance judgments is arguably the hardest part of building a test collection.  Additionally, ranking systems without judgments has implications for fusion-based IR where we would like to combine various systems’ output while bearing in mind our confidence in each system’s results.  A reliable way to rank systems without relevance judgments would make fusion in rapidly changing, very large corpora much more tractable.
I’ll save the question of how valuable judgment-free test collections would actually be for another post.  Here I’m interested in a slightly different matter.  I must preface my discussion, however, with an admission that this is an area of research I’ve come to recently.  I am SURE that some of our readers are more familiar with the literature and results in this area than I am… please bring your comments.  
The issue that interests me is the problem of identifying the best-performing systems during judgment-free evaluation.  Most approaches to judgment-free system ranking can identify really poor performers.  They also do a serviceable job ranking systems that perform fairly well.  But judgment-free rankings tend to fall apart when it comes to identifying systems that perform much better than average.

This problem appeared in a relatively early paper by Ian Soboroff and others and it has continued to be problematic since then (Aslam and Savell discuss it, as well).

The mechanics at work here are easy enough to understand. Most judgment-free ranking is based on an analysis of the documents that are commonly retrieved by many systems. Systems that perform fairly well tend to return many of the same documents as other ‘pretty good’ systems. Poor performers tend to miss these documents.  But what about the best performers?  Aslam and Savell argue that most judgment-free evaluation leads to a “tyranny of the masses,” punishing systems that do anything really different from the norm.  Wu and Crestani suggest that the best performers “are somewhat peculiar”; they do something qualitatively different from average performers.  Simply by deviating from the norm, then, the best systems look bad under the common judgment-free lenses.

If the best systems are doing something qualitatively different from the great unwashed, what is that difference?  Can we model it in order to improve our ranking of systems in the absence of relevance judgments?

Most of the literature on this topic focuses on TREC data.  In this context it is often the case that the best retrievals result from complex manual runs, as opposed to automatic, title-only runs. When I started looking into this a bit, I assumed that high-performance runs, by virtue of resulting from detailed statements of information need, would retrieve relevant documents that were missed by most other systems (e.g. documents that were relevant but that lacked terms from the topic title).

Pursuing this hypothesis will take some real work, but I was surprised by this figure:

 

Average recall for TREC-8 systems

Average recall for TREC-8 systems

The plot shows #rel_returned/#rel averaged over the 50 topics used in TREC-8 (ranking is by MAP).  Now it certainly could be true that the best systems are finding relevant documents that other systems are not.  But the best performers don’t appear to be finding more relevant documents than others.  

To me the mystery here is why these high-performing runs appear so bad using most judgment-free evaluation measures.  Retrieving “hidden” relevant documents would indeed lead to apparent bad performance under a tyranny of the masses.  But these systems don’t have especially high recall (quite the contrary, in fact).  Are they retrieving hidden relevant documents and failing to return obvious ones?  That seems unlikely.

What are the best performers doing that sets them apart from the crowd?  Can we account for this difference in judgment-free evaluation?  Until we can I can only be skeptical: what are we really measuring when we estimate performance using Cranfield-type methods without relevance judgments?