What is different about highly effective retrievals?

Several recent (and several not so recent) papers have focused on methods of evaluating IR systems without relevance judgments.  The appeal of this approach is obvious; forming relevance judgments is arguably the hardest part of building a test collection.  Additionally, ranking systems without judgments has implications for fusion-based IR where we would like to combine various systems’ output while bearing in mind our confidence in each system’s results.  A reliable way to rank systems without relevance judgments would make fusion in rapidly changing, very large corpora much more tractable.
I’ll save the question of how valuable judgment-free test collections would actually be for another post.  Here I’m interested in a slightly different matter.  I must preface my discussion, however, with an admission that this is an area of research I’ve come to recently.  I am SURE that some of our readers are more familiar with the literature and results in this area than I am… please bring your comments.  
The issue that interests me is the problem of identifying the best-performing systems during judgment-free evaluation.  Most approaches to judgment-free system ranking can identify really poor performers.  They also do a serviceable job ranking systems that perform fairly well.  But judgment-free rankings tend to fall apart when it comes to identifying systems that perform much better than average.

This problem appeared in a relatively early paper by Ian Soboroff and others and it has continued to be problematic since then (Aslam and Savell discuss it, as well).

The mechanics at work here are easy enough to understand. Most judgment-free ranking is based on an analysis of the documents that are commonly retrieved by many systems. Systems that perform fairly well tend to return many of the same documents as other ‘pretty good’ systems. Poor performers tend to miss these documents.  But what about the best performers?  Aslam and Savell argue that most judgment-free evaluation leads to a “tyranny of the masses,” punishing systems that do anything really different from the norm.  Wu and Crestani suggest that the best performers “are somewhat peculiar”; they do something qualitatively different from average performers.  Simply by deviating from the norm, then, the best systems look bad under the common judgment-free lenses.

If the best systems are doing something qualitatively different from the great unwashed, what is that difference?  Can we model it in order to improve our ranking of systems in the absence of relevance judgments?

Most of the literature on this topic focuses on TREC data.  In this context it is often the case that the best retrievals result from complex manual runs, as opposed to automatic, title-only runs. When I started looking into this a bit, I assumed that high-performance runs, by virtue of resulting from detailed statements of information need, would retrieve relevant documents that were missed by most other systems (e.g. documents that were relevant but that lacked terms from the topic title).

Pursuing this hypothesis will take some real work, but I was surprised by this figure:

 

Average recall for TREC-8 systems

Average recall for TREC-8 systems

The plot shows #rel_returned/#rel averaged over the 50 topics used in TREC-8 (ranking is by MAP).  Now it certainly could be true that the best systems are finding relevant documents that other systems are not.  But the best performers don’t appear to be finding more relevant documents than others.  

To me the mystery here is why these high-performing runs appear so bad using most judgment-free evaluation measures.  Retrieving “hidden” relevant documents would indeed lead to apparent bad performance under a tyranny of the masses.  But these systems don’t have especially high recall (quite the contrary, in fact).  Are they retrieving hidden relevant documents and failing to return obvious ones?  That seems unlikely.

What are the best performers doing that sets them apart from the crowd?  Can we account for this difference in judgment-free evaluation?  Until we can I can only be skeptical: what are we really measuring when we estimate performance using Cranfield-type methods without relevance judgments?

5 Responses to “What is different about highly effective retrievals?”

  1. Why not always use human-judged relevance information? On Amazon Mech. Turk you can get hundreds of relevance judgments for a dollar.

    Sorry to promote my own posts, but here’s the evidence…

    http://blog.doloreslabs.com/2008/09/amt-fast-cheap-good-machine-learning/
    http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/

  2. Brendan — thanks for the links — great posts.

    Judgement-free evaluation has utility beyond absolute system assessment. In meta search, or unsupervised rank aggregation, we often want to know which system’s output is more reliable at query time in order to inform how results are merged.

  3. Before starting a discussion about performance prediction the following questions need to be answered,

    why are you interested in performance prediction? do you want to merge results? abandon (or select) a ranking system for the query? abandon (or select) a ranking a system system for all queries? not all of these suggest the optimizing the same metric.
    do you really not have relevance information? performance predictors often provide relevance surrogates (e.g. frequently retrieved documents across systems). does your system—from cold-start or warmed-up—have other sources of relevance surrogates?

  4. A few explanations for why your top runs may have low relret

    pooling depth. compute metric for the pooling depth. there could be unjudged relevant documents. automatic runs may be saved because their unjudged documents are more likely to be within in the pooling depth of another run.
    manual runs may be high precision, as opposed to high recall and we can certainly have methods with high precision which are low recall.

  5. Fernando– I suspect you’re right, and it is likely that the top-performing systems are retrieving unjudged relevant documents. But by virtue of _being_ the best performers, they are also retrieving the judged relevant docs. So it’s easy for me to believe that there are some very good systems far down in the rankings (having had the bad luck to retrieve unjudged relevant docs). But it’s not clear to me how relevant documents that weren’t judged due to pooling would hurt the relret for high performers.

Discussion Area - Leave a Comment