What is different about highly effective retrievals?

Several recent (and several not so recent) papers have focused on methods of evaluating IR systems without relevance judgments.  The appeal of this approach is obvious; forming relevance judgments is arguably the hardest part of building a test collection.  Additionally, ranking systems without judgments has implications for fusion-based IR where we would like to combine various systems’ output while bearing in mind our confidence in each system’s results.  A reliable way to rank systems without relevance judgments would make fusion in rapidly changing, very large corpora much more tractable.
I’ll save the question of how valuable judgment-free test collections would actually be for another post.  Here I’m interested in a slightly different matter.  I must preface my discussion, however, with an admission that this is an area of research I’ve come to recently.  I am SURE that some of our readers are more familiar with the literature and results in this area than I am… please bring your comments.  
The issue that interests me is the problem of identifying the best-performing systems during judgment-free evaluation.  Most approaches to judgment-free system ranking can identify really poor performers.  They also do a serviceable job ranking systems that perform fairly well.  But judgment-free rankings tend to fall apart when it comes to identifying systems that perform much better than average.

This problem appeared in a relatively early paper by Ian Soboroff and others and it has continued to be problematic since then (Aslam and Savell discuss it, as well).

The mechanics at work here are easy enough to understand. Most judgment-free ranking is based on an analysis of the documents that are commonly retrieved by many systems. Systems that perform fairly well tend to return many of the same documents as other ‘pretty good’ systems. Poor performers tend to miss these documents.  But what about the best performers?  Aslam and Savell argue that most judgment-free evaluation leads to a “tyranny of the masses,” punishing systems that do anything really different from the norm.  Wu and Crestani suggest that the best performers “are somewhat peculiar”; they do something qualitatively different from average performers.  Simply by deviating from the norm, then, the best systems look bad under the common judgment-free lenses.

If the best systems are doing something qualitatively different from the great unwashed, what is that difference?  Can we model it in order to improve our ranking of systems in the absence of relevance judgments?

Most of the literature on this topic focuses on TREC data.  In this context it is often the case that the best retrievals result from complex manual runs, as opposed to automatic, title-only runs. When I started looking into this a bit, I assumed that high-performance runs, by virtue of resulting from detailed statements of information need, would retrieve relevant documents that were missed by most other systems (e.g. documents that were relevant but that lacked terms from the topic title).

Pursuing this hypothesis will take some real work, but I was surprised by this figure:

 

Average recall for TREC-8 systems

Average recall for TREC-8 systems

The plot shows #rel_returned/#rel averaged over the 50 topics used in TREC-8 (ranking is by MAP).  Now it certainly could be true that the best systems are finding relevant documents that other systems are not.  But the best performers don’t appear to be finding more relevant documents than others.  

To me the mystery here is why these high-performing runs appear so bad using most judgment-free evaluation measures.  Retrieving “hidden” relevant documents would indeed lead to apparent bad performance under a tyranny of the masses.  But these systems don’t have especially high recall (quite the contrary, in fact).  Are they retrieving hidden relevant documents and failing to return obvious ones?  That seems unlikely.

What are the best performers doing that sets them apart from the crowd?  Can we account for this difference in judgment-free evaluation?  Until we can I can only be skeptical: what are we really measuring when we estimate performance using Cranfield-type methods without relevance judgments?

Blogs, queries, corpora

In 2006, I was studying information retrieval at the University of Massachusetts and, during a Friday of extreme impatience, I installed WordPress, started apached and created a blog called “Information Retrieval”. After a handful of posts over the course of six months, the comments queue filled with spam and WordPress stopped working. It is with this dubious evidence that I have been asked my esteemed colleagues to write the first post of “Probably Irrelevant”. The talent represented by those nominating me will ensure that “Probably Irrelevant” will see a little more life than “Information Retrieval” (if it has not already based on the title alone).

Now, it seems appropriate that the inaugural post of an information retrieval blog should address the subject of “blog search”. Unfortunately, I am dreadfully less qualified than my co-authors to discuss the state of the art. So, I apologize in advance for errors, omissions, or general ridiculousness and lay blame on Kevyn and Jonathan.

Now, when I started “Information Retrieval”, one of the first messages I received was from a senior member of the IR community. He wrote,

Maybe you could blog about why anyone is interested in blogs :-)

I replied,

I’ll keep this in mind when you’re chairing a session on blog search at SIGIR 2010.

I will not identify the original commenter but encourage conference attendees to pay attention in Geneva.

Of course, this comment deserves some thought. One of the issues with blog search is the under-defined taxonomy of queries. The TREC Blog Track defines the following tasks

  • blog post retrieval (i.e. “Find me posts about X.”)
  • opinion retrieval (i.e. “What do people think about X?”)
  • polarity (i.e. “Find me positive posts about X.”)
  • feed distillation (i.e. “Find me a blog with a principle, recurring interest in X.”)

One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers? Or are they merely contrived by the track organizers while trading pints at the Gaithersburg Marriot? These are questions, not criticisms. I think these are fine tasks but we have to be careful to define queries which are representative of those being issued blog search engines or, more generally, fulfill some desire users have. The problem with a new corpus is that how users interact with it is still not completely developed. What users will actually use these systems? Casual blog readers? Marketers? Political scientists? Sociologists?

The majority of time in an “Introduction to Information Retrieval” course is devoted to modeling documents. And, yes, we have sophisticated models of documents. We decompose individual documents using passages, sentences, or other exploitable structure. We also model the corpus as a whole either explicitly (e.g. cluster-based retrieval, latent semantic indexing, regularization) or implicitly (e.g. pseudo-relevance feedback).

For an information retrieval researcher, a corpus without queries is a corpse. Queries make information retrieval different from unsupervised learning. Also, because they are so short, queries make information retrieval different from traditional text classification. While information retrieval research has focused on ranking documents given a query, prior to the late 1990s, there were very few (published) results on modeling queries in aggregate. However, with the advent of web search engines, there has been a growing body of work on such models. These include descriptive studies of web query frequencies and user clicking behavior as well as models for query similarity and clicking behavior. These results have mainly been presented for web users and queries; I would be very interested in seeing whether the results generalize to non-web search scenarios.

To come back to blog search, I believe we need a better understanding of both the corpus and the queries before defining tasks. Blog corpora exist and are actively being studied. I am less certain about blog queries. One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios. Of course, some of us are engineers who sometimes desire to build a tool because we believe it would be used. However, if there is a mismatch between what we believe will be useful and what users find useful, then we have wasted time.*

I’ve touched on a lot in this first post and hope it serves as a starting point of discussion. So, welcome to “Probably Irrelevant”.

*I just became aware of a paper to be presented at CIKM entitled “What Should Blog Search Look Like?” which I hope will answer some of these questions.

Editor’s Note: Many thanks to Fernando for authoring our first post.  He couldn’t have chosen a more timely topic, the TREC 2008 Blog Track judgements are underway, Iadh Ounis as recently posted a call for suggestions for the 2009 tasks, Jeff Dalton has an insightful response, and Marti Hearst’s paper is now online.