About miles : Miles Efron is an assistant professor in the School of Information at the University of Texas. His research touches several areas of information retrieval. But his overarching concern is with statistical model selection, model building, and model averaging in the context of IR.

micro-IR

I’ve been watching with interest as Apple’s iphone/ipod_touch app store has grown and matured over the last couple of year (yes, I know, me and almost everyone else).  Interacting with apps on my own, and more recently, building a few, has started me thinking about what I perceive to be an interesting, and I think, novel mode of information interaction.

For lack of a better term, I think of this phenomenon as “micro information retrieval” (micro-IR).

By micro-IR I mean the practice of farming information needs out across multiple applications.  Each of these micro-IR applications is built around a tightly constrained problem space, and I think it’s this constraint that makes micro-IR interesting.

A couple of examples (apologies for any appearance of commercial endorsement; none intended):

  • the yelp app: find, say, restaurants near me
  • loopt: find friends near me
  • Barnes and Noble app: find info on the book in this photo I took
  • shazam: find the song that is playing into the iphone mic.

These examples swing close to simple database lookups.  But if we take a longer view, a more interesting dynamic comes up.  The apps are simple because each one solves a problem that is tightly constrained, answering a question that would involve complicated interaction in its absence.

By way of a few more examples, I am currently developing an app that answers the question: how many gallons of oil would it take to prepare a given recipe?  The app then ranks candidate recipes in increasing order of petroleum consumption.

And it’s not the case that these sorts of interactions are limited to mobile devices.  Thanks to Gene Golovchinsky for pointing me towards Blueprint an Eclipse plugin that allows users to search for code snippets from within their IDE, leveraging Flex syntax to finesse the search.

Trying to lasso these examples together in efforts to triangulate on what micro-IR actually is, I’ll note a few overarching commonalities that I see here:

  1. In ad hoc (text) IR a principal intellectual challenge lies in modeling ‘aboutness.’  In micro-IR settings, the creativity comes into play in posing a useful (and tractable) question to answer.  The engineering comes easily after that.
  2. The constrained nature of micro-IR applications leads to a lightweight articulation of information need.  There is a tight coupling here between task, query, and the unit of retrieval, a dynamic that I think is compelling.  Pushing this a bit farther, we might consider the simple act of choosing to use a particular application from those apps on a user’s palette as part of the information need expression.
  3. The tight coupling of task to data to ‘query’ enables a strong contextual element to inform the interaction.  Context constitutes the foreground of the micro-IR interaction.

I don’t want to overstate the distinction between micro- and macro-IR.  Of course applications fall along a spectrum of their similarity to the modalities I’ve laid out here.  But I do think that being aware of micro-IR system characteristics is worthwhile.  Aside from an inherent innovation to how people interact with information, micro-IR opens the door to small-scale developers gaining a wide audience (i.e. the barrier to entry is low).  And concomitant with this is the new monetization model at work in the app store.

I hope readers will comment on this: is micro-IR something at all?  Is it actually related to IR?  How might we turn our eye to micro-IR with respect to generating bona fide research?  Surely there are better example systems than those I’ve listed…

Is the science of IR improving?

I’m just back from the annual meeting of ASIST (American Society for Info Science and Technology) in Columbus, OH.   I gave a talk during one of the five sessions on IR, and after all the speakers were through there was a session of audience questions.  Andrew Dillon lobbed a provocative question our way:  how do we know if IR as a field is making forward progress?  (I’m paraphrasing, of course).  An uncomfortable pause set in, followed by obligatory sidestepping, e.g. “first we need to define progress.”  It’s a fair question, though: we see incremental progress reported in the literature, but getting a high-level sense of the field’s forward motion strikes me as harder to come by.

I offered an off-the-cuff answer that I suspect readers might comment on.  Actually it was two answers.

First, surely there is meaning in the increasing competition to publish in the field’s best venues.  This isn’t news, but the following figure showcases the fact that getting a paper into SIGIR is indeed growing more difficult (many more people are trying).

 

Of course SIGIR is not synonymous with the field, but I think the figure speaks to the question Andrew asked.   Unless the SIGIR community is spinning its wheels, increasing competition among researchers suggests expectations and standards for “successful research” is climbing.

My second answer had to do with the diversity of tasks that fit under the umbrella term of IR.  Looking at TREC over the years we see new tasks appear (and disappear), new problems to tackle.  I argued that the field is indeed making progress, and we can see that progress in this creativity.  We are solving problems that we didn’t know existed (e.g. adversarial IR) or that actually didn’t exist (e.g. blog search) only several years ago.  Does this creativity imply improvement?  I argued that it does.

What is different about highly effective retrievals?

Several recent (and several not so recent) papers have focused on methods of evaluating IR systems without relevance judgments.  The appeal of this approach is obvious; forming relevance judgments is arguably the hardest part of building a test collection.  Additionally, ranking systems without judgments has implications for fusion-based IR where we would like to combine various systems’ output while bearing in mind our confidence in each system’s results.  A reliable way to rank systems without relevance judgments would make fusion in rapidly changing, very large corpora much more tractable.
I’ll save the question of how valuable judgment-free test collections would actually be for another post.  Here I’m interested in a slightly different matter.  I must preface my discussion, however, with an admission that this is an area of research I’ve come to recently.  I am SURE that some of our readers are more familiar with the literature and results in this area than I am… please bring your comments.  
The issue that interests me is the problem of identifying the best-performing systems during judgment-free evaluation.  Most approaches to judgment-free system ranking can identify really poor performers.  They also do a serviceable job ranking systems that perform fairly well.  But judgment-free rankings tend to fall apart when it comes to identifying systems that perform much better than average.

This problem appeared in a relatively early paper by Ian Soboroff and others and it has continued to be problematic since then (Aslam and Savell discuss it, as well).

The mechanics at work here are easy enough to understand. Most judgment-free ranking is based on an analysis of the documents that are commonly retrieved by many systems. Systems that perform fairly well tend to return many of the same documents as other ‘pretty good’ systems. Poor performers tend to miss these documents.  But what about the best performers?  Aslam and Savell argue that most judgment-free evaluation leads to a “tyranny of the masses,” punishing systems that do anything really different from the norm.  Wu and Crestani suggest that the best performers “are somewhat peculiar”; they do something qualitatively different from average performers.  Simply by deviating from the norm, then, the best systems look bad under the common judgment-free lenses.

If the best systems are doing something qualitatively different from the great unwashed, what is that difference?  Can we model it in order to improve our ranking of systems in the absence of relevance judgments?

Most of the literature on this topic focuses on TREC data.  In this context it is often the case that the best retrievals result from complex manual runs, as opposed to automatic, title-only runs. When I started looking into this a bit, I assumed that high-performance runs, by virtue of resulting from detailed statements of information need, would retrieve relevant documents that were missed by most other systems (e.g. documents that were relevant but that lacked terms from the topic title).

Pursuing this hypothesis will take some real work, but I was surprised by this figure:

 

Average recall for TREC-8 systems

Average recall for TREC-8 systems

The plot shows #rel_returned/#rel averaged over the 50 topics used in TREC-8 (ranking is by MAP).  Now it certainly could be true that the best systems are finding relevant documents that other systems are not.  But the best performers don’t appear to be finding more relevant documents than others.  

To me the mystery here is why these high-performing runs appear so bad using most judgment-free evaluation measures.  Retrieving “hidden” relevant documents would indeed lead to apparent bad performance under a tyranny of the masses.  But these systems don’t have especially high recall (quite the contrary, in fact).  Are they retrieving hidden relevant documents and failing to return obvious ones?  That seems unlikely.

What are the best performers doing that sets them apart from the crowd?  Can we account for this difference in judgment-free evaluation?  Until we can I can only be skeptical: what are we really measuring when we estimate performance using Cranfield-type methods without relevance judgments?