Blogs, queries, corpora

In 2006, I was studying information retrieval at the University of Massachusetts and, during a Friday of extreme impatience, I installed WordPress, started apached and created a blog called “Information Retrieval”. After a handful of posts over the course of six months, the comments queue filled with spam and WordPress stopped working. It is with this dubious evidence that I have been asked my esteemed colleagues to write the first post of “Probably Irrelevant”. The talent represented by those nominating me will ensure that “Probably Irrelevant” will see a little more life than “Information Retrieval” (if it has not already based on the title alone).

Now, it seems appropriate that the inaugural post of an information retrieval blog should address the subject of “blog search”. Unfortunately, I am dreadfully less qualified than my co-authors to discuss the state of the art. So, I apologize in advance for errors, omissions, or general ridiculousness and lay blame on Kevyn and Jonathan.

Now, when I started “Information Retrieval”, one of the first messages I received was from a senior member of the IR community. He wrote,

Maybe you could blog about why anyone is interested in blogs :-)

I replied,

I’ll keep this in mind when you’re chairing a session on blog search at SIGIR 2010.

I will not identify the original commenter but encourage conference attendees to pay attention in Geneva.

Of course, this comment deserves some thought. One of the issues with blog search is the under-defined taxonomy of queries. The TREC Blog Track defines the following tasks

  • blog post retrieval (i.e. “Find me posts about X.”)
  • opinion retrieval (i.e. “What do people think about X?”)
  • polarity (i.e. “Find me positive posts about X.”)
  • feed distillation (i.e. “Find me a blog with a principle, recurring interest in X.”)

One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers? Or are they merely contrived by the track organizers while trading pints at the Gaithersburg Marriot? These are questions, not criticisms. I think these are fine tasks but we have to be careful to define queries which are representative of those being issued blog search engines or, more generally, fulfill some desire users have. The problem with a new corpus is that how users interact with it is still not completely developed. What users will actually use these systems? Casual blog readers? Marketers? Political scientists? Sociologists?

The majority of time in an “Introduction to Information Retrieval” course is devoted to modeling documents. And, yes, we have sophisticated models of documents. We decompose individual documents using passages, sentences, or other exploitable structure. We also model the corpus as a whole either explicitly (e.g. cluster-based retrieval, latent semantic indexing, regularization) or implicitly (e.g. pseudo-relevance feedback).

For an information retrieval researcher, a corpus without queries is a corpse. Queries make information retrieval different from unsupervised learning. Also, because they are so short, queries make information retrieval different from traditional text classification. While information retrieval research has focused on ranking documents given a query, prior to the late 1990s, there were very few (published) results on modeling queries in aggregate. However, with the advent of web search engines, there has been a growing body of work on such models. These include descriptive studies of web query frequencies and user clicking behavior as well as models for query similarity and clicking behavior. These results have mainly been presented for web users and queries; I would be very interested in seeing whether the results generalize to non-web search scenarios.

To come back to blog search, I believe we need a better understanding of both the corpus and the queries before defining tasks. Blog corpora exist and are actively being studied. I am less certain about blog queries. One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios. Of course, some of us are engineers who sometimes desire to build a tool because we believe it would be used. However, if there is a mismatch between what we believe will be useful and what users find useful, then we have wasted time.*

I’ve touched on a lot in this first post and hope it serves as a starting point of discussion. So, welcome to “Probably Irrelevant”.

*I just became aware of a paper to be presented at CIKM entitled “What Should Blog Search Look Like?” which I hope will answer some of these questions.

Editor’s Note: Many thanks to Fernando for authoring our first post.  He couldn’t have chosen a more timely topic, the TREC 2008 Blog Track judgements are underway, Iadh Ounis as recently posted a call for suggestions for the 2009 tasks, Jeff Dalton has an insightful response, and Marti Hearst’s paper is now online.

3 Responses to “Blogs, queries, corpora”

  1. Fernando — Although blog search is an area that I have had a lot of fun working in, I agree that the tasks need to be better defined and more rooted in reality. This really became evident in last year’s TREC track. Like other nascent trec tracks, the participants developed queries, and then judged them after runs were submitted. Some of the submitted queries really seemed like oddball blog search queries (”christmas”) and the corresponding relevance judgements were unrealistic for a real web search task (> 100 relevant blogs). A tighter task definition and queries from real query logs would’ve certainly helped the situation.

    But, I don’t think the TREC blog tasks are as separated from reality as you seem to imply. No doubt the research community can benefit from better understanding of queries and information needs — and that goes for pretty much every IR task considered at TREC. I think, at least to some degree, the blog track tasks were inspired by some commercial examples of blog information access. The distillation task is really doing what google is doing at the top of their blog search list. Bloglines is another good example. The opinion task is similar to sentiment mining services BlogPulse provided for corporate customers (based on my understanding from previous discussions with Matt Hurst & Natalie Glance).

  2. While I may be partial to an occasional pint of beer, I would have to say that a great deal of thought goes into the defining of a TREC track and the corresponding tasks! Tasks must be proposed and motivated prior to TREC, in proposals which are rated by the TREC program committee. In essence, we didn’t come up with them on the back of a beer mat.

    You suggest a query log analysis to motivate the tasks. Why indeed, what a great idea! Indeed, its funny that the opening of each TREC Blog track overview describes the tasks, and motivates them using a study of blog search query logs. Actually, if you care to read the overview papers or the ICSWM paper, you’ll find that all opinion finding queries for TREC 06 and 07 where sampled from a real query log.

    Nevertheless Fernando, if your company is able to provide more up-to-date query logs to allow the tasks to be refined, your help would be much appreciated.

    Further reading:
    1. Overview of TREC Blog track 2007. C. Macdonald, I. Ounis & I. Soboroff. TREC 2007 Proceedings.
    2. Overview of TREC Blog track 2006. I. Ounis et al. TREC 2006 Proceedings.
    3. On TREC Blog Track. I. Ounis et al. Proceedings of ICSWM 2008. [Video].
    4. A Study of Blog Search. G. Mishne & M. de Rijke. Proceedings of ECIR 2006.

  3. Hey Craig. Thank you for clarifying the process of developing the Blog tracks. We’ll have to discuss getting you those query logs over a pint sometime.

Discussion Area - Leave a Comment