<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Probably Irrelevant &#187; Blog Search</title>
	<atom:link href="http://probablyirrelevant.org/topics/blog-search/feed/" rel="self" type="application/rss+xml" />
	<link>http://probablyirrelevant.org</link>
	<description>Information Retrieval Research and Development</description>
	<lastBuildDate>Tue, 26 Jul 2011 01:19:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Directions in Search over Social Media</title>
		<link>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/</link>
		<comments>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/#comments</comments>
		<pubDate>Fri, 07 Nov 2008 19:41:42 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Blog Search]]></category>
		<category><![CDATA[Social Media]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=40</guid>
		<description><![CDATA[In his keynote at the Search in Social Media workshop at CIKM, Andrew Tomkins suggested that there is plenty of room for academic IR research progress in social media.  I happen to agree.
Community generated content has been all the rage for a few years:  blogs, Wikipedia, online forums, twitter, Yahoo! Answers, and the list goes on. [...]]]></description>
			<content:encoded><![CDATA[<p><em>In his keynote at the </em><a href="http://ir.mathcs.emory.edu/SSM2008/"><em>Search in Social Media workshop at CIKM</em></a><em>, </em><a href="http://datamining.typepad.com/data_mining/2008/10/search-and-social-media-cikm-2008-rough-notes-from-keynote-by-andrew-tomkins.html"><em>Andrew Tomkins suggested</em></a><em> that there is plenty of room for academic IR research progress in social media.  I happen to agree.</em></p>
<p>Community generated content has been all the rage for a few years:  <a href="http://probablyirrelevant.org">blogs</a>, <a href="http://wikipedia.org">Wikipedia</a>, online forums, <a href="http://twitter.com">twitter</a>, <a href="http://answers.yahoo.com">Yahoo! Answers</a>, and the list goes on.  Many of these generate a large volume of archived data &#8212; some in the form of more or less polished documents, like a blog post or Wikipedia article;  others, like twitter, are snippets of an often one-sided conversation and broadcast messages.</p>
<p>From the IR researcher&#8217;s perspective, is it worth studying these <em>artifacts</em> of &#8220;social media&#8221;?  Is there something that distinguishes these from other document collections?  If so, how can we leverage that distinction in our retrieval models?  This post aims to answer a couple of these questions and hopefully bring up a few more.</p>
<p>First and foremost, we need to identify whether there is value in providing access to artifacts of social media.  Some, like twitter, seem to be mostly ephemeral, only (generally) interesting in the moment and quickly fading from view.  Even the twitter search engine advertises: &#8220;See what&#8217;s happening — right now&#8221; and the results (as far as I can tell) are only ranked chronologically.  </p>
<p>Many other types of social media &#8212; some existing long before Web 2.0 was born &#8212; can be real treasure-troves of information.  There exists an online forum, public mailing lists, newsgroup or message board for virtually every special interest group under the sun &#8212; from <a href="http://forums.gardenweb.com/forums/">gardening</a>, to <a href="http://www.homebrewtalk.com/">home-brewing</a>, to <a href="http://forums.macrumors.com/">apple computers</a>.  These are often heavily trafficked, populated with real subject matter experts, and host a rich information exchange.  I would argue that the content created through these social media outlets present an enormous value to searchers, and information retrieval research has a lot to contribute in this corner of social media.</p>
<p>What makes these document collections different than what has been previously studied?  Can we just treat them the same as web pages?  Or do they need special consideration?</p>
<p>In many of these collections, the unit of retrieval &#8212; what we consider a document &#8212; is not fixed, but rather dependent on the task.  Consider online forums, often organized into topical sub-forums, which in turn are organized into conversation threads of individual posts.  Some information needs many only require a single post as a result, some require the context of the full conversation thread, and others may need to retrieve a pertinent sub-forum.</p>
<p>These collections often offer another orthogonal axis of retrieval &#8212; the author.  In highly trafficked message boards and mailing lists, tens or hundreds of thousands of users with varying levels of expertise contribute to the conversation.  One may wish to find subject matter experts to address a question to, or favor message threads with contributions from those more likely to know the answer.</p>
<p>These factors, of course, are not entirely unique to social media search, and have to some degree been addressed in previous research.  This question of identifying the granularity of the unit of retrieval has been addressed at the document level (for example in XML element retrieval at <a href="http://inex.is.informatik.uni-duisburg.de/">INEX</a>), but not so much at the collection level.   <a href="http://www.cs.purdue.edu/homes/lsi/f233-Si.pdf">Resource ranking in federated search</a> and <a href="http://www.dcs.gla.ac.uk/Keith/Chapter.3/Ch.3.html">cluster-based retrieval</a> bear some resemblance to the selection of a topical sub-collection, such as a sub-forum ranking.  Author-ranking has also been studied at <a href="http://trec.nist.gov">TREC</a> in the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">Blog</a> and <a href="http://www.ins.cwi.nl/projects/trec-ent/wiki/index.php/Main_Page">Enterprise Tracks</a>.  But, each of these have been studied in isolation, without much regard to the interaction between the different aspects of the collection.  To my knowledge, no IR testbeds exist that contain the rich <em>collection</em> structure offered in these types of social media.</p>
<p>This, in my mind, is the real promise of research in search over social media.  These collections provide multiple levels of organizational granularity, different axes of organization, multiple types of searchable objects, and relations among those objects.  I predict that this will be an interesting and fertile direction of information retrieval research &#8212; pushing the systems to support more sophisticated multi-dimensional indexing and extending existing retrieval models to handle rich relationships between documents.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Blogs, queries, corpora</title>
		<link>http://probablyirrelevant.org/2008/09/blogs-queries-corpora/</link>
		<comments>http://probablyirrelevant.org/2008/09/blogs-queries-corpora/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 16:59:08 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Blog Search]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=9</guid>
		<description><![CDATA[In 2006, I was studying information retrieval at the University of Massachusetts and, during a Friday of extreme impatience, I installed WordPress, started apached and created a blog called “Information Retrieval”.  After a handful of posts over the course of six months, the comments queue filled with spam and WordPress stopped working.  It [...]]]></description>
			<content:encoded><![CDATA[<p>In 2006, I was studying information retrieval at the University of Massachusetts and, during a Friday of extreme impatience, I installed WordPress, started apached and created a blog called “Information Retrieval”.  After a handful of posts over the course of six months, the comments queue filled with spam and WordPress stopped working.  It is with this dubious evidence that I have been asked my esteemed colleagues to write the first post of “Probably Irrelevant”.  The talent represented by those nominating me will ensure that “Probably Irrelevant” will see a little more life than “Information Retrieval” (if it has not already based on the title alone).</p>
<p>Now, it seems appropriate that the inaugural post of an information retrieval blog should address the subject of “blog search”.  Unfortunately, I am dreadfully less qualified than my co-authors to discuss the state of the art.  So, I apologize in advance for errors, omissions, or general ridiculousness and lay blame on Kevyn and Jonathan.</p>
<p>Now, when I started “Information Retrieval”, one of the first messages I received was from a senior member of the IR community.  He wrote,</p>
<blockquote><p>Maybe you could blog about why anyone is interested in blogs :-)</p></blockquote>
<p>I replied,</p>
<blockquote><p>I&#8217;ll keep this in mind when you&#8217;re chairing a session on blog search at SIGIR 2010.</p></blockquote>
<p>I will not identify the original commenter but encourage conference attendees to pay attention in Geneva.</p>
<p>Of course, this comment deserves some thought.  One of the issues with blog search is the under-defined taxonomy of queries.  The TREC Blog Track defines the following tasks</p>
<ul>
<li>blog post retrieval (i.e. “Find me posts about X.”)</li>
<li>opinion retrieval (i.e. “What do people think about X?”)</li>
<li>polarity (i.e. “Find me positive posts about X.”)</li>
<li>feed distillation (i.e. “Find me a blog with a principle, recurring interest in X.”)</li>
</ul>
<p>One question I hope will be resolved in the comments is where these query types came from.  Are they derived from actual blog searchers?   Or are they merely contrived by the track organizers while trading pints at the Gaithersburg Marriot?  These are questions, not criticisms. I think these are fine tasks but we have to be careful to define queries which are representative of those being issued blog search engines or, more generally, fulfill some desire users have.  The problem with a new corpus is that how users interact with it is still not completely developed.  What users will actually use these systems?  Casual blog readers?  Marketers?  Political scientists?  Sociologists?</p>
<p>The majority of time in an “Introduction to Information Retrieval” course is devoted to modeling documents.  And, yes, we have sophisticated models of documents. We decompose individual documents using passages, sentences, or other exploitable structure.  We also model the corpus as a whole either explicitly (e.g. cluster-based retrieval, latent semantic indexing, regularization) or implicitly (e.g. pseudo-relevance feedback).</p>
<p>For an information retrieval researcher, a corpus without queries is a corpse.  Queries make information retrieval different from unsupervised learning.  Also, because they are so short, queries make information retrieval different from traditional text classification.  While information retrieval research has focused on ranking documents given a query, prior to the late 1990s, there were very few (published) results on modeling queries in aggregate.  However, with the advent of web search engines, there has been a growing body of work on such models.  These include descriptive studies of web query frequencies and user clicking behavior as well as models for query similarity and clicking behavior.  These results have mainly been presented for web users and queries; I would be very interested in seeing whether the results generalize to non-web search scenarios.</p>
<p>To come back to blog search, I believe we need a better understanding of both the corpus and the queries before defining tasks.  Blog corpora exist and are actively being studied.  I am less certain about blog queries.  One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios.  Of course, some of us are engineers who sometimes desire to build a tool because we believe it would be used.  However, if there is a mismatch between what we believe will be useful and what users find useful, then we have wasted time.*</p>
<p>I’ve touched on a lot in this first post and hope it serves as a starting point of discussion.  So, welcome to &#8220;Probably Irrelevant&#8221;.</p>
<p>*I just became aware of a paper to be presented at CIKM entitled “What Should Blog Search Look Like?” which I hope will answer some of these questions.</p>
<p><em>Editor&#8217;s Note:</em><em> Many thanks to Fernando for authoring our first post.  He couldn&#8217;t have chosen a more timely topic, the TREC 2008 Blog Track judgements are underway, <a href="http://terrierteam.blogspot.com/2008/09/about-blog-search-tasks.html">Iadh Ounis as recently posted a call for suggestions for the 2009 tasks</a>, <a href="http://www.searchenginecaffe.com/2008/09/trec-2009-blog-track-thoughts.html">Jeff Dalton has an insightful response</a>, and <a href="http://people.ischool.berkeley.edu/%7Ehearst/papers/blogsearch08.pdf">Marti Hearst&#8217;s paper is now online</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2008/09/blogs-queries-corpora/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

