<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Probably Irrelevant &#187; Jon Elsas</title>
	<atom:link href="http://probablyirrelevant.org/author/jelsas/feed/" rel="self" type="application/rss+xml" />
	<link>http://probablyirrelevant.org</link>
	<description>Information Retrieval Research and Development</description>
	<lastBuildDate>Tue, 26 Jul 2011 01:19:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>IR in IBM&#8217;s Watson: An interview with Nico Schlaefer</title>
		<link>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/</link>
		<comments>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/#comments</comments>
		<pubDate>Thu, 17 Mar 2011 14:05:01 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Question Answering]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=100</guid>
		<description><![CDATA[Last month, IBM&#8217;s Watson Deep QA system took on two Jeopardy! champions and won. Several researchers at the Language Technologies Institute at Carnegie Mellon University have been involved with the project over the past few years, including Professor Eric Nyberg and his students Hideki Shima and Nico Schlaefer. Nico has been particularly involved in the [...]]]></description>
			<content:encoded><![CDATA[<p><em>Last month, <a href="http://www-03.ibm.com/innovation/us/watson/">IBM&#8217;s Watson Deep QA system</a> took on two <a href="http://www.jeopardy.com/">Jeopardy!</a> champions and won. Several researchers at the <a href="http://www.lti.cs.cmu.edu/">Language Technologies Institute</a> at <a href="http://www.cmu.edu/index.shtml">Carnegie Mellon University</a> have been involved with the project over the past few years, including <a href="http://www.cs.cmu.edu/~ehn/">Professor Eric Nyberg</a> and his students <a href="http://www.cs.cmu.edu/~hideki/">Hideki Shima</a> and <a href="http://www.cs.cmu.edu/~nico/index.html">Nico Schlaefer</a>. Nico has been particularly involved in the IR technology behind Watson, and has answered a few questions on his role in the project. This work forms the basis of his recent thesis proposal on &#8220;Statistical Source Expansion for Question Answering&#8221;.</em></p>
<p><em>If you didn&#8217;t see the Jeopardy! match, check out the <a href="http://www.youtube.com/watch?v=WFR3lOm_xhE">practice match</a>, <a href="http://www.youtube.com/watch?v=ls2IgNiOftA">interview with Professor Nyberg, Hideki and Nico</a>, and <a href="http://www-03.ibm.com/innovation/us/watson/building-watson/how-watson-works.html">background on IBM&#8217;s site</a>.</em></p>
<p><strong>Probably Irrelevant: What role does IR play in a QA system?</strong></p>
<p><strong>Nico Schlaefer: </strong>Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system &#8211; they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.</p>
<p><strong>PI: What are good characteristics of an IR system for QA?</strong></p>
<p><strong>NS: </strong>In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.</p>
<p><strong>PI: Where does IR fail with respect to QA?</strong></p>
<p><strong>NS: </strong>IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extent, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach &#8211; <strong>statistical source expansion</strong> &#8211; which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.</p>
<p><strong>PI: What kinds of resources are expanded?</strong></p>
<p><strong>NS: </strong>We focused on sources we found most useful for answering Jeopardy! questions and also question from <a href="http://trec.nist.gov">TREC</a> QA evaluations. These include encyclopedias (such as <a href="http://www.wikipedia.org/">Wikipedia</a>) and dictionaries (such as <a href="http://www.wiktionary.org/">Wiktionary</a>). More recently, we also experimented with the <a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">ClueWeb09 corpus</a>, a large web crawl created at CMU which comprises about 12 TB of English web pages.</p>
<p><strong>PI: What techniques and/or tools are you using to identify topics to expand?</strong></p>
<p><strong>NS: </strong>This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.</p>
<p><strong>PI: How do you estimate relevance to those topics?</strong></p>
<p><strong>NS: </strong>We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we&#8217;re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.</p>
<p><strong>PI: Could you give an example of a question that is helped by source expansion?</strong></p>
<p><strong>NS: </strong>Here is a question for which source expansion helped:</p>
<p>What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?</p>
<p>This is a question from the <a href="http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf">TREC 8 evaluation</a> [pdf], but if written as a statement (&#8221;This rare neurological disease has symptoms such as &#8230;&#8221;) I think it could also pass as a Jeopardy! question. The answer is &#8220;Tourette syndrome&#8221;.</p>
<p>We first tried to answer this question using Wikipedia as a source, and there is indeed an article about &#8220;Tourette syndrome&#8221; in our copy of Wikipedia, but unfortunately it doesn&#8217;t mention most of the keywords in the question and Watson wasn&#8217;t able to get the answer. We then expanded Wikipedia, and &#8220;Tourette syndrome&#8221; was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:</p>
<ul>
<li>Rare neurological disease that causes repetitive motor and vocal tics</li>
<li>The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.</li>
<li>Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.</li>
<li>The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.</li>
</ul>
<p>These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is &#8220;incoherent&#8221;. This made it very easy for Watson to find the answer.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What role does IR play in a QA system?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system &#8211; they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What are good characteristics of an IR system for QA?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Where does IR fail with respect to QA?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extend, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach &#8211; statistical source expansion &#8211; which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What kinds of resources are expanded?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We focused on sources we found most useful for answering Jeopardy! questions and also question from TREC QA evaluations. These include encyclopedias (such as Wikipedia) and dictionaries (such as Wiktionary). More recently, we also experimented with the ClueWeb09 corpus, a large web crawl created at CMU which comprises about 12 TB of English web pages.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What techniques and/or tools are you using to identify topics to expand?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">How do you estimate relevance to those topics?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we&#8217;re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Could you give an example of a question that is helped by source expansion?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Here is a question for which source expansion helped:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">This is a question from the TREC 8 evaluation, but if written as a statement (&#8221;This rare neurological disease has symptoms such as &#8230;&#8221;) I think it could also pass as a Jeopardy! question. The answer is &#8220;Tourette syndrome&#8221;.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We first tried to answer this question using Wikipedia as a source, and there is indeed an article about &#8220;Tourette syndrome&#8221; in our copy of Wikipedia, but unfortunately it doesn&#8217;t mention most of the keyword in the question and Watson wasn&#8217;t able to get the answer. We then expanded Wikipedia, and &#8220;Tourette syndrome&#8221; was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- Rare neurological disease that causes repetitive motor and vocal tics</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is &#8220;incoherent&#8221;. This made it very easy for Watson to find the answe</div>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Finding relevance judgements in the wild</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/</link>
		<comments>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/#comments</comments>
		<pubDate>Tue, 14 Apr 2009 14:45:41 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Social Media]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61</guid>
		<description><![CDATA[We recently heard our poster on online forum search was accepted to SIGIR 09, and I&#8217;ve been wanting to post something about the test setup we used in that study.
There&#8217;s no existing IR test collection for such a task, although some similar datasets do exist.   For various reasons we weren&#8217;t able to create [...]]]></description>
			<content:encoded><![CDATA[<p>We recently heard our <a href="http://www.cs.cmu.edu/~jelsas/papers/SIGIR2009-ForumThreadSearch_poster.pdf">poster on online forum search</a> was accepted to <a href="http://www.sigir2009.org">SIGIR 09</a>, and I&#8217;ve been wanting to post something about the test setup we used in that study.</p>
<p>There&#8217;s no existing IR test collection for such a task, although <a href="http://www.ins.cwi.nl/projects/trec-ent/">some similar datasets do exist</a>.   For various reasons we weren&#8217;t able to create a traditional test collection, with user-issued queries and deep pools of relevance judgements.  But, this particular dataset and possibly other online dialog archives can be mined to produce a ready-made IR test collection.</p>
<p>The users of <a href="http://forums.macrumors.com/">the online forum we&#8217;ve been looking at</a> frequently include links in their forum posts &#8212; often to previous messages and threads in the same forum. These links are sometimes in response to a new user&#8217;s question, and refer the user to a previous instance of the same (or similar) question and an answer contributed by another user.  Here&#8217;s <a href="http://forums.macrumors.com/showthread.php?p=1359222">a</a> <a href="http://forums.macrumors.com/showthread.php?p=4879012">few</a> <a href="http://forums.macrumors.com/showthread.php?p=1054727">examples</a> to illustrate my point.  This interaction among forum users can be used as a form of query/relevance judgement pair.  See <a href="http://www.cs.cmu.edu/~jelsas/papers/SIGIR2009-ForumThreadSearch_poster.pdf">the paper</a> for a few more details on how we characterize the presence of a question-post/answer-link pair.</p>
<p>This type of test collection creation does have some distinct advantages over the typical retrieval test collections used at TREC.  First, the queries represent real information needs of real users of the online forum.  Many TREC queries are pulled from search engine logs, but frequently (as in the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">Blog Track</a>&#8217;s Feed Distillation task) the queries are invented by participants or assessors.  The information needs present in the online forum posts are much more verbose than typical keyword queries on a web search engine, providing a retrieval system more evidence with which to use in relevance scoring.  The &#8220;relevance judgement&#8221;, provided by another forum user linking to a previous thread, also presents <em>in-situ relevance information</em> &#8212; sensitive not only to the original question, but also to the overall nature of the forum and the time when the question was asked.</p>
<p>There are several drawbacks inherent in this type of corpus creation, most importantly with regard to the exhaustiveness of the relevance assessment.  Typically in TREC-style collection development, ranked results from several retrieval systems are pooled and those pooled documents are assessed for relevance.  When the systems&#8217; output is sufficiently diverse and relevance assessment is sufficiently deep, this produces a reasonably complete relevance assessment for each query &#8212; if a relevant document is in the collection, it would most likely be retrieved by one of the systems and be judged by being admitted into the pool.  The method of collecting relevance judgements we use in our SIGIR poster, on the other hand, will not produce anything close to an exhaustive set of relevant threads.  In the great majority of cases, only a single thread is linked to in a subsequent reply message.  There is no guarantee that this thread is the best or only relevant thread in the collection.   For this reason, we must take care not to assume non-judged threads are necessarily irrelevant.</p>
<p>There are plenty of datasets that seem to be ready-made for classification or regression tasks, without any need for annotation &#8212; for example the classic <a href="http://people.csail.mit.edu/jrennie/20Newsgroups/">20 newsgroups</a> for text classification and <a href="http://answers.yahoo.com/">Yahoo! Answers</a> for a number of <a href="http://www.mathcs.emory.edu/~eugene/papers/sigir2008-cqa-satisfaction.pdf">prediction</a> <a href="http://www.mathcs.emory.edu/~eugene/papers/acl08s_cqa-personalization-prelim.pdf">tasks</a>.  For relevance ranking, however, I haven&#8217;t seen any ready-made datasets with real relevance <em>judgements</em>, as opposed to noisy interaction indicators such as click-through statistics.  Conversation archives like the one we use offer one way to mine behavioral data for relevance judgements, offering ground-truth preferable in many ways to post-hoc relevance assessment.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Directions in Search over Social Media</title>
		<link>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/</link>
		<comments>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/#comments</comments>
		<pubDate>Fri, 07 Nov 2008 19:41:42 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Blog Search]]></category>
		<category><![CDATA[Social Media]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=40</guid>
		<description><![CDATA[In his keynote at the Search in Social Media workshop at CIKM, Andrew Tomkins suggested that there is plenty of room for academic IR research progress in social media.  I happen to agree.
Community generated content has been all the rage for a few years:  blogs, Wikipedia, online forums, twitter, Yahoo! Answers, and the list goes on. [...]]]></description>
			<content:encoded><![CDATA[<p><em>In his keynote at the </em><a href="http://ir.mathcs.emory.edu/SSM2008/"><em>Search in Social Media workshop at CIKM</em></a><em>, </em><a href="http://datamining.typepad.com/data_mining/2008/10/search-and-social-media-cikm-2008-rough-notes-from-keynote-by-andrew-tomkins.html"><em>Andrew Tomkins suggested</em></a><em> that there is plenty of room for academic IR research progress in social media.  I happen to agree.</em></p>
<p>Community generated content has been all the rage for a few years:  <a href="http://probablyirrelevant.org">blogs</a>, <a href="http://wikipedia.org">Wikipedia</a>, online forums, <a href="http://twitter.com">twitter</a>, <a href="http://answers.yahoo.com">Yahoo! Answers</a>, and the list goes on.  Many of these generate a large volume of archived data &#8212; some in the form of more or less polished documents, like a blog post or Wikipedia article;  others, like twitter, are snippets of an often one-sided conversation and broadcast messages.</p>
<p>From the IR researcher&#8217;s perspective, is it worth studying these <em>artifacts</em> of &#8220;social media&#8221;?  Is there something that distinguishes these from other document collections?  If so, how can we leverage that distinction in our retrieval models?  This post aims to answer a couple of these questions and hopefully bring up a few more.</p>
<p>First and foremost, we need to identify whether there is value in providing access to artifacts of social media.  Some, like twitter, seem to be mostly ephemeral, only (generally) interesting in the moment and quickly fading from view.  Even the twitter search engine advertises: &#8220;See what&#8217;s happening — right now&#8221; and the results (as far as I can tell) are only ranked chronologically.  </p>
<p>Many other types of social media &#8212; some existing long before Web 2.0 was born &#8212; can be real treasure-troves of information.  There exists an online forum, public mailing lists, newsgroup or message board for virtually every special interest group under the sun &#8212; from <a href="http://forums.gardenweb.com/forums/">gardening</a>, to <a href="http://www.homebrewtalk.com/">home-brewing</a>, to <a href="http://forums.macrumors.com/">apple computers</a>.  These are often heavily trafficked, populated with real subject matter experts, and host a rich information exchange.  I would argue that the content created through these social media outlets present an enormous value to searchers, and information retrieval research has a lot to contribute in this corner of social media.</p>
<p>What makes these document collections different than what has been previously studied?  Can we just treat them the same as web pages?  Or do they need special consideration?</p>
<p>In many of these collections, the unit of retrieval &#8212; what we consider a document &#8212; is not fixed, but rather dependent on the task.  Consider online forums, often organized into topical sub-forums, which in turn are organized into conversation threads of individual posts.  Some information needs many only require a single post as a result, some require the context of the full conversation thread, and others may need to retrieve a pertinent sub-forum.</p>
<p>These collections often offer another orthogonal axis of retrieval &#8212; the author.  In highly trafficked message boards and mailing lists, tens or hundreds of thousands of users with varying levels of expertise contribute to the conversation.  One may wish to find subject matter experts to address a question to, or favor message threads with contributions from those more likely to know the answer.</p>
<p>These factors, of course, are not entirely unique to social media search, and have to some degree been addressed in previous research.  This question of identifying the granularity of the unit of retrieval has been addressed at the document level (for example in XML element retrieval at <a href="http://inex.is.informatik.uni-duisburg.de/">INEX</a>), but not so much at the collection level.   <a href="http://www.cs.purdue.edu/homes/lsi/f233-Si.pdf">Resource ranking in federated search</a> and <a href="http://www.dcs.gla.ac.uk/Keith/Chapter.3/Ch.3.html">cluster-based retrieval</a> bear some resemblance to the selection of a topical sub-collection, such as a sub-forum ranking.  Author-ranking has also been studied at <a href="http://trec.nist.gov">TREC</a> in the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">Blog</a> and <a href="http://www.ins.cwi.nl/projects/trec-ent/wiki/index.php/Main_Page">Enterprise Tracks</a>.  But, each of these have been studied in isolation, without much regard to the interaction between the different aspects of the collection.  To my knowledge, no IR testbeds exist that contain the rich <em>collection</em> structure offered in these types of social media.</p>
<p>This, in my mind, is the real promise of research in search over social media.  These collections provide multiple levels of organizational granularity, different axes of organization, multiple types of searchable objects, and relations among those objects.  I predict that this will be an interesting and fertile direction of information retrieval research &#8212; pushing the systems to support more sophisticated multi-dimensional indexing and extending existing retrieval models to handle rich relationships between documents.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2008/11/directions-in-search-over-social-media/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Welcome to Probably Irrelevant</title>
		<link>http://probablyirrelevant.org/2008/05/welcome-to-probably-irrelevant/</link>
		<comments>http://probablyirrelevant.org/2008/05/welcome-to-probably-irrelevant/#comments</comments>
		<pubDate>Fri, 02 May 2008 00:37:13 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Site News]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=7</guid>
		<description><![CDATA[IR is far, far more than a branch of computer science, concerned primarily with issues of algorithms, computers, and computing.
Tefko Saracevic, Acceptance address for the 1997 Gerard Salton Award.
Probably Irrelevant is a group blog on information retrieval and all things related. It serves as an open forum for IR research and development discussion. We aspire to [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>IR is far, far more than a branch of computer science, concerned primarily with issues of algorithms, computers, and computing.</p></blockquote>
<p><a href="http://www.scils.rutgers.edu/~tefko/">Tefko Saracevic</a>, Acceptance address for the 1997 Gerard Salton Award.</p>
<p><a href="http://probablyirrelevant.org">Probably Irrelevant</a> is a group blog on information retrieval and all things related. It serves as an open forum for IR research and development discussion. We aspire to have a wide range of IR researchers and practitioners contribute to the blog &#8212; from academia and industry, professors and students, evangelists and critics.</p>
<p>Of course, if you&#8217;d like to contribute, please leave a comment or <a href="mailto:contributions@probablyirrelevant.org">contact us</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2008/05/welcome-to-probably-irrelevant/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

