<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Probably Irrelevant</title>
	<atom:link href="http://probablyirrelevant.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://probablyirrelevant.org</link>
	<description>Information Retrieval Research and Development</description>
	<lastBuildDate>Tue, 26 Jul 2011 01:19:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Informal SIGIR Test of Time Award</title>
		<link>http://probablyirrelevant.org/2011/06/informal-sigir-test-of-time-award/</link>
		<comments>http://probablyirrelevant.org/2011/06/informal-sigir-test-of-time-award/#comments</comments>
		<pubDate>Wed, 29 Jun 2011 21:55:16 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Conferences]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=112</guid>
		<description><![CDATA[I have the fortune of attending ICML this year and hope to report on that next week.   Like other conferences, ICML includes a Test of Time award &#8220;given to papers that time and hindsight proved to be of lasting value to the Machine Learning community.&#8221;  This year, the award went to `Conditional Random Fields: Probabilistic [...]]]></description>
			<content:encoded><![CDATA[<p>I have the fortune of attending ICML this year and hope to report on that next week.   Like other conferences, ICML includes a Test of Time award &#8220;given to papers that time and hindsight proved to be of lasting value to the Machine Learning community.&#8221;  This year, the award went to `<a href="http://portal.acm.org/citation.cfm?id=655813" target="_blank">Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</a>&#8216; by John Lafferty, Andrew McCallum, and Fernando Pereira.</p>
<p>As an exercise, I scanned the list of titles from SIGIR 2001 and created a poll to see which papers readers would nominate for an informal SIGIR 2011 Test of Time award.  The poll can be found <a href="http://poll.fm/33am1">here</a>.  Three votes per person.  Poll closes on July 24, 2011.</p>
<p><strong>Update: </strong>Looks like I did not read the fine print of the polling site close enough and the polls will close on July 24, 2011 or when 100 votes have been received, whichever comes first.  Currently at 63 votes total.  I did specify &#8220;informal&#8221;, didn&#8217;t I?</p>
<p><strong>Update: </strong>The poll closed over the weekend and the top three papers captured 50% of the votes,</p>
<ol>
<li>&#8220;Relevance based language models&#8221;, Victor Lavrenko, W. Bruce Croft	 (24.39%; citations: 252/ACM,618/G)</li>
<li>&#8220;A statistical learning learning model of text classification for support vector machines&#8221;, Thorsten Joachims (15.85%; citations: 60/ACM,215/G)</li>
<li>&#8220;A study of smoothing methods for language models applied to Ad Hoc information retrieval&#8221;, Chengxiang Zhai, John Lafferty	 (13.41%; citations: 281/ACM,709/G)</li>
</ol>
<p>The best paper at SIGIR 2001 was &#8220;Temporal summaries of new topics&#8221;, James Allan, Rahul Gupta, Vikas Khandelwal (1.22%; citations: 43/ACM,132/G).  I cannot find an easy to get the most cited paper at the conference.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/06/informal-sigir-test-of-time-award/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGIR 2011 ACCEPTED PAPERS THREAD</title>
		<link>http://probablyirrelevant.org/2011/06/sigir-2011-accepted-papers-thread/</link>
		<comments>http://probablyirrelevant.org/2011/06/sigir-2011-accepted-papers-thread/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 19:29:41 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Conferences]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=110</guid>
		<description><![CDATA[Please visit Ian&#8217;s post on Not Relevant for pre-prints of SIGIR 2011 accepted papers.
]]></description>
			<content:encoded><![CDATA[<p>Please visit <a title="SIGIR 2011 Previews" href="http://nonrel.wordpress.com/2011/06/01/sigir-2011-previews/" target="_self">Ian&#8217;s post</a> on <a title="Not Relevant" href="http://nonrel.wordpress.com" target="_self">Not Relevant</a> for pre-prints of <a title="SIGIR 2011 accepted papers" href="http://sigir2011.org/papers.htm" target="_self">SIGIR 2011 accepted papers</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/06/sigir-2011-accepted-papers-thread/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2011-12 Computing Innovation Fellows Opportunities</title>
		<link>http://probablyirrelevant.org/2011/05/2011-12-computing-innovation-fellows-opportunities/</link>
		<comments>http://probablyirrelevant.org/2011/05/2011-12-computing-innovation-fellows-opportunities/#comments</comments>
		<pubDate>Tue, 10 May 2011 01:32:42 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[students]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=107</guid>
		<description><![CDATA[Upcoming PhD graduates should note the call from applications for the   Computing Innovation Fellows  Program,
The goals of the CIFellows Project are to retain new Ph.D. scholars in research and teaching during challenging economic times, while also supporting intellectual renewal and diversity in the computing fields at U.S. organizations. A total of 107 [...]]]></description>
			<content:encoded><![CDATA[<p>Upcoming PhD graduates should note the call from applications for the  <a href="http://cifellows.org/"> Computing Innovation Fellows  Program</a>,</p>
<blockquote><p>The goals of the CIFellows Project are to retain new Ph.D. scholars in research and teaching during challenging economic times, while also supporting intellectual renewal and diversity in the computing fields at U.S. organizations. A total of 107 Ph.D.s have been supported through the program since 2009 (see the box at right for more details).</p>
<p>These CIFellows have received outstanding research and teaching enrichment experiences, and several have landed permanent positions (including tenure-track faculty appointments) in academia and industry as a result of their experiences.</p>
<p>CRA/CCC will make awards for the 2011-12 academic year. The exact number of awards is contingent upon the quality of applications received as well as the outcome of a proposal for funding that we have submitted.</p></blockquote>
<p>Fellowships support &#8220;positions at universities, industrial research laboratories, and other organizations that are pursuing innovation in computing and its positive impact on society.&#8221;</p>
<p>The deadline is <strong>May 31, 2011</strong>.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/05/2011-12-computing-innovation-fellows-opportunities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IR in IBM&#8217;s Watson: An interview with Nico Schlaefer</title>
		<link>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/</link>
		<comments>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/#comments</comments>
		<pubDate>Thu, 17 Mar 2011 14:05:01 +0000</pubDate>
		<dc:creator>Jon Elsas</dc:creator>
				<category><![CDATA[Question Answering]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=100</guid>
		<description><![CDATA[Last month, IBM&#8217;s Watson Deep QA system took on two Jeopardy! champions and won. Several researchers at the Language Technologies Institute at Carnegie Mellon University have been involved with the project over the past few years, including Professor Eric Nyberg and his students Hideki Shima and Nico Schlaefer. Nico has been particularly involved in the [...]]]></description>
			<content:encoded><![CDATA[<p><em>Last month, <a href="http://www-03.ibm.com/innovation/us/watson/">IBM&#8217;s Watson Deep QA system</a> took on two <a href="http://www.jeopardy.com/">Jeopardy!</a> champions and won. Several researchers at the <a href="http://www.lti.cs.cmu.edu/">Language Technologies Institute</a> at <a href="http://www.cmu.edu/index.shtml">Carnegie Mellon University</a> have been involved with the project over the past few years, including <a href="http://www.cs.cmu.edu/~ehn/">Professor Eric Nyberg</a> and his students <a href="http://www.cs.cmu.edu/~hideki/">Hideki Shima</a> and <a href="http://www.cs.cmu.edu/~nico/index.html">Nico Schlaefer</a>. Nico has been particularly involved in the IR technology behind Watson, and has answered a few questions on his role in the project. This work forms the basis of his recent thesis proposal on &#8220;Statistical Source Expansion for Question Answering&#8221;.</em></p>
<p><em>If you didn&#8217;t see the Jeopardy! match, check out the <a href="http://www.youtube.com/watch?v=WFR3lOm_xhE">practice match</a>, <a href="http://www.youtube.com/watch?v=ls2IgNiOftA">interview with Professor Nyberg, Hideki and Nico</a>, and <a href="http://www-03.ibm.com/innovation/us/watson/building-watson/how-watson-works.html">background on IBM&#8217;s site</a>.</em></p>
<p><strong>Probably Irrelevant: What role does IR play in a QA system?</strong></p>
<p><strong>Nico Schlaefer: </strong>Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system &#8211; they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.</p>
<p><strong>PI: What are good characteristics of an IR system for QA?</strong></p>
<p><strong>NS: </strong>In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.</p>
<p><strong>PI: Where does IR fail with respect to QA?</strong></p>
<p><strong>NS: </strong>IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extent, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach &#8211; <strong>statistical source expansion</strong> &#8211; which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.</p>
<p><strong>PI: What kinds of resources are expanded?</strong></p>
<p><strong>NS: </strong>We focused on sources we found most useful for answering Jeopardy! questions and also question from <a href="http://trec.nist.gov">TREC</a> QA evaluations. These include encyclopedias (such as <a href="http://www.wikipedia.org/">Wikipedia</a>) and dictionaries (such as <a href="http://www.wiktionary.org/">Wiktionary</a>). More recently, we also experimented with the <a href="http://boston.lti.cs.cmu.edu/Data/clueweb09/">ClueWeb09 corpus</a>, a large web crawl created at CMU which comprises about 12 TB of English web pages.</p>
<p><strong>PI: What techniques and/or tools are you using to identify topics to expand?</strong></p>
<p><strong>NS: </strong>This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.</p>
<p><strong>PI: How do you estimate relevance to those topics?</strong></p>
<p><strong>NS: </strong>We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we&#8217;re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.</p>
<p><strong>PI: Could you give an example of a question that is helped by source expansion?</strong></p>
<p><strong>NS: </strong>Here is a question for which source expansion helped:</p>
<p>What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?</p>
<p>This is a question from the <a href="http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf">TREC 8 evaluation</a> [pdf], but if written as a statement (&#8221;This rare neurological disease has symptoms such as &#8230;&#8221;) I think it could also pass as a Jeopardy! question. The answer is &#8220;Tourette syndrome&#8221;.</p>
<p>We first tried to answer this question using Wikipedia as a source, and there is indeed an article about &#8220;Tourette syndrome&#8221; in our copy of Wikipedia, but unfortunately it doesn&#8217;t mention most of the keywords in the question and Watson wasn&#8217;t able to get the answer. We then expanded Wikipedia, and &#8220;Tourette syndrome&#8221; was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:</p>
<ul>
<li>Rare neurological disease that causes repetitive motor and vocal tics</li>
<li>The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.</li>
<li>Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.</li>
<li>The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.</li>
</ul>
<p>These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is &#8220;incoherent&#8221;. This made it very easy for Watson to find the answer.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What role does IR play in a QA system?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Watson and other state-of-the-art QA systems find answers in unstructured text, which is indexed and searched with IR systems. From an architectural point of view, QA applications are often build on top of an IR system &#8211; they take a natural language question and transform it into a query that can be handled by the retrieval engine, submit the query and get the search results, and then further process these results by extracting answers and scoring them. So IR plays a key role in question answering, and the performance of a QA system highly depends on the quality of the search results.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What are good characteristics of an IR system for QA?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">In QA, it is quite common to generate relatively complicated queries that include term weights and proximity operators. Some systems also pre-annotate their sources with syntactic or semantic information and formulate constrained queries that leverage these annotations. In addition, QA systems often do not retrieve whole documents but shorter passages comprising just a few sentences. To be suitable for QA, an IR system should provide a rich query language, support annotations on the source documents, and allow QA systems to retrieve search results of different granularities.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Where does IR fail with respect to QA?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">IR often fails if there is little relevant information for a given question in the sources. The question and relevant documents may use different terminology, which makes it hard for the IR system to retrieve useful text. Query expansion or pseudo-relevance feedback can help to some extend, but often these techniques do not consistently improve performance, and some QA systems only use them as a fallback solution if an initial search does not return anything useful. Obviously, these methods are not going to help if the answer to a question is not in the sources. We developed a different approach &#8211; statistical source expansion &#8211; which overcomes some of these issues by augmenting existing sources with more relevant information and by increasing semantic redundancy.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What kinds of resources are expanded?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We focused on sources we found most useful for answering Jeopardy! questions and also question from TREC QA evaluations. These include encyclopedias (such as Wikipedia) and dictionaries (such as Wiktionary). More recently, we also experimented with the ClueWeb09 corpus, a large web crawl created at CMU which comprises about 12 TB of English web pages.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What techniques and/or tools are you using to identify topics to expand?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">This depends on the sources. For example, when expanding an encyclopedia or a dictionary, we consider each document as a candidate topic. We can then sort the topics by some measure of popularity and focus on expanding the most popular ones. This approach is based on the assumption that Jeopardy! questions (and also questions in most other QA tasks, such as TREC) tend to ask about popular topics, so we get the largest performance gain out of expanding those topics. When expanding other sources that are not organized by topics, such as web crawls or newswire corpora, more sophisticated topic detection techniques become necessary. For example, the most popular topics can be identified using named entity recognizers, statistical methods or dictionaries of known topics.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">How do you estimate relevance to those topics?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We use a statistical model that combines a variety of features to estimate the topicality and textual quality of text passages. For example, one of the topicality features is a likelihood ratio estimated with language models. A topic model is trained using the seed document we&#8217;re expanding or related web pages retrieved for that seed, and a background model is trained on a large collection of text. The ratio of the likelihoods of a text passage under the topic model and the background model is a good indicator of topicality. Textual quality can, for example, be estimated using dictionaries of known words and n-grams. We also look at simple surface features of text passages, such as the length of a passage and its offset in the source document.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Could you give an example of a question that is helped by source expansion?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Here is a question for which source expansion helped:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">This is a question from the TREC 8 evaluation, but if written as a statement (&#8221;This rare neurological disease has symptoms such as &#8230;&#8221;) I think it could also pass as a Jeopardy! question. The answer is &#8220;Tourette syndrome&#8221;.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">We first tried to answer this question using Wikipedia as a source, and there is indeed an article about &#8220;Tourette syndrome&#8221; in our copy of Wikipedia, but unfortunately it doesn&#8217;t mention most of the keyword in the question and Watson wasn&#8217;t able to get the answer. We then expanded Wikipedia, and &#8220;Tourette syndrome&#8221; was one of the topics that was automatically selected. The expanded article contains the following text passages which, by the way, all come from different websites:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- Rare neurological disease that causes repetitive motor and vocal tics</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- The first symptoms usually are involuntary movements (tics) of the face, arms, limbs or trunk.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- Tourette’s syndrome (TS) is a neurological disorder characterized by repetitive, stereotyped, involuntary movements and vocalizations called tics.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">- The person afflicted may also swear or shout strange words, grunt, bark or make other loud sounds.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">These passages jointly almost perfectly cover the question keywords. I think the only content word that is not in there is &#8220;incoherent&#8221;. This made it very easy for Watson to find the answe</div>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/03/ir-in-ibms-watson-an-interview-with-nico-schlaefer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Yahoo Key Scientific Challenges Grant</title>
		<link>http://probablyirrelevant.org/2011/02/yahoo-key-scientific-challenges-grant/</link>
		<comments>http://probablyirrelevant.org/2011/02/yahoo-key-scientific-challenges-grant/#comments</comments>
		<pubDate>Fri, 25 Feb 2011 18:04:49 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[students]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=93</guid>
		<description><![CDATA[Two more weeks for Computer Science graduate students to apply for Yahoo&#8217;s Key Scientific Challenges grant.  Highlights of the grant include,

$5,000 unrestricted research seed funding which can be used for conference fees and travel, lab materials, professional society membership dues, etc.
Exclusive access to select Yahoo! datasets
The unique opportunity to collaborate with our industry-leading scientists
An [...]]]></description>
			<content:encoded><![CDATA[<p>Two more weeks for Computer Science graduate students to apply for Yahoo&#8217;s <a title="Key Scientific Challenges" href="http://labs.yahoo.com/ksc">Key Scientific Challenges grant</a>.  Highlights of the grant include,</p>
<ul>
<li>$5,000 unrestricted research seed funding which can be used for conference fees and travel, lab materials, professional society membership dues, etc.</li>
<li>Exclusive access to select Yahoo! datasets</li>
<li>The unique opportunity to collaborate with our industry-leading scientists</li>
<li>An invitation to this summer&#8217;s exclusive Key Scientific Challenges Graduate Student Summit where you&#8217;ll join the top minds in academia and industry to present your work, discuss research trends and jointly develop revolutionary approaches to fundamental problems</li>
</ul>
<p>Deadline is March 11, 2011.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/02/yahoo-key-scientific-challenges-grant/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Explicit negative feedback comes to the web&#8230;somewhat</title>
		<link>http://probablyirrelevant.org/2011/02/explicit-negative-feedback-comes-to-the-web-somewhat/</link>
		<comments>http://probablyirrelevant.org/2011/02/explicit-negative-feedback-comes-to-the-web-somewhat/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 22:17:13 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Web Search]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=91</guid>
		<description><![CDATA[If you use Chrome, you can block results from certain sites.  Even if this is equivalent to adding [-site:domain], it certainly makes the query easier to specify.  Promoted as a way to filter content farms, it could provide easily data to go beyond simple results filtering.
G is upfront about collecting the data,
If installed, [...]]]></description>
			<content:encoded><![CDATA[<p>If you use Chrome, you can <a href="http://googleblog.blogspot.com/2011/02/new-chrome-extension-block-sites-from.html">block results from certain sites</a>.  Even if this is equivalent to adding [-site:domain], it certainly makes the query easier to specify.  Promoted as a way to filter content farms, it could provide easily data to go beyond simple results filtering.</p>
<p>G is upfront about collecting the data,</p>
<blockquote><p>If installed, the extension also sends blocked site information to Google, and we will study the resulting feedback and explore using it as a potential ranking signal for our search results.</p></blockquote>
<p>Hopefully this means users are starting to get a better idea of how data flows and is exploited by modern information providers.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2011/02/explicit-negative-feedback-comes-to-the-web-somewhat/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Dear Facebook, What is the performance of your face recognizer?  Thanks!</title>
		<link>http://probablyirrelevant.org/2010/08/dear-facebook-what-is-the-performance-of-your-face-recognizer-thanks/</link>
		<comments>http://probablyirrelevant.org/2010/08/dear-facebook-what-is-the-performance-of-your-face-recognizer-thanks/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 14:45:34 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Social Media]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=86</guid>
		<description><![CDATA[As far as I can tell, Facebook must have one of the largest collection of images with face tags.  I can&#8217;t imagine any Facebook employee with even a few weeks of a machine learning course under their belt hasn&#8217;t tried to train a model to perform face recognition on their data.
Does anyone know of [...]]]></description>
			<content:encoded><![CDATA[<p>As far as I can tell, Facebook must have one of the largest collection of images with face tags.  I can&#8217;t imagine any Facebook employee with even a few weeks of a machine learning course under their belt hasn&#8217;t tried to train a model to perform face recognition on their data.</p>
<p>Does anyone know of publications using this proprietary data?  I mean the whole thing, not just samples we all have access to in our local networks.</p>
<p>A few more questions for those with the data:</p>
<ul>
<li>how much does the social network data help in recognizer performance?  other profile data?</li>
<li>if you suppress all image data for an individual, can you still recognize them with non-random accuracy?</li>
<li>can you infer any of the structured content in a profile from image data?</li>
<li>have any companies or government organizations asked you for access to this data?</li>
</ul>
<p>No need to share the data, just the results.</p>
<p><strong>UPDATE: </strong>Facebook (now?) has the following Privacy setting,</p>
<blockquote>
<div><strong>Suggest photos of me to friends</strong></p>
<div>When photos look like me, suggest my name</div>
</div>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2010/08/dear-facebook-what-is-the-performance-of-your-face-recognizer-thanks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Economic Impact Assessment of NIST’s Text REtrieval Conference (TREC) Program&#8221;</title>
		<link>http://probablyirrelevant.org/2010/07/economic-impact-assessment-of-nist%e2%80%99s-text-retrieval-conference-trec-program/</link>
		<comments>http://probablyirrelevant.org/2010/07/economic-impact-assessment-of-nist%e2%80%99s-text-retrieval-conference-trec-program/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 20:43:14 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Web Search]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=84</guid>
		<description><![CDATA[Thanks to your feedback,
&#8220;&#8230;this study estimates that TREC’s existence was responsible for approximately one-third of an improvement of more than 200% in web search products that was observed between 1999 and 2009.&#8221;
More here.
]]></description>
			<content:encoded><![CDATA[<p>Thanks to <a href="http://probablyirrelevant.org/2010/02/trec-survey/">your feedback</a>,</p>
<blockquote><p>&#8220;&#8230;this study estimates that TREC’s existence was responsible for approximately one-third of an improvement of more than 200% in web search products that was observed between 1999 and 2009.&#8221;</p></blockquote>
<p>More <a href="http://trec.nist.gov/pubs/2010.economic.impact.pdf">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2010/07/economic-impact-assessment-of-nist%e2%80%99s-text-retrieval-conference-trec-program/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR 2010 Best Paper Nominees</title>
		<link>http://probablyirrelevant.org/2010/07/sigir-2010-best-paper-nominees/</link>
		<comments>http://probablyirrelevant.org/2010/07/sigir-2010-best-paper-nominees/#comments</comments>
		<pubDate>Sat, 03 Jul 2010 22:18:57 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Conferences]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=82</guid>
		<description><![CDATA[SIGIR has posted best paper nominees.

A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose
Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs, R. White, J. Huang
Caching Search Engine Results over Incremental Indices, F. Junqueira, R. Blanco, E. Bortnikov, R. Lempel, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.sigir2010.org/doku.php?id=program:awards">SIGIR has posted best paper nominees.</a></p>
<ul>
<li>A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose</li>
<li>Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs, R. White, J. Huang</li>
<li>Caching Search Engine Results over Incremental Indices, F. Junqueira, R. Blanco, E. Bortnikov, R. Lempel, L. Telloli, H. Zaragoza</li>
<li>Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski, N. Craswell</li>
<li>Extending Average Precision to Graded Relevance Judgments, S. Robertson, E. Kanoulas, E. Yilmaz</li>
<li>Information Based Model for ad hoc information retrieval, S. Clinchant, E. Gaussier</li>
<li>Multi-style language model for web scale information retrieval, K. Wang, J. Gao, X. Li</li>
<li>Properties of Optimally Weighted Data Fusion in CBMIR, P. Wilkins, A. Smeaton</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2010/07/sigir-2010-best-paper-nominees/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Query logs and information retrieval research</title>
		<link>http://probablyirrelevant.org/2010/06/query-logs-and-information-retrieval-research/</link>
		<comments>http://probablyirrelevant.org/2010/06/query-logs-and-information-retrieval-research/#comments</comments>
		<pubDate>Wed, 02 Jun 2010 01:59:05 +0000</pubDate>
		<dc:creator>Fernando</dc:creator>
				<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Web Search]]></category>

		<guid isPermaLink="false">http://probablyirrelevant.org/?p=76</guid>
		<description><![CDATA[About one year ago,  Bruce Croft asked the IR community for help with getting access to query logs for academia,
The goal of this project is to create a database of web search activity that will be provided to the information retrieval research community to use on current and future information retrieval research projects.
To accomplish [...]]]></description>
			<content:encoded><![CDATA[<p>About one year ago,  Bruce Croft asked the IR community for help with getting access to query logs for academia,</p>
<blockquote><p>The goal of this project is to create a database of web search activity that will be provided to the information retrieval research community to use on current and future information retrieval research projects.</p></blockquote>
<p>To accomplish this, the Lemur Project developed a toolbar to be voluntarily installed by users.  After a year of data collection, <a href="http://lemurstudy.cs.umass.edu/">the project has been aborted</a>,</p>
<blockquote><p>Given that we have gathered the equivalent of less than 6 seconds of Google traffic (assuming 500 million queries per day) in one year, we have decided to terminate the project.</p></blockquote>
<p>This is pretty depressing news.  Admittedly, part of this depression originates from my guilt over not having contributed to the project myself.  However, a more substantial part stems from the potential this data set had to be groundbreaking, perhaps similar to the release of the first Tipster collections.  Although this was way before my time, I imagine the sudden release of a large, public corpus resulted in a tremendous amount of activity and excitement.</p>
<p>Information retrieval research has had large collections of documents for a few decades now.  We evaluate on a few hundred queries and publish results.  With some exceptions, the majority of interest in the field has focused on scaling up corpora.  As a result, we have rich set of tools to analyze and retrieve documents from large corpora.</p>
<p>There are two things missing from this model: a rich stream of queries coming into the system and a rich stream of interactions between users and documents.  Our friends in the CHI and information science communities have been doing a great job with understanding the important factors involved in user behavior on laboratory scale.  However, I&#8217;m going to draw an analogy here between small scale user studies for IR and document-level NLP analysis for IR that may raise a few eyebrows.  I believe that many IR researchers would argue that, given the choice between a corpus-driven approaches and NLP approaches to IR, they would opt for more data.  This is despite the rich analysis NLP can provide.  Similarly, I believe that the fine-grained analysis provided by laboratory studies may be less important than very large scale analysis of user behavior.  Of course, both the results about NLP for IR and the claim about laboratory experiments are based on relatively limited experiments (e.g. small sets of queries).  We should, as a community, continue research in all of these directions.</p>
<p>Having said this, let&#8217;s consider some motivations for web query logs and IR research,</p>
<p><strong>Claim 1. Web query logs will help with the contribution to web search research.</strong></p>
<p>There is no doubt that query logs are important for any search engine, web or otherwise.  However, query logs are only one of the many sources of interaction data available in production.  There are many, many other signals which can be effectively exploited for query understanding and document ranking.   In my opinion, outside of starting its own web search engine, academia will always be scurrying to catchup to industry&#8217;s data sources.</p>
<p>I convinced myself a few years ago that the resources required to build and maintain a web search engine may never exist in academia.  This is not to say that academic IR researchers should give up on having impact on web search engines.  IR research several decades old continues to impact modern search engine design.  What needs to be determined is how the current academic IR researchers can more directly address the problems confronted by web search companies.  I personally believe that a tight coupling between academic and industrial research labs needs to exist.  This could be accomplished in a number of ways.</p>
<ol>
<li> add value to an existing search engine&#8217;s interface.  If search engines provide ranker APIs, academics can develop new interfaces which may attract users and, as a result, interaction data.</li>
<li> teach the IR fundamentals during the academic year/perform intense interaction during the summer during internships or other collaborations.  I am most familiar with Yahoo&#8217;s <a href="http://labs.yahoo.com/ksc">Key Scientific Challenges Fellowships</a> and <a href="http://labs.yahoo.com/Academic_Relations/Faculty">Faculty Engagement Grants</a>.  Similar programs exist at other web search engines.</li>
<li> develop high-quality, public web search engine simulators which provide students/researchers with the ability to test algorithms <em>in silico</em>.  Our <a href="http://ciir.cs.umass.edu/~fdiaz/sigir09-DA.pdf">SIGIR 2009 paper</a> made extensive use of simulation whose parameters were grounded in real world data.  Systems research in computer architecture or computer networking have adopted this approach for a while.  SIGIR 2010 will be hosting a workshop on <a href="http://www.dcs.gla.ac.uk/access/simint/">simulated interaction</a>.</li>
</ol>
<p>No doubt there are many, many other alternatives.</p>
<p><strong>Claim 2. Web query logs will help with the contribution to production search research.</strong></p>
<p>As stated earlier, IR research has looked at the document side for many, many problems.  This research has benefited web search as well as search in other domains such as legal, news, and enterprise search.</p>
<p>User behavior data improved production web search engines; user behavior data will no doubt improve production non-web search engines.   Just as with web search though, this data does not exist in academia.</p>
<p>I believe, though, that the barrier to entry for non-web/vertical search engines is somewhat lower.  The collections are smaller and manageable.  At the same time, document representations can be richer for verticals, interaction is less constrained, and, as a result, the potential for attracting users may be higher than with portal web search engines.</p>
<p>If an academic institution maintained a domain-specific production search engine, academic research could become more relevant to industrial search engines.  For example, academic institutions would easily be able to publish about query logs, interaction, large scale adaptation, and online learning with large scale real world data.  One important, unresolved question is how to come to terms with experimental reproducibility and production data which is often closed due to privacy reasons.</p>
<p>Academic IR research will continue to contribute to general IR research.  Students trained in IR fundamentals will continue to be strong candidates for research and development in production search companies.  I believe that there is room for greater impact.  How that happens remains to be seen.</p>
]]></content:encoded>
			<wfw:commentRss>http://probablyirrelevant.org/2010/06/query-logs-and-information-retrieval-research/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

