<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Finding relevance judgements in the wild</title>
	<atom:link href="http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/feed/" rel="self" type="application/rss+xml" />
	<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/</link>
	<description>Information Retrieval Research and Development</description>
	<lastBuildDate>Mon, 14 Sep 2009 18:07:45 -0400</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: mariana_soffer</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1846</link>
		<dc:creator>mariana_soffer</dc:creator>
		<pubDate>Mon, 25 May 2009 14:11:34 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1846</guid>
		<description>And you complain, came on. Imagine if you have to do the stuff in spanish (not to mention other languages that are even more wierd), how do you parse? where do you get your training sets from? not even webscraping works here.</description>
		<content:encoded><![CDATA[<p>And you complain, came on. Imagine if you have to do the stuff in spanish (not to mention other languages that are even more wierd), how do you parse? where do you get your training sets from? not even webscraping works here.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1684</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Mon, 20 Apr 2009 14:36:56 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1684</guid>
		<description>I think your intuition is correct about the FIRST method.  First-posts also tend to be more verbose than answers -- users often respond with a single sentence, which tends to be generally unhelpful in ranking.  

Due to time constraints &amp; such, we weren&#039;t able to annotate many more threads for relevance.  We identified 17k &quot;candidate&quot; question-post/answer-link pairs in the collection, using a few simple heuristics such as the presence of a link in a response message.  Of those 17k, we annotated 550 as to whether or not they actually contained a question/answer pair and identified the 48 we used in the study.  So, we found that roughly 8% of those candidates contain a real question/answer pair, and extrapolating up to the full 17k, I would estimate there are about 1400 question-answer pairs in the collection -- that&#039;s quite a few still to be found.  I&#039;m sure we didn&#039;t find them all.

Ideally, of course, we&#039;d like to use more queries.  I don&#039;t know at this point whether we&#039;ll push forward with this type of test set creation, or whether we&#039;ll do a more traditional relevance assessment.  I&#039;m tempted to do the latter, particularly because it would be nice to see if we observe the same results with the different types of test collections.</description>
		<content:encoded><![CDATA[<p>I think your intuition is correct about the FIRST method.  First-posts also tend to be more verbose than answers &#8212; users often respond with a single sentence, which tends to be generally unhelpful in ranking.  </p>
<p>Due to time constraints &amp; such, we weren&#8217;t able to annotate many more threads for relevance.  We identified 17k &#8220;candidate&#8221; question-post/answer-link pairs in the collection, using a few simple heuristics such as the presence of a link in a response message.  Of those 17k, we annotated 550 as to whether or not they actually contained a question/answer pair and identified the 48 we used in the study.  So, we found that roughly 8% of those candidates contain a real question/answer pair, and extrapolating up to the full 17k, I would estimate there are about 1400 question-answer pairs in the collection &#8212; that&#8217;s quite a few still to be found.  I&#8217;m sure we didn&#8217;t find them all.</p>
<p>Ideally, of course, we&#8217;d like to use more queries.  I don&#8217;t know at this point whether we&#8217;ll push forward with this type of test set creation, or whether we&#8217;ll do a more traditional relevance assessment.  I&#8217;m tempted to do the latter, particularly because it would be nice to see if we observe the same results with the different types of test collections.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Webber</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1683</link>
		<dc:creator>William Webber</dc:creator>
		<pubDate>Mon, 20 Apr 2009 10:10:01 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1683</guid>
		<description>OK, I&#039;ve properly read (rather than skimmed) your poster now.  Very interesting work!

With regard to the relatively good performance of the FIRST method,
is this because the opening query in the linked-from thread is often similar to the opening query in the linked-to thread?

It is rather disappointing that you were only able to identify 48 query/answer pairs out of 375,000 threads.  Is this because that
was all there was, or did you stop looking once you&#039;d found 48?</description>
		<content:encoded><![CDATA[<p>OK, I&#8217;ve properly read (rather than skimmed) your poster now.  Very interesting work!</p>
<p>With regard to the relatively good performance of the FIRST method,<br />
is this because the opening query in the linked-from thread is often similar to the opening query in the linked-to thread?</p>
<p>It is rather disappointing that you were only able to identify 48 query/answer pairs out of 375,000 threads.  Is this because that<br />
was all there was, or did you stop looking once you&#8217;d found 48?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1675</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Fri, 17 Apr 2009 14:00:10 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1675</guid>
		<description>William - I agree it can be a little problematic.  But, the nice thing about mining a conversation archive is that we can observe any feedback from the original asker.

In all three cases I linked to in the post (and most of the cases we used in our evaluation) the original user who asked the question confirms that the linked-to answer thread is in fact relevant to their question in a subsequent post.  If the original poster indicates that the linked-to thread is not relevant, then clearly that shouldn&#039;t be used in an evaluation.  If there&#039;s no indication at all, then we&#039;re back to the same situation that we see in TREC -- where the annotators (or &quot;linkers&quot;) are different than the user who issues the query.</description>
		<content:encoded><![CDATA[<p>William &#8211; I agree it can be a little problematic.  But, the nice thing about mining a conversation archive is that we can observe any feedback from the original asker.</p>
<p>In all three cases I linked to in the post (and most of the cases we used in our evaluation) the original user who asked the question confirms that the linked-to answer thread is in fact relevant to their question in a subsequent post.  If the original poster indicates that the linked-to thread is not relevant, then clearly that shouldn&#8217;t be used in an evaluation.  If there&#8217;s no indication at all, then we&#8217;re back to the same situation that we see in TREC &#8212; where the annotators (or &#8220;linkers&#8221;) are different than the user who issues the query.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: William Webber</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1674</link>
		<dc:creator>William Webber</dc:creator>
		<pubDate>Fri, 17 Apr 2009 01:03:16 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1674</guid>
		<description>A very nice idea, but as you say, the relationship between &quot;linked&quot; and relevant-by-annotation is problematic.  Perhaps a useful experiment would be to have domain experts perform a standard relevance annotation run (without the prompting of links), and see what the overlap between linked and relevant was?</description>
		<content:encoded><![CDATA[<p>A very nice idea, but as you say, the relationship between &#8220;linked&#8221; and relevant-by-annotation is problematic.  Perhaps a useful experiment would be to have domain experts perform a standard relevance annotation run (without the prompting of links), and see what the overlap between linked and relevant was?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Elsas</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1671</link>
		<dc:creator>Jon Elsas</dc:creator>
		<pubDate>Wed, 15 Apr 2009 19:18:48 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1671</guid>
		<description>The point is: there are tradeoffs in all evaluations.  We would like to know whether system X performs better than system Y for a given task.  But in practice, we always have to approximate that task somehow.  

There&#039;s a few aspects to the evaluation: the query, which is representative of some information need, and the document assessment.  Ideally, we would like:
(1) to get a query or other description of the information need and the relevance assessment from the same person, to ensure there is agreement between the two 
(2) that person to be actually performing the task in question, yielding queries &amp; information needs we know to be representative of the task
(3) (nearly) exhaustive relevance assessment.  
But, this is never the case in practice.

Sometimes we have a query log, which provides highly representative queries.  In those cases the assessment is done someone who didn&#039;t originally provide the query and the information need description is &quot;back fit&quot; to the query observed in the log.  In this case, we cannot ensure criteria (1)

Sometimes we don&#039;t have a query log, in which case the participants or assessors make an attempt to develop information needs that they believe are representative of the task, and then those same people assess those documents for relevance.  In this case we cannot ensure criteria (2)

In the case I described in the blog post, we do meet criteria (2), and maybe (1), but we cannot ensure criteria (3).</description>
		<content:encoded><![CDATA[<p>The point is: there are tradeoffs in all evaluations.  We would like to know whether system X performs better than system Y for a given task.  But in practice, we always have to approximate that task somehow.  </p>
<p>There&#8217;s a few aspects to the evaluation: the query, which is representative of some information need, and the document assessment.  Ideally, we would like:<br />
(1) to get a query or other description of the information need and the relevance assessment from the same person, to ensure there is agreement between the two<br />
(2) that person to be actually performing the task in question, yielding queries &#038; information needs we know to be representative of the task<br />
(3) (nearly) exhaustive relevance assessment.<br />
But, this is never the case in practice.</p>
<p>Sometimes we have a query log, which provides highly representative queries.  In those cases the assessment is done someone who didn&#8217;t originally provide the query and the information need description is &#8220;back fit&#8221; to the query observed in the log.  In this case, we cannot ensure criteria (1)</p>
<p>Sometimes we don&#8217;t have a query log, in which case the participants or assessors make an attempt to develop information needs that they believe are representative of the task, and then those same people assess those documents for relevance.  In this case we cannot ensure criteria (2)</p>
<p>In the case I described in the blog post, we do meet criteria (2), and maybe (1), but we cannot ensure criteria (3).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AC</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1668</link>
		<dc:creator>AC</dc:creator>
		<pubDate>Wed, 15 Apr 2009 12:22:26 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1668</guid>
		<description>The query &quot;celebrity babies&quot; doesn&#039;t seem odd at all.

http://www.google.com/insights/search/#q=%22celebrity%20babies%22&amp;cmpt=q</description>
		<content:encoded><![CDATA[<p>The query &#8220;celebrity babies&#8221; doesn&#8217;t seem odd at all.</p>
<p><a href="http://www.google.com/insights/search/#q=%22celebrity%20babies%22&amp;cmpt=q" rel="nofollow">http://www.google.com/insights/search/#q=%22celebrity%20babies%22&amp;cmpt=q</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Elsas</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1666</link>
		<dc:creator>Jon Elsas</dc:creator>
		<pubDate>Tue, 14 Apr 2009 16:41:57 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1666</guid>
		<description>Iadh -- Sorry for the inaccuracy in my post -- I *meant* to refer to just the distillation task, not the queries used across the track.  I&#039;ve corrected the post to reflect this.  

I agree with you that the queries reflect real information needs, but they were not harvested from query logs of a blog search engine.  They were created by IR researchers trying to test their systems.  I never actually went to a blog search engine and entered the query [celebrity babies], nor did I observe someone doing that.   This was a query that I thought may be reflective of the types of queries such a system would receive.  Do I *know* that the queries I generated are in fact representative of blog search queries?  No, I don&#039;t.  I have no data to base my characterization of feed distillation queries on.  

You might, and if you do, I&#039;d love to hear your thoughts on how representative the feed distillation queries we generated are.</description>
		<content:encoded><![CDATA[<p>Iadh &#8212; Sorry for the inaccuracy in my post &#8212; I *meant* to refer to just the distillation task, not the queries used across the track.  I&#8217;ve corrected the post to reflect this.  </p>
<p>I agree with you that the queries reflect real information needs, but they were not harvested from query logs of a blog search engine.  They were created by IR researchers trying to test their systems.  I never actually went to a blog search engine and entered the query [celebrity babies], nor did I observe someone doing that.   This was a query that I thought may be reflective of the types of queries such a system would receive.  Do I *know* that the queries I generated are in fact representative of blog search queries?  No, I don&#8217;t.  I have no data to base my characterization of feed distillation queries on.  </p>
<p>You might, and if you do, I&#8217;d love to hear your thoughts on how representative the feed distillation queries we generated are.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Iadh</title>
		<link>http://probablyirrelevant.org/2009/04/finding-relevance-judgements-in-the-wild/comment-page-1/#comment-1664</link>
		<dc:creator>Iadh</dc:creator>
		<pubDate>Tue, 14 Apr 2009 15:10:20 +0000</pubDate>
		<guid isPermaLink="false">http://probablyirrelevant.org/?p=61#comment-1664</guid>
		<description>Jon,

{{{

but frequently (as in the Blog Track) the queries are invented by participants or assessors.
}}}

This is at least inaccurate, if not misleading. We have repeatedly said that the queries used for the opinion finding and polarity tasks are driven from a real commercial query logs (including on this same blog). They are certainly not invented by participants (please read the TREC Blog track overview papers).

Only the topics used for the blog distillation task were made by the participating groups. However, they were certainly not invented (sic) by them. They are based on *rea*l information needs.</description>
		<content:encoded><![CDATA[<p>Jon,</p>
<p>{{{</p>
<p>but frequently (as in the Blog Track) the queries are invented by participants or assessors.<br />
}}}</p>
<p>This is at least inaccurate, if not misleading. We have repeatedly said that the queries used for the opinion finding and polarity tasks are driven from a real commercial query logs (including on this same blog). They are certainly not invented by participants (please read the TREC Blog track overview papers).</p>
<p>Only the topics used for the blog distillation task were made by the participating groups. However, they were certainly not invented (sic) by them. They are based on *rea*l information needs.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
