TAC 2008 Question Answering Track Guidelines

I. Overview

The goal of the TAC QA track is to foster research on systems that search large document collections and retrieve precise answers to questions (rather than entire documents). The focus is on systems that can function in unrestricted domains.

The 2008 QA task focuses on finding answers to opinion questions. The 2008 QA task is similar to the main QA task in TREC 2007 in that the test set will consist of question series. However, each series in 2008 asks for people's opinions about a particular target (rather than general information about the target), and the questions will be asked over only blog documents. There will be two types of questions -- rigid list questions and squishy list questions -- each with its own evaluation measure. The rigid list questions will be evaluated using the same methodology used to evaluate list questions in past TREC QA Track tasks. The squishy list questions will be evaluated with the nugget Pyramid method used to evaluate complex questions as described in (Dang and Lin, 2007).

The test questions for the QA task will be available on the TAC 2008 QA home page on June 24. Submissions are due at NIST on or before July 1, 2008. Each team may submit up to three runs (submission files) for the QA task, ranked by priority. NIST will judge the first- and second-priority runs from each team and (if resources allow) up to one additional run from each team. The first-priority run must be fully automatic; second- and third-priority runs are allowed to be manual.

II. Test Data

Questions

The test set consists of 50 targets, each with a series of 2-4 questions about that target. Each series is an abstraction of a user session with a QA system. Each series will contain a number of rigid list questions and a number of squishy list questions.

The question set will be in the XML format given in the following:

TAC 2008 QA Track sample questions

The format will explicitly tag the target as the target, as well as the type of each question in the series (type is one of RigidList and SquishyList). Each question will have an ID of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Rigid and Squishy List questions are requests for a set of instances of a specified type. Rigid list questions require exact answers to be returned. Responses to "squishy" list questions need not be exact, though excessive length will be penalized.

The questions used in previous TREC QA tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous TREC QA tracks that may be useful for system development. This data includes judgment files, answer patterns, top ranked document lists, and sentence files.

Document set

Answers for all questions in the test set must be drawn from the TREC Blog06 collection. The TREC Blog06 collection is a large sample of the blogsphere, and contains spam as well as possibly non-blogs, e.g. RSS feeds from news broadcasters. It was crawled over an eleven-week period from December 6, 2005 until February 21, 2006. The collection is 148GB in size, consisting of:

38.6GB of feeds
88.8GB of permalink documents
28.8GB of homepages

For the TAC QA task, each instance of an answer must be supported by a document from the permalinks component of the Blog06 collection. (See the section on Assessment environment for technical details about assessment involving blog documents.) There are over 3.2 million permalink documents in the Blog06 collection. Each document in the permalinks collection is the raw HTML content from the Web wrapped between a <DOC>...</DOC> pair. Just after <DOC>, there are some informational metadata tags, including the <DOCNO> element which contains the document ID. More details can be found in the Blog06 README.

The TREC Blog06 collection was created by the University of Glasgow for the TREC 2006 Blog Track. The collection is currently distributed only by the University of Glasgow. License details and information on how to obtain access to the TREC Blog06 collection are provided in http://ir.dcs.gla.ac.uk/test_collections/. Further information on the Blog06 collection and how it was created can be found in the DCS Technical Report TR-2006-224, Department of Computing Science, University of Glasgow at http://www.dcs.gla.ac.uk/~craigm/publications/macdonald06creating.pdf.

Document lists

As a service to the track, for each target, NIST will provide the ranking of the top 1000 documents retrieved by the PRISE search engine when using the target as the query. NIST will not provide document lists for individual questions. Note that this is a service only, provided as a convenience for teams that do not wish to implement their own document retrieval system. There is no guarantee that these rankings will contain all or even any of the documents that actually answer the questions in a series. The document lists will be in the same format as previous years of TREC QA:

qnum rank docid rsv

where:

qnum is the target number
rank is the rank at which PRISE retrieved the doc (most similar doc is rank 1)
docid is the document identifier
rsv is the "retrieval status value" or similarity between the query and document, where larger is better

NIST will also provide the full text of the top 50 documents per target (as given from the above rankings). Participants who would like to receive the text of the top 50 documents must submit or have already submitted to NIST a signed user agreement form for the Blog06 collection. The user agreement form for Blog06 is located at http://tac.nist.gov/data/forms/org_appl_blog06.html. Participating teams that have already obtained the Blog06 collection from the University of Glasgow should fax to NIST their existing (signed) Blog06 organization agreement. Please submit all forms to NIST following the instructions at http://tac.nist.gov/data/forms/index.html.

III. Submission guidelines

Submission format

A submission file for the QA task must be a plain text file containing at least one line for each question in the QA task. Each line in the file must have the form:

qid run-tag docid answer-string

where:

qid is the question number (of the form X.Y)
run-tag is the run ID
docid is the ID of a permalink document
answer-string is a text string with no embedded newlines

Any amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present. The answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. The total length of all answer-strings for each question cannot exceed 7000 non-white-space characters. The run-tag should be the concatenation of the Team ID and the priority of the run. (For example, if the Team ID is "NISTAssessor" then the run-tag for the first-priority run should be "NISTAssessor1".)

Sample submission file for the first-priority run from the team "NISTAssessor" for the TAC 2008 QA Track sample questions. The submission consists of (non-exhaustive) lists of answers found by NIST assessors during topic development. The text of the supporting documents shows the context in which the answers were found.

Submission procedure

Each team may submit up to three runs, ranked by priority (1-3). The first-priority run must be a completely automatic run. NIST will evaluate the first- and second-priority runs from each team. If resources allow, NIST will evaluate the third-priority run from each team.

NIST will post the questions on the TAC QA web site on June 24 and results will have to be submitted to NIST by 11:59 p.m. (EDT) on July 1, 2008. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list when the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid document numbers, wrong formats, missing data, etc. Participants should check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Restrictions

For automatic runs, no changes can be made to any component of the QA system or any resource used by the system in response to the test targets and questions. If there is any manual processing of questions, answers or any other part of a system or the resources that it uses, then the resulting run must be classified as a manual run. At the time of submission, each team will be asked to fill out a form stating:

whether the submission is manual or fully automatic. A description of the manual processing (if any) will also be requested.
whether the entire Blog06 collection was searched for each target, or whether the search was limited to the text of the top 50 documents distributed by NIST.

Targets must be processed independently (i.e., the system may not adapt to targets that have already been processed). Questions within a series must be processed in order, without looking ahead. That is, the system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.

IV. Evaluation

Rigid list

For each rigid list question, the system should return an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. The answer-string must contain nothing other than an answer item, and the docid must be the ID of a permalink document in the Blog06 collection that supports answer-string as an answer item. The answer-string does not have to appear literally in a document in order for the document to support it as being a correct answer item.

An answer-string must contain a complete, exact answer item and nothing else. Support, correctness, and exactness will be in the opinion of the assessor. Instances will be judged by human assessors who will assign one of four possible judgments to an instance:

incorrect: the answer-string does not contain a correct answer item;
unsupported: the answer-string contains a correct answer item but the document returned does not support that answer item;
non-exact: the answer-string contains a correct answer item and the document supports that answer item, but the string contains more than just the answer item (or is missing bits of the answer item);
correct: the answer-string consists of exactly a correct answer item and that answer item is supported by the document returned.

Note that this means that if an answer-string contains multiple answer items for the question it will be marked inexact and will thus will not help the question's score.

In addition to judging the individual instances in a response, the assessors will also group correct instances into equivalence classes, where each equivalence class is considered a distinct answer item. Scores will be computed using the number of distinct answer items returned in the set.

Rigid-list-score

The final answer set for a rigid list question will be created from the union of the distinct answer items returned by all participants and answer items found by the NIST assessor during question development. An individual list question will be scored by first computing instance recall (IR) and instance precision (IP) using the final answer set, and combining those scores using the F measure with recall and precision equally weighted. That is:

Squishy list

The response for a squishy list question is syntactically the same as for a rigid list question: an unordered, non-empty set of [answer-string, docid] pairs. The interpretation of this set is different, however. There is no expectation of an exact answer to squishy list questions, although responses will be penalized for excessive length.

For each squishy list question, the assessor will create a list of acceptable information nuggets from the union of the returned responses and the information discovered during question development. All decisions regarding acceptability are in the opinion of the assessor. Once the list of acceptable nuggets is created, the assessor will mark the nuggets contained in each [answer-string, docid] pair. Each nugget that is present will be counted only once.

Some of the acceptable nuggets will be deemed vital, while other nuggets on the list are merely okay. A score for each squishy list question will be computed using multiple assessors' judgments of whether a nugget is vital or okay. Each nugget will be assigned a weight equal to the number of assessors who judged it to be vital; nugget weights will then be normalized so that the maximum weight of nuggets for each squishy list question is 1.0 . See (Lin and Demner-Fushman, HLT/NAACL 2006) for details.

Squishy-list-score

An individual squishy list question will be scored using nugget recall (NR) and an approximation to nugget precision (NP) based on length. These scores will be combined using the F measure with beta=3 (recall weighted more heavily than precision). In particular:

if length < allowance

if length >= allowance

Overall score

The different types of questions (rigid list and squishy list) have different scoring metrics, but each of the two scores has a range of [0.0, 1.0] with 1.0 being the high score. NIST will compute the rigid-list-score and squishy-list-score for each series. The rigid-list-score for a series is the mean of the F scores of the rigid list questions in the series. The squishy-list-score for a series is the mean of the F scores of the squishy list questions in the series. The per-series combined score will be a simple average of these two scores for questions in the series:

The final score for a run will be the mean of the per-series combined scores.

Assessment environment

NIST assessors assess the answer-strings with respect to the viewable text of the supporting document as viewed from a web browser. This means that text in the document that is not displayed when the document is viewed in the browser is not considered by the assessor. That includes META tags, comments, and other text that is only visible by viewing the HTML source.

The NIST assessment platform displays the blog documents to the assessors using a web browser. It tries to show the document as closely as possible to how it would be seen during normal browsing, with two caveats. The first caveat is that SCRIPT sections are removed, so that the pages don't interact adversely with the assessment platform itself. The second caveat is that images and stylesheets are loaded from the web, rather than from a local cache, so if that data has changed then the page appearance can be different. For blogs this last is less of a problem than for the general Web, based on past NIST experience.

The NIST assessment platform (as well as the PRISE retrieval system) is written in Java and uses an open-source HTML and XML parsing tool called NekoHTML, written by Andy Clark. NIST uses NekoHTML when parsing documents for indexing, and also to remove SCRIPT elements before sending documents to the browser.

V. Schedule

TAC 2008 QA Track Schedule
now	Blog06 collection available
June 24	Release of test questions
June 24	Top ranked documents available
July 7, 6:00 AM (EDT)	EXTENDED Deadline for participants' submissions
late September	Release of individual evaluated results

BACK to TAC 2008 QA Track Homepage

Last updated:
Comments to: [email protected]