TAC 2008 Opinion Summarization Task Guidelines

I. Overview

The goal of the TAC Summarization track is to foster research on systems that produce summaries of documents. The focus is on systems that can produce well-organized, fluent summaries of text.

The 2008 Opinion Summarization pilot task is to generate well-organized, fluent summaries of opinions about specified targets, as found in a set of blog documents. Similar to past query-focused summarization tasks, each summary will be focused by a number of complex questions about the target, where the question cannot be answered simply with a named entity (or even a list of named entities). The input to the summarization task will come from the TAC 2008 QA task and will comprise a target, some "squishy list" questions about the target, and a set of documents that contain answers to the questions. The output will be a summary for each target that summarizes the answers to the questions. Rather than evaluating content against a set of model summaries, each submitted summary will be evaluated against a nugget Pyramid created during the evaluation of submissions to the QA task.

Much of the test data and evaluation metrics for the opinion summarization task will be in common with the opinion QA task. For details about test questions, documents, and "squishy list" evaluation metrics, opinion summarization participants are invited to read the following:

TAC 2008 QA Track Guidelines

The test questions for the opinion summarization task will be available on the TAC 2008 Summarization home page on August 22. Submissions are due at NIST on or before September 2, 2008. Each team may submit up to three runs (submissions) for the opinion summarization pilot task, ranked by priority. NIST will judge the first-priority run from each team and (if resources allow) up to 2 additional runs from each team. Runs may be either manual or automatic.

II. Test Data

The test questions and documents will be a subset of the test data for the TAC 2008 QA task. The opinion summarization test data will consist of:

an xml file (in the same format as the TAC 2008 QA questions) with a number of targets; each target has one or more squishy list questions and one or more document IDs of relevant documents to be summarized
a tarball of a directory containing the relevant documents, with one document per file

The "relevant" documents are those documents containing an answer to a squishy list question, as found by human assessors and QA systems. Optionally, additional input will be available in the form of answer-containing text snippets found by QA systems and/or assessors, along with a supporting document ID for each snippet. The answer-snippet need not appear literally in its associated document, but may be derived from information in the document. Each line of the optional snippet file is of the form:

target-id document-id answer-snippet

Sample input data (generated using documents and answer-snippets from human assessors):

Sample targets/questions/documents
Sample documents (password-protected, gzipped tar file)
Optional Sample answer-snippets

Actual test data will be generated using documents and answer-snippets from human assessors and TAC 2008 QA participants. There will be approximately 20-25 targets in the test data.

III. Submission guidelines

Submission format

A submission to the opinion summarization task will comprise exactly one file per summary, where the name of each summary file is the numeric ID of the target of the summary. Please include a file for each summary, even if the file is empty. The number of non-whitespace characters in the summary must not exceed 7000 times the number of squishy list questions for the target of the summary. Each file will be read and assessed as a plain text file, so no special characters or markups are allowed. The files must be in a directory whose name should be the concatenation of the Team ID and the priority of the run. (For example, if the Team ID is "SYSX" then the directory name for the first-priority run should be "SYSX1".) Please package the directory in a tarfile and gzip the tarfile before submitting it to NIST.

Submission procedure

Each team may submit up to three runs, ranked by priority (1-3). NIST will evaluate the first-priority run from each team. If resources allow, NIST will evaluate an additional 1 or 2 runs from each team.

NIST will post the test data on the TAC Summarization web site on August 22 and results will have to be submitted to NIST by 11:59 p.m. (EDT) on September 2, 2008. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list when the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants should check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Submissions may be either manual or automatic. For automatic runs, no changes can be made to any component of the summarization system or any resource used by the system in response to the current year's test data (targets, questions, or documents). If any part of the system (including resources used) is changed or tuned in response to the current year's test data, then the resulting run must be classified as a manual run. At the time of submission, each team will be asked to fill out a form stating:

whether the submission is manual or fully automatic. A description of the manual processing (if any) will also be requested.
whether or not the system used the optional answer-snippets provided by NIST.

IV. Evaluation

Rather than evaluating content against a set of model summaries, each summary will be evaluated for content using the nuggets Pyramid method used to evaluate the squishy list questions in the TAC QA task. The assessor will use the list(s) of acceptable nuggets previously created for the question(s) in the QA track and count the nuggets contained in each summary. Each nugget that is present will be counted only once. Scoring will be the same as for the QA squishy list score, but likely with a lower value for beta (i.e., recall will be weighted less heavily than in the QA task).

The assessor will also give an overall responsiveness score to each summary. The overall responsiveness score will be an integer between 1-10 (10 being best) and will reflect both content and linguistic quality. NIST will use the overall responsiveness score to determine appropriate parameters for scoring, including an appropriate value for beta.

V. Schedule

TAC 2008 Opinion Summarization Task Schedule
August 22	Release of test data
September 2, 11:59pm (EDT)	Deadline for participants' submissions
early October	Release of individual evaluated results

BACK to TAC 2008 Summarization Track Homepage

Last updated:
Comments to: [email protected]