TAC 2009 Update Summarization Task Guidelines

(Also see general TAC 2009 policies and guidelines at http://tac.nist.gov/2009/)

Contents:

Overview
Test data
Submission guidelines
Evaluation
Schedule

Overview

The TAC 2009 update summarization task is based on the following scenario: A user is interested in a particular news story and wants to track it as it develops over time, so she subscribes to a news feed that sends her relevant articles as they are submitted from various news services. However, either there's so much news that she can't keep up with it, or she has to leave for a while and then wants to catch up. Whenever she checks up on the news, it bothers her that most articles keep repeating the same information; she would like to read summaries that only talk about what's new or different.

The TAC 2009 update summarization task is to generate short fluent multi-document summaries of news articles. For each topic, participants are given a topic statement expressing the information need of a user, and two chronologically ordered batches of articles about the topic. Participants are asked to generate a 100-word summary for each batch of articles, that addresses the information need of the user. The summary of the second batch of articles should be written under the assumption that the user has already read the earlier batch of articles and should inform the user of new information about the topic.

The 2009 task repeats the TAC 2008 update summarization task, with the following changes:

In 2008, many of the topics had documents that spanned a wide time period. In 2009, NIST assessors have been more careful about selecting relevant documents that are as close together in time as possibles, subject to the availability of relevant documents in the AQUAINT-2 document collection.
In 2009, overall responsiveness is being evaluated on a 10-point scale rather than a 5-point scale. The extended scale is intended to give this metric greater discriminative power. Values on the 10-point scale can be mapped to a 5-point scale to allow comparison with past years' evaluations.

The test data for the update summarization task will be available on the TAC 2009 Summarization Track home page on July 1, 2009. Submissions are due at NIST on or before July 15. Each team may submit up to two runs (submissions), and all runs will be judged. Runs must be fully automatic.

Test Data

The test dataset is composed of 44 topics. Each topic has a topic statement (title and narrative) and 20 relevant documents which have been divided into 2 sets: Document Set A and Document Set B. Each document set has 10 documents, and all the documents in Set A chronologically precede the documents in Set B.

Test topic statements and document sets will be distributed by NIST via the TAC 2009 Summarization web page. Teams will need to use their TAC 2009 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2009 team ID and password for the summarization track, teams must submit the following forms to NIST, even if these forms were already submitted in previous TAC cycles.

When submitting forms, please also include the TAC 2009 team ID, the email address of the main TAC 2009 contact person for the team, and a comment saying that the form is from a TAC 2009 registered participant.

Documents

The documents for summarization come from the AQUAINT-2 collection of news articles. The AQUAINT-2 collection is a subset of the LDC English Gigaword Third Edition (LDC catalog number LDC2007T07) and comprises approximately 2.5 GB of text (about 907K documents) spanning the time period of October 2004 - March 2006. Articles are in English and come from a variety of sources including Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-Washington Post News Service, New York Times, and the Associated Press. Each document has an ID consisting of a source code, a date when the document was delivered to LDC, and 4 digits to differentiate documents that come from the same source on the same date; for example, document NYT_ENG_20050311.0029 was received from the New York Times on March 11, 2005.

Test Data Format

The topic statements and documents will be in the same format as the TAC 2008 Update Summarization topic statements and documents (sample given below):

Sample topic statements
Sample document sets (password-protected, gzipped tar file)

Submission guidelines

System task

Given a topic, the task is to write 2 summaries (one for Document Set A and one for Document Set B) that address the information need expressed in the corresponding topic statement.

The summary for Document Set A should be a straightforward query-focused summary.
The update summary for Document Set B is also query-focused but should be written under the assumption that the user of the summary has already read the documents in Document Set A.

Each summary should be well-organized, in English, using complete sentences. A blank line may be used to separate paragraphs, but no other formatting is allowed (such as bulleted points, tables, bold-face type, etc.). Each summary can be no longer than 100 words (whitespace-delimited tokens). Summaries over the size limit will be truncated.

Within a topic, the document sets must be processed in chronological order; i.e., the summarizer cannot look at documents in Set B when generating the summary for Set A. However, the documents within a document set can be processed in any order.

All processing of documents and generation of summaries must be automatic. No changes can be made to any component of the summarization system or any resource used by the system in response to the current year's test data.

Submission format

Each team may submit up to two runs. NIST will evaluate all submitted runs.

A run will comprise exactly one file per summary, where the name of each summary file is the ID of its document set. Please include a file for each summary, even if the file is empty. Each file will be read and assessed as a plain text file, so no special characters or markups are allowed. The files must be in a directory whose name should be the concatenation of the Team ID and a number (1-2) for the run. (For example, if the Team ID is "SYSX" then the directory name for the first run should be "SYSX1".) Please package the directory in a tarfile and gzip the tarfile before submitting it to NIST.

Submission procedure

NIST will post the test data on the TAC Summarization web site on July 1, 2009 and results will have to be submitted to NIST by 11:59 p.m. (EDT) on July 15, 2009. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Evaluation

All summaries will first be truncated to 100 words. NIST will then manually evaluate each submitted summary for:

Content (using Columbia University's Pyramid method)
Readability/Fluency
Overall responsiveness

Content: Multiple model summaries will be used in the Pyramid evaluation of summary content. Each topic statement and its 2 document sets will be given to 4 different NIST assessors. For each document set, the assessor will create a 100-word model summary that addresses the information need expressed in the topic statement. The assessors will be guided by the following:

Assessor instructions for writing model summaries.

In the Pyramid evaluation, the assessor will first extract Summary Content Units (SCUs) from the 4 model summaries for the document set. Each SCU is assigned a weight that is equal to the number of model summaries in which it appears. Once all SCUs have been harvested from the model summaries, the assessor will determine which of these SCUs can be found in each of the peer summaries that are to be evaluated. Repetitive information is not rewarded, as each SCU contained in the peer summary is counted only once. The final Pyramid score for a peer summary is the sum of the weights of SCUs contained in the summary, divided by the maximum sum of SCU weights possible for summary of average length (where the average length is determined by the mean SCU count of the model summaries for this document set). For additional details, see:

R.J. Passonneau, A. Nenkova, K. McKeown, and S. Sigelman, Applying the Pyramid Method in DUC 2005
Columbia University's 2006 web page on Pyramids

Readability/Fluency: The assessor will give a readability/fluency score to each summary. The score reflects the fluency and readability of the summary (independently of whether it contains any information that responds to the topic statement) and is based on factors such as the summary's grammaticality, non-redundancy, referential clarity, focus, and structure and coherence.

Overall Responsiveness: The assessor will give an overall responsiveness score to each summary. The overall responsiveness score is based on both content and readability/fluency.

Readability and Overall Responsiveness will each be judged on the following 10-point scale:

1-2	Very Poor
3-4	Poor
5-6	Barely Acceptable
7-8	Good
9-10	Very Good

TAC 2009 Workshop Presentations and Papers

Each team that submits runs for evaluation is requested to write a paper for the TAC 2009 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2009 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2009 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2009, and the TAC Advisory Committee will select the groups who will present at the workshop. Please see guidelines for papers and presentation proposals at http://tac.nist.gov/2009/reporting_guidelines.html.

Schedule

TAC 2009 Update Summarization Task Schedule
July 1	Release of test data
July 15	Deadline for participants' submissions
September 4	Release of individual evaluated results
September 25	Deadline for TAC 2009 workshop presentation proposals
October 22	Deadline for systems' reports

BACK to TAC 2009 Summarization Track Homepage

Last updated:
Comments to: [email protected]