TAC 2011 Guided Summarization Task Guidelines

(Also see general TAC 2011 policies and guidelines at http://tac.nist.gov/2011/)

Contents:

Overview
Test data
Submission guidelines
Evaluation
Schedule

Overview

One of the main problems in automatic text summarization is the absence of a single "gold standard" that automatic systems can model. Summarization is guided by a vague notion of "importance" of facts mentioned in the source text, a concept which is highly subjective and content-dependent. Methods that select high-scoring sentences based on term frequency provide a good baseline for summarization, but they can be hindered by synonyms and equivalent expressions in the source text and, in multi-document summarization, can result in high degrees of redundancy and non-readability. A second problem is using solely extractive methods: experiments on human extractive summarization (Genest et al., 2009) show that even the best content-selection mechanism (i.e., a human summarizer) is unable to create good summaries if it's limited to pasting together sentences taken out of context from a number of independently written articles. The design of the TAC 2011 Guided Summarization task aims to address both these issues simultaneously: By using topics that fall into template-like categories and contain highly predictable elements, as well as explicitly guiding the creation of human reference summaries to contain all these elements, the guided summarization task presents a specific, unified information model that automatic summarizers can emulate. At the same time, emphasis on finding relevant content on the sub-sentential level enables the use of information extraction techniques and other meaning-oriented methods, and thus encourages a move towards abstractive summarization.

The guided summarization task is to write a 100-word summary of a set of 10 newswire articles for a given topic, where the topic falls into a predefined category. There are five topic categories:

Accidents and Natural Disasters
Attacks
Health and Safety
Endangered Resources
Investigations and Trials

Participants (and human summarizers) are given a list of important aspects for each category, and a summary must cover all these aspects (if the information can be found in the documents). The summaries may also contain other information relevant to the topic.

Additionally, an "update" component of the guided summarization task is to write a 100-word "update" summary of a subsequent 10 newswire articles for the topic, under the assumption that the user has already read the earlier articles. (The update summarization task was run in the Summarization track of TAC 2008, TAC 2009, and TAC 2010.)

The goal of the update component in TAC Summarization is to train automatic summarization systems to recognize new (or non-redundant) information in the second set of documents on the same topic. This functionality is important when the subject event is extended in time, and the user requires periodical updates on the state of affairs about which he has previous knowledge.

Essentially, this problem is just another instance of a more general issue: preventing redundancy in multi-document summarization. The update component in TAC Guided Summarization is intertwined with the guided component: for the second set of documents on a given topic, participants should write the summary following the template for that topic, but the non-redundancy requirement takes precendence, i.e., the summary should not repeat any information present in the first set of documents. In case most or all template information has been covered in the first set of documents, the summarizer can include any other information they deem important to the topic. (The "other" aspect can also be used when summarizing the first document set, if there is some information that is non-standard yet essential to the topic.)

In previous years, participants' control of non-redundancy was evaluated indirectly, when the update summaries were compared to model update summaries written by human assessors, mirroring the evaluation process used for main summaries (Owczarzak and Dang, 2010). However, in effect it only showed the relevant information in the summaries, not the redundant information. In 2011, update summaries will be judged against information extracted from both initial and update model summaries; this will allow us to identify relevance and redundancy at the same time.

The test data for the guided summarization task will be available from the LDC on July 1, 2011. Submissions are due at NIST on or before July 17. Each team may submit up to two runs (submissions), and all runs will be judged. Runs must be fully automatic.

Test Data

The test dataset is composed of approximately 44 topics, divided into five categories: Accidents and Natural Disasters, Attacks, Health and Safety, Endangered Resources, Investigations and Trials. Each topic has a topic ID, category, title, and 20 relevant documents which have been divided into 2 sets: Document Set A and Document Set B. Each document set has 10 documents, and all the documents in Set A chronologically precede the documents in Set B. No topic narrative is provided; instead the category and its aspects define what information the reader is looking for.

Topic categories and aspects

All test data will be distributed by the LDC. See TAC 2011 Summarization Registration Information for how to register, submit required agreement forms, and obtain test data.

Documents

The documents for summarization come from the newswire portion of the TAC 2010 KBP Source Data (LDC Catalog Number: LDC2010E12). The collection spans the years 2007-2008 and consists of documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires.

Test Data Format

The topic statements and documents will be in a similar format as the TAC 2010 Guided Summarization Task. Each topic belongs to one of five categories. Each topic's category ID is indicated in the topic tag. Sample topic statements and documents are included below:

Sample topic statements
Sample document sets (Available as Past TAC Data: 2010 Guided Summarization)

Submission guidelines

System task

Given a topic, the task is to write 2 summaries (one for Document Set A and one for Document Set B) that describe the event indicated in the topic title, according to the list of aspects given for the topic category.

The summary for Document Set A should be a straightforward query-focused summary.
The update summary for Document Set B is also query-focused but should be written under the assumption that the user of the summary has already read the documents in Document Set A.

Each summary should cover all the aspects relevant to its category, and it may contain other relevant information as well. The categories, their aspects, and their numerical IDs are as follows:

1. Accidents and Natural Disasters:: 1.1 WHAT: what happened; 1.2 WHEN: date, time, other temporal placement markers; 1.3 WHERE: physical location; 1.4 WHY: reasons for accident/disaster; 1.5 WHO_AFFECTED: casualties (death, injury), or individuals otherwise negatively affected by the accident/disaster; 1.6 DAMAGES: damages caused by the accident/disaster; 1.7 COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts, other reactions to the accident/disaster
2. Attacks (Criminal/Terrorist):: 2.1 WHAT: what happened; 2.2 WHEN: date, time, other temporal placement markers; 2.3 WHERE: physical location; 2.4 PERPETRATORS: individuals or groups responsible for the attack; 2.5 WHY: reasons for the attack; 2.6 WHO_AFFECTED: casualties (death, injury), or individuals otherwise negatively affected by the attack; 2.7 DAMAGES: damages caused by the attack; 2.8 COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts, other reactions to the attack (e.g. police investigations)
3. Health and Safety:: 3.1 WHAT: what is the issue; 3.2 WHO_AFFECTED: who is affected by the health/safety issue; 3.3 HOW: how they are affected; 3.4 WHY: why the health/safety issue occurs; 3.5 COUNTERMEASURES: countermeasures, prevention efforts
4. Endangered Resources:: 4.1 WHAT: description of resource; 4.2 IMPORTANCE: importance of resource; 4.3 THREATS: threats to the resource; 4.4 COUNTERMEASURES: countermeasures, prevention efforts
5. Investigations and Trials (Criminal/Legal/Other):: 5.1 WHO: who is a defendant or under investigation; 5.2 WHO_INV: who is investigating, prosecuting, or judging; 5.3 WHY: general reasons for the investigation/trial; 5.4 CHARGES: specific charges to the defendant; 5.5 PLEAD: defendant's reaction to charges, including admission of guilt, denial of charges, or explanations; 5.6 SENTENCE: sentence or other consequences to defendant

The categories and aspects were developed based on model summaries from past DUC and TAC summarization tasks. Examples of model summaries from TAC 2008 and TAC 2009, which have been annotated with the above aspects, can be downloaded here:

Annotated example summaries

These examples are provided only to show possible quality and distribution of the aspects in a summary. Participants' summaries should not be annotated or tagged with the aspect labels.

Each summary should be well-organized, in English, using complete sentences. A blank line may be used to separate paragraphs, but no other formatting is allowed (such as bulleted points, tables, bold-face type, etc.). Each summary can be no longer than 100 words (whitespace-delimited tokens). Summaries over the size limit will be truncated.

Within a topic, the document sets must be processed in chronological order; i.e., the summarizer cannot look at documents in Set B when generating the summary for Set A. However, the documents within a document set can be processed in any order.

All processing of documents and generation of summaries must be automatic. No changes can be made to any component of the summarization system or any resource used by the system in response to the current year's test data. Participants can use the list of categories and aspects in their generation process. This is not obligatory, and participants who are unable or do not wish to use the provided categories should still be able to produce query-focused summaries in a way similar to previous years.

Submission format

Each team may submit up to two runs. NIST will evaluate all submitted runs.

A run will comprise exactly one file per summary, where the name of each summary file is the ID of its document set. Please include a file for each summary, even if the file is empty. Each file will be read and assessed as a plain text file, so no special characters or markups are allowed. The files must be in a directory whose name should be the concatenation of the Team ID and a number (1-2) for the run. (For example, if the Team ID is "SYSX" then the directory name for the first run should be "SYSX1".) Please package the directory in a tarfile and gzip the tarfile before submitting it to NIST.

Submission procedure

LDC will release the test data on July 1, 2011 and results must be submitted to NIST by 11:59 p.m. (EDT) on July 17, 2011. Teams will need to use their TAC 2011 Team ID and Team Password to submit results through the TAC 2011 Summarization web page. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Evaluation

All summaries will first be truncated to 100 words. NIST will then manually evaluate each submitted summary for:

Content (based on Columbia University's Pyramid method)
Readability/Fluency
Overall responsiveness

Content: Multiple model summaries will be used in the Pyramid evaluation of summary content. Each topic statement and its 2 document sets will be given to 4 different NIST assessors. For each document set, the assessor will create a 100-word model summary covering all the aspects listed for the topic category (if such information can be found in the documents). The assessor can also include other information relevant to the topic. The assessors will be guided by the following:

Assessor instructions for writing model summaries.

In the Pyramid evaluation, the assessor will first extract Summary Content Units (SCUs) from the 4 model summaries for the document set, sorting the SCUs into aspect bins (one bin per aspect of a given category). Each SCU is assigned a weight that is equal to the number of model summaries in which it appears. Once all SCUs have been harvested from the model summaries, the assessor will determine which of these SCUs can be found in each of the peer summaries that are to be evaluated. Repetitive information is not rewarded, as each SCU contained in the peer summary is counted only once. The final Pyramid score for a peer summary is the sum of the weights of SCUs contained in the summary, divided by the maximum sum of SCU weights possible for summary of average length (where the average length is determined by the mean SCU count of the model summaries for this document set).

In the update summarization component, peer summaries will be evaluated against SCUs harvested from both the update and initial model summaries. This will provide a direct measure of how much "old" information was unnecessarily repeated in the peer update summary.

For additional details, see:

R.J. Passonneau, A. Nenkova, K. McKeown, and S. Sigelman, Applying the Pyramid Method in DUC 2005
Columbia University's 2006 web page on Pyramids

The Pyramid evaluation will be adapted to provide detailed scores on the level of each category and each aspect. This way, participants can find out about their system's performance in extracting different types of information.

Readability/Fluency: The assessor will give a readability/fluency score to each summary. The score reflects the fluency and readability of the summary (independently of whether it contains any relevant information) and is based on factors such as the summary's grammaticality, non-redundancy, referential clarity, focus, and structure and coherence.

Overall Responsiveness: The assessor will give an overall responsiveness score to each summary. The overall responsiveness score is based on both content (coverage of all required aspects) and readability/fluency.

Readability and Overall Responsiveness will each be judged on the following 5-point scale:

1	Very Poor
2	Poor
3	Barely Acceptable
4	Good
5	Very Good

TAC 2011 Workshop Presentations and Papers

Each team that submits runs for evaluation is requested to write a paper for the TAC 2011 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2011 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2011 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2011. Please see guidelines for papers and presentation proposals at http://tac.nist.gov/2011/reporting_guidelines.html.

Schedule

by May 1	TAC 2010 KBP Source Data available from the LDC
June 3	Deadline for TAC 2011 track registration
July 1	Release of test data (Guided task)
July 17	Deadline for participants' submissions (Guided task)
August 22	Release of test data (AESOP)
August 28	Deadline for participants' submissions (AESOP)
September 7	Release of individual evaluated results (Guided task, AESOP)
September 25	Deadline for TAC 2011 workshop presentation proposals
October 25	Deadline for system reports (workshop notebook version)
November 14-15	TAC 2011 Workshop

BACK to TAC 2011 Summarization Track Homepage

Last updated:
Comments to: [email protected]