TAC 2011 AESOP Task Guidelines

(Also see general TAC 2011 policies and guidelines at http://tac.nist.gov/2011/)

Contents:

Overview
Test data
Submission guidelines
Evaluation
Schedule

Overview

The purpose of the Automatically Evaluating Summaries of Peers (AESOP) task is to promote research and development of systems that automatically evaluate the quality of summaries in terms of their (1) content and (2) readability (i.e. linguistic quality).

Measuring content: In TAC 2009 and TAC 2010 AESOP task, the focus was on developing automatic metrics that can measure summary content on the system level (i.e. measuring the average quality of summarizers). In addition to that, in 2011 participating metrics will also be evaluated for their ability to accurately measure summary content on the level of individual summaries.

Measuring readability: For the first time in the AESOP task, participating metrics will also be evaluated for their ability to measure summary readability, both on the level of summarizers and individual summaries.

Participants can submit metrics that are designed either to measure content or readability, or both; however, all metrics will be evaluated in all categories to provide a full picture of the metric's capabilities. Participants will run their automatic metrics on the data from the TAC 2011 Guided Summarization task and submit to NIST the results of their evaluations.

The output of automatic metrics will be compared against three manual metrics: the (Modified) Pyramid score, which measures summary content; Overall Readability, which measures linguistic quality; and Overall Responsiveness, which measures a combination of content and linguistic quality. NIST will calculate Pearson's, Spearman's, and Kendall's correlations between scores produced by each automatic metric and the three manual metrics, both on the summarizer and summary level. Using a one-way ANOVA and the multiple comparison procedure, NIST will also test the discriminative power of the automatic metrics, i.e., the extent to which each automatic metric can detect statistically significant differences between summarizers. The assumption is that a good automatic metric will make the same significant distinctions between summarizers as the manual metrics (and possibly add more), but will not give a contradicting ranking to two summarizers (i.e., infer that Summarizer X is significantly better than Summarizer Y when the manual metric infers that Summarizer Y is significantly better than Summarizer X) or lose too many of the distinctions made by the manual metrics.

AESOP participants will receive all the test data from the TAC 2011 Guided Summarization task, plus the human-authored and automatic summaries from that task. The data will be available for download from the TAC 2011 Summarization Track web page on August 22, 2011. The deadline for submission of automatic evaluations is August 28, 2011. Participants may submit up to four runs, all of which will be evaluated by NIST. Runs must be fully automatic. No changes can be made to any component of the AESOP system or any resource used by the system in response to the current year's test data.

Test data

Test data for the AESOP task consists of all test data and summaries produced within the TAC 2011 Guided Summarization task:

Topic statements
Two sets of 10 documents for each topic
Four human-authored model summaries for each document set
Additional summaries evaluated in the TAC 2011 Guided Summarization task
A list of topic categories and important aspects, which summaries were asked to address

Source documents for summarization will come from the newswire portion of the TAC 2010 KBP Source Data (LDC Catalog Number: LDC2010E12). The collection spans the years 2007-2008 and consists of documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires.

Test source documents will be distributed by the LDC. The remaining test data will be distributed by NIST via the TAC 2011 Summarization web page. Teams will need to use their TAC 2011 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2011 team ID and password, teams must submit all required agreement forms, even if these forms were already submitted in previous TAC cycles. See TAC 2011 Summarization Registration Information for how to register, submit required agreement forms, and obtain AESOP test data.

Test Data Format

The topic statements and documents will be in a similar format as the TAC 2010 Guided Summarization Task.

Sample topic statements
Sample document sets (Available as Past TAC Data: 2010 Guided Summarization)

The topic IDs have the following naming convention:

The summaries to be evaluated will include both human-authored summaries and automatic summaries. Each summary will be in a single file, with the following file naming convention:

<topic>-<docset>.M.100.<selectorID>.<summarizerID> Model summaries have an alphabetic summarizerID (A-H, indicating a human summarizer from NIST); all other summarizerIDs are numeric. For example, the summary "D1114-B.M.100.H.38" is a multi-document summary written by automatic Summarizer 38 for document set D1114H-B, and the summary "D1114-B.M.100.H.C" was written by NIST Assessor C for the same Document Set. Each human summarizer from NIST will have summaries for only a subset of the topics, while each of the other summarizers will have one summary per document set.

Submission guidelines

System task

Given a set of peer summaries (model summaries and other summaries produced as part of the TAC 2011 Guided Summarization task), the goal of the AESOP task is to automatically produce summary-level and summarizer-level scores that will correlate with one or all of the following manual metrics from the TAC 2011 Guided Summarization task:

(Modified) Pyramid Score
Overall Readability
Overall Responsiveness

The actual AESOP task is to produce two sets of numeric summary-level scores:

All Peers case: a numeric score for each peer summary, including the model summaries. The "All Peers" case is intended to focus on whether an automatic metric can differentiate between human vs automatic summarizers. If your metric uses model summaries as reference summaries, the scoring process should incorporate jackknifing, i.e. each automatic summary should be evaluated four times, each time against a different subset of three human model summaries. The final score for the automatic summary will be the mean of the four scores. This process ensures a fair evaluation, since each human summary can only be evaluated against three (remaining) reference summaries.
No Models case: a numeric score for each peer summary, excluding the model summaries. The "No Models" case is intended to focus on how well an automatic metric can evaluate automatic summaries. If using model summaries as references, each automatic summary can be evaluated against all four references simultaneously.

Different summarizers may have different numbers of summaries. NIST will assume that the final summarizer-level score is the mean of the summarizer's summary-level scores. Please contact the track coordinator if this is not the case for your metric and you use a different calculation to arrive at the final summarizer-level scores.

Participants are allowed (but not required) to use the designated model summaries as reference summaries to score a peer summary; however, when evaluating a model summary S in the "All Peers" case, participants are not allowed to include S in the set of reference summaries (see the explanation in "AllPeers" above).

All processing of test data and generation of scores must be automatic. No changes can be made to any component of the system or any resource used by the system in response to the current year's test data.

Submission format

Each team may submit up to four runs. NIST will evaluate all submitted runs.

A run consists of a single ASCII file containing two sets of summary-level scores produced by the participant's automatic evaluation metric. Each line of the file must be in the following format:

eval_case summary_id score

where

"eval_case" is either "AllPeers" or "NoModels",
"summary_id" is the name of the file containing the summary in the AESOP test data, and
"score" is the automatic score of the summary.

For example:

AllPeers D1001-A.M.100.C.1 0.5

AllPeers D1001-A.M.100.C.2 0.5

AllPeers D1001-A.M.100.C.3 0.5

AllPeers D1001-A.M.100.C.A 0.8

AllPeers D1001-A.M.100.C.B 0.8

AllPeers D1001-A.M.100.C.C 0.8

AllPeers D1044-B.M.100.H.1 0.3

AllPeers D1044-B.M.100.H.2 0.3

AllPeers D1044-B.M.100.H.3 0.3

AllPeers D1044-B.M.100.H.A 0.6

AllPeers D1044-B.M.100.H.B 0.6

AllPeers D1044-B.M.100.H.C 0.6

NoModels D1001-A.M.100.C.1 0.45

NoModels D1001-A.M.100.C.2 0.45

NoModels D1001-A.M.100.C.3 0.45

NoModels D1044-B.M.100.H.1 0.25

NoModels D1044-B.M.100.H.2 0.25

NoModels D1044-B.M.100.H.3 0.25

At the time of submission, participants should indicate whether the run is intended to correlate with the (Modified) Pyramid Score, Overall Responsiveness, Overall Readability, or any subset of the three metrics.

NIST will assume that for each run, the final score that the run is giving to a summarizer is the mean of the summarizer's summary-level scores.

Submission procedure

NIST will post the test data on the TAC Summarization web site on August 22, 2011 and results must be submitted to NIST by 11:59 p.m. (EDT) on August 28, 2011. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Evaluation

Each AESOP run will be evaluated for:

Correlation with the manual metric
Discriminative Power compared with the manual metric

Correlation: NIST will calculate the Pearson's, Spearman's, and Kendall's correlations between the summarizer-level scores produced by each submitted metric and the manual metrics (Overall Responsiveness, Overall Readability, and Pyramid). NIST will also calculate the Pearson's, Spearman's, and Kendall's correlations between the summary-level scores (within each topic) produced by each submitted metric and the manual metrics.

Discriminative Power: NIST will conduct a one-way analysis of variance (ANOVA) on the scores produced by each metric (automatic or manual). The output from ANOVA will be submitted to MATLAB's multiple comparison procedure, using Tukey's honestly significant difference criterion. The multiple comparison procedure tests every pair of summarizers (X, Y) for a significant difference in their mean scores and infers whether:

X > Y, the mean score of Summarizer X is significantly higher than the mean score of Summarizer Y (Summarizer X is significantly better than Summarizer Y),
or
X < Y, the mean score of Summarizer X is significantly lower than the mean score of Summarizer Y (Summarizer X is significantly worse than Summarizer Y),
or
X = Y, the mean score of Summarizer X is not significantly different from the mean score of Summarizer Y (Summarizer X is not significantly better or worse than Summarizer Y).

The multiple comparison procedure will insure that the probability of inferring that a pair is different when there is no real difference, is no more than 0.05. NIST will report the number of pairwise comparisons where the automatic metric agrees or disagrees with the manual metric.

Baselines

Baseline 1 (shallow): NIST will run ROUGE-1.5.5 to compute ROUGE-2 scores, with stemming and keeping stopwords.
Baseline 2 (shallow): NIST will run ROUGE-1.5.5 to compute ROUGE-SU4 scores, with stemming and keeping stopwords.
Baseline 3 (syntactic): NIST will run ROUGE-1.5.5 to compute Basic Elements (BE) scores. Summaries will be parsed with Minipar, and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.

Where sentences need to be identified for automatic evaluation, NIST will use a simple Perl script for sentence segmentation. Jackknifing will be implemented so that human and system scores can be compared.

TAC 2011 Workshop Presentations and Papers

Each team that submits runs for evaluation is requested to write a paper for the TAC 2011 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2011 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2011 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2011. Please see guidelines for papers and presentation proposals at http://tac.nist.gov/2011/reporting_guidelines.html.

Schedule

TAC 2011 AESOP Task Schedule
by May 1	TAC 2010 KBP Source Data available from the LDC
August 22	Release of test data (AESOP)
August 28	Deadline for participants' submissions (AESOP)
September 7	Release of individual evaluated results
September 25	Deadline for TAC 2011 workshop presentation proposals
October 25	Deadline for system reports (workshop notebook version)
November 14-15	TAC 2011 Workshop

BACK to TAC 2011 Summarization Track Homepage

Last updated:
Comments to: [email protected]