TAC 2010 AESOP Task Guidelines

(Also see general TAC 2010 policies and guidelines at http://tac.nist.gov/2010/)

Contents:

Overview
Test data
Submission guidelines
Evaluation
Schedule

Overview

The purpose of the Automatically Evaluating Summaries of Peers (AESOP) task is to promote research and development of systems that automatically evaluate the quality of summaries. As last year, the focus in 2010 is on developing automatic metrics that accurately measure summary content. Participants will run their automatic metrics on the data from the TAC 2010 Guided Summarization task and submit to NIST the results of their evaluations.

The output of automatic metrics will be compared against two manual metrics: the (Modified) Pyramid score, which measures summary content, and Overall Responsiveness, which measures a combination of content and linguistic quality. NIST will calculate Pearson's, Spearman's, and Kendall's correlations between scores produced by each automatic metric and the two manual metrics. Using a one-way ANOVA and the multiple comparison procedure, NIST will also test the discriminative power of the automatic metrics, i.e., the extent to which each automatic metric can detect statistically significant differences between summarizers. The assumption is that a good automatic metric will make the same significant distinctions between summarizers as the manual metrics (and possibly add more), but will not give a contradicting ranking to two summarizers (i.e., infer that Summarizer X is significantly better than Summarizer Y when the manual metric infers that Summarizer Y is significantly better than Summarizer X) or lose too many of the distinctions made by the manual metrics.

AESOP participants will receive all the test data from the TAC 2010 Guided Summarization task, plus the human-authored and automatic summaries from that task. The data will be available for download from the TAC 2010 Summarization Track web page on August 23, 2010. The deadline for submission of automatic evaluations is August 29, 2010. Participants may submit up to four runs, all of which will be evaluated by NIST. Runs must be fully automatic. No changes can be made to any component of the AESOP system or any resource used by the system in response to the current year's test data.

Test data

Test data for the AESOP task consists of all test data and summaries produced within the TAC 2010 Guided Summarization task:

Topic statements
Two sets of 10 documents for each topic
Four human-authored model summaries for each document set
Additional summaries evaluated in the TAC 2010 Guided Summarization task
A list of topic categories and important aspects, which summaries were asked to address

Test data will be distributed by NIST via the TAC 2010 Summarization web page. Teams will need to use their TAC 2010 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2010 team ID and password for the summarization track, teams must submit the following forms to NIST, even if these forms were already submitted in previous TAC cycles.

Agreement Concerning Dissemination of TAC Results
AQUAINT Organization form
AQUAINT-2 Organization form

Forms are available at the TAC User Agreements web page. When submitting forms, please also include the TAC 2010 team ID, the email address of the main TAC 2010 contact person for the team, and a comment saying that the form is from a TAC 2010 registered participant.

Test Data Format

The topic statements and documents will be in a similar format as the TAC 2009 Update Summarization Task, except this year there is no topic narrative, and topic category has been added.

Sample topic statements
Sample document sets (Available as Past TAC Data: 2010 Guided Summarization)

The topic IDs have the following naming convention:

The summaries to be evaluated will include both human-authored summaries and automatic summaries. Each summary will be in a single file, with the following file naming convention:

<topic>-<docset>.M.100.<selectorID>.<summarizerID> Model summaries have an alphabetic summarizerID (A-H, indicating a human summarizer from NIST); all other summarizerIDs are numeric. For example, the summary "D1014-B.M.100.H.38" is a multi-document summary written by automatic Summarizer 38 for document set D1014H-B, and the summary "D1014-B.M.100.H.C" was written by NIST Assessor C for the same Document Set. Each human summarizer from NIST will have summaries for only a subset of the topics, while each of the other summarizers will have one summary per document set.

Submission guidelines

System task

Given a set of peer summaries (model summaries and other summaries produced as part of the TAC 2010 Guided Summarization task), the goal of the AESOP task is to automatically produce a summarizer-level score that will correlate with one or both of the following manual metrics from the TAC 2010 Guided Summarization task:

(Modified) Pyramid Score
Overall Responsiveness

The actual AESOP task is to produce two sets of numeric summary-level scores:

All Peers case: a numeric score for each peer summary, including the model summaries. The "All Peers" case is intended to focus on whether an automatic metric can differentiate between human vs automatic summarizers.
No Models case: a numeric score for each peer summary, excluding the model summaries. The "No Models" case is intended to focus on how well an automatic metric can evaluate automatic summaries.

Different summarizers may have different numbers of summaries. NIST will assume that the final summarizer-level score is the mean of the summarizer's summary-level scores. Please contact the track coordinator if this is not the case for your metric and you use a different calculation to arrive at the final summarizer-level scores.

Participants are allowed (but not required) to use the designated model summaries as reference summaries to score a peer summary; however, when evaluating a model summary S in the "All Peers" case, participants are not allowed to include S in the set of reference summaries.

All processing of test data and generation of scores must be automatic. No changes can be made to any component of the system or any resource used by the system in response to the current year's test data.

Submission format

Each team may submit up to four runs. NIST will evaluate all submitted runs.

A run consists of a single ASCII file containing two sets of summary-level scores produced by the participant's automatic evaluation metric. Each line of the file must be in the following format:

eval_case summary_id score

where

"eval_case" is either "AllPeers" or "NoModels",
"summary_id" is the name of the file containing the summary in the AESOP test data, and
"score" is the automatic score of the summary.

For example:

AllPeers D1001-A.M.100.C.1 0.5

AllPeers D1001-A.M.100.C.2 0.5

AllPeers D1001-A.M.100.C.3 0.5

AllPeers D1001-A.M.100.C.A 0.8

AllPeers D1001-A.M.100.C.B 0.8

AllPeers D1001-A.M.100.C.C 0.8

AllPeers D1044-B.M.100.H.1 0.3

AllPeers D1044-B.M.100.H.2 0.3

AllPeers D1044-B.M.100.H.3 0.3

AllPeers D1044-B.M.100.H.A 0.6

AllPeers D1044-B.M.100.H.B 0.6

AllPeers D1044-B.M.100.H.C 0.6

NoModels D1001-A.M.100.C.1 0.45

NoModels D1001-A.M.100.C.2 0.45

NoModels D1001-A.M.100.C.3 0.45

NoModels D1044-B.M.100.H.1 0.25

NoModels D1044-B.M.100.H.2 0.25

NoModels D1044-B.M.100.H.3 0.25

At the time of submission, participants should indicate whether the run is intended to correlate with the (Modified) Pyramid Score, or Overall Responsiveness, or both.

NIST will assume that for each run, the final score that the run is giving to a summarizer is the mean of the summarizer's summary-level scores.

Submission procedure

NIST will post the test data on the TAC Summarization web site on August 23, 2010 and results must be submitted to NIST by 11:59 p.m. (EDT) on August 29, 2010. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Evaluation

Each AESOP run will be evaluated for:

Correlation with the manual metric
Discriminative Power compared with the manual metric

Correlation: NIST will calculate the Pearson's, Spearman's, and Kendall's correlations between the summarizer-level scores produced by each submitted metric and the manual metrics (Overall Responsiveness and Pyramid).

Discriminative Power: NIST will conduct a one-way analysis of variance (ANOVA) on the scores produced by each metric (automatic or manual). The output from ANOVA will be submitted to MATLAB's multiple comparison procedure, using Tukey's honestly significant difference criterion. The multiple comparison procedure tests every pair of summarizers (X, Y) for a significant difference in their mean scores and infers whether:

X > Y, the mean score of Summarizer X is significantly higher than the mean score of Summarizer Y (Summarizer X is significantly better than Summarizer Y),
or
X < Y, the mean score of Summarizer X is significantly lower than the mean score of Summarizer Y (Summarizer X is significantly worse than Summarizer Y),
or
X = Y, the mean score of Summarizer X is not significantly different from the mean score of Summarizer Y (Summarizer X is not significantly better or worse than Summarizer Y).

The multiple comparison procedure will insure that the probability of inferring that a pair is different when there is no real difference, is no more than 0.05. NIST will report the number of pairwise comparisons where the automatic metric agrees or disagrees with the manual metric.

Baselines

Baseline 1 (shallow): NIST will run ROUGE-1.5.5 to compute ROUGE-2 scores, with stemming and keeping stopwords.
Baseline 2 (shallow): NIST will run ROUGE-1.5.5 to compute ROUGE-SU4 scores, with stemming and keeping stopwords.
Baseline 3 (syntactic): NIST will run ROUGE-1.5.5 to compute Basic Elements (BE) scores. Summaries will be parsed with Minipar, and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.

Where sentences need to be identified for automatic evaluation, NIST will use a simple Perl script for sentence segmentation. Jackknifing will be implemented so that human and system scores can be compared.

TAC 2010 Workshop Presentations and Papers

Each team that submits runs for evaluation is requested to write a paper for the TAC 2010 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2010 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2010 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 26, 2010, and the TAC Advisory Committee will select the groups who will present at the workshop. Please see guidelines for papers and presentation proposals at http://tac.nist.gov/2010/reporting_guidelines.html.

Schedule

TAC 2010 AESOP Task Schedule
August 23	Release of test data
August 29	Deadline for participants' submissions
September 7	Release of individual evaluated results
September 26	Deadline for TAC 2010 workshop presentation proposals
October 27	Deadline for systems' reports

BACK to TAC 2010 Summarization Track Homepage

Last updated:
Comments to: [email protected]