TAC 2009 AESOP Task Guidelines

(Also see general TAC 2009 policies and guidelines at http://tac.nist.gov/2009/)

Contents:

Overview
Test data
Submission guidelines
Evaluation
Schedule

Overview

The purpose of the Automatically Evaluating Summaries of Peers (AESOP) task is to promote research and development of systems that automatically evaluate the quality of summaries. The focus in 2009 is on developing automatic metrics that accurately measure summary content. Participants will run their automatic metrics on the data from the TAC 2009 Update Summarization task and submit to NIST the results of their evaluations.

The output of automatic metrics will be compared against two manual metrics: the (Modified) Pyramid score, which measures summary content, and Overall Responsiveness, which measures a combination of content and linguistic quality. NIST will calculate Pearson's, Spearman's, and Kendall's correlations between scores produced by each automatic metric and the two manual metrics. Using a one-way ANOVA and the multiple comparison procedure, NIST will also test the discriminative power of the automatic metrics, i.e., the extent to which each automatic metric can detect statistically significant differences between summarizers. The assumption is that a good automatic metric will make the same significant distinctions between summarizers as the manual metrics (and possibly add more), but will not give a contradicting ranking to two summarizers (i.e., infer that Summarizer X is significantly better than Summarizer Y when the manual metric infers that Summarizer Y is significantly better than Summarizer X) or lose too many of the distinctions made by the manual metrics.

AESOP participants will receive all the test data from the TAC 2009 Update Summarization task, plus the human-authored and automatic summaries from that task. The data will be available for download from the TAC 2009 Summarization Track web page on August 24, 2009. The deadline for submission of automatic evaluations is August 30, 2009. Participants may submit up to four runs, all of which will be evaluated by NIST. Runs must be fully automatic. No changes can be made to any component of the AESOP system or any resource used by the system in response to the current year's test data.

Test data

Test data for the AESOP task consists of all test data and summaries produced within the TAC 2009 Update Summarization task:

44 topic statements
88 document sets (two sets of 10 documents for each topic)
four human-authored model summaries for each document set
additional summaries evaluated in the TAC 2009 Update Summarization task

Test data will be distributed by NIST via the TAC 2009 Summarization web page. Teams will need to use their TAC 2009 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2009 team ID and password for the summarization track, teams must submit the following forms to NIST, even if these forms were already submitted in previous TAC cycles.

When submitting forms, please also include the TAC 2009 team ID, the email address of the main TAC 2009 contact person for the team, and a comment saying that the form is from a TAC 2009 registered participant.

Test Data Format

The topic statements and documents will be in the same format as the TAC 2008 Update Summarization topic statements and documents (sample given below):

Sample topic statements
Sample document sets (password-protected, gzipped tar file)

The IDs of document sets have the following naming convention:

The summaries to be evaluated will include both human-authored summaries and automatic summaries. Each summary will be in a single file, with the following file naming convention:

<topic>-<docset>.M.100.<selectorID>.<summarizerID> Model summaries have an alphabetic summarizerID (A-H, indicating a human summarizer from NIST); all other summarizerIDs are numeric. For example, the summary "D0944-B.M.100.H.38" is a multi-document summary written by automatic Summarizer 38 for document set D0944H-B, and the summary "D0944-B.M.100.H.C" was written by NIST Assessor C for the same Document Set. Each human summarizer from NIST will have summaries for only 22 topics (total 44 summaries), while each of the other summarizers will have one summary per document set (total of 88 summaries).

Submission guidelines

System task

Given a set of peer summaries (model summaries and other summaries produced as part of the TAC 2009 Update Summarization task), the goal of the AESOP task is to automatically produce a summarizer-level score that will correlate with one or both of the following manual metrics from the TAC 2009 Update Summarization task:

(Modified) Pyramid Score
Overall Responsiveness

The actual AESOP task is to produce two sets of numeric summary-level scores:

All Peers case: a numeric score for each peer summary, including the model summaries. The "All Peers" case is intended to focus on whether an automatic metric can differentiate between human vs automatic summarizers.
No Models case: a numeric score for each peer summary, excluding the model summaries. The "No Models" case is intended to focus on how well an automatic metric can evaluate automatic summaries.

Different summarizers may have different numbers of summaries. NIST will assume that the final summarizer-level score is the mean of the summarizer's summary-level scores. Please contact the track coordinator if this is not the case for your metric and you use a different calculation to arrive at the final summarizer-level scores.

Participants are allowed (but not required) to use the designated model summaries as reference summaries to score a peer summary; however, when evaluating a model summary S in the "All Peers" case, participants are not allowed to include S in the set of reference summaries.

All processing of test data and generation of scores must be automatic. No changes can be made to any component of the system or any resource used by the system in response to the current year's test data.

Submission format

Each team may submit up to four runs. NIST will evaluate all submitted runs.

A run consists of a single ASCII file containing two sets of summary-level scores produced by the participant's automatic evaluation metric. Each line of the file must be in the following format:

eval_case summary_id score

where

"eval_case" is either "AllPeers" or "NoModels",
"summary_id" is the name of the file containing the summary in the AESOP test data, and
"score" is the automatic score of the summary.

For example:

AllPeers D0901-A.M.100.C.1 0.5

AllPeers D0901-A.M.100.C.2 0.5

AllPeers D0901-A.M.100.C.3 0.5

AllPeers D0901-A.M.100.C.A 0.8

AllPeers D0901-A.M.100.C.B 0.8

AllPeers D0901-A.M.100.C.C 0.8

AllPeers D0944-B.M.100.H.1 0.3

AllPeers D0944-B.M.100.H.2 0.3

AllPeers D0944-B.M.100.H.3 0.3

AllPeers D0944-B.M.100.H.A 0.6

AllPeers D0944-B.M.100.H.B 0.6

AllPeers D0944-B.M.100.H.C 0.6

NoModels D0901-A.M.100.C.1 0.45

NoModels D0901-A.M.100.C.2 0.45

NoModels D0901-A.M.100.C.3 0.45

NoModels D0944-B.M.100.H.1 0.25

NoModels D0944-B.M.100.H.2 0.25

NoModels D0944-B.M.100.H.3 0.25

At the time of submission, participants should indicate whether the run is intended to correlate with the (Modified) Pyramid Score, or Overall Responsiveness, or both.

NIST will assume that for each run, the final score that the run is giving to a summarizer is the mean of the summarizer's summary-level scores.

Submission procedure

NIST will post the test data on the TAC Summarization web site on August 24, 2009 and results must be submitted to NIST by 11:59 p.m. (EDT) on August 30, 2009. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the [email protected] mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.

Evaluation

Each AESOP run will be evaluated for:

Correlation with the manual metric
Discriminative Power compared with the manual metric

Correlation: NIST will calculate the Pearson's, Spearman's, and Kendall's correlations between the summarizer-level scores produced by each submitted metric and the manual metrics (Overall Responsiveness and Pyramid).

Discriminative Power: NIST will conduct a one-way analysis of variance (ANOVA) on the scores produced by each metric (automatic or manual). The output from ANOVA will be submitted to MATLAB's multiple comparison procedure, using Tukey's honestly significant difference criterion. The multiple comparison procedure tests every pair of summarizers (X, Y) for a significant difference in their mean scores and infers whether:

X > Y, the mean score of Summarizer X is significantly higher than the mean score of Summarizer Y (Summarizer X is significantly better than Summarizer Y),
or
X < Y, the mean score of Summarizer X is significantly lower than the mean score of Summarizer Y (Summarizer X is significantly worse than Summarizer Y),
or
X = Y, the mean score of Summarizer X is not significantly different from the mean score of Summarizer Y (Summarizer X is not significantly better or worse than Summarizer Y).

The multiple comparison procedure will insure that the probability of inferring that a pair is different when there is no real difference, is no more than 0.05. NIST will report the number of pairwise comparisons where the automatic metric agrees or disagrees with the manual metric.

Baselines

Baseline 1 (shallow): NIST will run ROUGE-1.5.5 to compute ROUGE-SU4 scores, with stemming and keeping stopwords.
Baseline 2 (syntactic): NIST will run ROUGE-1.5.5 to compute Basic Elements (BE) scores. Summaries will be parsed with Minipar, and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.

Where sentences need to be identified for automatic evaluation, NIST will use a simple Perl script for sentence segmentation. Jackknifing will be implemented so that human and system scores can be compared.

TAC 2009 Workshop Presentations and Papers

Each team that submits runs for evaluation is requested to write a paper for the TAC 2009 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2009 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2009 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2009, and the TAC Advisory Committee will select the groups who will present at the workshop. Please see guidelines for papers and presentation proposals at http://tac.nist.gov/2009/reporting_guidelines.html.

Schedule

TAC 2009 AESOP Task Schedule
August 24	Release of test data
August 30	Deadline for participants' submissions
September 4	Release of individual evaluated results
September 25	Deadline for TAC 2009 workshop presentation proposals
Mid-October	Deadline for systems' reports

BACK to TAC 2009 Summarization Track Homepage

Last updated:
Comments to: [email protected]