Workshop on Evaluation Metrics and System Comparison
for Automatic Summarization
June 8, 2012
Montreal, Quebec, Canada
05/07/2012: The workshop program has been posted. A significant part of the workshop will be devoted to presentation and discussion of a new community summarization task on summarization of scientific literature, planned for late 2012 or early 2013.
03/29/2012: Drago Radev of the University of Michigan will give the invited talk on summarization of academic articles.
01/27/2012: Please make
your hotel reservations early (now might be a good time), as
hotel space will be extremely limited during the days surrounding the
workshop due to people coming in for the Montreal Grand Prix Formula 1
01/27/2012: The paper submission deadline has been extended to Sunday, April 1.
Interest in summarization research has been steadily growing in the past decade, with numerous new methods being proposed for generic and topic-focused summarization of news. Other genres and domains, most notably related to spoken input, have also become well established, including summarization of broadcast news, meetings, spoken conversations and lectures.
At the same time, development of evaluation metrics for summarization
and of resources for some genres and domains has lagged behind. Manual
evaluation protocols (Pyramid scores for content selection, scores for
linguistic quality and overall responsiveness) show considerable
disparity between human performance and the performance of systems for
multi-document summarization of news; however, the widely used suite
for automatic evaluation of content, ROUGE, shows much narrower
difference between machine and human performance and even fails to
distinguish the two. For speech summarization ROUGE also does not
properly reflect the difference between human and automatic
summarizers and, unlike for written news, has low correlations with
manual evaluation protocols. The challenge of automatic evaluation of
linguistic quality of summaries has also only recently started to be
It has also become harder to identify the most competitive approaches to summarization. This is partly due to confusing or inconsistent evidence that comes from different test sets. Evaluating the same system configuration against several test sets will make possible a fairer comparison between methods and will further stimulate research on automatic evaluation metrics.
For this workshop we will seek submission on a wide range of topics related to evaluation and system comparison in summarization. Topics of interest include:
- system comparison on several evaluation datasets. For example for multi-document summarization we will seek systems evaluated on multiple years of DUC/TAC data with emphasis on measuring statistically significant differences
- manual evaluation protocols for summarization in new genres where existing methods may not apply
- manual evaluation protocols for abstractive summarization, which assess the degree of text-to-text generation capabilities of the systems and rewards successful generation capabilities
- automatic evaluation metrics of linguistic quality
- automatic evaluation metrics that better reflect the differences in human and machine performance
- automatic metrics that significantly outperform ROUGE in content selection evaluation for news summarization
- automatic metrics that perform evaluation without the use of human goldstandards
- analysis of domain and genre difference that expose weaknesses of currently adopted evaluation metrics and proposals for addressing these weaknesses
Submissions will consist of regular full papers of up to 8 pages, plus
additional pages for references. Shorter papers are also welcome.
All papers should be formatted following the NAACL-HLT 2012
guidelines. As the reviewing will be blind, the paper must not
include the authors' names and affiliations. Furthermore,
self-references that reveal the author's identity, e.g., "We
previously showed (Smith, 1991) ..." must be avoided. Instead, use
citations such as "Smith previously showed (Smith, 1991) ..."
We encourage individuals who are submitting papers on automatic
methods for summarization and evaluation to evaluate their approaches
using multiple publicly available datasets, such as those from DUC and the TAC Summarization track.
Both submission and review processes will be handled electronically
using the Softconf submission software (https://www.softconf.com/naaclhlt2012/WEAS2012/).
The submission deadline is Sunday, April 1, 2012 by 11:59PM Pacific
Standard Time (GMT-8).
Apr 01: Paper due date (EXTENDED deadline)
Apr 25: Notification of acceptance
May 04: Camera-ready deadline
Jun 08: Workshop at NAACL-HLT 2012
John Conroy (IDA Center for Computing Sciences)
Hoa Dang (National Institute of Standards and Technology)
Ani Nenkova (University of Pennsylvania)
Karolina Owczarzak (National Institute of Standards and Technology)
Enrique Amigo (UNED, Madrid)
Giuseppe Carenini (University of British Columbia)
Katja Filippova (Google Research)
George Giannakopoulos (NCSR Demokritos)
Dan Gillick (University of California at Berkeley)
Min-Yen Kan (National University of Singapore)
Guy Lapalme (University of Montreal)
Yang Liu (University of Texas, Dallas)
Annie Louis (University of Pennsylvania)
Kathy McKeown (Columbia University)
Gabriel Murray (University of British Columbia)
Dianne O'Leary (University of Maryland)
Drago Radev (University of Michigan)
Steve Renals (University of Edinburgh)
Horacio Saggion (Universitat Pompeu Fabra)
Judith Schlesinger (IDA Center for Computing Sciences)
Josef Steinberger (European Commission Joint Research Centre)
Stan Szpakowicz (University of Ottawa)
Lucy Vanderwende (Microsoft Research)
Stephen Wan (CSIRO ICT Centre)
Xiaodan Zhu (National Research Council Canada)
Please contact us by email: email@example.com