SIXTH RECOGNIZING TEXTUAL ENTAILMENT CHALLENGE at TAC 2010 (http://www.nist.gov/tac/2010/RTE/) The Recognizing Textual Entailment (RTE) task consists of developing a system that, given two text fragments, can determine whether the meaning of one text is entailed, i.e. can be inferred, from the other text. Since its inception in 2005, RTE has enjoyed a constantly growing popularity in the NLP community. After the first three highly successful PASCAL RTE Challenges campaigns held in Europe, in 2008 RTE became a track at the Text Analysis Conference (TAC), bringing it together with communities working on NLP applications. The interaction has provided the opportunity to apply RTE to specific application settings and move it towards more realistic scenarios. In particular, the RTE-5 Pilot Search Task represented a step forward, as for the first time textual entailment recognition was performed on a corpus, instead of isolated T-H pairs, and on a real NLP application, namely Summarization. Encouraged by the positive response obtained so far, the RTE Organizing Committee is glad to launch the Sixth Recognizing Textual Entailment Challenge, proposed for the third year as a track of TAC. Organizations interested in participating in the RTE-6 Challenge are invited to submit a track registration form by May 21, 2010, at the TAC 2010 web site: http://www.nist.gov/tac/2010/ WHAT IS NEW IN RTE-6 1) RTE-6 does not include the traditional RTE Main Task which was carried out in the first five RTE challenges, i.e. there will be no task to make entailment judgments over isolated T-H pairs drawn from multiple applications. 2) A new Main Task based on only the Summarization application setting is proposed, together with a subtask: - Main Task: Recognizing Textual Entailment within a Corpus. A close variant of the Pilot Search Task in RTE-5, the RTE-6 Main Task differs significantly in two ways: * Unlike in RTE-5, where the Search Task was performed on the whole corpus, in RTE-6 a preliminary Information Retrieval filtering phase is performed using Lucene, in order to select for each H a subset of candidate entailing sentences to be judged by the participating systems. * In the RTE-6 data set some of the H's have no entailing sentences. - Novelty Detection subtask. This task has the same structure as the Main Task, but it is separated out as a subtask to allow participants to optimize their RTE engines for detecting novelty, i.e. judging whether the information contained in each H is novel with respect to the information contained in the corpus. A novel H is defined as one that has no entailing sentences in the set of candidate T's. Systems' outputs will have the same format as for the Main Task but will be specifically scored using metrics designed for assessing novelty detection. 3) A KBP Validation Pilot, set in the Knowledge Base Population scenario, is also proposed. 4) The exploratory effort on resource evaluation will be extended also to tools. Mandatory ablation tests for both knowledge resources and tools will be required to participants in the new RTE-6 Main Task. RTE-6 MAIN TASK - RECOGNIZING TEXTUAL ENTAILMENT WITHIN A CORPUS In the RTE-6 Main Task given a corpus, a hypothesis H, and a set of "candidate" entailing sentences for that H retrieved by Lucene from the corpus, RTE systems are required to identify all the sentences that entail H among the candidate sentences. The RTE-6 Main data set is based on the data created for the TAC 2009 Update Summarization task, consisting of a number of topics, each containing two sets of documents, namely i) Cluster A, made up of the first 10 texts in chronological order of publication date, and ii) Cluster B, made up of the last 10 texts. H's are standalone sentences taken from Cluster B documents, meanwhile candidate entailing sentences (T's) are the 100 top-ranked sentences retrieved for each H by Lucene from the Cluster A corpus, using H verbatim as the search query. While only the subset of the candidate entailing sentences must be judged for entailment, these sentences are not to be considered as isolated texts, but the entire Cluster A corpus, to which the candidate entailing sentences belong, is to be taken into consideration in order to resolve discourse references and appropriately judge the entailment relation. The example below presents a hypothesis referring to a given topic and some of the entailing sentences found in the subset of candidate sentences (the first entailing sentence entails H because "new hurricane" can be seen to resolve to "Hurricane Rita" from the context in which it occurs in its Cluster A document): Rita barreled toward the Gulf of Mexico. World oil prices fell further on Tuesday, despite a new hurricane powering towards oil facilities in the Gulf of Mexico, and as OPEC pledged to supply more crude from the start of October if required. Hurricane Rita barreled near southern Florida islands and headed toward the Gulf of Mexico, threatening Texas and Louisiana with winds of 160 kilometers per hour (100 mph). Hurricane Rita pounded the fragile Florida Keys islands Tuesday as it barreled toward the oil-rich Gulf of Mexico. RTE-6 NOVELTY DETECTION SUBTASK The Novelty Detection subtask is based on the Main Task and is aimed at specifically addressing the interests of the Summarization community, in particular with regard to the Update Summarization task, focusing on detection of novelty in Cluster B documents. The task consists of judging if the information contained in each H (drawn from the cluster B documents) is novel with respect to the information contained in the set of Cluster A candidate entailing sentences. If for a given H one or more entailing sentences are found, it means that the content of the H is not new. On the contrary, if no entailing sentences are detected, it means that the information contained in the H is regarded as novel. The Novelty Detection Task requires the same output format as the Main Task - i.e. no additional type of decision is needed. Nevertheless, the Novelty Detection Task differs from the Main Task in the following ways: 1) The H's are only on a subset of the H's used for the Main Task; 2) The system outputs are scored differently, using specific scoring metrics designed for assessing novelty detection. The Main and Novelty Detection Task guidelines for participants, together with one sample topic taken from the Development Set, are available at the RTE-6 Website (http://www.nist.gov/tac/2010/RTE/). RTE-6 KBP VALIDATION PILOT TASK Based on the TAC Knowledge Base Population (KBP) Slot-Filling task, the new KBP validation pilot task is to determine whether a given relation (Hypothesis) is supported in an associated document (Text). Each slot fill that is proposed by a system for the KBP Slot-Filling task would create one evaluation item for the RTE-KBP Validation Pilot: the Hypothesis would be a simple sentence created from the slot fill, while the Text would be the source document that was cited as supporting the slot fill. The guidelines and the Development Set will be available by the end of April 2010 at the RTE-6 website (http://www.nist.gov/tac/2010/RTE/). RESOURCE AND TOOL EVALUATION THROUGH ABLATION TESTS The exploratory effort on resource evaluation started in RTE-5 will continue on the new RTE-6 Main Task and will be extended to tools. Ablation tests are required for systems participating in the new RTE-6 Main Task, in order to collect data to better understand the impact of both knowledge resources and tools used by RTE systems and evaluate their contribution to systems' performance. An ablation test consists of removing one module from a complete system, and rerunning the system on the test set with the other modules (excluding the module being tested). Comparing the results to those obtained by the complete system, it is possible to assess the practical contribution given by the individual module. THE RTE RESOURCE POOL AT ACLwiki (http://www.aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool) The RTE Resource Pool, set up for the first time during RTE-3, serves as a portal and forum for publicizing and tracking resources, and reporting on their use. All the RTE participants and other members of the NLP community who develop or use relevant resources are encouraged to contribute to this important resource. The RTE Resource Pool has been updated with a section specifically dedicated to knowledge resources. The new page (http://www.aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool#Knowledge_Resources ) contains a list of the "standard" RTE resources, which have been selected and exploited majorly in the design of RTE systems during the RTE challenges held so far, together with the links to the locations where they are made available. Furthermore, the results of the ablation tests carried out in RTE-5, and their description, is also provided. TENTATIVE SCHEDULE April 23 KBP Validation Pilot: Release of Development Set April 30 Main Task: Release of Development Set May 21 Deadline for TAC 2010 track registration September 2 Main Task: Release of Test Set September 9 Main Task: Deadline for task submissions September 10 KBP Validation Pilot: Release of Test Set September 16 Main Task: Release of individual evaluated results September 17 KBP Validation Pilot: Deadline for task submissions September 24 Main Task: Deadline for ablation tests submissions September 24 KBP Validation Pilot: Release of individual evaluated results September 26 Deadline for TAC 2010 workshop presentation proposals October 1 Main Task: Release of individual ablation test results October 20 Deadline for systems' reports TRACK COORDINATORS AND ORGANIZERS: Luisa Bentivogli, CELCT and FBK, Italy (Track coordinator, bentivo@fbk.eu) Danilo Giampiccolo, CELCT, Italy (Track coordinator, giampiccolo@celct.it) Hoa Trang Dang, NIST, USA Ido Dagan, Bar Ilan University, Israel Peter Clark, Boeing, USA