RTE-4 GUIDELINES FOR PARTICIPANTS

INTRODUCTION

The Recognizing Textual Entailment (RTE) challenge is an annual exercise that provides a framework for evaluation of textual entailment systems, and promotes international research in this area. In this evaluation exercise systems must recognize whether one piece of text entails another.

We define textual entailment as a directional relationship between two text fragments, which we term the Text (T) and the Hypothesis (H). We say that:

For example, given assumed common background knowledge of the business news domain and the following text:

the following hypotheses are entailed: If H is not entailed by T, there are two possibilities: For example, the following hypotheses are contradicted by T1 above: While the truth of the following hypotheses cannot be judged on the basis of T1 above:

TASK DESCRIPTION

Textual entailment recognition is the task of deciding, given a T-H pair, whether T entails H.

The three-way RTE task is to decide whether:

The two-way RTE task is to decide whether:

When T entails H

Following are some guidelines for deciding whether a Text entails a Hypothesis:

  1. The Hypothesis must be fully entailed by the Text. Judgment cannot be "Entailed" if the Hypothesis includes parts that cannot be inferred from the Text.
  2. Entailment is a directional relation. The Hypothesis must be fully entailed by the given Text, but the Text need not be entailed by the Hypothesis. From that perspective, it might help to first read the Hypothesis and understand what it states, and only then read the Text and see if it has sufficient information to entail the Hypothesis.
  3. Verb tense issues must be ignored. The Text and Hypothesis might originate from documents in different points in time. For instance, the Hypothesis Yahoo bought Overture is ENTAILED even if the Text reads Yahoo will conclude the acquisition of Overture next week.
  4. Common knowledge -- such as "a company has a CEO, a CEO is an employee of the company, an employee is a person" etc., -- is presupposed.
  5. T is considered to entail H even if the entailment is just very probable rather than certain. For example, John purchased the book should entail John paid for the book; even if it might theoretically be possible to buy something without paying for it. On the other hand, Mary criticized the proposal should NOT entail Mary rejected the proposal unless there is a strong reason to believe that indeed Mary rejected the idea.

When T does not entail H: CONTRADICTION VS UNKNOWN

Following are some guidelines for determining whether T contradicts H, or the truth of H is unknown based on T:

  1. H contradicts T if assertions in the hypothesis appear to directly refute or show portions of the text to be false/wrong, if the hypothesis were taken as reliable.


  2. A Text and Hypothesis pair reporting contradictory statements is marked as a CONTRADICTION, if the reports are stated as facts (these cases can be seen as embedded contradictions):

  3. For something to be a contradiction, it does not have to be impossible for the Text and Hypothesis to be reconcilable; it just has to appear highly unlikely in the absence of further evidence. For instance, it is reasonable to regard the pair T-H6 below as a contradiction (it is not very plausible to be able to determine that someone\u2019s throat was cut if the bodies were not found for over 18 months), but it does not seem prudent to regard the pair T-H7 as contradictory (despite a certain similarity in the reports, they could refer to different events and could easily both be true):


  4. Noun Phrase Co-reference: compatible noun phrases between the text and the hypothesis should be treated as co-referent in the absence of clear countervailing evidence. For example, in the pair T-H8 it should be assumed that the two references to "a woman" refer to the same woman:


  5. Event Co-reference: if two descriptions appear overlapping, rather than completely unrelated, by default it must be assumed that the two passages describe the same context, and contradiction is evaluated on this basis. For example, if there are details that seem to make it clear that the same event is being described, but one passage says it happened in 1985 and the other 1987, or one passage says two people met in Greece, and the other in Italy, then you should regard the two as a contradiction. For instance, in the following pair T-H9, it seems reasonable to regard "a ferry collision" and "a ferry sinking" as the same event, and the claims about casualties as contradictory:


  6. In other circumstances, it may be most reasonable to regard the two passages as describing different events. For instance, example T-H7 above was not marked as a contradiction, as it does not seem compelling to regard "another suicide bomb blast" and "a car bomb explosion" as referring to the same event.

(Many other examples can be found at http://nlp.stanford.edu/RTE3-pilot/, where the links to three-way annotated datasets from previous campaigns are provided. The current guidelines for contradiction annotation are based on the guidelines by Marie-Catherine de Marneffe and Christopher Manning, used for the evaluation of the pilot task at RTE-3; for the original version see http://nlp.stanford.edu/RTE3-pilot/contradictions.pdf.)

TEST SET FORMAT

The dataset of Text-Hypothesis pairs is collected by human annotators and consists of four subsets which correspond to different application settings: Information Extraction (IE), Information Retrieval (IR), Question Answering (QA), and Multi-Document Summarization (SUM).

The dataset is formatted as an XML file, as follows:

<pair id="id_num" entailment="ENTAILMENT|CONTRADICTION|UNKNOWN" task="IE|IR|QA|SUM">
      <t>the text...</t>
      <h>the hypothesis...</h>
</pair>

Where:

RESULT SUBMISSION

There are two tasks in this year's RTE challenge:

Teams can participate in either or both tasks. No partial submissions are allowed, i.e. the submission must cover the whole dataset. Each team is allowed to submit up to 6 runs (up to 3 runs for each task). This allows teams who attempt both 3-way and 2-way classification to optimize/train separately for each task. Teams that participate in the 3-way task and do not have a separate strategy for the 2-way task (other than to automatically conflate CONTRADICTION and UNKNOWN to NO ENTAILMENT), should not submit separate runs for the 2-way task, because runs for the 3-way task will automatically be scored for both the 3-way task and the 2-way task.

Each run may optionally rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Because the evaluation measure for confidence ranking applies only to the 2-way classification task, in the case of three-way runs the pairs tagged as CONTRADICTION and UNKNOWN will be conflated and automatically re-tagged as NO ENTAILMENT for scoring purposes.

Runs will be submitted using a password-protected online submission form on the RTE web page. The link to the submission form will be posted at the same time that the test data set is released. Only teams who have registered for the TAC 2008 RTE track and who have submitted the required Agreement Concerning Dissemination of TAC Results may access the test data and submit runs.

At the time of submission, each team will be asked to fill out the form stating:

NB: Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-4 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.

RESULT SUBMISSION FORMAT

Results will be submitted as one file per run. Each submitted file must be a plain ASCII file with one line for each T-H pair in the test set, in the following format:

Where:

If the run includes confidence ranking, then the pairs in the file should be ordered by decreasing entailment confidence: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likely. Thus, in a ranked run, all the pairs classified as ENTAILMENT are expected to appear before all the pairs that are classified as NO ENTAILMENT (for the two-way task) or CONTRADICTION or UNKNOWN (for the three-way task).

EVALUATION MEASURES

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). For the two-way task, a judgment of "NO ENTAILMENT" in a submitted run is considered to match either "CONTRADICTION" or "UNKNOWN" in the Gold Standard. The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision score will be computed for systems that provide as output a confidence-ranked list of all test examples. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is ENTAILMENT. More formally, it can be written as follows:

where n is the number of the pairs in the test set, R is the total number of ENTAILMENT pairs in the Gold Standard, E(i) is 1 if the i-th pair is marked as ENTAILMENT in the Gold Standard and 0 otherwise, and i ranges over the pairs, ordered by their ranking. As average precision is relevant only for a binary annotation, in the case of three-way judgment submissions the pairs tagged as CONTRADICTION and UNKNOWN will be conflated and re-tagged as NO ENTAILMENT.

SYSTEM REPORTS

Participating teams are requested to write a paper for the TAC 2008 proceedings that describes how the submitted runs were produced. For more details see the TAC 2008 guidelines for participants' papers.

IMPORTANT DATES

Sept 8 Release of test data
Sept 15 Deadline for participants' submissions
Sept 18 Release of individual evaluated results
Sept 21 Deadline for submission of workshop presentation proposals
Oct 22 Deadline for submission of participants' notebook papers

BACK to TAC 2008 RTE Track Home Page

Last updated:
Comments to: [email protected]