RTE-4 GUIDELINES FOR PARTICIPANTS

INTRODUCTION

The Recognizing Textual Entailment (RTE) challenge is an annual exercise that provides a framework for evaluation of textual entailment systems, and promotes international research in this area. In this evaluation exercise systems must recognize whether one piece of text entails another.

We define textual entailment as a directional relationship between two text fragments, which we term the Text (T) and the Hypothesis (H). We say that:

T entails H if the truth of H can be inferred from T within the context induced by T.

For example, given assumed common background knowledge of the business news domain and the following text:

-T1 Internet media company Yahoo Inc. announced Monday it is buying Overture Services Inc. in a $1.63-billion (U.S.) cash-and-stock deal that will bolster its on-line search capabilities.

the following hypotheses are entailed:

- H1.1 Yahoo bought Overture

- H1.2 Overture was acquired by Yahoo

- H1.3 Overture was bought

- H1.4 Yahoo is an internet company

If H is not entailed by T, there are two possibilities:

1) H is contradicted by T

2) the truth of H cannot be judged on the basis of T.

For example, the following hypotheses are contradicted by T1 above:

- H1.5 Overture bought Yahoo

While the truth of the following hypotheses cannot be judged on the basis of T1 above:

- H1.7 Yahoo manufactures cars

- H1.8 Overture shareholders will receive $4.75 cash and 0.6108 Yahoo stock for each of their shares.

TASK DESCRIPTION

Textual entailment recognition is the task of deciding, given a T-H pair, whether T entails H.

The three-way RTE task is to decide whether:

T entails H - in which case the pair will be marked as ENTAILMENT
T contradicts H - in which case the pair will be marked as CONTRADICTION
The truth of H cannot be determined on the basis of T - in which case the pair will be marked as UNKNOWN

The two-way RTE task is to decide whether:

T entails H - in which case the pair will be marked as ENTAILMENT
T does not entail H - in which case the pair will be marked as NO ENTAILMENT

When T entails H

Following are some guidelines for deciding whether a Text entails a Hypothesis:

The Hypothesis must be fully entailed by the Text. Judgment cannot be "Entailed" if the Hypothesis includes parts that cannot be inferred from the Text.
Entailment is a directional relation. The Hypothesis must be fully entailed by the given Text, but the Text need not be entailed by the Hypothesis. From that perspective, it might help to first read the Hypothesis and understand what it states, and only then read the Text and see if it has sufficient information to entail the Hypothesis.
Verb tense issues must be ignored. The Text and Hypothesis might originate from documents in different points in time. For instance, the Hypothesis Yahoo bought Overture is ENTAILED even if the Text reads Yahoo will conclude the acquisition of Overture next week.
Common knowledge -- such as "a company has a CEO, a CEO is an employee of the company, an employee is a person" etc., -- is presupposed.
T is considered to entail H even if the entailment is just very probable rather than certain. For example, John purchased the book should entail John paid for the book; even if it might theoretically be possible to buy something without paying for it. On the other hand, Mary criticized the proposal should NOT entail Mary rejected the proposal unless there is a strong reason to believe that indeed Mary rejected the idea.

When T does not entail H: CONTRADICTION VS UNKNOWN

Following are some guidelines for determining whether T contradicts H, or the truth of H is unknown based on T:

H contradicts T if assertions in the hypothesis appear to directly refute or show portions of the text to be false/wrong, if the hypothesis were taken as reliable.

T2: Jennifer Hawkins is the 21-year-old beauty queen from Australia.

H2: Jennifer Hawkins is Australia's 20-year-old beauty queen.

Assessment: CONTRADICTION

T3: In that aircraft accident, four people were killed: the pilot, who was wearing civilian clothes, and three other people who were wearing military uniforms.

H3: Four people were assassinated by the pilot.

Assessment: CONTRADICTION

A Text and Hypothesis pair reporting contradictory statements is marked as a CONTRADICTION, if the reports are stated as facts (these cases can be seen as embedded contradictions):

T4: That police statement reinforced published reports, that eyewitnesses said de Menezes had jumped over the turnstile at Stockwell subway station and was wearing a padded jacket, despite warm weather.

H5: However, the documents leaked to ITV News suggest that Menezes, an electrician, walked casually into the subway station and was wearing a light denim jacket.

Assessment: CONTRADICTION

For something to be a contradiction, it does not have to be impossible for the Text and Hypothesis to be reconcilable; it just has to appear highly unlikely in the absence of further evidence. For instance, it is reasonable to regard the pair T-H6 below as a contradiction (it is not very plausible to be able to determine that someone\u2019s throat was cut if the bodies were not found for over 18 months), but it does not seem prudent to regard the pair T-H7 as contradictory (despite a certain similarity in the reports, they could refer to different events and could easily both be true):

T6: The anti-terrorist court found two men guilty of murdering Shapour Bakhtiar and his secretary Sorush Katibeh, who were found with their throats cut in August 1991.

H6: Shapour Bakhtiar died in 1989.

Assessment: CONTRADICTION

T7: Five people were killed in another suicide bomb blast at a police station in the northern city of Mosul.

H7: Five people were killed and 20 others wounded in a car bomb explosion outside an Iraqi police station south of Baghdad.

Assessment: UNKNOWN

Noun Phrase Co-reference: compatible noun phrases between the text and the hypothesis should be treated as co-referent in the absence of clear countervailing evidence. For example, in the pair T-H8 it should be assumed that the two references to "a woman" refer to the same woman:

T8: Passions surrounding Germany's final match at the Euro 2004 soccer championships turned violent when a woman stabbed her partner in the head because she didn't want to watch the game on television.

H8: A woman passionately wanted to watch the soccer championship.

Assessment: CONTRADICTION

Event Co-reference: if two descriptions appear overlapping, rather than completely unrelated, by default it must be assumed that the two passages describe the same context, and contradiction is evaluated on this basis. For example, if there are details that seem to make it clear that the same event is being described, but one passage says it happened in 1985 and the other 1987, or one passage says two people met in Greece, and the other in Italy, then you should regard the two as a contradiction. For instance, in the following pair T-H9, it seems reasonable to regard "a ferry collision" and "a ferry sinking" as the same event, and the claims about casualties as contradictory:

T9: Rescuers searched rough seas off the capital yesterday for survivors of a ferry collision that claimed at least 28 lives, as officials blamed crew incompetence for the accident.

H9: 100 or more people lost their lives in a ferry sinking.

Assessment: CONTRADICTION

(Many other examples can be found at http://nlp.stanford.edu/RTE3-pilot/, where the links to three-way annotated datasets from previous campaigns are provided. The current guidelines for contradiction annotation are based on the guidelines by Marie-Catherine de Marneffe and Christopher Manning, used for the evaluation of the pilot task at RTE-3; for the original version see http://nlp.stanford.edu/RTE3-pilot/contradictions.pdf.)

TEST SET FORMAT

The dataset of Text-Hypothesis pairs is collected by human annotators and consists of four subsets which correspond to different application settings: Information Extraction (IE), Information Retrieval (IR), Question Answering (QA), and Multi-Document Summarization (SUM).

The dataset is formatted as an XML file, as follows:

<pair id="id_num" entailment="ENTAILMENT|CONTRADICTION|UNKNOWN" task="IE|IR|QA|SUM"> <t>the text...</t> <h>the hypothesis...</h> </pair>

Where:

each T-H pair appears within a single <pair> element.
the element <pair> has the following attributes:

id, a unique numeral identifier of the T-H pair.
task, the acronym of the application setting from which the pair has been generated: "IR","IE","QA" or "SUM".
entailment (in the gold standard only), the gold standard entailment annotation, being either "ENTAILMENT", "CONTRADICTION" or "UNKNOWN"

the element <t> (text) has no attributes, and it may be made up of one or more sentences.
the element <h> (hypothesis) has no attributes, and it usually contains one simple sentence.

RESULT SUBMISSION

There are two tasks in this year's RTE challenge:

the main three-way RTE task, where systems must tag each T-H pair as either:

ENTAILMENT, predicting that T entails H
CONTRADICTION, predicting that the content of T contradicts the content of H
UNKNOWN, predicting that the truth of H cannot be inferred by the content of T.

the classic two-way RTE task, where the systems must tag each T-H pair as either:

ENTAILMENT, predicting that T entails H
NO ENTAILMENT, predicting that T does not entail H.

Teams can participate in either or both tasks. No partial submissions are allowed, i.e. the submission must cover the whole dataset. Each team is allowed to submit up to 6 runs (up to 3 runs for each task). This allows teams who attempt both 3-way and 2-way classification to optimize/train separately for each task. Teams that participate in the 3-way task and do not have a separate strategy for the 2-way task (other than to automatically conflate CONTRADICTION and UNKNOWN to NO ENTAILMENT), should not submit separate runs for the 2-way task, because runs for the 3-way task will automatically be scored for both the 3-way task and the 2-way task.

Each run may optionally rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Because the evaluation measure for confidence ranking applies only to the 2-way classification task, in the case of three-way runs the pairs tagged as CONTRADICTION and UNKNOWN will be conflated and automatically re-tagged as NO ENTAILMENT for scoring purposes.

Runs will be submitted using a password-protected online submission form on the RTE web page. The link to the submission form will be posted at the same time that the test data set is released. Only teams who have registered for the TAC 2008 RTE track and who have submitted the required Agreement Concerning Dissemination of TAC Results may access the test data and submit runs.

At the time of submission, each team will be asked to fill out the form stating:

Whether the submission is for the three-way task or the two-way task
Whether the pairs are ranked in order of entailment confidence
A number (1-3) for the run, used to differentiate between the team's runs for the task

NB: Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-4 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.

RESULT SUBMISSION FORMAT

Results will be submitted as one file per run. Each submitted file must be a plain ASCII file with one line for each T-H pair in the test set, in the following format:

pair_id judgment

Where:

pair_id is the unique identifier of each T-H pair, as it appears in the test set
judgment for the three-way task is one of: ENTAILMENT, CONTRADICTION, or UNKNOWN
judgment for the two-way task is one of: ENTAILMENT or NO ENTAILMENT

If the run includes confidence ranking, then the pairs in the file should be ordered by decreasing entailment confidence: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likely. Thus, in a ranked run, all the pairs classified as ENTAILMENT are expected to appear before all the pairs that are classified as NO ENTAILMENT (for the two-way task) or CONTRADICTION or UNKNOWN (for the three-way task).

EVALUATION MEASURES

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). For the two-way task, a judgment of "NO ENTAILMENT" in a submitted run is considered to match either "CONTRADICTION" or "UNKNOWN" in the Gold Standard. The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision score will be computed for systems that provide as output a confidence-ranked list of all test examples. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is ENTAILMENT. More formally, it can be written as follows:

1/R * sum for i=1 to n (E(i) * #-entailing-up-to-pair-i/i)

where n is the number of the pairs in the test set, R is the total number of ENTAILMENT pairs in the Gold Standard, E(i) is 1 if the i-th pair is marked as ENTAILMENT in the Gold Standard and 0 otherwise, and i ranges over the pairs, ordered by their ranking. As average precision is relevant only for a binary annotation, in the case of three-way judgment submissions the pairs tagged as CONTRADICTION and UNKNOWN will be conflated and re-tagged as NO ENTAILMENT.

SYSTEM REPORTS

Participating teams are requested to write a paper for the TAC 2008 proceedings that describes how the submitted runs were produced. For more details see the TAC 2008 guidelines for participants' papers.

IMPORTANT DATES

Sept 8	Release of test data
Sept 15	Deadline for participants' submissions
Sept 18	Release of individual evaluated results
Sept 21	Deadline for submission of workshop presentation proposals
Oct 22	Deadline for submission of participants' notebook papers

BACK to TAC 2008 RTE Track Home Page

Last updated:
Comments to: [email protected]