Task Definition

The PASCAL Challenge introduces textual entailment as a generic evaluation framework for "practical" semantic inference in Natural Language Processing, Information Retrieval and Machine Learning applications.

Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (i.e., it can be inferred) from another text. More concretely, textual entailment is defined as a directional relationship between pairs of text expressions, denoted by T - the entailing "Text", and H - the entailed "Hypothesis". We say that T entails H, denoted by TH, if the meaning of H can be inferred from the meaning of T, as it would typically be interpreted by people. This somewhat informal definition is based on (and assumes) common human understanding of language and some common background knowledge. It is similar in spirit to evaluation of applied tasks such as Question Answering, in which humans need to judge whether the correct answer can be inferred from a given retrieved text.

This generic task captures a range of inferences that are relevant for multiple applications. The challenge dataset includes Text-Hypothesis pairs that correspond to typical success and failure settings of specific applications, as detailed below, and represent different difficulty levels of entailment reasoning, such as lexical, syntactic, morphological and logical.

Dataset Collection and Application Settings

The dataset of Text-Hypothesis pairs was collected by human annotators. It consists of seven subsets, which correspond to typical success and failure settings in different applications (as listed below). Within each application setting the annotators selected both positive entailment examples (judgment TRUE), where T does entail H, as well as negative examples (FALSE), where entailment does not hold (roughly 50%-50% split). Some T-H examples appear in the table below, please look at the development data to see more examples.

Information Retrieval (IR):

Annotators generated hypotheses that may correspond to meaningful IR queries, which express some concrete semantic relations (typically longer and more specific than a standard keyword query, thus representing a semantic-oriented variation within IR). The hypotheses were selected by examining prominent sentences in news stories, and were then submitted to a web search engine. Candidate texts (T) were selected from the documents retrieved by search engines, picking both texts that do or do not entail the hypothesis.

Comparable Documents (CD):

Annotators identified T-H pairs by examining comparable news articles which cover a common story and identifying "aligned" sentence-pairs based on some lexical overlap, but where semantic entailment may or may not hold. (Lexical overlap is a common practice technique for processing comparable documents, for example in applications such as multi-document summarization or paraphrase extraction).

Reading Comprehension (RC):

This task corresponds to a typical reading comprehension exercise in language teaching, were students are asked to judge whether a particular assertion may be inferred from a given text story. The challenge annotators were asked to create such hypotheses relative to text sentences in news documents. The annotators were instructed to consider a reading comprehension test for high school students.

Question Answering (QA):

Using a newspaper-based corpus built for QA experiments, annotators selected some questions and turned them into affirmative sentences with the correct answer "plugged in". These affirmative sentences serve as the hypotheses (H). Then the annotators chose relevant text snippets (T) that are suspected to indicate the correct answer, producing entailment pairs. For example, given the question, "Who is Ariel Sharon?" and an expected answer text "Israel's prime Minister, Ariel Sharon, visited Prague" (T), the question is turned it into the statement "Ariel Sharon is the Israeli Prime Minister" (H), producing a TRUE entailment pair.

Information Extraction (IE):

Comment: This task is inspired by the Information Extraction application, adapting the setting for having pairs of texts rather than a text and a structured template.

Given a set of IE relations of interest (e.g. a management succession event), annotators identified as the text (T) candidate news story sentences in which the relation might (or might not) hold. As a hypothesis they created a common natural language formulation of the IE relation, which is assumed to be very easy to br identified by an IE system. For example, given the text "Guerrillas killed a peasant in the city of Flores." (T), and the information extraction task of identifying killed civilians, a hypothesis "Guerrillas killed a civilian" is created producing a TRUE entailment pair.

Machine Translation (MT):

Two translations of the same text, an automatic translation and a gold standard human translation, were compared and modified in order to obtain T-H pairs, where correct translation corresponds to TRUE entailment. Automatic translations were sometimes grammatically adjusted, being otherwise unacceptable.

Paraphrase Acquisition (PP)

Similar meanings can be expressed in quite different ways, where not only the lexis varies but also the syntactical structure of expressions. Paraphrase acquisition systems attempt to acquire pairs (or sets) of expressions that paraphrase each other. Annotators exploited candidate pairs of paraphrase expressions produced by an automatic paraphrase acquisition system. They collected pairs of similar T-H sentences in which one sentence contains one expression and the other contains its paraphrase.

Table 1: Example T-H pairs
ID TEXT HYPOTHESIS TASK ENTAILMENT
1 iTunes software has seen strong sales in Europe. Strong sales for iTunes in Europe. IR TRUE
2 Cavern Club sessions paid the Beatles £15 evenings and £5 lunchtime. The Beatles perform at Cavern Club at lunchtime. IR TRUE
3 American Airlines began laying off hundreds of flight attendants on Tuesday, after a federal judge turned aside a union's bid to block the job losses. American Airlines will recall hundreds of flight attendants as it steps up the number of flights it operates. PP FALSE
4 The two suspects belong to the 30th Street gang, which became embroiled in one of the most notorious recent crimes in Mexico: a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others. Cardinal Juan Jesus Posadas Ocampo died in 1993. QA TRUE

Annotators were instructed to replace anaphors with the appropriate reference from preceding sentences where applicable, and to possibly shorten the sentences in order to keep the texts and hypotheses relatively short. All example T-H pairs were first judged (as TRUE/FALSE) by the annotator that created the example. The examples were then cross-evaluated by a second judge, who received only the text and hypothesis without the original additional context. Pairs for which there was disagreement among the judges were discarded, while the agreed decisions for the rest of the examples are considered as the gold standard for evaluation.

Some additional judgment criteria and guidelines are listed below (examples are taken from Table 1):

Data Sets and Format

Both Development and Test sets are formatted as XML files. The template will be as follows:

<pair id="id_num" task="task_acronym" value="TRUE|FALSE">

   <t> the text... </t>

   <h> the hypothesis... </h>

</pair>

Where:

The data is split to a development set and a test set, to be released separately. The goal of the development set is to guide the development and tuning of participating systems. Notice that since the given task has an unsupervised nature it is not expected that the development set can be used as a main resource for supervised training, given its anecdotal coverage. Rather it is assumed that systems will be using generic techniques and resources that are suitable for the news domain.

Submission

Systems should tag each T-H pair as either TRUE, predicting that entailment does hold for the pair, or as FALSE otherwise. Results will be submitted in a file with one line for each T-H pair in the test set, in the following format:

    pair_id<blank space>judgment<blank space>confidence_score

where:

The first lines of a run may look like this:

1 TRUE 0.348
2 FALSE 0.221
3 FALSE 0.873
4 TRUE 1
5 FALSE 0.003

Participating teams will be allowed to submit results of up to 2 systems. The corresponding result files should be named run1.txt (and run2.txt for a second submitted run).

The results files should be zipped and submitted via the submit form.

Systems should be developed based on the development data set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. If absolutely necessary, we ask that participants report any other changes they perform in their systems after downloading the test set.

Partial Coverage Submissions

In order to encourage systems and methods which do not cover all phenomena present in the test examples we allow submission of partial coverage results, for only part of the test examples. Any run which will not include judgments for all test examples will be considered as a partial submission and will be evaluated accordingly to its coverage (see next section). Naturally, the decision as to on which examples the system abstains should be done automatically by the system (with no manual involvement). We ask that participants that provide partial coverage results will include in their report some description and analysis of the types of examples that their system covers, versus the types of examples that are not addressed.

Evaluation Measures

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, a Confidence-Weighted Score (CWS, also known as Average Precision) will be computed. Judgments of the test examples will be sorted by their confidence (in decreasing order from the most certain to the least certain), calculating the following measure:

    1/n * sum for i=1 to n (#-correct-up-to-pair-i/i)

where n is the number of the pairs in the test set, and i ranges over the pairs.

The Confidence-Weighted Score ranges between 0 (no correct judgments at all) and 1 (perfect score), and rewards the systems' ability to assign a higher confidence score to the correct judgments than to the wrong ones. This score will not be computed for systems not supplying confidence values.

Partial Coverage Submission Evaluation

Partial coverage submissions will be evaluated separately. Accuracy and CWS (as described above), as well as coverage (number of examples in run / total number of examples), will be reported for each partial run. To visualize the relative performance of different systems we will plot the position of the various runs on accuracy/coverage and average precision/coverage graphs.

Final Notes

The goal of the challenge is to provide a first opportunity for presenting and comparing possible approaches for textual entailment recognition, aiming at an explorative rather than a competitive setting. Therefore, even though system results will be reported, there will not be an official ranking of systems.

The challenge seems to be difficult, and obtaining relatively low results will not be surprising.

Never the less, it provides a benchmark for a novel task and will supply meaningful baselines and analyses for the performance of current systems. It is also encouraged to submit results of simple baseline techniques, either as a second or a sole run, in order to provide additional insights into the problem.

The setting of this challenge is somewhat biased, as we specifically chose non-trivial pairs for which some inference is needed, and also imposed a balance of TRUE and FALSE examples. For this reason, system performance in applicative settings might be higher than the figures for the challenge data, due to a more favourable distribution of examples in real applications.

Finally, the task definition and evaluation methodologies are clearly not mature yet. We expect them to change over time and hope that participants' contributions, observations and comments will help shaping this evolving research direction.

Acknowledgements

The following sources were used in the preparation of the data:

We would like to thank the people and organizations that made these sources available for the challenge.

We'd also like to acknowledge the people (at ITC-Irst and Bar Ilan University) involved in creating and annotating the data: Danilo Giampiccolo, Tracy Kelly, Einat Barnoy, Alessandro Vallin, Ruthie Mandel, and Melanie Joseph.