Recognising Textual Entailment Challenges

Textual Entailment Recognition was proposed recently as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarization. This task requires to recognise, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.

The First Recognising Textual Entailment Challenge

(RTE 1)

The first PASCAL Recognising Textual Entailment Challenge (15 June 2004 – 10 April 2005) provided the first benchmark for the entailment task. The challenge raised noticeable attention in the research community, attracting 17 submissions from research groups worldwide. The relatively low accuracy achieved by the participating systems suggests that the entailment task is indeed a challenging one, with a wide room for improvement.

Challenge citation: Please use the following citation when referring to the RTE challenge:
Ido Dagan, Oren Glickman and Bernardo Magnini. The PASCAL Recognising Textual Entailment Challenge. In Quiñonero-Candela, J.; Dagan, I.; Magnini, B.; d’Alché-Buc, F. (Eds.), Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer, 2006.

Motivation

Recent years have seen a surge in research of text processing applications that perform semantic-oriented inference about concrete text meanings and their relationships. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL Challenge introduces textual entailment as a common task and evaluation framework for Natural Language Processing, Information Retrieval and Machine Learning researchers, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of generic semantic “engines”, which will play an analogous role to that of generic syntactic analyzers across multiple applications.

Textual Entailment

Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text. This task captures generically a broad range of inferences that are relevant for multiple applications. For example, a Question Answering (QA) system has to identify texts that entail the expected answer. Given the question “Who killed Kennedy?”, the text “the assassination of Kennedy by Oswald” entails the expected answer form “Oswald killed Kennedy”. Similarly, in Information Retrieval (IR) the concept denoted by a query expression should be entailed from relevant retrieved documents. In multi-document summarization a redundant sentence or expression, to be omitted from the summary, should be entailed from other expressions in the summary. In Information Extraction (IE) entailment holds between different text variants that express the same target relation. And in Machine Translation evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations have to entail each other. Thus, in a similar spirit to Word Sense Disambiguation and Named Entity Recognition which are recognized as generic tasks, modeling textual entailment may consolidate and promote broad research on applied semantic inference.

Task Definition

Participants in the evaluation exercise will be provided with pairs of small text snippets (one or more sentences in English), which we term Text-Hypothesis (T-H) pairs. The data set will include over 1000 English T-H pairs from the news domain (political, economical, etc.). Examples will be manually tagged for entailment (i.e. whether T entails H or not) by human annotators and will be divided into a Development Set (one third of the data) and a Test Set (two thirds of the data). Participating systems will have to decide for each T-H pair whether T indeed entails H or not, and results will be compared to the manual gold standard.

The dataset will be collected with respect to different text processing applications, such as question answering, information extraction, information retrieval, multi-document summarization, paraphrase acquisition, and machine translation. Each portion of the dataset will include typical T-H examples that correspond to success and failure cases of actual applications. The examples will represent different levels of entailment reasoning, such as lexical, syntactic, morphological and logical.

The goal of the challenge is to provide a first opportunity for presenting and comparing possible approaches for modeling textual entailment. In this spirit, we aim at an explorative rather than a competitive setting. While participant results will be reported there will not be an official ranking of systems. A development set will be released first to give an early impression of the different types of test examples. The test set will be released two months prior to the result submission date, but, of course, reported systems are expected to be generic in nature. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that are present in the test data, as long as the methodology is general and the cost of running the learning/acquisition procedure at full scale can be reasonably estimated.

Examples

TEXT

HYPOTHESIS

ENTAILMENT

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Yahoo bought Overture.

TRUE

Microsoft’s rival Sun Microsystems Inc. bought Star Office last month and plans to boost its development as a Web-based device running over the Net on personal computers and Internet appliances. Microsoft bought Star Office.

FALSE

The National Institute for Psychobiology in Israel was established in May 1971 as the Israel Center for Psychobiology by Prof. Joel. Israel was established in May 1971.

FALSE

Since its formation in 1948, Israel fought many wars with neighboring Arab countries. Israel was established in 1948.

TRUE

Putting hoods over prisoners’ heads was also now banned, he said. Hoods will no longer be used to blindfold Iraqi prisoners.

TRUE

The market value of u.s. overseas assets exceeds their book value. The market value of u.s. overseas assets equals their book value.

FALSE

for registration and further information and inquiries contact Oren Glickman <glikmao@cs.biu.ac.il>.

Organizing Committee

Ido Dagan (Bar Ilan University, Israel)

Oren Glickman (Bar Ilan University, Israel)

Bernardo Magnini (ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, Italy)