Introduction

Goal of the proposed challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE) from documents, identifying future challenges and to foster additional research in the field. The aim is to:

  1. Define a methodology for fair comparison of ML algorithms for IE.
  2. Define a publicly available resource for evaluation that will exist and be used beyond the lifetime of the challenge; such framework will be ML oriented, not IE oriented as so far proposed in other similar evaluations.
  3. Perform actual tests of different algorithms in controlled situations so to understand what works and what does not and therefore identify new future challenges.

Results of the challenges will be discussed in a workshop at the end of the evaluation. Moreover, such results will constitute material of discussion at the Dagstuhl workshop on Learning for the Semantic Web that some of the proposers are organizing for February 2005. Goal of the workshop is discussing strategies and challenges for Machine Learning for the Semantic Web. The drafting of a white paper for future research is among the objectives of the workshop as well. Document annotation via IE is one of the topics. We are currently seeking sponsorship by Pascal for the event. We believe that this workshop will be a good place to show results of a Pascal activity.

The proposed challenge will be partly sponsored by the European Project IST Dot.Kom (www.dot-kom.org). Dot.Kom will fund the definition of the task, annotation of the corpora and implementation of the evaluation server. We seek expense refund from Pascal for the parts that cannot be legally claimed from Dot.Kom because covering activities specific to Pascal (mainly 1 person month for running the evaluation and preparation of the workshop).

IE, ML and Evaluation

Evaluation has a long history in Information Extraction (IE), mainly thanks to the MUC conferences, where most of the IE evaluation methodology (as well as most of the IE methodology as a whole) was developed (Hirschman 1998). In particular the DARPA/MUC evaluations produced and made available annotated corpora that have been used as standard testbeds. More recently, a variety of other corpora have been shared by the research community, such as; Califf’s job postings collection (Califf 1998), and Freitag’s seminar announcements, corporate acquisition, university Web page collections (Freitag 1988). Such corpora are available in the RISE repository, which contains a number of disparate corpora, without any specific common aim, being mainly corpora defined by independent researchers for evaluating their own systems and made available to others. Most of them are devoted to implicit relation extraction, i.e. the task mainly defined by the wrapper induction community, requiring the identification of implicit events and relations. For example (Freitag 1998) defines the task of extracting speaker, start-time, end-time and location from a set of seminar announcements. No explicit mention of the event (the seminar) is done in the annotation. Implicit event extraction is simpler than full event extraction, but has important applications whenever either there is just one event per texts or it is easy to devise extraction strategies for recognizing the event structure from the document (Ciravegna and Lavelli 2004).

However, the definition of an evaluation methodology and the availability of standard annotated corpora do not guarantee that the experiments performed with different approaches and algorithms proposed in the literature can be reliably compared. Some of the problems are common to other NLP tasks (e.g., see (Daelemans et al., 2003)); the difficulty of exactly identifying the effects on performances of the data used (the sample selection and the sample size), of the information sources used (the features selected), and of the algorithm parameter settings.

One issue specific to IE evaluation is how leniently to assess inexact identification of filler boundaries. (Freitag 1998) proposes three different criteria for matching reference instances and extracted instances: exact, overlap, contains. Another question concerns the possibility of multiple fillers for a slot and how the counting is performed. The problem is that this issue is often left implicit in papers. Finally, because of the complexity of the task, the limited availability of tools, and the difficulty of re-implementing published algorithms (usually quite complex and sometimes not fully described in papers), in IE there are very few comparative articles in the sense mentioned in (Daelemans et al., 2003). Most of the papers simply present the results of the new proposed approach and compare them with the results reported in previous articles. There is rarely any detailed analysis to ensure that the same methodology is used across different experiments.

In this challenge, we propose a framework for evaluating algorithms for implicit relation extraction where the details of the evaluation are set explicitly. The methodology will be applied to test different algorithms in a specific task (a public call will be issue to invite international research groups to participate in the evaluation), but it will be applicable to future evaluation tasks as well.

Expected Outcome

The outcome of the challenge will be:

  • A comparative evaluation of a number of state of the art algorithms for ML-based IE;
  • An evaluation strategy for future experiments also after the end of the challenge and for other tasks;
  • An implemented evaluation framework for the field including a scorer.

Difficulty that challengers will address :

Focus of challenge

The task will simulate human centred document annotation for Knowledge Management or the Semantic Web as found in tools like MnM (Vargas Vera et al. 2002), Melita (Ciravegna et al. 2002) and Ontomat (Handschuh et al. 2002). In these tools, the role of IE is to learn on-line from the user?s annotations and present every new document with suggestions derived by generalizing from the previous annotations and/or analysing the unannotated part. In many applications, when the IE system has learnt enough, the user exits the annotation loop and automatic annotation is provided for the new documents.

Crucial points for adaptive IE in this kind of environments are related to the noisy and limited training material. The annotated material is noisy because humans are imperfect; annotation tends to be a tiring and therefore error prone process. The annotated material is generally of limited size because in most cases users cannot / are not willing to provide more than a thousand annotations maximum. In addition, users generally require that the IE system starts learning very quickly and that suggestions are frequent and reliable (effectiveness and low intrusivity (Ciravegna et al.2002)). It is therefore important to study the adaptive algorithms in order to understand their behaviour when learning with limited amount of training material before using it in such an annotation environment. The ability to learn quickly and reliably is one of the aspects that we will evaluate, together with the ability to perform automatic annotation after a reasonable training.

The requirements of the human centred annotation mentioned above are common to other application areas and therefore the evaluation will be representative of the applicability of ML+IE for a number of other tasks.

Focus of the challenge will be on documents that in our experience tend to be quite common in real world applications, but that have been so far neglected by other evaluations of IE systems: semi-structured Web documents (web pages, emails, etc) where information is conveyed through two types of channels: language and format. These documents are semi-structured in the sense that they tend to contain free text but the use of formatting can make sentences choppy and semi-structured. An example is the main page of cnn.com where there is a central panel containing free text, and a number of side panels containing lists (up to a couple of words) and titles (choppy sentences up to 10 words). In addition, much regularity can be found, especially in formatting, but such regularities are not as rigid as in pages produced by databases (as in cnn.com). For example, a personal bibliographic page such as www.dcs.shef.ac.uki/~fabio/cira-papers.html is highly structured internally, but its structure is different from any other personal bibliographic page. Examples of semi-structured documents are seminar announcements, jobs posting, conference announcement web pages, etc.

The characteristics above make this task completely different from other analogous tasks for a number of reasons:

  • Use of semi-structured texts: previous evaluations such as CONNL and MUC have focused on newspaper documents. Semi-structured documents are a class of documents that has completely different characteristics and it is very important from the application point of view; as mentioned there is no currently established methodology for such kind of documents (Lavelli et al 2004);
  • Evaluation of implicit relations: this is a task that has never been evaluated in any competitive evaluation, although it is an area with great application potentialities: currently some tools use this approach for commercial applications (Ciravegna and Lavelli 2004); an assessment of the IE capabilities of ML-algorithms is quite timely.
  • Machine learning oriented study:
    • We will study the way different algorithms behave in different situations, with different availability of training material, in order to simulate a wide variety of application situations. In addition to the study of the classic task of extracting information given separated training and test sets, we will also focus on more ML oriented evaluations such as tracing the learning curve when the training material is increased progressively. We also will investigate the behaviour of algorithms able to exploit unannotated material. We believe that this kind of evaluation is much more interesting for the ML community than the standard IE scenario;
    • Most of the evaluations carried out so far were taken from an IE perspective. For example all the MUC tasks but Named Entity Recognition (NER) required the composition of a number of tasks such as NER, coreference resolution, template filling, etc. This composition risks obscuring the contribution of the ML algorithms to the IE task.

In our evaluation, we intend to evaluate ML algorithms on implicit relation recognition; a task that is more complex than generic NER but that allows clearly stating the contribution of the ML algorithms.

Describe the expected impact that the challenge could have on research in PASCAL fields?

Goal of the proposed challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE) from documents, identifying future challenges and to foster additional research in the field. The aim is to:

  1. Define a methodology for fair comparison of ML algorithms for IE.
  2. Define a publicly available resource for evaluation that will exist and be used beyond the lifetime of the challenge; such framework will be ML oriented, not IE oriented as so far proposed in other similar evaluations.
  3. Perform actual tests of different algorithms in controlled situations so to understand what works and what does not and therefore identify new future challenges.

Results of the challenges will be discussed in a workshop at the end of the evaluation. Moreover, such results will constitute material of discussion at the Dagstuhl workshop on Learning for the Semantic Web that some of the proposers are organizing for February 2005. Goal of the workshop is discussing strategies and challenges for Machine Learning for the Semantic Web. The drafting of a white paper for future research is among the objectives of the workshop as well. Document annotation via IE is one of the topics. We are currently seeking sponsorship by Pascal for the event. We believe that this workshop will be a good place to show results of a Pascal activity.

Number of teams which would be interested in participating in the challenge In the last years, more than 30 international groups have so far presented results of ML algorithms tested on some of the currently available corpora for implicit relation definition such as the CMU Seminar Announcements (Freitag 1998), the Job Postings (Califf 1998), Academic Web Pages (Freitag 1998) and all the corpora for wrapper induction so far proposed (e.g. the Zagat corpus).

Groups include; the University of Sheffield (UK), University College Dublin (IRL), ITC-Irst (I), Roma Tor Vergata (I), the National Centre for Scientific Research Demokritos (G), University of Antwerp (B), Carnegie Mellon University (USA), University of Washington (USA), University of Utah (USA), Cornell University (USA), University of Illinois UC (USA), the University of Texas at Austin (USA), USC-Information Science Institute (USA), Stanford University (USA). Some of them have been already contacted and showed availability to participate in this challenge. We believe that this challenge will have a high international profile and to be an excellent vehicle for impacting a quite large community.

How long will it take to collect and preprocess any associated datasets?

The collection of the corpus for the task will be done via the Web querying Google (<1 day). Three annotators will annotate manually the 600 documents (<1 week time). Preprocessing via Gate will require less than a day. Preparation of the server for evaluation will require about two weeks of person time. Overall, we expect the preparation of the task to require about one month time and about 8 weeks person time.

On what date will these associated datasets be available?

End of June 2004

Timescale for organizing the challenge:

  • June 2004: availability of formal definition of the task and of annotated corpus, together with the evaluation server.
  • October 2004: formal evaluation;
  • November 2004: workshop.
  • February 2005: Dagstuhl workshop on learning for the Semantic Web

How will the results of the challenge be evaluated?

We will collect a corpus of 1,100 conference workshop call for papers (CFP) from the Web; 600 will be annotated, 500 will be left unannotated. Workshops from a variety of fields will be sampled, e.g. Computer Science, Biomedical, Psychology. However, due to their prevalence on the Web, the majority of the documents are likely to be Computer Science based. The exact task will be defined during the preparation phase, but we expect to require extraction of:

  • Name of Workshop
  • Acronym of Workshop
  • Date of Workshop
  • Location of Workshop
  • Name of Conference
  • Date of Conference
  • Homepage of Conference
  • Location of Conference
  • Registration Date of Workshop
  • Submission Date of Workshop
  • Notification Date of Workshop
  • Camera Ready Copy Date of Workshop
  • Programme Char/Co-chairs of Workshop (name plus affiliation)

In the preparation phase, we will define the exact experimental setup (both the numerical proportions between the training and test sets and the procedure adopted to select the documents). The experimental setup mentioned in the following is representative of the direction of work, but further discussion is still needed. We will also specify all of the following: (1) a set of fields to extract, (2) the legal numbers of fillers for each field, (3) the possibility of multiple varying occurrences of any particular filler and (4) how stringently matches are evaluated (exact, overlap or contains).

We will define and implement an evaluation server for the preliminary testing and for testing the final results. This server will be based on the MUC scorer (Douthat 1998). We will define the exact matching strategies by providing the configuration file for each of the tasks selected. Finally we will set up a public location where people will be able to store other new future corpora and expected results, together with the guidelines to be strictly followed for the evaluation. This will guarantee a reliable comparison of the performance of different algorithms even after the PASCAL competition is over. Moreover it will allow further fair evaluations settings.

Corpora will be annotated using Melita (Ciravegna et al. 2002), an existing tool that is already under use in scientific and commercial evaluation. Inter-annotator agreement will be guaranteed by a procedure where three annotators will be given overlapping sets of 600 documents to annotate. Discrepancies in annotations (computed automatically by a program) will be discussed among annotators. Annotation will be performed in stages (e.g. 30, 100, 300, 600 documents) with discussion of strategies and discrepancies after every stage.

Before the beginning of the evaluation, the corpora will be preprocessed using an existing NLP system : documents will be tokenized, annotated with POS tagging, gazetteer information and named entities. The different algorithms will have to use this preprocessed data . This is in order to ensure that they have all access to the same information: in this way we believe that we will be able to measure the algorithm?s ability on a fair and equal base, as already done in other evaluations such as CONNL. Moreover, this will allow researchers to concentrate on the task of learning without having to spend time on the linguistic pre-processor. We also believe that in this way we will enable the participation in the task of researchers with limited or no knowledge of language analysis: they will not risk to be penalised for their inability to define a good linguistic pre-processor. The pre-processing results will be provided as produced by the system; no human correction will be performed. This is so to allow the presence of noise given by real application environments.

Tasks

The corpus will be constituted of 600 annotated and 500 unannotated documents. In order to provide a realistic test scenario the 200 most recent annotated CFP will form the test set. The remaining 400 annotated documents will be divided into four, 100 document, partitions enabling comparative cross-validation experiments to be performed.

The three tasks described below will be evaluated. Each participant can decide to participate in any of tasks, but participation in task 1 is mandatory. Participants will be asked to use the preprocessed form of the corpus, but an optional task (evaluated separately) will also allow algorithms to use a different pre-processor.

TASK1: Full scenario: Learning to annotate implicit information.

Given 400 annotated documents, learn to extract information. Each algorithm provides results of a four-fold cross-validation experiment using the same document partitions for pre-competitive tests. The main goal of this task is to evaluate the ability of the system to learn how to extract information given a closed world (200 most recent annotated documents). The task will measure the ability to generalize over a limited amount of training material in an environment with a large amount of sparse data.

TASK2: Active learning: Learning to select documents

In this task, the same corpus of 600 documents mentioned above will be used; 400 as training documents and 200 as test documents. Baseline: given fixed subsets of the training corpus of increasing size (e.g. 10, 20, 30, 50, 75, 100, 150, 200), show the learning ability on the full test corpus. Advanced: given an initial number of annotated documents as a seed (e.g. 10), select training subsets of increasing size (e.g. 20, 30, 50, 75, 100, 150, 200) in order to show the algorithm?s ability to select the most suitable set of training documents from an unannotated pool.

Each algorithm’s results will be plotted on a chart in order to study its learning curve and to allow better understanding the results obtained in TASK1. Moreover, the ability to quickly reach reliable results is an important feature of any adaptive IE system supporting annotation (Ciravegna et al. 2002), so the study of the learning curve will allow to access the suitability of the algorithm for online learning.

TASK3: Enriched Scenario

Same as the full scenario, but the algorithms will be able to use a richer set of information sources. In particular, we will focus on using the unannotated part of the corpus (500 documents). Goal: study how unsupervised or semi-supervised methods can improve the results of supervised approaches. An interesting variant of this task could concern the use of unlimited resources, e.g. the Web.

Bibliography

  • (Califf 1998) Califf, M. and R. Mooney, 2003. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177-210.
  • (Ciravegna et al.2002) Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli, and Yorick Wilks 2002. User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag.
  • (Ciravegna and Lavelli 2004) Fabio Ciravegna and Alberto Lavelli 2004: LearningPinocchio: Adaptive Information Extraction for Real World Applications Journal of Natural Language Engineering, 10 (2).
  • (Daelemans et al., 2003) Daelemans, Walter and Vèronique Hoste, 2002. Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain.
  • (Douthat 1998) Douthat, A., 1998. The message understanding conference scoring software user?s manual. In Proceedings of the 7th Message Understanding Conference (MUC-7). http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_sw/muc_sw_manual.html.
  • (Freitag 1998) Freitag, Dayne, 1998. Machine Learning for Information Extraction in Informal domains. Ph.D. thesis, Carnegie Mellon University
  • (Handschuh et al. 2002) S. Handschuh, S. Staab, and F. Ciravegna, 2002: S-CREAM- Semi-automatic CREAtion of Metadata. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag, 2002.
  • (Hirschman 1998) Hirschman, Lynette, 1998. The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language, 12:281-305.
  • (Lavelli et al.2004) A. Lavelli, M. E. Califf, F. Ciravegna, D. Freitag, C. Giuliano, N. Kushmerick, L. Romano, 2003: A Critical Survey of the Methodology for IE Evaluation, Proceedings of the 3rd LREC Conference, Crete, May 2004.
  • (Vargas Vera et al. 2002) M. Vargas-Vera, Enrico Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna, 2002. MnM: Ontology driven semi-automatic or automatic support for semantic markup. In Proc. of the 13th International Conference on Knowledge Engineering and Knowledge Management, EKAW02. Springer Verlag, 2002.