Background: The idea of a multilingual world is becoming truly a reality, as sophisticated monolingual, cross-lingual and multilingual language technologies have been created and have immensely optimized the translation quality and language/topic coverage in real-life situations.

However, a major challenge resides in respecting the plural diversity of all world regions residing in their languages and cultures yet avoiding all forms of education, research and business fragmentation linked to those same assets. Another is the relationship between the EU’s language technologies for the Digital Single Market and its connection with other language markets, which is also a major opportunity the field had in the last couple of years.

Specific Challenge: The challenge is to facilitate multilingual online communication in developing countries specifically in the domain of education, and enable it with the technologies developed in the EU, currently leading in this field, by taking down existing language barriers. These barriers hamper wider penetration of cross-border education, commerce, social communication and exchange of cultural content.

Additionally, current machine translation solutions typically perform well only for a limited number of target languages, and for a given text type. The potential for a value added global action in creating access to educational content with machine translation acting as a bridge between national educational systems is enormous.

Specific Solution: The Knowledge 4 All Foundation solved the problem of mass translation in education by developing TransLexy, a robust service that provides translation from English into nine European and two BRIC, languages, namely:

  1. English → Bulgarian (Български)
  2. English → Czech (Čeština)
  3. English → German (Deutsch)
  4. English → Greek (Ελληνικά)
  5. English → Croatian (Hrvatski)
  6. English → Italian (Italiano)
  7. English → Dutch (Nederlands)
  8. English → Polish (Polszczyzna)
  9. English → Portuguese (Português)
  10. English → Russian (Русский)
  11. English → Chinese (漢語, 汉语)

The platform is intended to overcome the existing language barriers in education, and can deal with huge volumes, high variety of languages and education text styles, and deliver results in reasonable time (in most cases, instantly).

Moving forward: The Foundation will add to its portfolio in partnership with the University of Edinburgh the following language pairs:

  • English → Afaan Oromo
  • English → Tigrinya
  • English → Igbo
  • English → Yoruba
  • English → Gujarati
  • English → Punjabi
  • Kurdish→ English
  • North Korean → English
  • Hausa → English
  • Swahili→ English

The Probabilistic Automata learning Competition (PAutomaC) is the first on-line challenge about learning non-deterministic probabilistic finite state machines (HMM, PFA, …)

The competition is over. In addition of the technical report describing the competition that was written at the beginning, an article published in the proceedings of ICGI’12 is available. It contains a lot of details about the competition and the results. In the same proceedings, one can find a short paper from the winning team of the competition. Two articles from other participants to the competition can also be found here (team Hulden) and here (team Kepler).


December 11th 2012: the code of the winning team is avalable here.

September 11th 2012: following the discussion during the workshop, the target and solutions files of the competition phase data are available in the download section.

September 7th 2012: the PAutomaC workshop at ICGI’12 is on.

July 3rd: After a storm that postponed for 2 days the end of competition, PAutomaC is finished. The winner is the team of Chihiro Shibata and Ryo Yoshinaka, Congratulation to them! And thanks to all participants: this have been a great competition.

May 20th: Phase 2 is launched: The data of the real competition are available! The data sets of the training phase are still available but you cannot submit your results any more. However, the files containing the true probabilities (obtained with the target automata) are available.

March 20th: 16 new data sets are available!

March 8th: The website is fully operational, the first data set is available.

Why this competition?

Finite state automata (or machines) are well-known models for characterizing the behaviour of systems or processes. They have been used for several decades in computer and software engineering to model the complex behaviours of electronic circuits and softwares. A nice feature of an automaton model is that it is easy to interpret, but unfortunately in many applications the original design of a system is unknown. That is why learning approaches has been used, for instance:

  • To modelize DNA or protein sequences in bioinformatics.
  • To find patterns underlying different sounds for speech processing.
  • To develop morphological or phonological rules for natural language processing.
  • To modelize unknown mechanical processes in Physics
  • To discover the exact environment of robots.
  • To detect Anomaly for detecting intrusions in computer security.
  • To do behavioural modelling of users in applications ranging from web systems to the automotive sector.
  • To discover the structure of music styles for music classification and generation.

In all such cases, an automaton model is learned from observations of the system, i.e., a finite set of strings. As the data gathered from observations is usually unlabelled, the standard method of dealing with this situation is to assume a probabilistic automaton model, i.e., a distribution over strings. In such a model, different states can generate different symbols with different probabilities: the two main formalisms are Hidden Markov Models (HMM) and Probabilistic Finite State Automata (PFA).

This is what this competition is about.

How is it working?

We automatically generated PFAs in a way that is described here. We then generated two data sets from each PFA: one is the training set and the second is the test set (where duplicate strings have been removed). The idea is to train your learning algorithm on the first data set in order to assign probabilities to the strings in the test set.

Details about how to participate can be found in the participate section.

There also will be some real world data sets during the second phase of the competition.

 Probabilistic Automata Learning Competition
Probabilistic Automata Learning Competition

According to the World Health Organisation, cardiovascular diseases (CVDs) are the number one cause of death globally: more people die annually from CVDs than from any other cause. An estimated 17.1 million people died from CVDs in 2004, representing 29% of all global deaths. Of these deaths, an estimated 7.2 million were due to coronary heart disease. Any method which can help to detect signs of heart disease could therefore have a significant impact on world health. This challenge is to produce methods to do exactly that. Specifically, we are interested in creating the first level of screening of cardiac pathologies both in a Hospital environment by a doctor (using a digital stethoscope) and at home by the patient (using a mobile device).

The problem is of particular interest to machine learning researchers as it involves classification of audio sample data, where distinguishing between classes of interest is non-trivial. Data is gathered in real-world situations and frequently contains background noise of every conceivable type. The differences between heart sounds corresponding to different heart symptoms can also be extremely subtle and challenging to separate. Success in classifying this form of data requires extremely robust classifiers. Despite its medical significance, to date this is a relatively unexplored application for machine learning.

Data has been gathered from two sources: (A) from the general public via the iStethoscope Pro iPhone app, provided in Dataset A, and (B) from a clinic trial in hospitals using the digital stethoscope DigiScope, provided in Dataset B.

CHALLENGE 1 – Heart Sound Segmentation

The first challenge is to produce a method that can locate S1(lub) and S2(dub) sounds within audio data, segmenting the Normal audio files in both datasets. To enable your machine learning method to learn we provide the exact location of S1 and S2 sounds for some of the audio files. You need to use them to identify and locate the S1 and S2 sounds of all the heartbeats in the unlabelled group. The locations of sounds are measured in audio samples for better precision. Your method must use the same unit.

CHALLENGE 2 – Heart Sound Classification

The task is to produce a method that can classify real heart audio (also known as “beat classification”) into one of four categories for Dataset A:

  1. Normal
  2. Murmur
  3. Extra Heart Sound
  4. Artifact

and three classes for Dataset B:

  1. Normal
  2. Murmur
  3. Extrasystole

You may tackle either or both of these challenges. If you can solve the first challenge, the second will be considerably easier! The winner of each challenge will be the method best able to segment and/or classify two sets of unlabelled data into the correct categories after training on both datasets provided below. The creator of the winning method will receive a WiFi 32Gb iPad as the prize, awarded at a workshop at AISTATS 2012.


After downloading the data, please register your interest to participate in the challenge by clicking here.

There are two datsets:

Dataset A, containing 176 files in WAV format, organized as: 14Mb 31 files download 17.3Mb 34 files download 6.9Mb 19 files download 22.5Mb 40 files download 24.6Mb 52 files download

The same datasets are also available in aif format: 13.2Mb 31 files download 16.4Mb 34 files download 6.5Mb 19 files download 20.9Mb 40 files download 23.0Mb 52 files download

Segmentation data (updated 23 March 2012), giving locations of S1 and S2 sounds in Atraining_normal: Atraining_normal_seg.csv

Dataset B, containing 656 files in WAV format, organized as: (containing sub directory Btraining_noisynormal) 13.8Mb 320 files download (containing subdirectory Btraining_noisymurmur) 5.3Mb 95 files download 1.9Mb 46 files download 9.2Mb 195 files download

The same datasets are also available in aif format: (containing sub directory Btraining_noisynormal) 13.0Mb 320 files download (containing subdirectory Btraining_noisymurmur) 5.1Mb 95 files download 2.1Mb 46 files download 8.7Mb 195 files download

Segmentation data, giving locations of S1 and S2 sounds in Btraining_normal: Btraining_normal_seg.csv

 Evaluation Scripts plus full details of the metrics and test procedure you must use in order to measure the effectiveness of your methods are available here:

 Challenge 1 involves segmenting the audio files in and using the training segmentations provided above.

Challenge 2 involves correctly labelling the sounds in and

Please use the following citation if the data is used:

author = “Bentley, P. and Nordehn, G. and Coimbra, M. and Mannor, S.”,
title = “The {PASCAL} {C}lassifying {H}eart {S}ounds {C}hallenge 2011 {(CHSC2011)} {R}esults”,
howpublished = “”}

The audio files are of varying lengths, between 1 second and 30 seconds (some have been clipped to reduce excessive noise and provide the salient fragment of the sound).

Most information in heart sounds is contained in the low frequency components, with noise in the higher frequencies. It is common to apply a low-pass filter at 195 Hz. Fast Fourier transforms are also likely to provide useful information about volume and frequency over time. More domain-specific knowledge about the difference between the categories of sounds is provided below.

Normal Category
In the Normal category there are normal, healthy heart sounds. These may contain noise in the final second of the recording as the device is removed from the body. They may contain a variety of background noises (from traffic to radios). They may also contain occasional random noise corresponding to breathing, or brushing the microphone against clothing or skin. A normal heart sound has a clear “lub dub, lub dub” pattern, with the time from “lub” to “dub” shorter than the time from “dub” to the next “lub” (when the heart rate is less than 140 beats per minute). Note the temporal description of “lub” and “dub” locations over time in the following illustration:

…lub……….dub……………. lub……….dub……………. lub……….dub……………. lub……….dub…

In medicine we call the lub sound “S1” and the dub sound “S2”. Most normal heart rates at rest will be between about 60 and 100 beats (‘lub dub’s) per minute. However, note that since the data may have been collected from children or adults in calm or excited states, the heart rates in the data may vary from 40 to 140 beats or higher per minute. Dataset B also contains noisy_normal data – normal data which includes a substantial amount of background noise or distortion. You may choose to use this or ignore it, however the test set will include some equally noisy examples.

Murmur Category
Heart murmurs sound as though there is a “whooshing, roaring, rumbling, or turbulent fluid” noise in one of two temporal locations: (1) between “lub” and “dub”, or (2) between “dub” and “lub”. They can be a symptom of many heart disorders, some serious. There will still be a “lub” and a “dub”. One of the things that confuses non-medically trained people is that murmurs happen between lub and dub or between dub and lub; not on lub and not on dub. Below, you can find an asterisk* at the locations a murmur may be.

…lub..****…dub……………. lub..****..dub ……………. lub..****..dub ……………. lub..****..dub …


…lub……….dub…******….lub………. dub…******….lub ………. dub…******….lub ……….dub…

Dataset B also contains noisy_murmur data – murmur data which includes a substantial amount of background noise or distortion. You may choose to use this or ignore it, however the test set will include some equally noisy examples

Extra Heart Sound Category (Dataset A)
Extra heart sounds can be identified because there is an additional sound, e.g. a “lub-lub dub” or a “lub dub-dub”. An extra heart sound may not be a sign of disease.  However, in some situations it is an important sign of disease, which if detected early could help a person.  The extra heart sound is important to be able to detect as it cannot be detected by ultrasound very well. Below, note the temporal description of the extra heart sounds:

…lub.lub……….dub………..………. lub. lub……….dub…………….lub.lub……..…….dub…….


…lub………. dub.dub………………….lub.……….dub.dub………………….lub……..…….dub. dub……

Artifact Category (Dataset A)
In the Artifact category there are a wide range of different sounds, including feedback squeals and echoes, speech, music and noise. There are usually no discernable heart sounds, and thus little or no temporal periodicity at frequencies below 195 Hz. This category is the most different from the others. It is important to be able to distinguish this category from the other three categories, so that someone gathering the data can be instructed to try again.

Extrasystole Category (Dataset B)
Extrasystole sounds may appear occasionally and can be identified because there is a heart sound that is out of rhythm involving extra or skipped heartbeats, e.g. a “lub-lub dub” or a “lub dub-dub”. (This is not the same as an extra heart sound as the event is not regularly occuring.) An extrasystole may not be a sign of disease. It can happen normally in an adult and can be very common in children. However, in some situations extrasystoles can be caused by heart diseases. If these diseases are detected earlier, then treatment is likely to be more effective. Below, note the temporal description of the extra heart sounds:

………..lub……….dub………..………. lub. ………..……….dub…………….lub.lub……..…….dub…….
…lub………. dub……………………….lub.…………………dub.dub………………….lub……..…….dub.……


To allow systems to be comparable, there are some guidelines that we would like participants to follow:

  1. Domain-specific knowledge as provided in this document may be freely used to enhance the performance of the systems.
  2. We provide both training and test data sets, but labels are omitted for the test data. We require the results to be produced in a specific format in a text file. A scoring script is provided for participants to evaluate their data on the results for both test and training data.
  3. We expect to see results for both the training and test sets in the submissions. We also require the code for the method, which needs to include instructions for executing the system, to enable us to validate the submitted results if necessary.

See the evaluation scripts in the downloads section for details of how accuracy of your results can be calculated. You must use this script to enable each system to be compared.

ChaLearn organizes in 2015 parallel challenge tracks on RGB data for Human Pose Recovery, action/interaction spotting, and cultural event classification. For each Track, the awards for the first, second and third winners will consist of 500, 300 and 200 dollars, respectively.

 The challenge features three quantitative tracks:

Track 1: Human Pose Recovery: More than 8,000 frames of continuous RGB sequences are recorded and labeled with the objective of performing human pose recovery by means of recognizing more than 120,000 human limbs of different people. Examples of labeled frames are shown in Fig. 1.

Track 2: Action/Interaction Recognition: 235 performances of 11 action/interaction categories are recorded and manually labeled in continuous RGB sequences of different people performing natural isolated and collaborative actions randomly. Examples of labeled actions are shown in Fig. 1.

Track 3: Cultural event classification: More than 10,000 images corresponding to 50 different cultural event categories will be considered. In all the categories, garments, human poses, objects and context will be possible cues to be exploited for recognizing the events, while preserving the inherent inter- and intra-class variability of this type of images. Examples of cultural events will be Carnival, Oktoberfest, San Fermin, Maha-Kumbh-Mela and Aoi-Matsuri, among others, see Fig. 2. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for creating the baseline of this Track.


The goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

  • Person: person
  • Animal: bird, cat, cow, dog, horse, sheep
  • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
  • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

There will be three main competitions: classification, detection, and segmentation; and three “taster” competition: person layout, action classification, and ImageNet large scale recognition:

Segmentation Competition

  • Segmentation: Generating pixel-wise segmentations giving the class of the object visible at each pixel, or “background” otherwise.
    Image Objects Class

Person Layout Taster Competition

  • Person Layout: Predicting the bounding box and label of each part of a person (head, hands, feet).
    Image Person Layout

Action Classification Taster Competition

  • Action Classification: Predicting the action(s) being performed by a person in a still image.
    10 action classes + “other”
  1. Classification: For each of the twenty classes, predicting presence/absence of an example of that class in the test image.
  2. Detection: Predicting the bounding box and label of each object from the twenty target classes in the test image.
    20 classes

Participants may enter either (or both) of these competitions, and can choose to tackle any (or all) of the twenty object classes. The challenge allows for two approaches to each of the competitions:

  1. Participants may use systems built or trained using any methods or data excluding the provided test sets.
  2. Systems are to be built or trained using only the provided training/validation data.


To download the training/validation data, see the development kit.

The training data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. Note that multiple objects from multiple classes may be present in the same image. Some example images can be viewed online. A subset of images are also annotated with pixel-wise segmentation of each object present, to support the segmentation competition. Some segmentation examples can be viewed online.

Annotation was performed according to a set of guidelines distributed to all annotators.

The data will be made available in two stages; in the first stage, a development kit will be released consisting of training and validation data, plus evaluation software (written in MATLAB). One purpose of the validation set is to demonstrate how the evaluation software works ahead of the competition submission.

In the second stage, the test set will be made available for the actual competition. As in the VOC2008-2010 challenges, no ground truth for the test data will be released.

The data has been split into 50% for training/validation and 50% for testing. The distributions of images and objects by class are approximately equal across the training/validation and test sets. In total there are 28,952 images. Further statistics are online.

Example images

Example images and the corresponding annotation for the classification/detection/segmentation tasks, and and person layout taster can be viewed online:

Development Kit

The development kit consists of the training/validation data, MATLAB code for reading the annotation data, support files, and example implementations for each competition.

The development kit will be available according to the timetable.

Test Data

The test data is now available. Note that the only annotation in the data is for the layout/action taster competitions. As in 2008-2010, there are no current plans to release full annotation – evaluation of results will be provided by the organizers.

The test data can now be downloaded from the evaluation server. You can also use the evaluation server to evaluate your method on the test data.

Useful Software

Below is a list of software you may find useful, contributed by participants to previous challenges.


  • May 2011: Development kit (training and validation data plus evaluation software) made available.
  • June 2011: Test set made available.
  • 13 October 2011 (Thursday, 2300 hours GMT): Extended deadline for submission of results. There will be no further extensions.
  • 07 November 2011: Challenge Workshop in association with ICCV 2011, Barcelona.

Submission of Results

Participants are expected to submit a single set of results per method employed. Participants who have investigated several algorithms may submit one result per method. Changes in algorithm parameters do not constitute a different method – all parameter tuning must be conducted using the training and validation data alone.

Results must be submitted using the automated evaluation server:

It is essential that your results files are in the correct format. Details of the required file formats for submitted results can be found in the development kit documentation. The results files should be collected in a single archive file (tar/tgz/tar.gz).

Participants submitting results for several different methods (noting the definition of different methods above) should produce a separate archive for each method.

In addition to the results files, participants will need to additionally specify:

  • contact details and affiliation
  • list of contributors
  • description of the method (minimum 500 characters) – see below

New in 2011 we require all submissions to be accompanied by an abstract describing the method, of minimum length 500 characters. The abstract will be used in part to select invited speakers at the challenge workshop. If you are unable to submit a description due e.g. to commercial interests or other issues of confidentiality you must contact the organisers to discuss this. Below are two example descriptions, for classification and detection methods previously presented at the challenge workshop. Note these are our own summaries, not provided by the original authors.

  • Example Abstract: Object classification
    Based on the VOC2006 QMUL description of LSPCH by Jianguo Zhang, Cordelia Schmid, Svetlana Lazebnik, Jean Ponce in sec 2.16 of The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. We make use of a bag-of-visual-words method (cf Csurka et al 2004). Regions of interest are detected with a Laplacian detector (Lindeberg, 1998), and normalized for scale. A SIFT descriptor (Lowe 2004) is then computed for each detection. 50,000 randomly selected descriptors from the training set are then vector quantized (using k-means) into k=3000 “visual words” (300 for each of the 10 classes). Each image is then represented by the histogram of how often each visual word is used. We also make use a spatial pyramid scheme (Lazebnik et al, CVPR 2006). We first train SVM classifiers using the chi^2 kernel based on the histograms of each level in the pyramid. The outputs of these SVM classifiers are then concatenated into a feature vector for each image and used to learn another SVM classifier based on a Gaussian RBF kernel.
  • Example Abstract: Object detection
    Based on “Object Detection with Discriminatively Trained Part Based Models”; Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, September 2010. We introduce a discriminatively-trained parts-based model for object detection. The model consists of a coarse “root” template of HOG features (Dalal and Triggs, 2006), plus a number of higher-resolution part-based HOG templates which can translate in a neighborhood relative to their default position. The responses of the root and part templates are combined by a latent-SVM model, where the latent variables are the offsets of the parts. We introduce a novel training algorithm for the latent SVM. We also make use of an iterative training procedure exploiting “hard negative” examples, which are negative examples incorrectly classified in an earlier iteration. Finally the model is scanned across the test image in a “sliding-window” fashion at a variety of scales to produce candidate detections, followed by greedy non-maximum suppression. The model is applied to all 20 PASCAL VOC object detection challenges.

If you would like to submit a more detailed description of your method, for example a relevant publication, this can be included in the results archive.

Best Practice

The VOC challenge encourages two types of participation: (i) methods which are trained using only the provided “trainval” (training + validation) data; (ii) methods built or trained using any data except the provided test data, for example commercial systems. In both cases the test data must be used strictly for reporting of results alone – it must not be used in any way to train or tune systems, for example by runing multiple parameter choices and reporting the best results obtained.

If using the training data we provide as part of the challenge development kit, all development, e.g. feature selection and parameter tuning, must use the “trainval” (training + validation) set alone. One way is to divide the set into training and validation sets (as suggested in the development kit). Other schemes e.g. n-fold cross-validation are equally valid. The tuned algorithms should then be run only once on the test data.

In VOC2007 we made all annotations available (i.e. for training, validation and test data) but since then we have not made the test annotations available. Instead, results on the test data are submitted to an evaluation server.

Since algorithms should only be run once on the test data we strongly discourage multiple submissions to the server (and indeed the number of submissions for the same algorithm is strictly controlled), as the evaluation server should not be used for parameter tuning.

We encourage you to publish test results always on the latest release of the challenge, using the output of the evaluation server. If you wish to compare methods or design choices e.g. subsets of features, then there are two options: (i) use the entire VOC2007 data, where all annotations are available; (ii) report cross-validation results using the latest “trainval” set alone.

Policy on email address requirements when registering for the evaluation server

In line with the Best Practice procedures (above) we restrict the number of times that the test data can be processed by the evaluation server. To prevent any abuses of this restriction an institutional email address is required when registering for the evaluation server. This aims to prevent one user registering multiple times under different emails. Institutional emails include academic ones, such as, and corporate ones, but not personal ones, such as or

Publication Policy

The main mechanism for dissemination of the results will be the challenge webpage.

The detailed output of each submitted method will be published online e.g. per-image confidence for the classification task, and bounding boxes for the detection task. The intention is to assist others in the community in carrying out detailed analysis and comparison with their own methods. The published results will not be anonymous – by submitting results, participants are agreeing to have their results shared online.


If you make use of the VOC2011 data, please cite the following reference (to be prepared after the challenge workshop) in any publications:

	author = "Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.",
	title = "The {PASCAL} {V}isual {O}bject {C}lasses {C}hallenge 2011 {(VOC2011)} {R}esults",
	howpublished = ""}	

Database Rights

The VOC2011 data includes images obtained from the “flickr” website. Use of these images must respect the corresponding terms of use:

For the purposes of the challenge, the identity of the images in the database, e.g. source and name of owner, has been obscured. Details of the contributor of each image can be found in the annotation to be included in the final release of the data, after completion of the challenge. Any queries about the use or ownership of the data should be addressed to the organizers.


  • Mark Everingham (University of Leeds),
  • Luc van Gool (ETHZ, Zurich)
  • Chris Williams (University of Edinburgh)
  • John Winn (Microsoft Research Cambridge)
  • Andrew Zisserman (University of Oxford)


We gratefully acknowledge the following, who spent many long hours providing annotation for the VOC2011 database:

Yusuf Aytar, Jan Hendrik Becker, Ken Chatfield, Miha Drenik, Chris Engels, Ali Eslami, Adrien Gaidon, Jyri Kivinen, Markus Mathias, Paul Sturgess, David Tingdahl, Diana Turcsany, Vibhav Vineet, Ziming Zhang.

We also thank Sam Johnson for development of the annotation system for Mechanical Turk, and Yusuf Aytar for further development and administration of the evaluation server.


The preparation and running of this challenge is supported by the EU-funded PASCAL2 Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning.

We are grateful to Alyosha Efros for providing additional funding for annotation on Mechanical Turk.


Given two text fragments called ‘Text’ and ‘Hypothesis’, Textual Entailment Recognition is the task of determining whether the meaning of the Hypothesis is entailed (can be inferred) from the Text. The goal of the first RTE Challenge was to provide the NLP community with a benchmark to test progress in recognizing textual entailment, and to compare the achievements of different groups. Since its inception in 2004, the RTE Challenges have promoted research in textual entailment recognition as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and multi-document Summarization.

After the first three highly successful PASCAL RTE Challenges, RTE became a track at the 2008 Text Analysis Conference, which brought it together with communities working on NLP applications. The interaction has provided the opportunity to apply RTE systems to specific applications and to move the RTE task towards more realistic application settings.

RTE-7 pursues the direction taken in RTE-6, focusing on textual entailment in context, where the entailment decision draws on the larger context available in the targeted application settings.

RTE-7 Tasks

The RTE-7 tasks focus on recognizing textual entailment in two application settings: Summarization and Knowledge Base Population.

  1. Main Task (Summarization setting): Given a corpus and a set of “candidate” sentences retrieved by Lucene from that corpus, RTE systems are required to identify all the sentences from among the candidate sentences that entail a given Hypothesis. The RTE-7 Main Task is based on the TAC Update Summarization Task. In the Update Summarization Task, each topic contains two sets of documents (“A” and “B”), where all the “A” documents chronologically precede all the “B” documents. An RTE-7 Main Task “corpus” consists of 10 “A” documents, while Hypotheses are taken from sentences in the “B” documents.
  2. Novelty Detection Subtask (Summarization setting): In the Novelty Detection variant of the Main Task, systems are required to judge if the information contained in each H (based on text snippets from B summaries) is novel with respect to the information contained in the A documents related to the same topic. If entailing sentences are found for a given H, it means that the content of H is not new; if no entailing sentences are detected, it means that information contained in the H is novel.
  3. KBP Validation Task (Knowledge Base Population setting): Based on the TAC Knowledge Base Population (KBP) Slot-Filling task, the KBP validation task is to determine whether a given relation (Hypothesis) is supported in an associated document (Text). Each slot fill that is proposed by a system for the KBP Slot-Filling task would create one evaluation item for the RTE-KBP Validation Task: The Hypothesis would be a simple sentence created from the slot fill, while the Text would be the source document that was cited as supporting the slot fill.


Proposed RTE-7 Schedule
April 29 Main Task: Release of Development Set
April 29 KBP Validation Task: Release of Development Set
June 10 Deadline for TAC 2011 track registration
August 17 KBP Validation Task: Release of Test Set
August 29 Main Task: Release of Test Set
September 8 Main Task: Deadline for task submissions
September 15 Main Task: Release of individual evaluated results
September 16 KBP Validation Task: Deadline for task submissions
September 23 KBP Validation Task: Release of individual evaluated results
September 25 Deadline for TAC 2011 workshop presentation proposals
September 29 Main Task: Deadline for ablation tests submissions
October 6 Main Task: Release of individual ablation test results
October 25 Deadline for system reports (workshop notebook version)
November 14-15 TAC 2011 Workshop

Mailing List

The mailing list for the RTE Track is The list is used to discuss and define the task guidelines for the track, as well as for general discussion related to textual entailment and its evaluation. To subscribe, send a message to such that the body consists of the line:
subscribe rte <FirstName> <LastName>
In order for your messages to get posted to the list, you must send them from the email address used when you subscribed to the list. To unsubscribe, send a message from the subscribed email address to such that the body consists of the line:
unsubscribe rte
For additional information on how to use mailing lists hosted at NIST, send a message to such that the body consists of the line:

Organizing Committee

Luisa Bentivogli, CELCT and FBK, Italy
Peter Clark, Vulcan Inc., USA
Ido Dagan, Bar Ilan University, Israel
Hoa Trang Dang, NIST, USA
Danilo Giampiccolo, CELCT, Italy


The Pascal Exploration & Exploitation Challenge seeks to improve the relevance of content presented to visitors of a website, based on their individual interests.


In this challenge, the submitted algorithms have to predict which visitors of a website are likely to click on which piece of content. Visitors are characterised by a set of 120 features. Predicting clicks accurately, based on these features, is essential to present content that is relevant to visitors’ interests. It requires to continuously learn what might be of interest (exploration), while using this learning to serve relevant content often enough (exploitation).


Algorithms are run online (i.e. they receive their input sequentially) on data provided by Adobe Omniture, which closely simulates an actual web campaign. Each visitor click gives a reward of 1, and the best algorithm is the one that has highest cumulated reward in the end. The challenge will be run in phases, in between which the participants will have the opportunity to update their algorithms based on previous observations.



From its experience in web analytics, Adobe Omniture has created a dataset that simulates the responses to a web campaign, with changes over time. The dataset comprises about 20 million {visitor feature vector, option id, binary clickthrough indicator} records that each represent a single visit to the website. For each visitor feature vector v of the data set, and for each option i, the binary clickthrough indicator informs us on whether v would click on i or not.

We are pleased to announce the 4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge.  The LSHTC Challenge is a hierarchical text classification competition, using very large datasets. This year’s challenge focuses on interesting learning problems like multi-task and refinement learning.

Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.

The challenge consists of 3 tracks, involving different category systems with different data properties and focusing on different learning and mining problems. The challenge is based on two large datasets: one created from the ODP web directory (DMOZ) and one from Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories range between 13,000 and 325,000 roughly and number of the documents between 380,000 and 2,400,000. More information regarding the tracks and challenge rules can be found at the  “Datasets, Tracks, Rules and Guidelines” page.

Participants will be able to smoothly and continuously submit runs, in order to improve their systems.

In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.


Massih-Reza Amini, LIG, Grenoble, France
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Nicolas Baskiotis, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR “Demokritos” & AUEB, Athens, Greece
George Paliouras, NCSR “Demokritos”, Athens, Greece
Ioannis Partalas, LIG, Grenoble, France

This challenge addresses a question of fundamental and practical interest in machine learning: the assessment of data representations produced by unsupervised learning procedures, for use in supervised learning tasks. It also addresses the evaluation of transfer learning methods capable of producing data representations useful across many similar supervised learning tasks, after training on supervised data from only one of them.

Classification problems are found in many application domains, including in pattern recognition (classification of images or videos, speech recognition), medical diagnosis, marketing (customer categorization), and text categorization (filtering of spam). The category identifiers are referred to as “labels”. Predictive models capable of classifying new instances (correctly predicting the labels) usually require “training” (parameter adjustment) using large amounts of labeled training data (pairs of examples of instances and associated labels). Unfortunately, few labeled training data may be available due to the cost or burden of manually annotating data. Recent research has been focusing on making use of the vast amounts of unlabeled data available at low cost including: space transformations, dimensionality reduction, hierarchical feature representations (“deep learning”), and kernel learning. However, these advances tend to be ignored by practitioners who continue using a handful of popular algorithms like PCA, ICA, k-means, and hierarchical clustering. The goal of this challenge is to perform an evaluation of unsupervised and transfer learning algorithms free of inventor bias to help to identify and popularize algorithms that have advanced the state of the art.

Five datasets from various domains are made available. The participants should submit on-line transformed data representations (or similarity/kernel matrices) on a validation set and a final evaluation set in a prescribed format. The data representations (or similarity/kernel matrices) are evaluated by the organizers on supervised learning tasks unknown to the participants. The results on the validation set are displayed on the learderboard to provide immediate feed-back. The results on the final evaluation set will be revealed only at the end of the challenge. To emphasize the capability of the learning systems to develop useful abstractions, the supervised learning tasks used to evaluate them make use of very few labeled training examples and the classifier used is a simple linear discriminant classifier. The challenge will proceed in 2 phases:

  • Phase 1 — Unsupervised learning: There exist a number of methods that produce new data representations (or kernels) from purely unlabeled data. Such unsupervised methods are sometimes used as preprocessing to supervised learning procedures. In the first phase of the challenge, no labels will be provided to the participants. The participants are requested to produce data representations (or similarity/kernel matrices) that will be evaluated by the organizers on supervised learning tasks (i.e. using labeled data not available to the participants).
  • Phase 2 — Transfer learning: In other practical settings, it is desirable to produce data representations that are re-usable from domain to domain. We want to examine the possibility that a representation developed with one set of labels can be used to learn a new, similar task more easily. For example, in the handwriting recognition domain, labeled handwritten digits would be available for training. The evaluation task would then be the recognition of handwritten alphabetical letters. We call this setting “transfer learning”. In the second phase of the challenge, some labels will be provided to the participants for the same datasets used in the first phase. This will allow the participants to improve their data representation (or similarity/kernel matrices) using supervised tasks similar (but different) from the task on which they will be tested.

Competition Rules

  • Goal of the challenge: Given a data matrix of samples represented as feature vectors (p samples in rows and n features in columns), produce another data matrix of dimension (p, n’) (the transformed representation of n’ new features) or a similarity/kernel matrix between samples of dimension (p, p). The transformed representations (or similarity/kernel matrices) should provide good results on supervised learning tasks used by the organizers to evaluate them. The labels of the supervised learning tasks used for evaluation purpose will remain unknown to the participants in phase 1 and 2, but other labels will be made available for transfer learning in phase 2.
  • Prizes: The winners of each phase will be awarded prizes see the Prizes page for details.
  • Dissemination: The challenge is part of the competition program of the IJCNN 2011 conference, San Jose, California, July 31 – August 5, 2011. We are organizing a special session and a competition workshop at IJCNN 2011 to discuss the results of the challenge. We are also organizing a workshop at ICML 2011, Bellevue, Washington, July 2, 2011. There are three publications opportunities, in JMLR W&CP and in the IEEE proceedings of IJCNN 2011 and in the ICML proceedings.
  • Schedule:
    Dec. 25, 2010 Start of the development period. Phase 0: Registration and submissions open. Rules, toy data, and sample code made available.
    Jan. 3, 2010 Start of phase 1: UNSUPERVISED LEARNING. Datasets made available. No labels available.
    Feb. 1, 2011 IJCNN 2011 papers due (optional).
    March 3, 2011 End of phase 1, at midnight (0 h Mar. 4, server time — time indicated on the Submit page).
    March 4, 2011 Start of phase 2: TRANSFER LEARNING. Training labels made available for transfer learning.
    April 1, 2011 IJCNN paper decision notification.
    April 15, 2011 End of the challenge at midnight (0 h April 16, server time — time indicated on the Submit page). Submissions closed. [Note: the grace period until April 20 has been canceled]
    April 22, 2011 All teams must turn in fact sheets (compulsory). The fact sheets will be used as abstracts for the workshops. Reviewers and participants are given access to provisional rankings and fact sheets.
    April 29, 2011 ICML 2011 papers due, to be published in JMLR W&CP (optional).
    May 1, 2011 Camera ready copies of IJCNN papers due.
    May 20, 2011 Release of the official ranking. Notification of abstract and paper acceptance.
    July 2, 2011 Workshop at ICML 2011, Bellevue, Washington state, USA. Confirmed.
    July 31 – Aug. 5, 2011 Special session and workshop at IJCNN 2011, San Jose, California, USA. Confirmed.
    Aug. 7, 2011 Reviews of JMLR W&CP papers sent back to authors.
    Sep. 30, 2011 Revised JMLR W&CP papers due.
  • Challenge protocol: (1) Development: From the outset of the challenge, all unlabeled development and evaluation data will be provided to the participants. All data will be preprocessed in a feature representation, such that the patterns are not easily recognizable by humans, making it difficult to label data using human experts. During development the participants may make submissions of a feature-based representation (or a similarity/kernel matrix) for a subset of the evaluation data (called validation set). They will receive on-line feed-back on the quality of their representation (or similarity measure) with a number of scoring metrics. (2) Final evaluation: To participate in the final evaluation the participants will have to (i) register as mutually exclusive teams; (ii) make one “final” correct submission of a feature based representation (or similarity/kernel matrix) for the final evaluation data for all 5 datasets of the challenge, (iii) submit the answers to a questionnaire on their method (method fact-sheet) and (iv) compete either in one of the two phases only or in both phases (it is not necessary to compete in both phases to earn prizes).
  • Baseline results: Results using baseline methods will be provided on the website of the challenge by the organizing team. Those results will be clearly marked as “ULref”. The most basic baseline result is obtained using the raw data. To qualify for prizes, the participants should exceed the performances on raw data for all the datasets of the challenge.
  • Eligibility of participation: Anybody complying with the rules of the challenge, with the exception of the organizers, is eligible to enter the challenge. To enter results and get on-line feed-back, the participants must make themselves known to the organizers by registering and providing a valid email so the organizers can communicate with them. However the participants may remain anonymous to the outside world. To participate in the final test rounds, the participants will have to register as teams. No participant will be allowed to enter as part of several teams. The team leaders will be responsible for ensuring that the team respects the rules of the challenge. There is no commitment to deliver code, data or publish methods to participate in the development phase, but the team leaders will be requested to fill out fact sheets with basic information on their methods to be able to claim prizes. When the challenge is over and the results are known, the teams who want to claim a prize will have to reveal their true identity to the outside world.
  • Anonymity: All entrants must identify themselves to the organizers. However, only your “Workbench id” will be displayed in result tables and you may choose a pseudonym to hide your identity to the rest of the world. Your emails will remain confidential.
  • Data: Datasets from various domains and various difficulty are available to download from the Data page. No labels are made available during phase 1. Some labels will be made available for transfer learning at the beginning of phase 2. Reverse-enginering the datasets to gain information on the identity of the patterns in the original data is forbidden. If it is suspected that this rule was violated, the organizers reserve the right to organize post-challenge verifications to which the top ranking participants will have to comply to earn prizes.
  • Submission method: The method of submission is via the form on the Submit page. To be ranked, submissions must comply with the Instructions. Robot submissions are permitted. If the system gets overloaded, the organizers reserve the right to limit the number of submissions per day per participant. We recommend not to exceed 5 submissions per day per participant. If you encounter problems with the submission process, please contact the Challenge Webmaster (see bottom of page).
  • Ranking: The method of scoring is posted on the Evaluation page. If the scoring method changes, the participants will be notified by email by the organizers.
    – During the development period (phase 1 and 2), the scores on the validation sets will be posted in the Leaderboard table. The participants are allowed to make multiple submissions on the validation sets.
    – The results on the final evaluation set will only be released after the challenge is over. The participants may make multiple submissions on the final evaluation sets, to avoid the last minute rush. However, for each registered team, only ONE final evaluation set submission for each dataset of the challenge be taken into account. These submissions will have to be grouped under the same “experiment” name. The team leader will designate which experiment should be taken into account for the final ranking. For each phase, the teams will be ranked by for each individual dataset and the winner will be determined by the best average rank over all datasets.

This challenge addresses machine learning problems in which labeling data is expensive, but large amounts of unlabeled data are available at low cost. Such problems might be tackled from different angles: learning from unlabeled data or active learning. In the former case, the algorithms must satisfy themselves with the limited amount of labeled data and capitalize on the unlabeled data with semi-supervised learning methods. In the latter case, the algorithms may place a limited number of queries to get labels. The goal in that case is to optimize the queries to label data and the problem is referred to as active learning.

Much of machine learning and data mining has been so far concentrating on analyzing data already collected, rather than collecting data. While experimental design is a well-developed discipline of statistics, data collection practitioners often neglect to apply its principled methods. As a result, data collected and made available to data analysts, in charge of explaining them and building predictive models, are not always of good quality and are plagued by experimental artifacts. In reaction to this situation, some researchers in machine learning and data mining have started to become interested in experimental design to close the gap between data acquisition or experimentation and model building. This has given rise of the discipline of active learning. In parallel, researchers in causal studies have started raising the awareness of the differences between passive observations, active sampling, and interventions. In this domain, only interventions qualify as true experiments capable of unraveling cause-effect relationships. However, most practical experimental designs start with sampling data in a way to minimize the number of necessary interventions.
The Causality Workbench will propose in the next few month several challenges to evaluate methods of active learning and experimental design, which involve the data analyst in the process of data collection. From our perspective, to build good models, we need good data. However, collecting good data comes at a price. Interventions are usually expensive to perform and sometimes unethical or impossible, while observational data are available in abundance at a low cost. Practitioners must identify strategies for collecting data, which are cost effective and feasible, resulting in the best possible models at the lowest possible price. Hence, both efficiency and efficacy are factors of evaluation in these evaluations.
The setup of this first challenge represented in the figure above considers only sampling as an intervention of the data analyst or the learning machine, who may only place queries on the target values (labels) y. So-called “de novo queries” in which new patterns x can be created are not considered. We will consider them in an upcoming challenge on experimental design in causal discovery.
In this challenge, we propose several tasks of pool-based active learning in which a large unlabeled dataset is available from the onset of the challenge and the participants can place queries to acquire data for some amount of virtual cash. The participants will need to return prediction values for all the labels every time they want to purchase new labels. This will allow us to draw learning curves prediction performance vs. amount of virtual cash spend. The participants will be judged according to the area under the learning curves, forcing them to optimize both efficacy (obtain good prediction performance) and efficiency (spend little virtual cash). Learn more about active learning…