Competition has always driven people to achieve results that are better than they might have achieved working alone. Knowledge 4 All Foundation runs its highly successful Challenges programme, enabling its members to create their own Machine Learning challenges for each other and to disseminate their results in K4A-sponsored workshops. The Challenges also enabled diverse real-world applications from other fields to be introduced to the machine learning community. More than twenty challenges were held during the lifetime of PASCAL2 and PASCAL, some now so established that they have steered research agendas across the world; their regular workshops are almost conferences in their own right. From sign language to mind-reading, and heart sounds to distorted galaxies, here are some of the challenges.

PASCAL Large Scale Learning Challenge

With the exceptional increase in computing power, storage capacity and network bandwidth of the past decades, ever growing datasets are collected in fields such as bioinformatics (Splice Sites, Gene Boundaries, etc), IT-security (Network traffic) or Text-Classification (Spam vs. Non-Spam), to name but a few. The growth of data leaves computational methods as the only viable way of dealing with data, and it poses new challenges to ML methods. The Large-Scale Learning challenge is concerned with the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources, e.g. resulting from a high algorithmic complexity, from the size or dimensionality of the data set, and from the trade-off between distributed resolution and communication costs. Indeed many comparisons are presented in the literature; however, these usually focus on assessing a few algorithms, or considering a few datasets; further, they most usually involve different evaluation criteria, model parameters and stopping conditions. As a result it is difficult to determine how each method behaves and compares with the others in terms of test error, training time and memory requirements, which are the practically relevant criteria. This challenge is designed to be fair and enables a direct comparison of current large scale classifiers aimed at answering the question: Which learning method is the most accurate given limited resources? To this end we provide a generic evaluation framework tailored to the specifics of the competing methods. Providing a wide range of datasets, each of which having specific properties we evaluate the methods based on performance figures, displaying training time vs. test error, dataset size vs. test error and dataset size vs. training time.

Visual Object Classes Challenges 2008, 2009, 2010, 2011, 2012

The Visual Object Classes Challenge was organised for eight years in a row, with increasing success. For example, the VOC 2011 workshop took place at ICCV and there were approximately 200 attendees. There were very healthy numbers of entries: 19 entries for the classification task, 13 for detection, and 6 for segmentation. There was also a successful collaboration with ImageNet ( organized by a team in the US. They held a second competition on their dataset with 1000 categories (but with only one labelled object per image). It is safe to say that the PASCAL VOC challenges have become a major point of reference for the computer vision community when it comes to object category detection and segmentation. There are over 800 publications (using Google Scholar) which refer to the data sets and the corresponding challenges. The best student paper winner at CVPR 2011 made use of the VOC 2008 detection data; the prize winning paper at ECCV 2010 made use of the VOC 2009 segmentation data; and prize winning papers at ICCV 2009, CVPR 2008 and ECCV 2008 were based on the VOC detection challenge (using our performance measure in the loss functions).

The basic challenge is the recognition of objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). There can be multiple objects in each image. There are typically three main competitions with 20 object classes and around 10,000 images on: classification (is an object of class X present?), detection (where is it and what is its size?), segmentation (pixel-wise labelling). There were also “taster” competitions on subjects such as: layout (predicting the bounding box and label of each part of a person) and human action recognition (e.g. riding a bike, taking a photo, reading) The goal of the layout and action recognition tasters was to provide a richer description of people in images than just bounding box/segmentation information. Our experiences in 2010 and 2011 with Mechanical Turk annotation for the classification and detection challenges were that it was hard to achieve the level of quality we require from this pipeline. The focus for VOC 2012 annotation was thus to increase the labelled data for the segmentation and action recognition challenges. The segmentation data is of a very high quality not available elsewhere, and it is very valuable to provide more data of this nature. The legacy of the VOC challenges is the freely-available VOC 2007 data which was ported to We also extended our evaluation server to display the top k submissions for each of the challenges (a leaderboard feature), so that the likely performance increases after 2012 can be viewed on the web (similar to that available for the Middlebury evaluations, see

Pautomac competition: Learning Probabilistic Automata and Hidden Markov Models

Finite state automata (or machines) are well-known models for characterizing the behaviour of systems or processes. They have been used for several decades in computer and software engineering to model the complex behaviours of electronic circuits and software such as communication protocols. They are equivalent to Hidden Markov Models, used in a number of applications. The state of the art of learning either of these types of machines from strings is unclear as there has never been a challenge or even a benchmark over which learning algorithms have been compared. The goal of PAutomaC is to provide an overview of which probabilistic automaton learning technique works best in which setting and to stimulate the development of new techniques for learning distributions over strings. Such an overview will be very helpful for practitioners of automata learning and provide directions to future theoretical work and algorithm development. PAutomaC will provide the first elaborate test-suite for learning string distributions. The task is of interest to:

  • Grammatical Inference theoreticians wanting to find out how good their ideas and algorithms really are
  • Pattern recognition practitioners who have developed fine tuning EM inspired techniques to evaluate the parameters of HMMs or related models;
  • Statistical modelling experts who have to deal with strings or sequences.

MEG mind reading challenge

November 2010- June 2011

The goal of this challenge is to decode a natural stimulus from short time periods extracted from a continuous MEG signal (measured specifically for the challenge and made freely available), and the problem is formulated as a classification task. The data consisted of 204 MEG channels measured under stimulation with different types of movies (football, feature film etc), and the mind reading task was to decode the type of the movie for testing data. The results of the challenge were presented in the ICANN 2011 conference themed ‘Machine learning re-inspired by brain and cognition’. From the modelling side, the challenge built on earlier PASCAL workshops on Learning from multiple sources organized in NIPS 2008 and 2009, aiming to provide public data useful for developing such models. The challenge provided information on feasibility of decoding natural stimuli from continuous MEG signal, which is a novel task. It is also provided data for future evaluation of multi-source learning models, useful also for machine learning researchers outside MEG analysis.

Unsupervised Grammar Induction

The challenge considers the problem of inducing a grammar directly from natural language text. Resultant grammars can then be used to discriminate between strings that are part of the language (i.e., are grammatically well formed) and those that are not. This has long been a fundamental problem in Computational Linguistics and Natural Language Processing, drawing from theoretical Computer Science and Machine Learning. The popularity of the task is driven by two different motivations. Firstly, it can help us to better understand the cognitive process of language acquisition in humans. Secondly, it can help with portability of NLP applications into new domains and new languages. Most NLP algorithms rely on syntactic parse structure created by supervised parsers, however training data in the form of treebanks only exist for a few languages and for specific domains, thus limiting the portability of these algorithms. The challenge we are proposing aims to foster continuing research in grammar induction, while also opening up the problem to more ambitious settings, including a wider variety of languages, removing the reliance on part-of-speech and, critically, providing a thorough evaluation.  The data that we provided was collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. This gives a diverse range of data upon which to test grammar induction algorithms yielding a deeper insight into the accuracy and shortcomings of different algorithms. Where possible, we intend to compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. Overall this test set forms a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the NLP field’s continuing reliance on the Penn Treebank.

Large Scale Hierarchical Text Classification 2009, 2012

May 2009 – December 2009
In recent years, the problem of hierarchical text classification has been addressed in machine learning literature, but its handling at large scale (i.e. involving several thousand categories) remains an open research issue. Combined with the increasing demand for practical systems of this kind, there seems to be a need for a significant push of this research activity. This is our motivation for this PASCAL challenge aiming at assessing models, methods and tools for classification in very large, hierarchically organized, category systems. We prepared large datasets for experimentation, based in the ODP Web directory (, as well as baseline classifiers based on k-NN and logistic regression. We used two of these datasets for the challenge: a very large one (around 30,000 categories) and a smaller one (around 3,000 categories). The participants were given the chance to dry-run their classification methods on the smaller datasets. They were then asked to learn their system using the training and validation parts of the larger set, and provide their classification results on the test part. A two-sided evaluation of the participating methods was used: one measuring classification performance and one computational performance. Work on this challenge resulted in a new EU project “BioASQ: A challenge on large-scale biomedical semantic indexing and question answering” which started on October 1, 2012.

Learning Propagation Models from Social Networks

The emergence of Social Networks and Social Media sites has motivated a large amount of recent research. Different generic tasks are currently studied such as Social Network Analysis, Social Network annotation, Community Detection, Link Prediction. One classical question concerns the study of temporal propagation of information through this new type of media. It aims at studying how information propagates on a network. Many recent works are directly inspired from the literature in the epidemiologic domain or in the social science domain. These works mainly propose different propagation models – independent cascade models or linear threshold models – and analyze different properties of these models, such as the epidemic threshold. Recently, instead of analyzing how information spreads, different articles address the problem of predicting the propagation in the future ([4, 5]). This is a key problem with many applications like Buzz prediction – predicting if a particular content is a buzz – or Opinion Leader Detection – detecting if a node in a network will well spread content.  This challenge analyzed and compared the quality of propagation prediction methods. The challenge was organized in order to facilitate the participation of any interested researcher, by providing simple tools and easy to use datasets. We anticipate that the produced material can become the first large benchmark for propagation models.

Gesture Recognition

The competition on gesture recognition was organised in collaboration with the DARPA Deep Learning program. This challenge was part of a series of challenges on the theme of unsupervised and transfer learning. The goal is to push the state of the art in algorithms capable of learning data representations, which may be re-used from task to task, using unlabelled data and/or labelled data from similar domains. In this challenge, the competitors were given a large database of videos of American Sign Language performed by native signers and videos of International Sign Language performed by non-native signers, which we collected using Amazon Mechanical Turk. Entrants each developed a system that was tested in a live competition on gesture recognition. The test was carried on a small but new sign language vocabulary. The platform of the challenge remains open after the end of the competition and all the datasets are freely available for research in the Pascal 2 repository.

Recognising Textual Entailment

The RTE challenges have run annually for several rounds to great success. The task consists of recognizing that the meaning of a textual statement, termed H (the hypothesis), can be inferred by the content of a given text, termed T (the text). Given a set of pairs of Ts and Hs as input, the systems must recognize whether each T entails the corresponding H, classifying whether:

  • T entails H
  • T contradicts H, or shows it false
  • the veracity of H is unknown on the basis of T.

A human-annotated development set is first released to allow investigation, tuning and training of systems, which are then evaluated on a gold-standard test set. In later rounds of the challenge, the given texts were made substantially longer, usually corresponding to a coherent portion of the document such as a paragraph or a group of closely related sentences. Texts come from a variety of unedited sources. Thus, systems are required to handle real text forms that may include typographical errors and ungrammatical sentences. A novel Entailment Search pilot task was also introduced.

GRavitational lEnsing Accuracy Testing (GREAT08, GREAT10 & GREAT3)

Gravitational lensing is the process where light from distant galaxies is bent by intervening mass in the Universe as it travels towards us. This bending process causes the shapes of galaxies to appear distorted. By measuring the properties and statistics of this distortion we are able to measure the properties of both dark matter and dark energy. For the vast majority of galaxies the effect of gravitational lensing is to simply apply a matrix distortion to the whole galaxy image: The shears g1 and g2 determines the amount of stretching along the axes, and along the diagonals, respectively. Since galaxies are not circular, we cannot tell whether any individual galaxy has been gravitationally lensed. We must statistically combine the measured shapes of many galaxies, marginalising over the (poorly known) intrinsic galaxy shape distribution, to extract information on dark matter and dark energy.

The GREAT challenges focussed on this unresolved and crucial problem which is of paramount importance for current and future cosmological observations. The resolution of this statistical inference problem would allow the cosmological world to answer some of the most important questions in physics. Solution of this problem would allow the cosmological community to reveal the nature of dark energy with the highest possible precision. This could rule out Einstein’s cosmological constant as a candidate for the dark energy and inspire a new theory to replace Einstein’s gravity.

For the challenges a suite of several million images was provided for download from a server at UCL with multiple mirrors provided at other institutions. Each image contains one galaxy or star (convolution kernel image) in roughly the center of the image. The images would be labelled as star or galaxy. The images would be divided into sets. Each set would contain a small number of star images from which the convolution kernel can be obtained for that set. Each galaxy image in a set would have the same shear (and convolution kernel) applied. The GREAT participant would then submit a shear estimate for each set of images. A key problem is that, as in real life, we will not be providing a model describing the shapes of the stars or galaxies. These must be inferred from the data simultaneously with measuring the shear, from noisy, incomplete and pixelised data.

The challenges have at least two key aspects that go beyond applications of machine learning. Firstly the estimation is required to be extremely accurate, something that contrasts with more traditional estimation tasks. Secondly the sizes of the data sets are very large. Both of these features have made the challenges of great interest to current developments in machine learning.

BCI Competitions

A Brain-Computer Interface (BCI) is a novel augmentative communication system that translates human intentions – reflected by suitable brain signals – into a control signal for an output device such as a computer application or a neuroprosthesis. In developing a BCI system many fields of research are involved, such as classification, signal processing, neurophysiology, measurement technology, psychology, control theory. In recent EEG-based BCI research the role of machine learning (a BCI approach pioneered by Fraunhofer FIRST at NIPS*01) becomes more and more important. In the literature, many machine learning and pattern classification algorithms have been reported to give impressive results when applied to BCI data in offline analyses. However, it is more difficult to evaluate their relative value for actual online use. Typically in each publication a different data set is used, such that — given the high inter-subject variability with respect to BCI performance — a comparison between different methods is practically impossible. Furthermore the offline evaluation EEG classification methods holds many possible pitfalls that lead to an overestimation of the performance. BCI data competitions have been organized to provide objective formal evaluations of alternative methods and therefore to foster the development of improved BCI technology by providing an unbiased validation of a variety of data analysis techniques.

Five Brain-Computer-Interface (BCI) competitions have been held, all to great success. The first BCI Competitions addressed basic problems of BCI research (most tasks posed the problem of classifying short term windows of defined mental states), and later BCI Competition addressed advanced problems with time-continuous feedback, classifiers that needed to be applied to sliding windows and the integration of different measurement sources for generating BCI control signals. More than 200 submissions from more than 50 different labs; one overview article has appeared (IEEE Trans Neural Sys Rehab Eng, 14(2):153-159, 2006) and a special volume of Lecture Notes Computational Science in 2010). Furthermore individual articles of the competition winners have appeared in different journals.

Experimental Design Challenge

Much of machine learning and data mining has been so far concentrating on analyzing data already collected, rather than collecting data. While experimental design is a well-developed discipline of statistics, data collection practitioners often neglect to apply its principled methods. As a result, data collected and made available to data analysts, in charge of explaining them and building predictive models, are not always of good quality and are plagued by experimental artifacts. In reaction to this situation, some researchers in machine learning and data mining have started to become interested in experimental design to close the gap between data acquisition or experimentation and model building. This has given rise of the discipline of active learning. In parallel, researchers in causal studies have started raising the awareness of the differences between passive observations, active sampling, and interventions. In this domain, only interventions qualify as true experiments capable of unravelling cause-effect relationships.

In this challenge, which follows on from two very successful earlier challenges (“Causation and Prediction”, and “Competition Pot-luck”) sponsored by PASCAL, we evaluated methods of experimental design, which involve the data analyst in the process of data collection. From our perspective, to build good models, we need good data. However, collecting good data comes at a price. Interventions are usually expensive to perform and sometimes unethical or impossible, while observational data are available in abundance at a low cost. For instance, in policy-making, one may want to predict the effect on a population’s health status of forbidding the use of cell phones when driving, before passing a law to that effect. This example illustrates the case of an experiment, which is possible, but expensive, particularly if there turns out to be of no effect. Practitioners must identify strategies for collecting data, which are cost effective and feasible, resulting in the best possible models at the lowest possible price. Hence, both efficiency and efficacy are factors of evaluation considered in this new challenge. Evaluating experimental design methods requires performing actual experiments. Because of the difficulty of experimenting on real systems in the context of our challenge, experimentation were made on realistic simulators of real systems, trained on real data or incorporating real data. The tasks were taken from a variety of domains, including medicine, pharmacology, manufacturing, plant biology, sociology, and marketing. Typical examples of tasks include: to evaluate the therapeutic potential or the toxicity of a drug, to optimize the throughput an industrial manufacturing process, to assess the potential impact of a promotion on sales.

The participants carried out virtual experiments by intervening on the system, e.g. by clamping variables to given values. We made use of our recently developed Virtual Laboratory. In this environment, the participants pay a price in virtual cash to perform a given experiment, hence they must optimize their design to reach their goal at the lowest possible cost. This challenge contributed to bringing to light new methods to integrate modeling and experimental design in an iterative process and new methods to combine the use of observational and experimental data in modeling.

Zulu: an interactive learning competition

The challenge addressed the issue of active learning: a server can be accessed via the web and queries can be made. The goal (for the participant) is to obtain the best classifier making use of a fixed number of queries. The classifiers used are DFA, allowing participants to use Angluin’s L* algorithm as a starting point. Zulu is both a web based platform simulating an Oracle in a DFA learning task and a competition. As a web platform, Zulu allows users to generate tasks, to interact with the Oracle in learning sessions and to record the results of the users. It provides the users with a baseline algorithm written in JAVA, or the elements allowing to build from scratch a new learning algorithm capable of interacting with the server. In order to classify the contestants, a two-dimensional grid was used: one dimension concerns the size (in states) of the automata, and the other the size of the alphabet.

KDD cup 2009: Fast Scoring on Large Databases

This challenge uses important marketing problems to benchmark classification methods in a setting typical of large-scale industrial applications. Three large databases made available by the French Telecom company, Orange, were used, each with tens of thousands of examples and variables. These data are unique in that they have both a large number of examples and a large number of variables, making the problem particularly challenging to many state-of-the-art machine learning algorithms. The problems used to illustrate this technical difficulty were the marketing problems of churn, appetency and up-selling. Churn is the propensity of customers to switch between service providers, appetency is the propensity of customers to buy a service, and up-selling is the success in selling additional good or services to make a sale more profitable. The challenge participants were given customer records and their goal was to predict whether a customer will switch provider (churn), buy the main service (appetency) and buy additional extras (up-selling), hence solving simultaneously three 2-class classification problems. Large prizes were donated by Orange (10000 Euros) to encourage participation. Winners were designated for gold, silver and bronze prizes, sharing the total amount.

Semisupervised and unsupervised Morpheme Analysis — Morpho Challenge

The objective of the challenges was to design a statistical machine learning algorithm that discovers the morphemes (smallest individually meaningful units of language) that comprise words. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling. The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

The Morpho Challenges ran successfully in 2005, 2007, 2008, 2009 and 2010. They aimed at advancing the field of machine learning by providing a concrete application challenge for both semi-supervised and unsupervised algorithms whose objective is to learn to provide morphological analyses for words. The algorithms will be evaluated in information retrieval (IR) and statistical machine translation (SMT) tasks. Both tasks are evaluated using the state-of-the-art evaluation systems and evaluation corpora to see which algorithm performs best and does it improve the state-of-the-art. The Morpho Challenges have evoked significant interest both as participants in the challenge and as citations of the evaluation results.

WCCI 2010 Active Learning Challenge

This challenge addresses machine learning problems in which labeling data is expensive, but large amounts of unlabeled data are available at low cost. Examples include: – Handwriting and speech recognition; – Document classification (including Internet web pages); – Vision tasks; – Drug design using recombinant molecules or protein engineering. Such problems might be tackled from different angles: learning from unlabeled data or active learning. In the former case, the algorithms must satisfy themselves with the limited amount of labeled data and capitalize on the unlabeled data with semi-supervised learning methods. Several challenges have addressed this problem in the past. In the latter case, the algorithms may place a limited number of queries to get new sample labels. The goal in that case is to optimize the queries and the problem is referred to as active learning. In most past challenges we organized, we used the same datasets during the development period and during the test period. In this challenge we used two sets of datasets, one for development and one for the final test, drawn from: Embryology, cancer diagnosis, chemoinformatics, handwriting recognition, text ranking, ecology, and marketing.

Separating and recognising speech in natural environments

In this challenge we consider the problem of separating and recognising speech in the cluttered acoustic backgrounds that characterise everyday listening conditions. In 2005, Pascal sponsored a highly successful ‘Speech Separation Challenge,’ which addressed the problem of recognising overlapping speech in single and multiple microphone scenarios. Although the challenge attracted much interest and culminated in the publication of a dedicated special issue of Computer Speech and Language, the focus on overlapping speech encouraged special-case solutions that do not necessarily generalise to real application scenarios. Five years on, the second challenge in PASCAL2 built on this work by extending the problem in ways that better modelled the demands of real noise-robust speech processing systems. In particular we considered the problem of a ‘speech-driven home automation’ application that needs to recognise spoken commands within the ongoing complex mixture of background sounds found in a typical domestic environment. The task was to identify the target commands being spoken given the binaural mixtures. Data was supplied first as isolated utterances (as is traditional for speech recognition evaluations) and then, more realistically, as sequences of utterances mixed intermittently into extended background recording sessions.

Exploration and Exploitation in content optimisation

The two challenges held during PASCAL2 build on the success of the Exploration vs exploitation challenge run in PASCAL1. This challenge considered the standard bandit problem but with response rates changing over time. Despite the apparent simplicity of this challenge it inspired a range of very important developments including the UCT (Upper Confidence Tree) algorithm and its successful application to artificial Go in the award winning MoGo system. The earlier challenge included a £1000 award to the winner. The later challenges built on the earlier challenge in two important respects. Firstly, they considered so-called multi-variate bandits, that is bandits where the visitor/arm combination have associated features that are expected potentially to enable more accurate prediction of the response probability for that combination. Secondly, the data was drawn from a real-world dataset of advertisement (banner) placement on webpages with the response corresponding to click-through by the user. The multi-variate bandit problem represents an important stepping stone towards more complex problems involving delayed feedback, such as reinforcement learning. It involves a single state, but by involving the additional features takes significantly closer to standard supervised learning when compared to the simple bandits considered in the first challenge. The ability to respond accurately and bound performance for such systems is an important step towards a key component that can be integrated into cognitive systems, one of the major goals of the PASCAL network.

UAI 2010 Approximate Inference Evaluation

Probabilistic graphical models are a powerful tool for representing complex multivariate distributions. They have been used with considerable success in many fields, from machine vision and natural language processing to computational biology and channel coding. One of the key challenges in using such models in practice is that the inference problem is computationally intractable in many cases of interest. This has prompted much research on algorithms that approximate the inference task. Examples for such algorithms are loopy belief propagation, then mean field method and Gibbs sampling. Due to the wide use of graphical models it is of key importance to design algorithms that work well in practice. Empirical evaluation in this case is key, since one does not expect approximation algorithms to work well for any problem (due to the theoretical intractability of inference).

The challenge was held as part of the Uncertainty in Artificial Intelligence conference (UAI). The challenge involved several inference tasks (finding the MAP assignment, computing the probability of evidence, calculating marginals, and learning models using approximate inference). Participants provided inference algorithms and these were applied to models from the following domains: machine vision (e.g., segmentation and object detection), computational biology (e.g., protein design and pedigree analysis), constraint satisfaction, medical diagnosis and collaborative filtering, as well as some synthetic problems whose graph structure appears in real world problems (e.g., 2D and 3D grids). Evaluating the state of the art in the field of approximate inference helps guide research in the field. It highlights which methods are particularly promising in which domains. Additionally, since running time was carefully evaluated, it indicates which methods can perform on very large scale data.

Unsupervised and Transfer Learning Challenge

This challenge addressed a question of fundamental and practical interest in machine learning: the assessment of data representations produced by unsupervised and transfer learning procedures. By unsupervised learning we mean learning strictly from unlabelled data. In contrast, transfer learning is concerned with learning from labelled data from tasks that are related but different from the task at hand. For instance, the task to be performed may be recognizing alphabetical letters and the training task may be recognizing digits. Several large datasets from various application domains were made available for the evaluation. The task of the participants was to produce a data representation on an evaluation dataset, given both a very large unlabelled development set and an unlabelled evaluation set.

For clarity of the scientific evaluation, a first phase of the challenge focussed strictly on unsupervised learning. It was then followed by a second phase on transfer learning in which a few target values (labels) for other tasks than the evaluation task (training tasks) were provided on development data (details in the PDF file attached).

We used data from five different domains (handwriting recognition, object recognition from still images, action recognition in videos, text processing, sensor data) and specified that the participants made entries on all datasets to demonstrate the versatility of the method employed. The datasets were selected to follow a number of criteria: (1) being of medium difficulty to provide a good separation of results obtained by different approaches, (2) having over 10,000 unlabeled examples, (3) having over 10,000 labeled examples, with more than 10 classes and a minimum of 100 examples per class.

We believe this challenge has helped advance methodology in evaluating unsupervised learning algorithms and channel research effort on the important new problem of transfer learning. Every year, dozens of papers on unsupervised space transformations, dimensionality reduction and clustering get published. Yet, practitioners tend to ignore them and continue using a handful of popular algorithms like PCA, ICA, k-means, and hierarchical clustering. An evaluation free of inventor bias might help identify and popularize algorithms, which have advanced the state of the art. Another aspect of this challenge was to promote research on deep machine learning architectures, which use hierarchical feature representations.

Classifying Heart Audio Challenge

According to the World Health Organisation, cardiovascular diseases (CVDs) are the number one cause of death globally: more people die annually from CVDs than from any other cause. An estimated 17.1 million people died from CVDs in 2004, representing 29% of all global deaths. Of these deaths, an estimated 7.2 million were due to coronary heart disease. Any method which can help to detect signs of heart disease could therefore have a significant impact on world health. This challenge was to produce methods to do exactly that.

The purpose of this challenge was to attempt to automate the expertise of cardiologists using machine learning, by classifying heart sounds into groups corresponding to specific medical conditions. The challenge was also to help us learn more about the distribution of heart sounds and how they might be most effectively clustered. The challenge exploited machine learning methods for a real world application involving data gathered by the general public and emailed via smart phones to a central server, and also data gathered by physicians using a digital stethoscope. The aim was to attempt to duplicate some of the cognitive processes of a trained cardiologist during the diagnosis process, when listening to heart sounds. This cognitive automation investigates the question: can an algorithm classify heart sounds into different categories of disease without the additional domain knowledge available to the trained cardiologist? The challenge results were presented at a workshop held at AISTATS 2012, the winner receiving an iPad. Since then the data gathered for the challenge has been used several times by other researchers.

GREAT08 Challenge

1 March – 30 December 2008

The GRavitational lEnsing Accuracy Testing 2008 (GREAT08) Challenge focuses on a problem that is of crucial importance for future observations in cosmology. The shapes of distant galaxies can be used to determine the properties of dark energy and the nature of gravity, because light from those galaxies is bent by gravity from the intervening dark matter. The observed galaxy images appear distorted, although only slightly, and their shapes must be precisely disentangled from the effects of pixelisation, convolution and noise.

Causality Challenge #1: Causation and Prediction Challenge

15 December 2007 – 30 April 2008

The focus of this challenge is on predicting the results of actions performed by an external agent. Examples of that problem are found, for instance, in the medical domain, where one needs to predict the effect of a drug prior to administering it, or in econometrics, where one needs to predict the effect of a new policy prior to issuing it. We focus on a given target variable to be predicted (e.g. health status of a patient) from a number of candidate predictive variables (e.g. risk factors in the medical domain). Under the actions of an external agent, variable predictive power and causality are tied together. For instance, both smoking and coughing may be predictive of lung cancer (the target) in the absence of external intervention; however, prohibiting smoking (a possible cause) may prevent lung cancer, but administering a cough medicine to stop coughing (a possible consequence) would not.

Human-machine comparisons of consonant recognition in noise challenge

1 December 2007 – 31 March 2008

Listeners outperform automatic speech recognition systems at every level of speech recognition, including the very basic level of consonant recognition. What is not clear is where the human advantage originates. Does the fault lie in the acoustic representations of speech or in the recogniser architecture, or in a lack of compatibility between the two? There have been relatively few studies comparing human and automatic speech recognition on the same task, and, of these, overall identification performance is the dominant metric. However, there are many insights which might be gained by carrying out a far more detailed comparison. The purpose of this challenge is to promote focused human-computer comparisons on a task involving consonant identification in noise, with all participants using the same training and test data. Training and test data and native listener and baseline recogniser results will be provided by the organisers, but participants are encouraged to also contribute listener responses.

Graph Labelling Challenge: Application to Web Spam Detection Challenge

1 January 2007 – 1 January 2008

The goal of the Web Spam Challenge series is to identify and compare Machine Learning (ML) methods for automatically labeling structured data represented as graphs. More precisely, we focus on the problem of labeling all nodes of a graph from a partial labeling of them. The application we study is Web Spam Detection, where we want to detect deliberate actions of deception aimed at the ranking functions used by search engines.

Visual Object Classes Challenge 2007

1 January – 31 October 2007

The goal of this challenge is to recognise objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

  • Person: person
  • Animal: bird, cat, cow, dog, horse, sheep
  • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
  • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

Unsupervised Morpheme Analysis — Morpho Challenge 2007

1 October 2006 – 21 September 2007

The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

Agnostic Learning vs. Prior Knowledge Challenge

1 October 2006 – 1 August 2007

“When everything fails, ask for additional domain knowledge” is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question. Most commercial data mining programs accept data pre-formatted as a table, each example being encoded as a fixed set of features. Is it worth spending time engineering elaborate features incorporating domain knowledge and/or designing ad hoc algorithms? Or else, can off-the-shelf programs working on simple features encoding the raw data without much domain knowledge put out-of-business skilled data analysts?   In this challenge, the participants are allowed to compete in two tracks:

  • The “prior knowledge” track, for which they will have access to the original raw data representation and as much knowledge as possible about the data.
  • The “agnostic learning” track for which they will be forced to use a data representation encoding the raw data with dummy features.

Third Recognising Textual Entailment Challenge

1 December 2006 – 1 June 2007

RTE 3 follows the same basic structure of the previous campaign, in order to facilitate the participation of newcomers and to allow “veterans” to assess the improvements of their systems in a comparable test exercise. Nevertheless, the following innovations are introduced to make the challenge more stimulating and, at the same time, to encourage collaboration between system developers:

  • a limited number of longer texts, i.e. up to a paragraph- in order to move toward more comprehensive scenarios which incorporate the need for discourse analysis. However, the majority of examples will remain similar to those in the previous challenges, providing pairs with relatively short texts.
  • an RTE Resource Pool has been created where contributors have the possibility to share the resources they use.
  • an optional task, “Extending the Evaluation of Inferences from Texts”, which explores two other tasks closely related to textual entailment: differentiating unknown from false/contradicts and providing justifications for answers.

Learning when Test and Training Inputs have Different Distributions Challenge

1 June 2005 – 30 April 2007

The goal of this challenge is to attract the attention of the Machine Learning community towards the problem where the input distributions, p(x), are different for test and training inputs. A number of regression and classification tasks are proposed, where the test inputs follow a different distribution than the training inputs. Training data (input-output pairs) are given, and the contestants are asked to predict the outputs associated to a set of validation and test inputs. Probabilistic predictions are strongly encouraged, though non-probabilitic “point” predictions are also accepted. The performance of the competing algorithms will be evaluated both with traditional losses that only take into account “point predictions” and with losses that evaluate the quality of the probabilistic predictions.

Computer-Assisted Stemmatology Challenge

6 October 2006 – 14 April 2007

Stemmatology (a.k.a. stemmatics) studies relations among different variants of a document that have been gradually built from an original by copying and modifying earlier versions. The aim of such study is to reconstruct the family tree of the variants. We invite applications of established and, in particular, novel approaches, including but of course not restricted to hierarchical clustering, graphical modeling, link analysis, phylogenetics, string-matching, etc. The objective of the challenge is to evaluate the performance of various approaches. Several sets of variants for different texts are provided, and the participants should attempt to reconstruct the relationships of the variants in each data-set. This enables the comparison of methods usually applied in unsupervised scenarios.

Type I and Type II Errors for Multiple Simultaneous Hyppothesis Testing Challenge

1 January 2006 – 1 February 2007

Multiple Simultaneous Hypothesis Testing is a main issue in many areas of information extraction:

  • rule extraction,
  • validation of genes influence,
  • validation of spatio-temporal patterns extraction (e.g. in brain imaging),
  • other forms of spatial or temporal data (e.g. spatial collocation rule).
  • other multiple hypothesis testing,

In all above frameworks, the goal is to extract patterns such that some quantity of interest is significantly greater than some given threshold.

Letter-to-Phoneme Conversion Challenge

1 February 2006 – 31 January 2007

Letter-to-phoneme conversion is a classic problem in machine learning (ML), as it is both hard (at least for languages like English and French) and important. For non-linguists, a ‘phoneme’ is an abstract unit corresponding to the equivalence class of physical sounds that ‘represent’ the same speech sound. That is, members of the equivalence class are perceived by a speaker of the language as the ‘same’ phonemes: the word ‘cat’ consists of three phonemes, two of which are shared with the word ‘bat’. A phoneme is defined by its role in distinguishing word pairs like ‘bat’ and ‘cat’. Thus, /b/ and /k/ are different phonemes. But the /b/ in ‘bat’ and the /b/ in ‘tab’ are the same phoneme, in spite of their different acoustic realisations, because the difference between them is never used (in English) to signal a difference between minimally-distinctive word-pairs. Although we intend to give most prominence to letter-to-phoneme conversion, the community is challenged to develop and submit innovative solutions to these related problems.

Exploration vs. Exploitation Challenge

19 January – 15 November 2006

Touch Clarity ( provides real time optimisation of websites. Touch Clarity chooses, from a number of options, the most popular content to display on a page. This decision is made by tracking how many visitors respond to each of the options, by clicking on them. This is a direct commercial application of the multi armed bandit problem – each of the items which might be shown is a separate bandit, with a separate response rate. As in the multi armed bandit problem, there is a trade off between exploration and exploitation – it is necessary to sometimes serve items other than the most popular in order to measure their response rate with sufficient precision to correctly identify which is the most popular. However, in this application there is a further complication – typically the rates of response to each item will vary over time, so continuous exploration is necessary in order to track this time variation, as old knowledge becomes out of  date. An extreme example of this might be in choosing which news story to serve as the main story on a news page – interest in one story will decrease over time while interest in another will increase. In addition, the interest in several stories might vary in a similar, coherent way – for example a general increase in interest in sports stories at weekends, or in political stories near to an election. So there are typically two types of variation to consider – where response rates vary together, and where response rates vary completely independently.

Visual Object Classes Challenge

1 January – 30 June 2006

The Visual Object Classes Chellenges has the following objectives:

  • To compile a standardised collection of object recognition databases
  • To provide standardised ground truth object annotations across all databases
  • To provide a common set of tools for accessing and managing the database annotations
  • To run a challenge evaluating performance on object class recognition

Unsupervised Segmentation of Words into Morphemes Challenge

1 September 2005 – 12 April 2006

The objective of the Challenge is to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling. The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

Second Recognising Textual Entailment Challenge

1 October 2005 – 10 April 2006

Textual Entailment Recognition has been proposed recently as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarisation. This task requires to recognise, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text. By introducing a second challenge we hope to keep the momentum going, and to further promote the formation of a research community around the applied entailment task. As in the previous challenge, the main task is judging whether a hypothesis (H) is entailed by a text (T). One of the main goals for the RTE-2 dataset is to provide more “realistic” text-hypothesis examples, based mostly on outputs of actual systems. We focus on the four application settings mentioned above: QA, IR, IE and multi-document summarisation. Each portion of the dataset includes typical T-H examples that correspond to success and failure cases of such applications. The examples represent different levels of entailment reasoning, such as lexical, syntactic, morphological and logical.

XML Challenge

30 July 2005 – 1 April 2006

The objective of the challenge is to develop machine learning methods for structured data mining and to evaluate these methods for XML document mining tasks. The challenge is focused on classification and clustering for XML documents. Datasets coming from different XML collections and covering a variety of classification and clustering situations will be provided to the participants. One goal of this track is to build a reference categorisation/ clustering corpora of XML documents. The organisers are open to any suggestion concerning the construction of such corpora.

Performance Prediction Challenge

1 October 2005 – 1 March 2006

This project is dedicated to stimulate research and reveal the state-of-the art in “model selection” by organising a competition followed by a workshop. Model selection is a problem in statistics, machine learning, and data mining. Given training data consisting of input-output pairs, a model is built to predict the output from the input, usually by fitting adjustable parameters. Many predictive models have been proposed to perform such tasks, including linear models, neural networks, trees, and kernel methods. Finding methods to optimally select models, which will perform best on new test data, is the object of this project. The competition will help identify accurate methods of model assessment, which may include variants of the well-known cross-validation methods and novel techniques based on learning theoretic performance bounds. Such methods are of great practical importance in pilot studies, for which it is essential to know precisely how well desired specifications are met.

Inferring Relevance from Eye Movements Challenge

1 March – 1 September 2005

In the Challenge we have an experimental setup, where the test subject is first shown a question, followed by ten sentences. Five of the sentences are “relevant” to the question (they are of the same topic as the question) and five of the sentences are irrelevant (they have no relation to the topic of the question). One of the relevant sentences is the correct answer to the question. The experimental setup is designed to resemble a real-life information retrieval scenario as closely possible while at the same time retaining a controlled setup where the ground truth is known. Thus, in the Challenge the meaning of “relevant” is defined in terms of this experimental setup. The objective of the Challenge is to find the best methods and features that can be used to predict the relevance from the eye movement measurements.

BCI Competition III Challenge

12 December 2004 – 22 May 2005

The goal of the “BCI Competition III” is to validate signal processing and classification methods for Brain-Computer Interfaces (BCIs). Compared to the past BCI Competitions, new challanging problems are addressed that are highly relevant for practical BCI systems, such as

  • session-to-session transfer (data set I)
  • small training sets, maybe to be solved by subject-to-subject transfer (data set IVa),
  • non-stationarity problems (data set IIIb, data set IVc),
  • multi-class problems (data set IIIa, data set V, data set II,),
  • classification of continuous EEG without trial structure (data set IVb, data set V).

Also this BCI Competition includes for the first time ECoG data (data set I) and one data set for which preprocessed features are provided (data set V) for competitors that like to focus on the classification task rather than to dive into the depth of EEG analysis.

PASCAL Ontology Learning Challenge

1 November 2004 – 30 April 2005

The aim of this challenge is to encourage work on automated construction and population of ontologies. For the purposes of this challenge, an ontology consists of a set of concepts and a set of instances. An instance can be assigned to one or more concepts. The concepts are connected into a hierarchy. Several types of tasks are included in this challenge:

  • Ontology construction: given a set of documents, construct an ontology with these documents as instances.
  • Ontology extension: given a partial ontology and a set of instances, extend the ontology with new concepts using the given instances.
  • Ontology population: given a partially populated hierarchy of concepts, develop a model that can assign new instances to concepts.
  • Concept naming: given a set of instances and the assignment of instances to concepts, suggest user-friendly labels for the concepts.

Evaluation is based on comparing the results to a “golden standard” ontology prepared by human editors.

Recognising Textual Entailment Challenge

1 June 2004 – 10 April 2005

Recent years have seen a surge in research of text processing applications that perform semantic-oriented inference about concrete text meanings and their relationships. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL Challenge introduces textual entailment as a common task and evaluation framework for Natural Language Processing, Information Retrieval and Machine Learning researchers, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of generic semantic “engines”, which will play an analogous role to that of generic syntactic analyzers across multiple applications.

101 Visual Object Classes Challenge

1 September 2004 – 31 March 2005

The goal of this challenge is to recognise objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images will be provided. The four object classes that have been selected are:

  • motorbikes
  • bicycles
  • people
  • cars

There will be two main competitions:

  • For each of the 4 classes, predicting presence/absence of an example of that class in the test image.
  • Predicting the bounding box and label of each object from the 4 target classes in the test image.

Contestants may enter either (or both) of these competitions, and can choose to tackle any (or all) of the four object classes. The challenge allows for two approaches to each of the competitions:

  • Contestants may use systems built or trained using any methods or data excluding the provided test sets.
  • Systems are to be built or trained using only the provided training data.

The intention in the first case is to establish just what level of success can currently be achieved on these problems and by what method; in the second case the intention is to establish which method is most successful given a specified training set.

Assessing ML methodologies to Extract Implicit relations from documents Challenge

1 June 2004 – 28 February 2005

The goal of the proposed challenge is to assess the current situation concerning Machine Learning (ML) algorithms for Information Extraction (IE) from documents, identifying future challenges and to foster additional research in the field. The aim is to:

  • Define a methodology for fair comparison of ML algorithms for IE.
  • Define a publicly available resource for evaluation that will exist and be used beyond the lifetime of the challenge; such framework will be ML oriented, not IE oriented as so far proposed in other similar evaluations.
  • Perform actual tests of different algorithms in controlled situations so to understand what works and what does not and therefore identify new future challenges.

Large Hybrid Networks Challenge

1 July – 31 December 2004

Efficient approximate inference in large Hybrid Networks (graphical models with discrete and continuous variables) is one of the major unsolved problems in machine learning, and insight into good solutions would be beneficial in advancing the application of sophisticated machine learning to a wide range  of real-world problems. Such research would benefit potentially applications in Speech Recognition, Visual Object Tracking and Machine Vision, Robotics, Music Scene Analysis, Analysis of complex Times series, understanding and modelling complex computer networks, Condition monitoring, and other complex phenomena. This theory challenge specifically addresses a central component area of PASCAL, namely Bayesian Statistics and statistical modelling, and is also related to the other central areas of Computational Learning, Statistical Physics and Optimisation techniques. One aim of this challenge is to bring together leading researchers in graphical models and related areas to develop and improve on existing methods for tackling the fundamental intractability in HNs. We do not believe that there will necessarily emerge a single best approach, although we would expect that successes in one application area should be transferable to related areas. Many leading machine learning researches are currently working on applications that involve HNs, and we invite participants to suggest their own applications. Ideally, this would be in the form of a dataset along the lines of PASCAL.

Evaluating Predictive Uncertainty Challenge

1 September – 12 December 2004

The goal of this challenge is to evaluate probabilistic methods for regression and for classification problems. A number of regression classification tasks are proposed. Training data (input-output pairs) are given, and the contestants are asked to predict the outputs associated to a set of validation and test inputs. These predictions are probabilistic and take the form of predictive distributions. The performance of the competing algorithms will be evaluated both with traditional losses that only take into account “point predictions” and with losses that evaluate the quality of the probabilistic predictions.