This challenge addressed a question of fundamental and practical interest in machine learning: the assessment of data representations produced by unsupervised and transfer learning procedures. By unsupervised learning we mean learning strictly from unlabelled data. In contrast, transfer learning is concerned with learning from labelled data from tasks that are related but different from the task at hand. For instance, the task to be performed may be recognizing alphabetical letters and the training task may be recognizing digits. Several large datasets from various application domains were made available for the evaluation. The task of the participants was to produce a data representation on an evaluation dataset, given both a very large unlabelled development set and an unlabelled evaluation set.

For clarity of the scientific evaluation, a first phase of the challenge focussed strictly on unsupervised learning. It was then followed by a second phase on transfer learning in which a few target values (labels) for other tasks than the evaluation task (training tasks) were provided on development data (details in the PDF file attached).

We used data from five different domains (handwriting recognition, object recognition from still images, action recognition in videos, text processing, sensor data) and specified that the participants made entries on all datasets to demonstrate the versatility of the method employed. The datasets were selected to follow a number of criteria: (1) being of medium difficulty to provide a good separation of results obtained by different approaches, (2) having over 10,000 unlabeled examples, (3) having over 10,000 labeled examples, with more than 10 classes and a minimum of 100 examples per class.

We believe this challenge has helped advance methodology in evaluating unsupervised learning algorithms and channel research effort on the important new problem of transfer learning. Every year, dozens of papers on unsupervised space transformations, dimensionality reduction and clustering get published. Yet, practitioners tend to ignore them and continue using a handful of popular algorithms like PCA, ICA, k-means, and hierarchical clustering. An evaluation free of inventor bias might help identify and popularize algorithms, which have advanced the state of the art. Another aspect of this challenge was to promote research on deep machine learning architectures, which use hierarchical feature representations.

According to the World Health Organisation, cardiovascular diseases (CVDs) are the number one cause of death globally: more people die annually from CVDs than from any other cause. An estimated 17.1 million people died from CVDs in 2004, representing 29% of all global deaths. Of these deaths, an estimated 7.2 million were due to coronary heart disease. Any method which can help to detect signs of heart disease could therefore have a significant impact on world health. This challenge was to produce methods to do exactly that.

The purpose of this challenge was to attempt to automate the expertise of cardiologists using machine learning, by classifying heart sounds into groups corresponding to specific medical conditions. The challenge was also to help us learn more about the distribution of heart sounds and how they might be most effectively clustered. The challenge exploited machine learning methods for a real world application involving data gathered by the general public and emailed via smart phones to a central server, and also data gathered by physicians using a digital stethoscope. The aim was to attempt to duplicate some of the cognitive processes of a trained cardiologist during the diagnosis process, when listening to heart sounds. This cognitive automation investigates the question: can an algorithm classify heart sounds into different categories of disease without the additional domain knowledge available to the trained cardiologist? The challenge results were presented at a workshop held at AISTATS 2012, the winner receiving an iPad. Since then the data gathered for the challenge has been used several times by other researchers.

The GRavitational lEnsing Accuracy Testing 2008 (GREAT08) Challenge focuses on a problem that is of crucial importance for future observations in cosmology. The shapes of distant galaxies can be used to determine the properties of dark energy and the nature of gravity, because light from those galaxies is bent by gravity from the intervening dark matter. The observed galaxy images appear distorted, although only slightly, and their shapes must be precisely disentangled from the effects of pixelisation, convolution and noise.

The focus of this challenge is on predicting the results of actions performed by an external agent. Examples of that problem are found, for instance, in the medical domain, where one needs to predict the effect of a drug prior to administering it, or in econometrics, where one needs to predict the effect of a new policy prior to issuing it. We focus on a given target variable to be predicted (e.g. health status of a patient) from a number of candidate predictive variables (e.g. risk factors in the medical domain). Under the actions of an external agent, variable predictive power and causality are tied together. For instance, both smoking and coughing may be predictive of lung cancer (the target) in the absence of external intervention; however, prohibiting smoking (a possible cause) may prevent lung cancer, but administering a cough medicine to stop coughing (a possible consequence) would not.

Listeners outperform automatic speech recognition systems at every level of speech recognition, including the very basic level of consonant recognition. What is not clear is where the human advantage originates. Does the fault lie in the acoustic representations of speech or in the recogniser architecture, or in a lack of compatibility between the two? There have been relatively few studies comparing human and automatic speech recognition on the same task, and, of these, overall identification performance is the dominant metric. However, there are many insights which might be gained by carrying out a far more detailed comparison. The purpose of this challenge is to promote focused human-computer comparisons on a task involving consonant identification in noise, with all participants using the same training and test data. Training and test data and native listener and baseline recogniser results will be provided by the organisers, but participants are encouraged to also contribute listener responses.

The goal of the Web Spam Challenge series is to identify and compare Machine Learning (ML) methods for automatically labeling structured data represented as graphs. More precisely, we focus on the problem of labeling all nodes of a graph from a partial labeling of them. The application we study is Web Spam Detection, where we want to detect deliberate actions of deception aimed at the ranking functions used by search engines.

The goal of this challenge is to recognise objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:

  • Person: person
  • Animal: bird, cat, cow, dog, horse, sheep
  • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
  • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

“When everything fails, ask for additional domain knowledge” is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question. Most commercial data mining programs accept data pre-formatted as a table, each example being encoded as a fixed set of features. Is it worth spending time engineering elaborate features incorporating domain knowledge and/or designing ad hoc algorithms? Or else, can off-the-shelf programs working on simple features encoding the raw data without much domain knowledge put out-of-business skilled data analysts?   In this challenge, the participants are allowed to compete in two tracks:

  • The “prior knowledge” track, for which they will have access to the original raw data representation and as much knowledge as possible about the data.
  • The “agnostic learning” track for which they will be forced to use a data representation encoding the raw data with dummy features.

RTE 3 follows the same basic structure of the previous campaign, in order to facilitate the participation of newcomers and to allow “veterans” to assess the improvements of their systems in a comparable test exercise. Nevertheless, the following innovations are introduced to make the challenge more stimulating and, at the same time, to encourage collaboration between system developers:

  • a limited number of longer texts, i.e. up to a paragraph- in order to move toward more comprehensive scenarios which incorporate the need for discourse analysis. However, the majority of examples will remain similar to those in the previous challenges, providing pairs with relatively short texts.
  • an RTE Resource Pool has been created where contributors have the possibility to share the resources they use.
  • an optional task, “Extending the Evaluation of Inferences from Texts”, which explores two other tasks closely related to textual entailment: differentiating unknown from false/contradicts and providing justifications for answers.