With the exceptional increase in computing power, storage capacity and network bandwidth of the past decades, ever growing datasets are collected in fields such as bioinformatics (Splice Sites, Gene Boundaries, etc), IT-security (Network traffic) or Text-Classification (Spam vs. Non-Spam), to name but a few. The growth of data leaves computational methods as the only viable way of dealing with data, and it poses new challenges to ML methods. The Large-Scale Learning challenge is concerned with the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources, e.g. resulting from a high algorithmic complexity, from the size or dimensionality of the data set, and from the trade-off between distributed resolution and communication costs. Indeed many comparisons are presented in the literature; however, these usually focus on assessing a few algorithms, or considering a few datasets; further, they most usually involve different evaluation criteria, model parameters and stopping conditions. As a result it is difficult to determine how each method behaves and compares with the others in terms of test error, training time and memory requirements, which are the practically relevant criteria. This challenge is designed to be fair and enables a direct comparison of current large scale classifiers aimed at answering the question: Which learning method is the most accurate given limited resources? To this end we provide a generic evaluation framework tailored to the specifics of the competing methods. Providing a wide range of datasets, each of which having specific properties we evaluate the methods based on performance figures, displaying training time vs. test error, dataset size vs. test error and dataset size vs. training time.

Much of machine learning and data mining has been so far concentrating on analyzing data already collected, rather than collecting data. While experimental design is a well-developed discipline of statistics, data collection practitioners often neglect to apply its principled methods. As a result, data collected and made available to data analysts, in charge of explaining them and building predictive models, are not always of good quality and are plagued by experimental artifacts. In reaction to this situation, some researchers in machine learning and data mining have started to become interested in experimental design to close the gap between data acquisition or experimentation and model building. This has given rise of the discipline of active learning. In parallel, researchers in causal studies have started raising the awareness of the differences between passive observations, active sampling, and interventions. In this domain, only interventions qualify as true experiments capable of unravelling cause-effect relationships.

In this challenge, which follows on from two very successful earlier challenges (“Causation and Prediction”, and “Competition Pot-luck”) sponsored by PASCAL, we evaluated methods of experimental design, which involve the data analyst in the process of data collection. From our perspective, to build good models, we need good data. However, collecting good data comes at a price. Interventions are usually expensive to perform and sometimes unethical or impossible, while observational data are available in abundance at a low cost. For instance, in policy-making, one may want to predict the effect on a population’s health status of forbidding the use of cell phones when driving, before passing a law to that effect. This example illustrates the case of an experiment, which is possible, but expensive, particularly if there turns out to be of no effect. Practitioners must identify strategies for collecting data, which are cost effective and feasible, resulting in the best possible models at the lowest possible price. Hence, both efficiency and efficacy are factors of evaluation considered in this new challenge. Evaluating experimental design methods requires performing actual experiments. Because of the difficulty of experimenting on real systems in the context of our challenge, experimentation were made on realistic simulators of real systems, trained on real data or incorporating real data. The tasks were taken from a variety of domains, including medicine, pharmacology, manufacturing, plant biology, sociology, and marketing. Typical examples of tasks include: to evaluate the therapeutic potential or the toxicity of a drug, to optimize the throughput an industrial manufacturing process, to assess the potential impact of a promotion on sales.

The participants carried out virtual experiments by intervening on the system, e.g. by clamping variables to given values. We made use of our recently developed Virtual Laboratory. In this environment, the participants pay a price in virtual cash to perform a given experiment, hence they must optimize their design to reach their goal at the lowest possible cost. This challenge contributed to bringing to light new methods to integrate modeling and experimental design in an iterative process and new methods to combine the use of observational and experimental data in modeling.

The emergence of Social Networks and Social Media sites has motivated a large amount of recent research. Different generic tasks are currently studied such as Social Network Analysis, Social Network annotation, Community Detection, Link Prediction. One classical question concerns the study of temporal propagation of information through this new type of media. It aims at studying how information propagates on a network. Many recent works are directly inspired from the literature in the epidemiologic domain or in the social science domain. These works mainly propose different propagation models – independent cascade models or linear threshold models – and analyze different properties of these models, such as the epidemic threshold. Recently, instead of analyzing how information spreads, different articles address the problem of predicting the propagation in the future ([4, 5]). This is a key problem with many applications like Buzz prediction – predicting if a particular content is a buzz – or Opinion Leader Detection – detecting if a node in a network will well spread content.  This challenge analyzed and compared the quality of propagation prediction methods. The challenge was organized in order to facilitate the participation of any interested researcher, by providing simple tools and easy to use datasets. We anticipate that the produced material can become the first large benchmark for propagation models.

In recent years, the problem of hierarchical text classification has been addressed in machine learning literature, but its handling at large scale (i.e. involving several thousand categories) remains an open research issue. Combined with the increasing demand for practical systems of this kind, there seems to be a need for a significant push of this research activity. This is our motivation for this PASCAL challenge aiming at assessing models, methods and tools for classification in very large, hierarchically organized, category systems. We prepared large datasets for experimentation, based in the ODP Web directory (www.dmoz.org), as well as baseline classifiers based on k-NN and logistic regression. We used two of these datasets for the challenge: a very large one (around 30,000 categories) and a smaller one (around 3,000 categories). The participants were given the chance to dry-run their classification methods on the smaller datasets. They were then asked to learn their system using the training and validation parts of the larger set, and provide their classification results on the test part. A two-sided evaluation of the participating methods was used: one measuring classification performance and one computational performance. Work on this challenge resulted in a new EU project “BioASQ: A challenge on large-scale biomedical semantic indexing and question answering” which started on October 1, 2012.

The challenge considers the problem of inducing a grammar directly from natural language text. Resultant grammars can then be used to discriminate between strings that are part of the language (i.e., are grammatically well formed) and those that are not. This has long been a fundamental problem in Computational Linguistics and Natural Language Processing, drawing from theoretical Computer Science and Machine Learning. The popularity of the task is driven by two different motivations. Firstly, it can help us to better understand the cognitive process of language acquisition in humans. Secondly, it can help with portability of NLP applications into new domains and new languages. Most NLP algorithms rely on syntactic parse structure created by supervised parsers, however training data in the form of treebanks only exist for a few languages and for specific domains, thus limiting the portability of these algorithms. The challenge we are proposing aims to foster continuing research in grammar induction, while also opening up the problem to more ambitious settings, including a wider variety of languages, removing the reliance on part-of-speech and, critically, providing a thorough evaluation.  The data that we provided was collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. This gives a diverse range of data upon which to test grammar induction algorithms yielding a deeper insight into the accuracy and shortcomings of different algorithms. Where possible, we intend to compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. Overall this test set forms a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the NLP field’s continuing reliance on the Penn Treebank.

The goal of this challenge is to decode a natural stimulus from short time periods extracted from a continuous MEG signal (measured specifically for the challenge and made freely available), and the problem is formulated as a classification task. The data consisted of 204 MEG channels measured under stimulation with different types of movies (football, feature film etc), and the mind reading task was to decode the type of the movie for testing data. The results of the challenge were presented in the ICANN 2011 conference themed ‘Machine learning re-inspired by brain and cognition’. From the modelling side, the challenge built on earlier PASCAL workshops on Learning from multiple sources organized in NIPS 2008 and 2009, aiming to provide public data useful for developing such models. The challenge provided information on feasibility of decoding natural stimuli from continuous MEG signal, which is a novel task. It is also provided data for future evaluation of multi-source learning models, useful also for machine learning researchers outside MEG analysis.

The Visual Object Classes Challenge was organised for eight years in a row, with increasing success. For example, the VOC 2011 workshop took place at ICCV and there were approximately 200 attendees. There were very healthy numbers of entries: 19 entries for the classification task, 13 for detection, and 6 for segmentation. There was also a successful collaboration with ImageNet (www.image-net.org) organized by a team in the US. They held a second competition on their dataset with 1000 categories (but with only one labelled object per image). It is safe to say that the PASCAL VOC challenges have become a major point of reference for the computer vision community when it comes to object category detection and segmentation. There are over 800 publications (using Google Scholar) which refer to the data sets and the corresponding challenges. The best student paper winner at CVPR 2011 made use of the VOC 2008 detection data; the prize winning paper at ECCV 2010 made use of the VOC 2009 segmentation data; and prize winning papers at ICCV 2009, CVPR 2008 and ECCV 2008 were based on the VOC detection challenge (using our performance measure in the loss functions).

The basic challenge is the recognition of objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). There can be multiple objects in each image. There are typically three main competitions with 20 object classes and around 10,000 images on: classification (is an object of class X present?), detection (where is it and what is its size?), segmentation (pixel-wise labelling). There were also “taster” competitions on subjects such as: layout (predicting the bounding box and label of each part of a person) and human action recognition (e.g. riding a bike, taking a photo, reading) The goal of the layout and action recognition tasters was to provide a richer description of people in images than just bounding box/segmentation information. Our experiences in 2010 and 2011 with Mechanical Turk annotation for the classification and detection challenges were that it was hard to achieve the level of quality we require from this pipeline. The focus for VOC 2012 annotation was thus to increase the labelled data for the segmentation and action recognition challenges. The segmentation data is of a very high quality not available elsewhere, and it is very valuable to provide more data of this nature. The legacy of the VOC challenges is the freely-available VOC 2007 data which was ported to mldata.org. We also extended our evaluation server to display the top k submissions for each of the challenges (a leaderboard feature), so that the likely performance increases after 2012 can be viewed on the web (similar to that available for the Middlebury evaluations, see http://vision.middlebury.edu/stereo/eval/).

Finite state automata (or machines) are well-known models for characterizing the behaviour of systems or processes. They have been used for several decades in computer and software engineering to model the complex behaviours of electronic circuits and software such as communication protocols. They are equivalent to Hidden Markov Models, used in a number of applications. The state of the art of learning either of these types of machines from strings is unclear as there has never been a challenge or even a benchmark over which learning algorithms have been compared. The goal of PAutomaC is to provide an overview of which probabilistic automaton learning technique works best in which setting and to stimulate the development of new techniques for learning distributions over strings. Such an overview will be very helpful for practitioners of automata learning and provide directions to future theoretical work and algorithm development. PAutomaC will provide the first elaborate test-suite for learning string distributions. The task is of interest to:

  • Grammatical Inference theoreticians wanting to find out how good their ideas and algorithms really are
  • Pattern recognition practitioners who have developed fine tuning EM inspired techniques to evaluate the parameters of HMMs or related models;
  • Statistical modelling experts who have to deal with strings or sequences.

The competition on gesture recognition was organised in collaboration with the DARPA Deep Learning program. This challenge was part of a series of challenges on the theme of unsupervised and transfer learning. The goal is to push the state of the art in algorithms capable of learning data representations, which may be re-used from task to task, using unlabelled data and/or labelled data from similar domains. In this challenge, the competitors were given a large database of videos of American Sign Language performed by native signers and videos of International Sign Language performed by non-native signers, which we collected using Amazon Mechanical Turk. Entrants each developed a system that was tested in a live competition on gesture recognition. The test was carried on a small but new sign language vocabulary. The platform of the challenge remains open after the end of the competition and all the datasets are freely available for research in the Pascal 2 repository.

The RTE challenges have run annually for several rounds to great success. The task consists of recognizing that the meaning of a textual statement, termed H (the hypothesis), can be inferred by the content of a given text, termed T (the text). Given a set of pairs of Ts and Hs as input, the systems must recognize whether each T entails the corresponding H, classifying whether:

  • T entails H
  • T contradicts H, or shows it false
  • the veracity of H is unknown on the basis of T.

A human-annotated development set is first released to allow investigation, tuning and training of systems, which are then evaluated on a gold-standard test set. In later rounds of the challenge, the given texts were made substantially longer, usually corresponding to a coherent portion of the document such as a paragraph or a group of closely related sentences. Texts come from a variety of unedited sources. Thus, systems are required to handle real text forms that may include typographical errors and ungrammatical sentences. A novel Entailment Search pilot task was also introduced.