The reason for which learning languages by active learning is an important topic in the field of computational linguistics is that it is not always possible to rely on statistics. This is particularly the case for those languages of which large corpora are not available. This question is known sometimes as working on under-resourced languages.
Zulu is an active learning task. The participants are asked to build algorithms that can learn languages by querying a server via membership queries. The targets are regular languages, and for each task a fixed number of queries is allowed.
Yes. Anyone can. One should remember that the winner of the 1998 Abbadingo competition (Nick Price) was just a computer scientist not affiliated to any research lab! Typically, students that have learnt about Dana Angluin’s L* algorithm should be interested. But we hope that researchers in computational linguistics, machine learning or any field which needs grammatical inference will be interested.
A task is set when a new target automaton is created. During the training phase anyone can create a new task of which he becomes the task owner. The task owner is the first to try to solve a given task. His score appears on the web and can be used as a guideline by the other players.
The baseline algorithm is a simple adaptation of Angluin’s 1987 L* algorithm. It makes membership queries in order to obtain a table which is both closed and complete. At this point it samples some strings (using a distribution based on the size of the current hypothesis), and tests (again via membership queries) the labelling of the sampled strings against the labelling returned by the oracle. If all strings are perfectly labelled, the algorithm decides that (under the important assumption that the chosen distribution was good enough) the error cannot be too large. It can now test. If some example appears to be badly labelled, the algorithm can use this as a counterexample from an equivalence query.
More details concerning the baseline algorithm can be found here.
The baseline algorithm has two purposes. It is used to compute the number of queries allowed for the task, but it can also be used by any user as a starting point to participate in the competition.
No. The baseline is written in Java, but since the programme will run on your own machine, you can choose anything. Please refer to the Develop your algorithm in your language section.
Because this is only training phase. People might get things wrong and would cease to participate if their first trials were consistently beaten by everyone else. Furthermore, the second time one plays one cannot avoid different types of hill climbing: you could rely on the results from queries done by another algorithm, or you know the exact size.
But you are encouraged to discuss with the task-owner if you want to know more.
Because sometimes it can seem challenging to see how well our algorithm is progressing.
We are using here the same programs that were used by Gowachin. Details can be found on the webpage, and in the associated publications [lang1998rao]. To make things simple, we start by randomly drawing a number n at most as large as the parameter the task owner proposes. We then construct a random graph with slightly more elements than n and randomly decide that some states are final and others are not. Then minimisation takes place.
Again we follow the protocol from Abbadingo and Gowachin. A maximum length k is defined that depends on the thickness of the DFA. Then strings are randomly drawn from
This number is computed as follows: when a target is created, the baseline algorithm is run and when this algorithm is capable of doing less than 40% errors, the number of queries made up to that point is allocated to the task.
Two reasons: the first is that this is not about identification, and the second is that since L* makes use of equivalence queries, and we are not allowed these, he will have to do some sampling instead. But since he does not know what the real distribution is, this is a problem (and it is easy to show that in theory there is no good answer).
During the competition phase, the task owner will be the organisers. They will provide a series of tasks .
No. Obviously, before proclaiming who the winner is, the organisers will ask for the code and the sources. They will then run the winner’s code on their machines.
Because that would oblige us to select some very strict format. And here the idea is not to impose limitation on how the algorithms treat the information they gather, but only on how much information they gather.
The actual competition will probably provide a small financial prize for the winner. Furthermore, with the help of our scientific committee, we should be able to propose that the winner publishes his solution in the Machine Learning Journal.
We also hope to hold a workshop in the autumn of 2010.
There are a number of papers one might read: Angluin’s papers on query learning, her description of L*, the chapter in Vazirani’s book [Il faut des references]
This may come as a surprise, but it is. Here are some examples.
Simple: send us an e-mail. If the question (and the answer) are likely to be of interest to others, we will add it to this list.
Indeed, there is room for improvement. The feeling was that if active learning was an important question, we did not know enough about learning strategies, when the learner could interactively interrogate its environment. And in this case, the crucial resource is not the (CPU) time that the learning program is going to spend, but the number of times it uses the expensive resource, the oracle.
In all cases, since we have noted that this question is an important one, we will gladly discuss this in the forum.
Another good question. In preliminary experiments we have noted an (albeit uneven) exponential progression of the number of queries with the rate of success. For the moment, we fix the number of queries. If you can do better with less. Good, because it probably means that on another task you will score higher.
Lang, K.J. and Pearlmutter, B.A. and Price, R.A. (1998) Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm, Lecture Notes in Computer Science
The earliest task was that of map building in robotics: the (simplified) map is a graph and the outputs in each state are what the robot may encounter in a state. A particularity here is that the robot cannot be reset: the learning algorithm is to learn from just one very long string and all its prefixes.
References
R. L. Rivest and R. E. Schapire. Inference of finite automata sequences. Information and Computation , 103:299-347, 1993.
Thomas Dean , Dana Angluin, Kenneth Basye , Sean P. Engelson , Leslie Pack Kaelbling , Evangelos Kokkevis , Oded Maron : Inferring Finite Automata with Stochastic Output Functions and an Application to Map Learning. AAAI 1992 : 208-214
A task somehow related is that of discivering the rational strategy followed by an adversary. This line was looked into in a number of papers related to agent technologies.
References
D. Carmel and S. Markovitch. Model-based learning of interaction strategies in multi-agent systems. Journal of Experimental and Theoretical Artificial Intelligence , 10(3):309-332, 1998.
D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multiagent systems. Autonomous Agents and Multi-agent Systems , 2(2):141-172, 1999.
One task on which grammatical inference is proving to be particularly useful is that of wrapper induction: The idea is to find in a web page (or various of the same type) all arguments of a special sort. In this context, the role of the Oracle is played by the human user.
References
J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learning of node selecting tree transducer. In Ijcai Workshop on Grammatical Inference , 2005.
Julien Carme , Rémi Gilleron , Aurélien Lemay , Joachim Niehren: Interactive learning of node selecting tree transducer. Machine Learning 66 (1): 33-67 (2007)
In different tasks linked with checking if the specifications of a system or a piece of hardware are met, the item to be checked is used as an oracle. Queries are made in order to build a model of the item, and then the learned model can be checked against the specifications. Obviously, there is an issue with the distribution used to simulate the equivalence query in this context to check through interaction with the actual chips used as an Oracle that the specifications are met..
References
L. Bréhélin, O. Gascuel, and G. Caraux. Hidden Markov models with patterns to learn boolean vector sequences and application to the built-in self-test for integrated circuits. Pattern Analysis and Machine Intelligence , 23(9):997-1008, 2001.
T. Berg, O. Grinchtein, B. Jonsson, M. Leucker, H. Raffelt, and B. Steffen. On the correspondence between conformance testing and regular inference. In Proceedings of Fundamental Approaches to Software Engineering, 8th International Conference, FASE 2005 , volume 3442 of Lncs , pages 175-189. Springer-Verlag, 2005.
H. Raffelt and B. Steffen. Learnlib: A library for automata learning and experimentation. In Proceedings of Fase 2006 , volume 3922 of Lncs , pages 377-380. Springer-Verlag, 2006.
Other links
Dana Angluin: A Note on the Number of Queries Needed to Identify Regular Languages Information and Control 51 (1): 76-87 (1981)
Dana Angluin: Queries and Concept Learning. Machine Learning 2 (4): 319-342 (1987)
Dana Angluin: Learning Regular Sets from Queries and Counterexamples Inf. Comput. 75 (2): 87-106 (1987)
Dana Angluin: Equivalence Queries and Approximate Fingerprints. COLT 1989 : 134-145
Dana Angluin, Michael Kharitonov : When Won’t Membership Queries Help? (Extended Abstract) STOC 1991 : 444-454
Dana Angluin: Queries revisited. Theor. Comput. Sci. 313 (2): 175-194 (2004)
C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition , 38:1332-r1348, 2005.
C. de la Higuera. Data complexity issues in grammatical inference. In M. Basu and T. Kam Ho, editors, Data Complexity in Pattern Recognition , pages 153-172. Springer-Verlag, 2006.
C. de la Higuera. Ten open problems in grammatical inference. In Sakakibara et al. [SKS+ 06], pages 32-44.
M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory . MIT press, 1994.
We are pleased to announce an upcoming evaluation of probabilistic approximate inference algorithms, as part of the UAI conference this July. All researchers working on inference in graphical models are encouraged to participate.
The evaluation will focus on the following computational tasks:
Submitted algorithms can compete in some or all of these tasks.
We look forward to your participation,
The organizers
Given two text fragments called ‘Text’ and ‘Hypothesis’, Textual Entailment Recognition is the task of determining whether the meaning of the Hypothesis is entailed (can be inferred) from the Text. The goal of the first RTE Challenge was to provide the NLP community with a benchmark to test progress in recognizing textual entailment, and to compare the achievements of different groups. Since its inception in 2004, the PASCAL RTE Challenges have promoted research in textual entailment recognition as a generic task that captures major semantic inference needs across many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and multi-document Summarization.
After the first three highly successful PASCAL RTE Challenges, RTE became a track at the 2008 Text Analysis Conference, which brought it together with communities working on NLP applications. The interaction has provided the opportunity to apply RTE systems to specific applications and to move the RTE task towards more realistic application scenarios.
The RTE-6 tasks focus on recognizing textual entailment in two application settings: Summarization and Knowledge Base Population.
RTE-6 does not include the traditional RTE Main Task which was carried out in the first five RTE challenges; i.e., there will be no task to make entailment judgments over isolated T-H pairs drawn from multiple applications. Instead, the new Main Task for RTE-6 is based on only the Summarization application setting. The RTE-6 Main Task is similar to the RTE-5 Search Pilot, with the following changes:
RTE-6 Schedule | |
---|---|
April 30 | Main Task: Release of Development Set |
May 10 | KBP Validation Pilot: Release of Development Set |
May 21 | Deadline for TAC 2010 track registration |
August 17 | KBP Validation Pilot: Release of Test Set |
August 30 | Main Task: Release of Test Set |
September 9 | Main Task: Deadline for task submissions |
September 16 | Main Task: Release of individual evaluated results |
September 17 | KBP Validation Pilot: Deadline for task submissions |
September 24 | KBP Validation Pilot: Release of individual evaluated results |
September 26 | Deadline for TAC 2010 workshop presentation proposals |
September 30 | Main Task: Deadline for ablation tests submissions |
October 7 | Main Task: Release of individual ablation test results |
October 27 | Deadline for systems’ reports |
November 15-16 | TAC 2010 Workshop |
The mailing list for the RTE Track is rte@nist.gov. The list is used to discuss and define the task guidelines for the track, as well as for general discussion related to textual entailment and its evaluation. To subscribe, send a message to listproc@email.nist.gov such that the body consists of the line:
subscribe rte <FirstName> <LastName>
In order for your messages to get posted to the list, you must send them from the email address used when you subscribed to the list. To unsubscribe, send a message from the subscribed email address to listproc@email.nist.gov such that the body consists of the line:
unsubscribe rte
For additional information on how to use mailing lists hosted at NIST, send a message to listproc@email.nist.gov such that the body consists of the line:
HELP
The Challenge results are now available as well as the Challenge workshop program and slides.
The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.
The scientific goals are:
Morpho Challenge 2010 is a follow-up to our previous Morpho Challenge 2005, 2007, 2008 and 2009. The task in 2010 is similar to 2009, where the aim was to find the morpheme analysis of the word forms in the data. As a new task we will provide a possibility for semi-supervised learning using the available linguistic gold standard morpheme analysis.
Participation in the previous challenges is by no means a prerequisite for participation in Morpho Challenge 2010. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop. Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in a paper. Please read the formatting instructions in rules.
If you plan to participate in Morpho Challenge, please contact the organizers using the email address in contact and ask to be added in our mailing list. We will use this mailing list to provide news about the tasks, data and evaluations.
We are looking forward to an interesting challenge!
Mikko Kurimo, Krista Lagus, Sami Virpioja and Ville Turunen
Adaptive Informatics Research Centre, Aalto University School of Science and Technology (previously known as Helsinki University of Technology)
The organizers
Mathias Creutz and Krista Lagus (2005). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Publications in Computer and Information Science, Report A81, Helsinki University of Technology, March.
[ Article (PDF) ]Teemu Hirsimäki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Janne Pylkkönen, and Sami Virpioja (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish.Computer Speech and Language, Volume 20, Issue 4, October, pages 515–541.
[ Article (PDF) ]Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, and Markus Sadeniemi (2007). Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. In Proceedings of the Machine Translation Summit XI, pages 491–498.
[ Article (PDF) ]Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. Overview and results of Morpho Challenge 2009. In Working Notes for the CLEF 2009 Workshop, Corfu, Greece, September 2009.
[ Article (PDF) ]Rules
Submission of large data files
Send an email to the organizers morphochallenge2010mail.cis.hut.fi and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.
Acceptance
The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.
Eligibility
Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.
Test languages
Data sets are provided for four languages: English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.
(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)
Task
The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.
The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.
New in 2010: A new category for semi-supervised learning algorithms using the available linguistic gold standard morpheme analysis. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.
Participants are also allowed to use additional text corpora from any source for training their algorithms.
Competitions
The segmentations will be evaluated in three complementary ways:
- Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.
- Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
- Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.
Competition 1 will include all four test languages. Winners will be selected separately for each language and category (unsupervised or semi-supervised). As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.
Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.
Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.
Workshop and publication
All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and 4-page extended abstract describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.
Extended abtracts
For the extended abstract you can use the two-column ACL format. For templates and general formatting instructions, see ACL 2010 Instructions for Preparing Camera-Ready Papers. The length of the paper should be around 4 pages. Email your extended abstract to the organizers by July 16.
Arbitration
In the case of disagreement the organizers will decide the final interpretation of the rules.
In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).
In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:
F-measure = 1/(1/Precision + 1/Recall).
Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The maximum number of points for one sampled word is normalized to one. The total number of points is then divided by the total number of sampled words.
For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision for the word “abyss” is thus 1/2 = 50%.
Recall is calculated analogously to precision: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. Points per word is normalized to one and the total number of points is divided by the total number of words.
For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, normalization of the points is carried out. In short, an equal weight is given for each alternative analysis, as well as each word pair in an analysis. E.g., if a word has three alternative analyses, the first analysis has four morphemes, and the first word pair in that analysis has two morphemes in common, each of the two common morphemes will amount to 1/3*1/4*1/2=1/24 of the one point available for that word.
Small sets of the gold standards used are available on the datasets page for semi-supervised learning and parameter validation. Neither the labeled training set nor the development test set contain any of the word forms used in the final test set.
You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: eval_morphemes_v2.pl. The evaluation program is invoked as follows:
eval_morphemes_v2.pl [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile
Four files are given as arguments to eval_morphemes_v2.pl:
The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:
PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76) PART0. Recall: 25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345) PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22% # TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73% TOTAL. Recall: 25.59%; non-affixes: 49.78%; affixes: 10.78% TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.
The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.
Sampling word pairs for the calculation of an estimate of the precision
In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script eval_morphemes_v2.pl with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the gold standard labels that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the gold standard sample. There is still a problem that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.
Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:
cut -f1 goldstdfile > relevantwordsfile
Then sample word pairs for 300 words selected by random from your results file:
sample_word_pairs_v2.pl -n 300 -refwords relevantwordsfile < resultfile > wordpairsfile_result
You can use more than 300 words if you wish (the maximum is the amount of words in the gold standard file), or sample several word pair files with different random initializations (-rand switch), but remember that the estimate of precision that you will obtain is biased in any case, due to the limited amount of gold standard labels.
The necessary Perl program is sample_word_pairs_v2.pl. The output file wordpairsfile_result is used as input to eval_morphemes_v2.pl (see above).
Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.
In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant’s submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant’s submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.
In competition 3, the morpheme analyses proposed by the participants’ algorithm will be evaluated in a statistical machine translation (SMT) framework. The translation models will be trained to translate from a morphologically complex source language to English. The words of the source language will be replaced by their morpheme analyses before training. The translations from this morpheme-to-word model will be combined with translations from a standard word-to-word translation model. For all models, we will use a state-of-the-art phrase-based SMT system. Evaluation of the translations will be performed by applying an automatic metric such as BLEU on a held-out test set.
Competition 3 does not require any extra effort by the participants. The organizers will use the analyses provided by the participants in machine translation experiments. Data from the Europarl corpus will be used. Those participants who wish to submit morpheme analysis for words in their actual context, please contact the organizers for information on how to get the full corpus.
The goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:
There will be three main competitions: classification, detection, and segmentation; and three “taster” competition: person layout, action classification, and ImageNet large scale recognition:
20 classes | ||||||||||
Participants may enter either (or both) of these competitions, and can choose to tackle any (or all) of the twenty object classes. The challenge allows for two approaches to each of the competitions:
The intention in the first case is to establish just what level of success can currently be achieved on these problems and by what method; in the second case the intention is to establish which method is most successful given a specified training set.
Image | Objects | Class | |
Image | Person Layout | |
9 action classes | ||||||||||
The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training. Test images will be presented with no initial annotation – no segmentation or labels – and algorithms will have to produce labelings specifying what objects are present in the images. In this initial version of the challenge, the goal is only to identify the main objects present in images, not to specify the location of objects.
Further details can be found at the ImageNet website.
To download the training/validation data, see the development kit.
The training data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. Note that multiple objects from multiple classes may be present in the same image. Some example images can be viewed online. A subset of images are also annotated with pixel-wise segmentation of each object present, to support the segmentation competition. Some segmentation examples can be viewed online.
Annotation was performed according to a set of guidelines distributed to all annotators.
The data will be made available in two stages; in the first stage, a development kit will be released consisting of training and validation data, plus evaluation software (written in MATLAB). One purpose of the validation set is to demonstrate how the evaluation software works ahead of the competition submission.
In the second stage, the test set will be made available for the actual competition. As in the VOC2008/VOC2009 challenges, no ground truth for the test data will be released.
The data has been split into 50% for training/validation and 50% for testing. The distributions of images and objects by class are approximately equal across the training/validation and test sets. In total there are 21,738 images. Further statistics are online.
Example images and the corresponding annotation for the classification/detection/segmentation tasks, and and person layout taster can be viewed online:
The development kit consists of the training/validation data, MATLAB code for reading the annotation data, support files, and example implementations for each competition.
The development kit will be available according to the timetable.
The test data is now available. Note that the only annotation in the data is for the layout/action taster competitions. As in 2008/2009, there are no current plans to release full annotation – evaluation of results will be provided by the organizers.
The test data can now be downloaded from the evaluation server. You can also use the evaluation server to evaluate your method on the test data.
Below is a list of software you may find useful, contributed by participants to previous challenges.
Participants are expected to submit a single set of results per method employed. Participants who have investigated several algorithms may submit one result per method. Changes in algorithm parameters do not constitute a different method – all parameter tuning must be conducted using the training and validation data alone.
Results must be submitted using the automated evaluation server:
It is essential that your results files are in the correct format. Details of the required file formats for submitted results can be found in the development kit documentation. The results files should be collected in a single archive file (tar/tgz/tar.gz). An example of the correct format is available:
The format of your results must match that specified in the development kit and in the example file, including both file names and directory structure. If you are not entering a particular competition, just omit the corresponding files.
Participants submitting results for several different methods (noting the definition of different methods above) should produce a separate archive for each method.
In addition to the results files, participants will need to additionally specify:
If you would like to submit a more detailed description of your method, for example a relevant publication, this can be included in the results archive.
The VOC challenge encourages two types of participation: (i) methods which are trained using only the provided “trainval” (training + validation) data; (ii) methods built or trained using any data except the provided test data, for example commercial systems. In both cases the test data must be used strictly for reporting of results alone – it must not be used in any way to train or tune systems, for example by runing multiple parameter choices and reporting the best results obtained.
If using the training data we provide as part of the challenge development kit, all development, e.g. feature selection and parameter tuning, must use the “trainval” (training + validation) set alone. One way is to divide the set into training and validation sets (as suggested in the development kit). Other schemes e.g. n-fold cross-validation are equally valid. The tuned algorithms should then be run only once on the test data.
In VOC2007 we made all annotations available (i.e. for training, validation and test data) but since then we have not made the test annotations available. Instead, results on the test data are submitted to an evaluation server.
Since algorithms should only be run once on the test data we strongly discourage multiple submissions to the server (and indeed the number of submissions for the same algorithm is strictly controlled), as the evaluation server should not be used for parameter tuning.
We encourage you to publish test results always on the latest release of the challenge, using the output of the evaluation server. If you wish to compare methods or design choices e.g. subsets of features, then there are two options: (i) use the entire VOC2007 data, where all annotations are available; (ii) report cross-validation results using the latest “trainval” set alone.
The main mechanism for dissemination of the results will be the challenge webpage.
The detailed output of each submitted method will be published online e.g. per-image confidence for the classification task, and bounding boxes for the detection task. The intention is to assist others in the community in carrying out detailed analysis and comparison with their own methods. The published results will not be anonymous – by submitting results, participants are agreeing to have their results shared online.
If you make use of the VOC2010 data, please cite the following reference (to be prepared after the challenge workshop) in any publications:
@misc{pascal-voc-2010, author = "Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.", title = "The {PASCAL} {V}isual {O}bject {C}lasses {C}hallenge 2010 {(VOC2010)} {R}esults", howpublished = "http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html"}
The VOC2010 data includes images obtained from the “flickr” website. Use of these images must respect the corresponding terms of use:
For the purposes of the challenge, the identity of the images in the database, e.g. source and name of owner, has been obscured. Details of the contributor of each image can be found in the annotation to be included in the final release of the data, after completion of the challenge. Any queries about the use or ownership of the data should be addressed to the organizers.
We gratefully acknowledge the following, who spent many long hours providing annotation for the VOC2010 database:
Yusuf Aytar, Jan Hendrik Becker, Patrick Buehler, Miha Drenik, Chris Engels, Ali Eslami, Adrien Gaidon, Sam Johnson, Jyri Kivinen, Lubor Ladicky, Markus Mathias, Alastair Moore, Glenn Sheasby, Paul Sturgess, David Tingdahl, Josiah Wang.
We also thank Alexander Sorokin for production and support of the annotation systems for Mechanical Turk, Ivan Laptev for development of the action classification task, and Marcin Eichner, Ali Eslami, Lubor Ladicky, Marcin Marszalek, Arpit Mittal and Andrea Vedaldi for testing of the development kit and evaluation server. We are also most grateful to Yusuf Aytar for further development and administration of the evaluation server.
The preparation and running of this challenge is supported by the EU-funded PASCAL2 Network of Excellence on Pattern Analysis, Statistical Modelling and Computational Learning.
We are grateful to Alyosha Efros for providing additional funding for annotation on Mechanical Turk.
The main challenges have run each year since 2005. For more background on VOC, the following journal paper discusses some of the choices we made and our experience in running the challenge, and gives a more in depth discussion of the 2007 methods and results:
The PASCAL Visual Object Classes (VOC) Challenge
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A.
International Journal of Computer Vision, 88(2), 303-338, 2010
Bibtex source | Abstract | PDF
The table below gives a brief summary of the main stages of the VOC development.
Year | Statistics | New developments | Notes |
---|---|---|---|
2005 | Only 4 classes: bicycles, cars, motorbikes, people. Train/validation/test: 1578 images containing 2209 annotated objects. | Two competitions: classification and detection | Images were largely taken from exising public datasets, and were not as challenging as the flickr images subsequently used. This dataset is obsolete. |
2006 | 10 classes: bicycle, bus, car, cat, cow, dog, horse, motorbike, person, sheep. Train/validation/test: 2618 images containing 4754 annotated objects. | Images from flickr and from Microsoft Research Cambridge (MSRC) dataset | The MSRC images were easier than flickr as the photos often concentrated on the object of interest. This dataset is obsolete. |
2007 | 20 classes:
Train/validation/test: 9,963 images containing 24,640 annotated objects. |
|
This year established the 20 classes, and these have been fixed since then. This was the final year that annotation was released for the testing data. |
2008 | 20 classes. The data is split (as usual) around 50% train/val and 50% test. The train/val data has 4,340 images containing 10,363 annotated objects. |
|
|
2009 | 20 classes. The train/val data has 7,054 images containing 17,218 ROI annotated objects and 3,211 segmentations. |
|
|
2010 | 20 classes. The train/val data has 10,103 images containing 23,374 ROI annotated objects and 4,203 segmentations. |
|
|
The goal is to select/design a classifier (and any pre-processing systems, including a feature extractor) that correctly classifies EEG data into one of two classes. The winner will be the submission that maximizes the area under the ROC curve.
Anyone that has an interest in machine learning and that has access to Matlab®.
Registration is not required. However, if you wish to receive important updates on the competition by email then please send a request to hildk@bme.ogi.edu.
The training data may be downloaded from mlsp2010TrainingData.mat (44 MB). The training data, which is in Matlab® format, consist of EEG data collected while a subject viewed satellite images that were displayed in the center of an LCD monitor approximately 43 cm in front of them.
There are 64 channels of EEG data. The total number of samples is 176378. The sampling rate is 256 Hz. There are 75 blocks and 2775 total satellite images. Each block contains a total of 37 satellite images, each of which measures 500 x 500 pixels. All images within a block are displayed for 100 ms and each image is displayed as soon as the preceding image is finished. Each block is initiated by the subject after a rest period, the length of which was not specified in advance.
The subject was instructed to fixate on the center of the images and to press the space bar whenever they detected an instance of a target, where the targets are surface-to-air missile (SAM) sites. Subjects also needed to press the space bar to initiate a new block and to clear feedback information that was displayed to the subject after each block.
We expect a particular neural signature, the P300, to occur whenever the subject detects an instance of a target in one of the satellite images. The P300 gets its name from the fact that detection of a (rare, pre-defined) target causes a deflection of the EEG signal aproximately 300 ms after the visual stimulus is presented to the subject. The P300 is most prominent in the midline channels (e.g., Pz, Cz, Fz; see cap_64_layout_medium.jpg for the layout of the channels). In addition, there is a separate neural signature associated with the pressing of the space bar. For our data, this second neural signal occurs around 500-600 ms after the stimulus containing the target is presented.
The variables in the training data are defined as follows,
The test data are not made available to the participants. Instead, participants will submit Matlab® code, which the competition chairs will test by passing the test data to the submitted code.
Like before, there are 64 channels of EEG data, the sampling rate is 256 Hz, there is no delay between images within the same block, the subject rests between blocks for as long as they wish, and the subject pressed the space bar to signify that they detected a target, to initiate a new block, and to clear feedback information that was displayed after each block.
Unlike before, the test data consist of 890 blocks and 9891 satellite images, the total number of samples of EEG data is 1603334, every other image within a block is a mask image (mask images do not contain targets), the buttonTrigger variable is not available, and the imageTrigger variable takes only values of 0 or 1, where a 1 corresponds to the onset of each prospective target image (i.e., the satellite images) and 0 is used elsewhere. Another difference is that 4 different image durations are used in the test data. The image durations, which apply to both satellite and mask images, are (approximately) 50 ms, 100 ms, 150 ms, and 200 ms. All images within a given block have the same image duration and all blocks having a specified image duration are grouped together. Each block contains 22, 10, 7, and 5 prospective target images when the image duration is 50 ms, 100 ms, 150 ms, and 200 ms, respectively. Keep in mind that the time difference between successive prospective target images is twice the corresponding image duration due to the presence of the mask images. Hence, successive prospective target images within a block appear every (approximately) 100 ms, 200 ms, 300 ms, or 400 ms.
A successful submission consists of (1) the names of the team members (each person may belong to at most two teams and each team is allowed a single submission), (2) the name(s) of the host institutions of the researchers, (3) a 1-3 paragraph description of the approach used, and (4) Matlab® code, myFunction.m, which must conform to the requirements below.
The competition chairs will call the submitted code using,
>> [out] = myFunction(eegData,t,imageTrigger,eegLabel,eegCoord,thr);
where the first 5 input variables correspond to the test data, out is a (9891 x 1) vector consisting of 0’s and 1’s (each 0 corresponds to a predicted non-target and each 1 corresponds to a predicted target) and thr is a user-defined threshold that biases the predicted class (0 < thr < 1). When thr = 0, the code should predict that all prospective target images are non-targets and, when thr = 1, the code should predict that all prospective target images are targets. The thr variable will be used to construct an ROC curve. Example code that demonstrates the use of thr can be downloaded from myFunction.m. The code should not write to any drive, it must finish running in a reasonable time (on the order of minutes), and must consist only of regular, uncompiled Matlab® code (P-code, mex files, and compiled code are, e.g., forbidden). Parameter values for the trained classifier can be hardcoded into myFunction.m. Alternatively, entrants can read parameter values into myFunction.m from a single user-supplied *.mat file.
prior to submitting, we recommend testing the code on the training data (after performing the following: “f=find(imageTrigger==2); imageTrigger(f)=1;”).
Submissions must be emailed to hildk@bme.ogi.edu no later than April 8, 2010.
The MLSP 2010 proceedings will include a publication, written by the competition chairs, which describes the competition and shows the comparison of the submitted methods. The committee chairs will invite up to three teams to submit a two-page summary that discusses the method they used in the competition (the mathematical notation must be consistent with the paper authored by the competition chairs). Pending approval, the invited summaries will be published in the conference proceedings. The process of selecting teams to which invitations to publish will be sent is distinct from selecting the winner of the competition. Invitations to publish will be based on both (1) utilization of machine learning principles and (2) performance in terms of area under ROC curve.
Up to two N900 high-performance mobile computers from Nokia will be awarded as prizes. The 2010 MLSP Competition Committee will distribute one award to each team or teams that they select, where the selection is based on: (1) the performance of the submitted methods and (2) the requirement that at least one member of each selected team attend the 2010 MLSP Conference. Members of the 2010 MLSP Competition Committee (and everyone belonging to any of the labs of the 2010 MLSP Competition Committee) are not eligible for this award.
In addition to the Nokia mobile devices, the MLSP Competition Committee will award a maximum of 1500 Euros to the selected winning team(s). The monetary award(s), provided by the PASCAL2 Challenge Program, are intended to defray some of the expenses related to traveling to and attending the MLSP 2010 Conference. More details will be made available at a later date.
[1] K.B. Campbell, E. Courchesne, T.W. Picton, and K.C. Squires, “Evoked potential correlations of human information processing,” Biological Psychology, Vol. 8, pp. 45-68, 1979.
[2] T.W. Picton, “The P300 wave of the human event-related potential,” Journal of Clinical Neurophysiology, Vol. 9, No. 4, pp. 456-479, 1992.
[3] Barry S. Oken, “Evoked Potentials in Clinical Medicine,” edited by K.H. Chiappa, Lippincott-Raven, Philadelphia, PA, 1997.
The goal of the “BCI Competition IV” is to validate signal processing and classification methods for Brain-Computer Interfaces (BCIs). Compared to the past BCI Competitions, new challanging problems are addressed that are highly relevant for practical BCI systems, such as
The organizers are aware of the fact that by such a competition it is impossible to validate BCI systems as a whole. But nevertheless we envision interesting contributions to ultimately improve the full BCI and to provide a challenging data base for the research community.
For each data set specific goals are given in the respective description. Technically speaking, each data set consists of single-trials of spontaneous brain activity, one part labeled (calibration or training data) and another part unlabeled (evaluation or test data), and a performance measure. The goal is to infer labels (or their probabilities) for the evaluation data sets from calibration data that maximize the performance measure for the true (but to the competitors unknown) labels of the evaluation data. Results will be announced at a workshop of the NIPS 2008 conference and on this web site. For each data set, the competition winner gets a chance to publish the algorithm in an individual article that will appear in a volume devoted to this competition.
Data sets 1: ‹motor imagery, uncued classifier application› (description)
provided by the Berlin BCI group: Technische Universität Berlin (Machine Learning Laboratory) and Fraunhofer FIRST (Intelligent Data Analysis Group) (Klaus-Robert Müller, Benjamin Blankertz, Carmen Vidaurre, Guido Nolte), and Campus Benjamin Franklin of the Charité – University Medicine Berlin, Department of Neurology, Neurophysics Group (Gabriel Curio)
EEG, motor imagery (2 classes of left hand, right hand, foot); evaluation data is continuous EEG which contains also periods of idle state
[64 EEG channels (0.05-200Hz), 1000Hz sampling rate, 2 classes (+ idle state), 7 subjects]
Data sets 2a: ‹4-class motor imagery› (description)
provided by the Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology, (Clemens Brunner, Robert Leeb, Gernot Müller-Putz, Alois Schlögl, Gert Pfurtscheller)
EEG, cued motor imagery (left hand, right hand, feet, tongue)
[22 EEG channels (0.5-100Hz; notch filtered), 3 EOG channels, 250Hz sampling rate, 4 classes, 9 subjects]
Data sets 2b: ‹motor imagery› (description)
provided by the Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology, (Robert Leeb, Clemens Brunner, Gernot -Müller-Putz, Alois Schlögl, Gert Pfurtscheller)
EEG, cued motor imagery (left hand, right hand)
[3 bipolar EEG channels (0.5-100Hz; notch filtered), 3 EOG channels, 250Hz sampling rate, 2 classes, 9 subjects]
Data sets 3: ‹hand movement direction in MEG› (description)
provided by the Brain Machine Interfacing Initiative, Albert-Ludwigs-University Freiburg, the Bernstein Center for Computational Neuroscience Freiburg and the Institute of Medical Psychology and Behavioral Neurobiology, University of Tübingen (Stephan Waldert, Carsten Mehring, HubertPreissl, Christoph Braun)
The data set contains directionally modulated low-frequency MEG activity that was recorded while subjects performed wrist movements in four different directions.
[10 MEG channels (filtered to 0.5-100Hz), 400Hz sampling rate, 4 classes, 2 subjects]
Data sets 4: ‹finger movements in ECoG› (description)
provided by Departments of Physics and Medicine of the University of Washington, Seattle (Kai J. Miller) and Wadsworth Center, NYS Department of Health (Gerwin Schalk)
ECoG data during individual flexions of the five fingers; movements acquired with a data glove.
[48 – 64 ECoG channels (0.15-200Hz), 1000Hz sampling rate, 5 classes, 3 subjects]
[ top ]
July 3rd 2008: launching of the competition
November 21st 2008, midnight CET to Nov 22nd: deadline for submissions
December 12th 2008: announcement of the results at a workshop of NIPS 2008 and on this web site
[ top ]
Submissions to a data set are to be sent to the responsible contact person as stated in the data set description. The submission has to comprise the estimated labels, names and affiliations of all involved researchers and a short note on the involved processing techniques. We send confirmations for each submission we get. If you do not receive a confirmation within 2 days please resend your email with CC to the other organizing committee members, ‹benjamin.blankertz@tu-berlin.de›, ‹alois.schloegl@tugraz.at›, ‹waldert@bccn.uni-freiburg.de›, ‹kjmiller@u.washington.edu›.
One researcher may NOT submit multiple results to one data set. She/he has to decide for her/his favorite one. However: From one research group multiple submissions to one data set are possible. The sets of involved researchers do not have to be disjoint, but (1) the ‘first author’ (contributor) should be distinct, and (2) approaches should be substantially different.
For details on how to submit your results please refer to the description of the respective data set. If questions remain unanswered send an email to the responsable contact person for the specific data set which is indicated in the description.
Submissions are evaluated for each group of data sets separately. There is no need to submit for all data sets of the competition in order to participate, however each submission must provide results for all data sets of one group (e.g., for all subjects provided in Data sets 2a).
Each participant agrees to deliver an extended description (4-6 pages) of the used algorithm for publication until Februar 1st 2009 in case she/he is the winner for one of the data sets.
Berlin: Benjamin Blankertz, Carmen Vidaurre, Michael Tangermann, Klaus-Robert Müller
Graz: Clemens Brunner, Robert Leeb, Gernot Müller-Putz, Alois Schlögl, Gert Pfurtscheller
Freiburg/Tübingen: Stephan Waldert, Carsten Mehring, Ad Aertsen, Niels Birbaumer
Washington/Albany: Kai J. Miller, Gerwin Schalk
[ top ]
The first BCI Competition was announced at NIPS 2001, and the second at NIPS 2002. The first competition was a first try to see how such a thing would work and it was only announced in a smaller community. Accordingly there were not such much submissions, but nevertheless many researchers showed great interest when the results were published (first in the internet and then in IEEE Trans Neural Sys Rehab Eng, 2003, vol 11(2), pp. 184-185 [ draft ]). For the second competition data sets were provided by four of the leading groups in EEG-based BCIs. Here we received 59 submissions. A review of the 2nd competition appeared in IEEE Trans Biomed Eng, 51(6):1044-1051, 2004 [ draft ] and articles of all winning teams of the competition were published in the same issue which provides a good overview of the state of art in classification techniques for BCI. The 3rd BCI Competition involved data sets from five BCI labs and we received 99 submissions. It was reviewed in IEEE Trans Neural Sys Rehab Eng, 14(2):153-159, 2006 [ draft ] and individual articles of the competition winners appeared in different journals.
can be found here. Please help us to make the list complete and keep it up to date by reporting unlisted papers to ‹benjamin.blankertz@tu-berlin.de›, preferably PubMed ID (PMID) or in BibTex format.
Links to General Interest BCI Sites.
We are pleased to announce the launch of the Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a hierarchical text classification competition using large datasets based on the ODP Web directory data (www.dmoz.org).
Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories are an example. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue despite the vastness of available data. The reasons for this are the simultaneous increase in the number of classes and their hierarchical organization. The latter leads to a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.
The challenge will consist of four tasks with partially overlapping data. Information regarding the tasks and the challenge rules can be found at the “Tasks, Rules and Guidelines” link.
We plan a two-stage evaluation of the participating methods: one measuring classification performance and one computational performance. It is important to measure both, as they are dependent. The results will be included in a final report about the challenge and we also aim at organizing a special workshop.
In order to register for the challenge and gain access to the datasets, please create a new account at http://lshtc.iit.demokritos.gr/.
Organisers:
Eric Gaussier, LIG, Grenoble, France
George Paliouras, NCSR “Demokritos” , Athens, Greece
Aris Kosmopoulos, NCSR “Demokritos” and AUEB, Athens, Greece
Sujeevan Aseervatham, LIG, Grenoble, France