The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

Morpho Challenge 2009 is a follow-up to our previous Morpho Challenge 2005, 2007 and 2008. The task of Morpho Challenge 2009 is similar to the Morpho Challenge 2008, where the aim was to find the morpheme analysis of the word forms in the data. For this challenge, new Machine Translation tasks are added (from Finnish to English and from German to English) to evaluate the performance of the morpheme analysis.

Participation in the previous challenges is by no means a prerequisite for participation in Morpho Challenge 2009. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop. Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in a paper. Please read the formatting instructions in rules.

If you plan to participate in Morpho Challenge, please contact the organizers using the email address in contact and ask to be added in our mailing list. We will use this mailing list to provide news about the tasks, data and evaluations.

The results from the evaluation runs are now in the Results page.
The Workshop was held in September 30, 2009


Mathias Creutz and Krista Lagus (2005). Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Publications in Computer and Information Science, Report A81, Helsinki University of Technology, March.
[ Article (PDF) ]Teemu Hirsimäki, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Janne Pylkkönen, and Sami Virpioja (2006). Unlimited vocabulary speech recognition with morph language models applied to Finnish.Computer Speech and Language, Volume 20, Issue 4, October, pages 515–541.
[ Article (PDF) ]

Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, and Markus Sadeniemi (2007). Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. In Proceedings of the Machine Translation Summit XI, pages 491–498.
[ Article (PDF) ]

Mikko Kurimo and Matti Varjokallio (2008). Unsupervised morpheme analysis evaluation by a comparison to a linguistic Gold Standard – Morpho Challenge 2008. In Working Notes for the CLEF 2008 Workshop.
[ Article (PDF) ]

Mikko Kurimo and Ville Turunen (2008). Unsupervised morpheme analysis evaluation by IR experiments – Morpho Challenge 2008. In Working Notes for the CLEF 2008 Workshop.
[ Article (PDF) ]


Submission of large data files

Send an email to the organizers morphochallenge2008<> and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.


The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.


Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for five languages: Arabic, English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)


The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

Solutions, in which a large number of parameters must be “tweaked” separately for each test language, are of little interest. This challenge aims at the unsupervised (or very minimally supervised) morpheme analysis of words. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.


The segmentations will be evaluated in three complementary ways:

  • Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.
  • Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
  • Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.

Competition 1 will include all five test languages. Winners will be selected separately for each language. As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and a paper of at most 10 pages describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.

Workshop papers

For your paper submission (due August 15), please use the single-column CLEF 2007 Notebook Proceedings format. Here are a sample PDF file and a template Latex document. Email your paper submission to the organizers.
Formatting instructions:

size: A4
format: pdf (if difficult, ps or MS Word (rtf) are acceptable) Do NOT lock the pdf file.
borders: top, left right 2.5 cm ; bottom 3 cm
text size: 16 x 24 cm
length: 10 pages maximum
title: Times 14 pt bold centered
author(s): Times 10 pt centered
abstract: Times 10 pt justified
ACM Categories and Subject Descriptors: Times 10 pt left aligned
Free Keywords: Times 10 pt left aligned
body text: Times 10 pt justified
Section Headings: Times 12 pt bold left aligned
Emphasis: Times 10 pt italic


In the case of disagreement the organizers will decide the final interpretation of the rules.

Competition 1

NEW: The evaluation measures of competition 1 are updated for Morpho Challenge 2009. Some bugs related to the handling of alternative analyses are fixed from the scripts, and points are now measured as one per word, not one per word pair. The new evaluation scripts are now available:



The old scripts are found from Challenge 2008.

In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Samples of the gold standards used are available for download on the datasets page.

Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).

In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:

F-measure = 1/(1/Precision + 1/Recall).

Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The maximum number of points for one sampled word is normalized to one. The total number of points is then divided by the total number of sampled words.

For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision for the word “abyss” is thus 1/2 = 50%.

Recall is calculated analogously to precision: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. Points per word is normalized to one and the total number of points is divided by the total number of words.

For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, normalization of the points is carried out. In short, an equal weight is given for each alternative analysis, as well as each word pair in an analysis. E.g., if a word has three alternative analyses, the first analysis has four morphemes, and the first word pair in that analysis has two morphemes in common, each of the two common morphemes will amount to 1/3*1/4*1/2=1/24 of the one point available for that word.

Evaluation of a sample (development test set)

You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: The evaluation program is invoked as follows: [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile

Four files are given as arguments to

  1. wordpairsfile_goldstd: this is the “random word pairs file” available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
  2. wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
  3. goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses for circa 500 words.
  4. resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses.

The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:

PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76)
PART0. Recall:    25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345)
PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73%
TOTAL. Recall:    25.59%; non-affixes: 49.78%; affixes: 10.78%
TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%

Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.

The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.

Sampling word pairs for the calculation of an estimate of the precision

In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the 500-word gold standard that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the 500-word gold standard sample. The problem here is that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.

Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:

cut -f1 goldstdfile > relevantwordsfile

Then sample word pairs for 100 words selected by random from your results file: -refwords relevantwordsfile < resultfile > wordpairsfile_result

The necessary Perl program is The output file wordpairsfile_result is used as input to (see above).

Competition 2

Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.

In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant’s submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant’s submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.

Competition 3

In competition 3, the morpheme analyses proposed by the participants’ algorithm will be evaluated in a statistical machine translation (SMT) framework. The translation models will be trained to translate from a morphologically complex source language to English. The words of the source language will be replaced by their morpheme analyses before training. The translations from this morpheme-to-word model will be combined with translations from a standard word-to-word translation model. For all models, we will use a state-of-the-art phrase-based SMT system. Evaluation of the translations will be performed by applying an automatic metric such as BLEU on a held-out test set.

Data is from the Europarl corpus. The participants should apply their algorithms to the list of the word forms in the corpus. It is also possible to use the context information of the words by downloading the full corpus. (See datasets for details.)