The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.

The scientific goals are:

  • To learn of the phenomena underlying word construction in natural languages
  • To discover approaches suitable for a wide range of languages
  • To advance machine learning methodology

Morpho Challenge 2007 is a follow-up to our previous Morpho Challenge 2005 (Unsupervised Segmentation of Words into Morphemes). The task of Morpho Challenge 2007 is more general in that we are not necessarily looking for an explicit segmentation of words this time, but a morpheme analysis of the word forms in the data. (For instance, the English words “boot, boots, foot, feet” might obtain the analyses “boot, boot + plural, foot, foot + plural”, respectively.)

Participation in the previous challenge is by no means a prerequisite for participation in Morpho Challenge 2007. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop arranged in conjunction with CLEF 2007 (Cross-Language Evaluation Forum). Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in an extended abstract. Please read the formatting instructions in rules.

The result tables from the latest evaluation runs are now in the Results page.

We are looking forward to an interesting competition!

Mikko Kurimo, Mathias Creutz and Matti Varjokallio
Adaptive Informatics Research Centre, Helsinki University of Technology
The organizers

Program committee

Levent Arslan, Boğaziçi University
Eric Atwell, University of Leeds
Samy Bengio, Google
Tolga Cilogu, Middle-East Technical University
Kadri Hacioglu, Colorado University
Colin de la Higuera, Jean Monnet University, Saint-Etienne
Chun Yu Kit, City University of Hong Kong
Dietrich Klakow, Saarland University
James Martin, University of Colorado at Boulder
Jan Nouza,Technical University of Liberec
Erkki Oja, Helsinki University of Technology
Murat Saraçlar, Boğaziçi University
Richard Sproat, University of Illinois, Urbana-Champaign
Richard Wicentowski, Swarthmore College



The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.


Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for four languages: English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)


The task is the unsupervised morpheme analysis of every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

Solutions, in which a large number of parameters must be “tweaked” separately for each test language, are of little interest. This challenge aims at the unsupervised (or very minimally supervised) morpheme analysis of words. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.


The segmentations will be evaluated in two complementary ways:

  • Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.
  • Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.

Competition 1 will include all four test languages. Winners will be selected separately for each language. As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the program committee based on the results and an extended abstract of at most 6 pages.

Extended abtracts

For the extended abstract you can use the two-column ACL/COLING format (figures and tables may still span the whole width of a page). Detailed formatting instructions can be found here: PDF or PS. You can use the following files: Latex style file, Latex bibliography file, and a template Latex document. The maximum length of your paper is 6 pages (including references and figures). Email your extended abstract to the organizers by May 31.

Camera-ready submission

The final camera-ready submissions use a different format than the papers submitted for review. We are sorry about the inconvenience of your having to reformat your documents. For your final paper submission (due August 15th), please use the single-column CLEF 2007 Notebook Proceedings format. Here are a sample PDF file and a template Latex document. Detailed formatting instructions can be requested from the Morpho Challenge Organizers. The maximum length of your paper is 10 pages (including references and figures), but you can ask for a couple of extra pages if you think it improves the paper. Email your final paper to the organizers.


In the case of disagreement the organizers will decide the final interpretation of the rules.


There are a number of data files involved in this challenge. Each type of file is available for each language.

Word list (input)

First and foremost, there is a list of word forms. The words have been extracted from a text corpus, and each word in the list is preceded by its frequency in the corpus used.

For instance, a subset of the supplied English word list looks like this:

1 barefoot's
2 barefooted
6699 feet
653 flies
2939 flying
1782 foot
64 footprints

Result file (output, i.e., what to submit)

The participants’ task is to return a list containing exactly the same words as in the input, with morpheme analyses provided for each word. The list returned shall not contain the word frequency information.

A submission for the above English words may look like this:

barefoot's      BARE FOOT +GEN
barefooted      BARE FOOT +PAST
feet            FOOT +PL
flies           FLY_N +PL, FLY_V +3SG
flying          FLY_V +PCP1
foot            FOOT
footprints      FOOT PRINT +PL

There are a number of things to note about the result file: Each line of the file contains a word (e.g., “feet”) separated from its analysis (e.g., “FOOT +PL”) by one TAB character. The word needs to look exactly as it does in the input; no capitalization or change of character encoding is allowed. The analysis contains morpheme labels separated using space. The order in which the labels appear does not matter; e.g., “FOOT +PL” is equivalent to “+PL FOOT”. The labels are arbitrary: e.g., instead of using “FOOT” you might use “morpheme784” and instead of “+PL” you might use “morpheme2”. However, we strongly recommend you to use intuitive labels, when possible, since they make it easier for anyone to get an idea of the quality of the result by looking at it.

If a word has several interpretations, all interpretations should be supplied: e.g., the word “flies” may be the plural form of the noun “fly” (insect) or the third person singular present tense form of the verb “to fly”. The alternative analyses must be separated using a comma, as in: “FLY_N +PL, FLY_V +3SG”. The existence of alternative analyses makes the task challenging, and we leave it to the participants to decide how much effort they will put into this aspect of the task. In English, for instance, in order to get a perfect score, it would be necessary to distinguish the different functions of the ending “-s” (plural or person ending) as well as the different parts-of-speech of the stem “fly” (noun or verb). As the results will be evaluated against reference analyses (our so-called gold standard), it is worth reading about the guiding principles used when constructing the gold standard.

As far as we understand, you can use any characters in your morpheme labels except whitespace and comma (,). However, we cannot guarantee that the evaluation scripts will work properly, if your labels contain some “strange” characters.

Text corpus

The word list (input data) has been constructed by collecting word forms occurring in a text corpus. The text corpora have been obtained from the Wortschatz collection at the University of Leipzig (Germany). We used the plain text files (sentences.txt for each language); the corpus sizes are 3 million sentences for English, Finnish and German, and 1 million sentences for Turkish. For English, Finnish and Turkish we use preliminary corpora, which have not yet been released publicly at the Wortschatz site. The corpora have been preprocessed for the Morpho Challenge (tokenized, lower-cased, some conversion of character encodings).

If the participants like to do so, they can use the corpora in order to get information about the context in which the different words occur.

We are most grateful to the University of Leipzig for making these resources available to the Challenge, and in particular we thank Stefan Bordag for his kind assistance.

Gold standard morpheme analyses

The desired “correct” analyses for a random sample of circa 500 words are supplied for each language. These samples can be used for visual inspection and as a development test set (in order to get a rough estimate of the performance of the participants’ morpheme-analyzing algorithm).

The format of the gold standard file is exactly the same as that of the result file to be submitted. That is, each line contains a word and its analysis. The word is separated from the analysis by a TAB character. Morpheme labels in the analysis are separated from each other by a space character. For some words there are multiple correct analyses. These alternative analyses are separated by a comma (,). Examples:

Language Examples
English baby-sitters       baby_N sit_V er_s +PL
indoctrinated      in_p doctrine_N ate_s +PAST
Finnish linuxiin           linux_N +ILL
makaronia          makaroni_N +PTV
German choreographische   choreographie_N isch +ADJ-e
zurueckzubehalten  zurueck_B zu be halt_V +INF
Turkish kontrole           kontrol +DAT
popUlerliGini      popUler +DER_lHg +POS2S +ACC, popUler +DER_lHg +POS3 +ACC3

The English and German gold standards are based on the CELEX data base. The Finnish gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft, Inc. The Turkish gold-standard analyses have been obtained from a morphological parser developed at Boğaziçi University; it is based on Oflazer’s finite-state machines, with a number of changes. We are indebted to Ebru Arısoy for making the Turkish gold standard available to us.

The morphological analyses are morpheme analyses. This means that only grammatical categories that are realized as morphemes are included. For instance, for none of the languages will you find a singular morpheme for nouns or a present-tense morpheme for verbs, because these grammatical categories do not alter or add anything to the word form, in contrast to, e.g., the plural form of a noun (house vs. house+s), or the past tense of verbs (help vs. help+ed, come vs. came).

The morpheme labels that correspond to inflectional (and sometimes also derivational) affixes have been marked with an initial plus sign (e.g., +PL, +PAST). This is due to a feature of the evaluation script: in addition to the overall performance statistics, evaluation measures are also computed separately for the labels starting with a plus sign and those without an initial plus sign. It is thus possible to make an approximate assessment of how accurately affixes are analyzed vs. non-affixes (mostly stems). If you use the same naming convention when labeling the morphemes proposed by your algorithm, this kind of statistics will be available for your output (see the evaluation page for more information).

The morpheme labels that have not been marked as affixes (no initial plus sign) are typically stems. These labels consist of an intuitive string, usually followed by an underscore character (_) and a part-of-speech tag, e.g., “baby_N”, “sit_V”. In many cases, especially in English, the same morpheme can function as different parts-of-speech; e.g., the English word “force” can be a noun or a verb. In the majority of these cases, however, if there is only a difference in syntax (and not in meaning), the morpheme has been labeled as either a noun or a verb, throughout. For instance, the “original” part-of-speech of “force” is a noun, and consequently both noun and verb inflections of “force” contain the morpheme “force_N”:

force force_N
force's force_N GEN
forced force_N +PAST
forces force_N +3SG, force_N +PL
forcing force_N +PCP1

Thus, there is not really a need for your algorithm to distinguish between different meanings or syntactic roles of the discovered stem morphemes. However, in some rare cases, if the meanings of the different parts-of-speech do differ clearly, there are two variants, e.g., “train_N” (vehicle), “train_V” (to teach), “fly_N” (insect), “fly_V” (to move through the air). But again, if there are ambiguous meanings within the same part-of-speech, these are not marked in any way, e.g., “fan_N” (device for producing a current of air) vs. “fan_N” (admirer). This notation is a consequence of using CELEX and FINTWOL as the sources for our gold standards. We could have removed the part-of-speech tags, but we decided to leave them there, since they carry useful information without significantly making the task more difficult. There are no part-of-speech tags in the Turkish gold standard.

Random word pairs file

If you want to carry out a small-scale evaluation yourself using the gold standard sample, you need to download a randomly generated so-called word pairs file for each language to be tested. Read more about this on the evaluation page.

Character encoding

In the source data used for the different languages, there is variation in how accurately certain distinctions are made when letters are rendered. This makes it hard to apply a unified character encoding scheme for all the languages (such as UTF-8). Thus, the following encodings have been used, in which all letters are encoded as one-byte (8-bit) characters:

Standard text. All words are lower-cased, also proper names.
ISO Latin 1 (ISO 8859-1). The Scandinavian special letters å, ä, ö (as well as other letters occuring in loan words, e.g., ü, é, à) are rendered as one-byte characters. All words are lower-cased, also proper names.
Standard text. All words are lower-cased, also all nouns. The German umlaut letters are rendered as the corresponding non-umlaut letter followed by “e”, e.g., “laender” (Länder), “koennte” (könnte), “fuer” (für). Double-s is rendered as “ss”, e.g., “strasse” (Straße). This coarse encoding is due to the fact that CELEX, the source for the morphological gold standard, utilizes this scheme. Note, however, that in the data you may see special letters encoded using ISO Latin 1 in some loan words, e.g., “société”, “l’unità” (these words are not included in CELEX and their analyses will not be evaluated).
Standard text. All words are lower-cased. The letters specific to the Turkish language are replaced by capital letters of the standard Latin alphabet, e.g., “açıkgörüşlülüğünü” is spelled “aCIkgOrUSlUlUGUnU”.


Language Word list Text corpus Sample of gold standard Random word pairs file
English Text Text gzipped Text gzipped Text Text
Finnish Text Text gzipped Text gzipped Text Text
German Text Text gzipped Text gzipped Text Text
Turkish Text Text gzipped Text gzipped Text Text

Instead of downloading each file separately, you can download the whole package, either as a tar/gzip file: morphochal07data.tgz (530 MB; unpack using “tar xzf“) or as a zip file: (540 MB).

Competition 1

In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Samples of the gold standards used are available for download on the datasets page.

Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).

In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:

F-measure = 1/(1/Precision + 1/Recall).

Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The total number of points is then divided by the total number of word pairs.

For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision here is thus 1/2 = 50%.

Recall is calculated analogously to recall: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. The total number of points is then divided by the total number of word pairs.

For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, some normalization of the points is carried out in order not to give these words considerably more weight in the evaluation than “less complex” words. We will spare the participants from the gory details. (The passionately interested reader may have a look at the source code of the evaluation script.)

Evaluation of a sample (development test set)

You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: The evaluation program is invoked as follows: [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile

Four files are given as arguments to

  1. wordpairsfile_goldstd: this is the “random word pairs file” available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
  2. wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
  3. goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses for circa 500 words.
  4. resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses.

The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:

PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76)
PART0. Recall:    25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345)
PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73%
TOTAL. Recall:    25.59%; non-affixes: 49.78%; affixes: 10.78%
TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%

Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.

The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.

Sampling word pairs for the calculation of an estimate of the precision

In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the 500-word gold standard that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the 500-word gold standard sample. The problem here is that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.

Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:

cut -f1 goldstdfile > relevantwordsfile

Then sample word pairs for 100 words selected by random from your results file: -refwords relevantwordsfile < resultfile > wordpairsfile_result

The necessary Perl program is The output file wordpairsfile_result is used as input to (see above).

Competition 2

Competition 2 does not require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used. It is possible that no evaluation data for Turkish will be available for Competition 2.

The words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the evaluation using a state-of-the-art retrieval method (the latest version of the freely available LEMUR toolkit, and the evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for each language and these 4 different indexing runs in each language:

1. All morphemes from the training data are used as index terms, Tfidf (BM25) weighting for all
2. Additional morphemes from the IR data used, too, Tfidf (BM25) weighting for all
3. All morphemes from the training data are used as index terms, Okapi (BM25) weighting for all except the most common ones (stoplist)
4. Additional morphemes from the IR data used, too, Okapi (BM25) weighting for all except the most common ones (stoplist)


This page contains only the result tables from the latest evaluation runs. The full evaluation report was published at the Workshop.

Competition 1

The segmentation with the highest F-measure is the best. The winner is selected separately for each language.
Download the report on the final results from here.


Competition 2

In Competition 2, the organizers applied the analyses provided by the participants in information retrieval experiments. The words in the queries and source documents were replaced by the corresponding morpheme analyses provided by the participants, and the search was then based on morphemes instead of words. The evaluation was perfomed using a state-of-the-art retrieval method (the latest version of the freely available LEMUR toolkit. The evaluation criterion was Uninterpolated Average Precision There were several different categories in Competition 2 and the winner with the highest Average Precision was selected separately for each language and each category:

  1. All morpheme analysis from the training data are used as index terms “withoutnew” vs. additionally using also the morpheme analysis for new words that existed in the IR data but not in the training data “withnew”.
  2. Tfidf (BM25) term weighting was utilized for all index terms without any stoplists vs. Okapi (BM25) term weighting for all index terms excluding an automatic stoplist consisting of 75,000 (Finnish) or 150,000 (German and English) most common terms. The stoplist was used with the Okapi weighting, because it did not perform well with the indexes that had many very common terms.

Download the report on the final results from here.



The proceedings are available in CLEF working notes.
The manuscripts and abstracts can be downloaded from here.

Reference methods

To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.

  1. Morfessor Categories-Map: The same Morfessor Categories-Map as described in Competition 1 was used for the unsupervised morpheme analysis. The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
  2. Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis. This means that the obtained subword units were directly used as index terms as such. This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005. We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
  3. dummy: No words were split nor any morpheme analysis provided. This means the all were directly used as index terms as such without any stemming or tags. We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline. However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
  4. grammatical: The words were analysed using the gold standard in each language that were utilised as the “ground truth” in the Competition 1. Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well. For many words the gold standard analyses included several alternative interpretations that were all included in the indexing. However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied. This was here called “grammatical first” whereas the default was called “grammatical all”. Because our gold standards are quite small, 60k (English) – 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect “grammatical” to perform particularly well, even though it would probably capture some useful indexing features to beat the “dummy”, at least.
  5. Porter: No real morpheme analysis was performed, but the words were stemmed by the Porter stemming, an option provided by the Lemur toolkit. Because this is quite standard procedure in IR, especially for English text material, we expected this to provide the best results, at least for English. For the other languges the default Porter stemming was not likely to perform very well.
  6. Tepper: A hybrid method developed by Michael Tepper was utilized to improve the morpheme analysis reference obtained by our Morfessor Categories-MAP. Based on the obtained performance in Competition 1, we expected that this could provide some interesting results here, as well.