The objective of the Challenge is to design a statistical machine learning algorithm that discovers which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling.
The scientific goals are:
- To learn of the phenomena underlying word construction in natural languages
- To discover approaches suitable for a wide range of languages
- To advance machine learning methodology
Morpho Challenge 2008 is a follow-up to our previous Morpho Challenge 2005 (Unsupervised Segmentation of Words into Morphemes) and Morpho Challenge 2007 (Unsupervised Morpheme Analysis). The task of Morpho Challenge 2008 is similar to the Morpho Challenge 2007, where the aim is to find the morpheme analysis of the word forms in the data. For this challenge, Arabic is added as one of the evaluated languages. For the IR task, there is also a new possibility to provide the morpheme analysis of the words in their context.
Participation in the previous challenges is by no means a prerequisite for participation in Morpho Challenge 2008. Everyone is welcome and we hope to attract many participating teams. The results will be presented in a workshop. Please read the rules and see the schedule. The datasets are available for download. Submit your analyses (result files) by sending them by email to the organizers, or by indicating a location where the organizers can download your files. Remember also to describe your algorithm in a paper. Please read the formatting instructions in rules.
Mikko Kurimo, Ville Turunen and Matti Varjokallio
Adaptive Informatics Research Centre, Helsinki University of Technology
The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.
Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.
Data sets are provided for five languages: Arabic, English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.
(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)
The task is the unsupervised morpheme analysis of every word form contained in a word list supplied by the organizers for each test language.
The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.
Solutions, in which a large number of parameters must be “tweaked” separately for each test language, are of little interest. This challenge aims at the unsupervised (or very minimally supervised) morpheme analysis of words. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.
The segmentations will be evaluated in two complementary ways:
- Competition 1: The proposed morpheme analyses will be compared to a linguistic “gold standard”.
- Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
Competition 1 will include all five test languages. Winners will be selected separately for each language. As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.
Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.
Workshop and publication
All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and a paper of at most 10 pages describing the algorithm and the data submission.
For your paper submission (due August 1st), please use the single-column CLEF 2007 Notebook Proceedings format. Here are a sample PDF file and a template Latex document. Detailed formatting instructions can be requested from the Morpho Challenge organizers. The maximum length of your paper is 10 pages (including references and figures). Email your paper submission to the organizers.
In the case of disagreement the organizers will decide the final interpretation of the rules.
There are a number of data files involved in this challenge. Each type of file is available for each language.
Word list (input)
First and foremost, there is a list of word forms. The words have been extracted from a text corpus, and each word in the list is preceded by its frequency in the corpus used.
For instance, a subset of the supplied English word list looks like this:
... 1 barefoot's 2 barefooted 6699 feet 653 flies 2939 flying 1782 foot 64 footprints ...
Result file (output, i.e., what to submit)
The participants’ task is to return a list containing exactly the same words as in the input, with morpheme analyses provided for each word. The list returned shall not contain the word frequency information.
A submission for the above English words may look like this:
... barefoot's BARE FOOT +GEN barefooted BARE FOOT +PAST feet FOOT +PL flies FLY_N +PL, FLY_V +3SG flying FLY_V +PCP1 foot FOOT footprints FOOT PRINT +PL ...
There are a number of things to note about the result file: Each line of the file contains a word (e.g., “feet”) separated from its analysis (e.g., “FOOT +PL”) by one TAB character. The word needs to look exactly as it does in the input; no capitalization or change of character encoding is allowed. The analysis contains morpheme labels separated using space. The order in which the labels appear does not matter; e.g., “FOOT +PL” is equivalent to “+PL FOOT”. The labels are arbitrary: e.g., instead of using “FOOT” you might use “morpheme784” and instead of “+PL” you might use “morpheme2”. However, we strongly recommend you to use intuitive labels, when possible, since they make it easier for anyone to get an idea of the quality of the result by looking at it.
If a word has several interpretations, all interpretations should be supplied: e.g., the word “flies” may be the plural form of the noun “fly” (insect) or the third person singular present tense form of the verb “to fly”. The alternative analyses must be separated using a comma, as in: “FLY_N +PL, FLY_V +3SG”. The existence of alternative analyses makes the task challenging, and we leave it to the participants to decide how much effort they will put into this aspect of the task. In English, for instance, in order to get a perfect score, it would be necessary to distinguish the different functions of the ending “-s” (plural or person ending) as well as the different parts-of-speech of the stem “fly” (noun or verb). As the results will be evaluated against reference analyses (our so-called gold standard), it is worth reading about the guiding principles used when constructing the gold standard.
As far as we understand, you can use any characters in your morpheme labels except whitespace and comma (,). However, we cannot guarantee that the evaluation scripts will work properly, if your labels contain some “strange” characters.
Text corpus for English, Finnish, German and Turkish
The word list (input data) has been constructed by collecting word forms occurring in a text corpus. The text corpora have been obtained from the Wortschatz collection at the University of Leipzig (Germany). We used the plain text files (sentences.txt for each language); the corpus sizes are 3 million sentences for English, Finnish and German, and 1 million sentences for Turkish. For English, Finnish and Turkish we use preliminary corpora, which have not yet been released publicly at the Wortschatz site. The corpora have been preprocessed for the Morpho Challenge (tokenized, lower-cased, some conversion of character encodings).
If the participants like to do so, they can use the corpora in order to get information about the context in which the different words occur.
We are most grateful to the University of Leipzig for making these resources available to the Challenge, and in particular we thank Stefan Bordag for his kind assistance.
Text corpus for Arabic
The text data (135K sentences with 3.9M words) used same as in this paper:
Habash, Nizar and Fatiha Sadat. Arabic Preprocessing Schemes for Statistical Machine Translation, In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), New York, 2006. [PDF]
Unfortunately this text data is not freely available.
Gold standard morpheme analyses
The desired “correct” analyses for a random sample of circa 500 words are supplied for each language. These samples can be used for visual inspection and as a development test set (in order to get a rough estimate of the performance of the participants’ morpheme-analyzing algorithm).
The format of the gold standard file is exactly the same as that of the result file to be submitted. That is, each line contains a word and its analysis. The word is separated from the analysis by a TAB character. Morpheme labels in the analysis are separated from each other by a space character. For some words there are multiple correct analyses. These alternative analyses are separated by a comma (,). Examples:
|baby-sitters baby_N sit_V er_s +PL
indoctrinated in_p doctrine_N ate_s +PAST
|linuxiin linux_N +ILL
makaronia makaroni_N +PTV
|choreographische choreographie_N isch +ADJ-e
zurueckzubehalten zurueck_B zu be halt_V +INF
|kontrole kontrol +DAT
popUlerliGini popUler +DER_lHg +POS2S +ACC, popUler +DER_lHg +POS3 +ACC3
|Algbn gabon POS:N Al+ +SG +MASC
The English and German gold standards are based on the CELEX data base. The Finnish gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft, Inc. The Turkish gold-standard analyses have been obtained from a morphological parser developed at Boğaziçi University; it is based on Oflazer’s finite-state machines, with a number of changes. We are indebted to Ebru Arısoy for making the Turkish gold standard available to us.
For Arabic the gold standard analyses are based on representation of lexeme and features used the Aragen system (a wrapper using publicly availble BAMA-1 databases):
Habash, Nizar. Large Scale Lexeme Based Arabic Morphological Generation. In Proceedings of Traitement Automatique du Langage Naturel (TALN-04). Fez, Morocco, 2004. [PDF]
The first part of an analysis is a lexeme followed by list of features.
The morphological analyses are morpheme analyses. This means that only grammatical categories that are realized as morphemes are included. For instance, for none of the languages will you find a singular morpheme for nouns or a present-tense morpheme for verbs, because these grammatical categories do not alter or add anything to the word form, in contrast to, e.g., the plural form of a noun (house vs. house+s), or the past tense of verbs (help vs. help+ed, come vs. came).
The morpheme labels that correspond to inflectional (and sometimes also derivational) affixes have been marked with an initial plus sign (e.g., +PL, +PAST). This is due to a feature of the evaluation script: in addition to the overall performance statistics, evaluation measures are also computed separately for the labels starting with a plus sign and those without an initial plus sign. It is thus possible to make an approximate assessment of how accurately affixes are analyzed vs. non-affixes (mostly stems). If you use the same naming convention when labeling the morphemes proposed by your algorithm, this kind of statistics will be available for your output (see the evaluation page for more information).
The morpheme labels that have not been marked as affixes (no initial plus sign) are typically stems. These labels consist of an intuitive string, usually followed by an underscore character (_) and a part-of-speech tag, e.g., “baby_N”, “sit_V”. In many cases, especially in English, the same morpheme can function as different parts-of-speech; e.g., the English word “force” can be a noun or a verb. In the majority of these cases, however, if there is only a difference in syntax (and not in meaning), the morpheme has been labeled as either a noun or a verb, throughout. For instance, the “original” part-of-speech of “force” is a noun, and consequently both noun and verb inflections of “force” contain the morpheme “force_N”:
|force_N +3SG, force_N +PL
Thus, there is not really a need for your algorithm to distinguish between different meanings or syntactic roles of the discovered stem morphemes. However, in some rare cases, if the meanings of the different parts-of-speech do differ clearly, there are two variants, e.g., “train_N” (vehicle), “train_V” (to teach), “fly_N” (insect), “fly_V” (to move through the air). But again, if there are ambiguous meanings within the same part-of-speech, these are not marked in any way, e.g., “fan_N” (device for producing a current of air) vs. “fan_N” (admirer). This notation is a consequence of using CELEX and FINTWOL as the sources for our gold standards. We could have removed the part-of-speech tags, but we decided to leave them there, since they carry useful information without significantly making the task more difficult. There are no part-of-speech tags in the Turkish gold standard.
Random word pairs file
If you want to carry out a small-scale evaluation yourself using the gold standard sample, you need to download a randomly generated so-called word pairs file for each language to be tested. Read more about this on the evaluation page.
In the source data used for the different languages, there is variation in how accurately certain distinctions are made when letters are rendered. This makes it hard to apply a unified character encoding scheme for all the languages (such as UTF-8). Thus, the following encodings have been used, in which all letters are encoded as one-byte (8-bit) characters:
- Standard text. All words are lower-cased, also proper names.
- ISO Latin 1 (ISO 8859-1). The Scandinavian special letters å, ä, ö (as well as other letters occuring in loan words, e.g., ü, é, à) are rendered as one-byte characters. All words are lower-cased, also proper names.
- Standard text. All words are lower-cased, also all nouns. The German umlaut letters are rendered as the corresponding non-umlaut letter followed by “e”, e.g., “laender” (Länder), “koennte” (könnte), “fuer” (für). Double-s is rendered as “ss”, e.g., “strasse” (Straße). This coarse encoding is due to the fact that CELEX, the source for the morphological gold standard, utilizes this scheme. Note, however, that in the data you may see special letters encoded using ISO Latin 1 in some loan words, e.g., “société”, “l’unità” (these words are not included in CELEX and their analyses will not be evaluated).
- Standard text. All words are lower-cased. The letters specific to the Turkish language are replaced by capital letters of the standard Latin alphabet, e.g., “açıkgörüşlülüğünü” is spelled “aCIkgOrUSlUlUGUnU”.
- All words are presented in Buckwalter transliteration.
Download data for Competition 1
|Sample of gold standard
|Random word pairs file
Download data for Competition 2
Participation in competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants for competition 1 in information retrieval experiments. Data from CLEF will be used.
However, because the information retrieval evaluation texts are different from the training texts of competition 1, a slightly better IR performance may be obtained, by submitting also the analyses of the words that do not exist in the word lists of competition 1. The joined word lists can be downloaded below.
|See the paragraph below
|See the paragraph below
|See the paragraph below
Those participants who wish to use the full text corpora in order to get information about the context in which the different words occur, please contact the organizers for more information how to register to CLEF to obtain the full texts. If there are participants who wish to submit morpheme analysis for words in their actual context (competition 2b), they will need to request the full texts, too. If you need the full texts, please contact the organizers for details how to fill in and submit the CLEF Registration Form and CLEF End-User Agreement. The DL for this registration is 1 May, 2008.
NOTE: If you do not participate in competition 2b and do not need the full texts for to submit the unsupervised morpheme analysis for competition 2, it is enough to just download the data available at this page.
In Competition 1, for each language, the morpheme analyses proposed by the participants’ algorithm will be compared against a linguistic gold standard. Samples of the gold standards used are available for download on the datasets page.
Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants’ algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words “foot” and “feet” both contain the morpheme “foot_N”. It is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both these word forms (be it called “FOOT”, “morpheme784”, “foot” or something else).
In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:
Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The total number of points is then divided by the total number of word pairs.
For instance, assume that the proposed analysis of the English word “abyss” is: “abys +s”. Two word pairs are formed: Say that “abyss” happens to share the morpheme “abys” with the word “abysses”; we thus obtain the word pair “abyss – abysses”. Also assume that “abyss” shares the morpheme “+s” with the word “mountains”; this produces the pair “abyss – mountains”. Now, according to the gold standard the correct analyses of these words are: “abyss_N”, “abyss_N +PL”, “mountain_N +PL”, respectively. The pair “abyss – abysses” is correct (common morpheme: “abyss_N”), but the pair “abyss – mountain” is incorrect (no morpheme in common). Precision here is thus 1/2 = 50%.
Recall is calculated analogously to recall: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants’ algorithm. The total number of points is then divided by the total number of word pairs.
For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, some normalization of the points is carried out in order not to give these words considerably more weight in the evaluation than “less complex” words. We will spare the participants from the gory details. (The passionately interested reader may have a look at the source code of the evaluation script.)
Evaluation of a sample (development test set)
You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: eval_morphemes.pl. The evaluation program is invoked as follows:
eval_morphemes.pl [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile
Four files are given as arguments to eval_morphemes.pl:
- wordpairsfile_goldstd: this is the “random word pairs file” available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
- wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
- goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses for circa 500 words.
- resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses.
The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:
PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76) PART0. Recall: 25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345) PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22% # TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73% TOTAL. Recall: 25.59%; non-affixes: 49.78%; affixes: 10.78% TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants’ algorithms.
The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., “+PL”, “+PAST”. This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.
Sampling word pairs for the calculation of an estimate of the precision
In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script eval_morphemes.pl with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the 500-word gold standard that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the 500-word gold standard sample. The problem here is that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.
Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:
cut -f1 goldstdfile > relevantwordsfile
Then sample word pairs for 100 words selected by random from your results file:
sample_word_pairs.pl -refwords relevantwordsfile < resultfile > wordpairsfile_result
Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.
In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant’s submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant’s submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.
The full evaluation reports and the descriptions of the participating methods have been published at the Workshop.
The segmentation with the highest F-measure is the best. The winner is selected separately for each language.
These results are preliminary. Download the report manuscript on the final results from here.
In Competition 2, the organizers applied the analyses provided by the participants in information retrieval experiments. The words in the queries and source documents were replaced by the corresponding morpheme analyses provided by the participants, and the search was then based on morphemes instead of words. The evaluation was perfomed using a state-of-the-art retrieval method (the latest version of the freely available LEMUR toolkit. The evaluation criterion was Uninterpolated Average Precision. The evaluation was performed only for the parcipitants that submitted the segmentation for the extended wordlists gathered from the IR evaluation corpora. Okapi (BM25) term weighting was used for all index terms excluding an automatic stoplist consisting of terms that have collection frequency higher than 75000 (Finnish) or 150000 (German and English). The stoplist was used with the Okapi weighting, because it did not perform well with the indexes that had many very common terms.
Download the report manuscript on the final results from here.
To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.
- Morfessor Categories-Map: The same Morfessor Categories-Map as described in Competition 1 was used for the unsupervised morpheme analysis. The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
- Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis. This means that the obtained subword units were directly used as index terms as such. This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005. We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
- dummy: No words were split nor any morpheme analysis provided except hyphens were replaced by spaces so that hyphenated words were indexed as separate words (changed from last year). This means words were directly used as index terms as such without any stemming or tags. We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline. However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
- grammatical: The words were analysed using the gold standard in each language that were utilised as the “ground truth” in the Competition 1. Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well. For many words the gold standard analyses included several alternative interpretations that were all included in the indexing. However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied. This was here called “grammatical first” whereas the default was called “grammatical all”. Words that were not in the gold standard segmentation were indexed as such. Because our gold standards are quite small, 60k (English) – 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect “grammatical” to perform particularly well, even though it would probably capture some useful indexing features to beat the “dummy” method, at least.
- snowball: No real morpheme analysis was performed, but the words were stemmed by stemming algorithms provided by snowball libstemmer library. Porter stemming algoritm was used for English. Finnish and German stemmers were used for the other languages. Hyphenated words were first split to parts that were then stemmed separately. Stemming is expected to perform very well for English but not necessarily for the other languages because it is harder to find good stems.
- TWOL: Two-level morphological analyzer was used to find the normalized forms of the words. These forms were then used as index terms. Some words may have several alternative normalized forms and two cases were studied similarly to the grammatical case. Either all alternatives were used (“all”) or only the first one (“first”). Compound words were split to parts. Words not recognized by the analyzer were indexed as such. German analyzer was not available.