Description
The goal of the challenge is to identify the different Machine Learning (ML) methods proposed so far for structured data, to assess the potential of these methods for dealing with generic ML tasks in the structured domain, to identify the new challenges of this emerging field and to foster research in this domain. Structured data appears in many different domains. We will focus here on Graph document collections and we are organizing this challenge in cooperation with the INEX initiative. This challenge aims at gathering ML, Information Retrieval (IR) and Data Mining researchers in order to:
- Define the new challenges for structured data mining with ML techniques.
- Build Interlinked document collections, define evaluation methodologies and develop software which will be used for the evaluation of classification of documents in a graph.
- Compare existing methods on different datasets.
Results of the track will be presented at the INEX workshop.
Task : Graph (Semi-)Supervised Classification
Dealing with XML document collections is a particularly challenging task for ML and IR. XML documents are de¯ned by their logical structure and their content (hence the name semi-structured data). Moreover, in a large majority of cases (Web collections for example), XML documents collections are also structured by links between documents (hyperlinks for example). These links can be of different types and correspond to different nformation: for example, one collection can provide hierarchical links, hyperlinks, citations, etc.
Earlier models developed in the field of XML categorization/clustering simultaneously use the content information and the internal structure of XML documents for a list of models) but they rarely use the external structure of the collection i.e the links between documents.
We focus here on the problem of classication of XML documents organized in graph. More precisely, the participants of the task have to classify the document of a partially labelled graph.
Tasks
Collection
The corpus used this year will be a subset of the Wikipedia XML Corpus of INEX 2009. This subset will be different than the one used last year. Mainly:
- Each document will belong to one or more than one categories
- Each document will be and XML document
- The different documents will be organized in a graph of documents where each link correspond to an hyperlink (or wiki link) between two documents
The corpus proposed is a graph of XML documents.
Semi supervised classification
In this track, the goal is to classify each node of a graph (a node corresponds to a document) knowing a set of already labelled nodes (the training documents). In the ML point of view, the track proposed here is a transductive (or semi) supervised classification task.
The following figure gives an example of classification task.
Training set: The training set is composed of XML documents organized in a graph. The red nodes correspond to documents in category 1, the blue nodes corresponds to documents in category 2. The white nodes correspond to documents where the category is hidden. The goal of the categorization task is to find the categories of the white nodes |
The goal of the categorization models are to find the color of the unlabelled nodes of the training graph. |
The evaluation measure for categorization will be ROC curves and F1 measure
Results by team
The measures computed are:
- Measures computed over the categories (micro and macro):
- ACC = Accuracy
- ROC = Arear under Roc curve
- PRF = F1 measure
- Measure computed over the documents:
- Mean average precision by document
- University of Wollongong
- University of Peking
- Xerox Research Center
- University of Saint Etienne
- University of Granada
Package for computing performances
In order to use the package, you have to write:
perl compute.pl all_categories.txt train_categories.txt yourSubmissionFile
If the software find negative scores in the file, it normalizes the score by applying a logistic function over the scores.