AI4D blog series: The First Tunisian Arabizi Sentiment Analysis Dataset

Motivation

On social media, Arabic speakers tend to express themselves in their own local dialect. To do so, Tunisians use “Tunisian Arabizi”, which consists in supplementing numerals to the Latin script rather than the Arabic alphabet.

In the African continent, analytical studies based on Deep Learning are data hungry. To the best of our knowledge, no annotated Tunisian Arabizi dataset exists.

Twitter, Facebook and other micro-blogging systems are becoming a rich source of feedback information in several vital sectors, such as politics, economics, sports and other matters of general interest. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Tunisian Arabizi.

TUNIZI is composed of one instance presented as text comments collected from Social Media, annotated as positive, negative or neutral. This data does not include any confidential information. However, negative comments may include offensive or insulting content.

TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect. TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.

Team

 TUNIZI Dataset is collected, preprocessed and annotated by iCompass team, the Tunisian Startup speciallized in NLP/NLU. The team composed of academics and engineers specialized in Information technology, mathematics and linguistics were all dedicated to ensure the success of the project. iCompass can be contacted through emails or through the website: www.icompass.tn

Implementation

  1. Data Collection: TUNIZI is collected from comments on Social Media platforms. All data was directly observable and did not require other data to be inferred from. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Arabizi. This dataset relates directly to Tunisians from different regions, different ages and different genders. Our dataset is collected anonymously and contains no information about users identity.
  2. Data Preprocessing & Annotation: TUNIZI was preprocessed by removing links, emoji symbols and punctuation. Annotation was then performed by five Tunisian native speakers, three males and two females at a higher education level (Master/PhD).
  3. Distribution and Maintenance: TUNIZI dataset is made public for all upcoming research and development activitieson Github. TUNIZI is maintained by iCompass team that can be contacted through emails or through the Github repository. Updates will be available on the same Github link.
  4. Conclusion: As the interest in Natural Language Processing, particularly for African languages is growing, a natural future step would involve building Arabizi datasets for other underrepresented north African dialects such as Algerian and Moroccan.

AI4D blog series: Preservation of Indigenous Languages

Context

In most African countries, perhaps more so in Africa than elsewhere, the majority of the populations do not speak the official languages; instead, they speak traditional languages. In some countries, this proportion is as high as 80%. Because of this language barrier, this large part of the population is practically excluded from the march of society: they have no access to information or education and cannot really participate in the debates on the socio-economic development of their country.

From another point of view, our values, cultures, knowledge of all kinds and history are conveyed orally in these languages and thus remain inaccessible to the rest of the world.

Objectives

The main objective of the Preservation of Indigenous Languages project is to contribute to the preservation of local languages and the enhancement of local language content through (1) archiving, (2) promotion and (3) popularization of local language content. Archiving will make it possible to preserve content and knowledge in local languages. We will collect and use existing data in local languages for this purpose. The promotion will be done by exploiting the richness of this local language content. And popularisation will be made possible by making this content accessible in the official languages. In order to achieve these objectives, our project is divided into three parts, all of which have an important upstream data collection and pre-processing stage:

  • Transcription from local languages to text in local languages
  • Translation from local languages to official languages (French) and vice versa
  • Voice synthesis of texts in local languages into audios in local languages.

Team

To successfully carry out the project, we have set up a dedicated team of 10 people:

  • A research mentor with a background in AI,
  • Two practice mentors with a background in local languages. The first one is a specialist of education in local languages and the second one is with various works in translation from French to Moore, the main local language in Burkina Faso.
  • A research assistant with a background in linguistic. In this case, the assistant was a student whose responsibility was to help on the collection of content in languages, pre-treatement of data,
  • Three computer programmers. In this case, the programmers were computer science students (master and PhD students). Each of them has in charge one of the three parts of the project plus some pretreatment tasks.

Implementation

For this project, we limited ourselves to one local language, Mooré. This language is the main language of Burkina Faso and is spoken by more than half of the population. There are also many sources of data in this language and important work has already been done on translations from French into this language, especially in the educational and religious fields.

(0) Data Collection: As announced, data collection is an important and necessary step for the different parts of the project. It is also one of the most difficult steps. The opening of data is not yet compulsory in our countries.

With the invaluable help of practice mentors, meetings were organised with the main institutions, both public and private, to explore existing data and the extent to which these data could be exploited.

Among the institutions that were contacted, the main ones are the following:

  • Fondation pour le Développement Communautaire/ Burkina Faso(FDC-BF);
  • the biblical alliance of Burkina Faso;
  • Fonds pour l’alphabétisation et l’éducation non formelle (FONAENF);
  • The Directorate of Research in Non-Formal Education (DRENF);
  • The DPDMT;
  • Ecole et langue nationale en Afrique (ELAN);
  • Savane Media.

We were thus able to access a certain amount of data but not always in digital format or not always complete. This required an enormous amount of pre-processing work either to put the data in digital format or to complete it either with translations or transcriptions.

One of the first sources of data we had access to was the Moore Bible in text and audio. It is this source that was also used after pre-processing (audio cutting sentence by sentence or verse by verse, alignment of Moore and French texts) for the first tests for the different parts of the project.

The collection and pre-processing work is still in progress to enrich our data sources and improve our models.

(1) Transcription: Since writing is not yet very popular in our local languages, we have a large amount of data in local languages in audio format. In addition, people who cannot write will always use oral communication to express themselves. The step of transcribing the audio content into local languages is an essential step to not only collect existing information but also to gather what people have to say.

After a state of the art and testing of existing transcription tools, the student in charge of this part implemented his transcription model based on the DeepSeepch tool. He uses data from the bible for these tests. In addition to the workload for pre-processing and the working conditions made a bit difficult because of the Covid19 pandemic, we unfortunately had problems with computing capacity and are working with one of the partners to increase the capacities of the leased Virtual Machines.

(2)  Translation: Translation is at the heart of this project. It aims to make official language information accessible to people in rural areas but also to provide access to the wealth of local language content.

The student in charge of this component has, after a state of the art of existing translation approaches, applied classical neural machine translation techniques on bible data using OpenMT. But the results were not very good as one could expect given the lack of training data. So he is now implementing meta-learning using the Meta-NMT tool. Meta-learning has been described in the literature as performing better than the classical approach when there is little data.

Here, too, in addition to the need for more data, we face a need for computing capacity that should also be resolved with the provision of VMs.

(3) Voice synthesis: Voice synthesis will make it possible, after translation from the official languages into local languages, to make the content available to populations who cannot read but who will be able to have it in audio format. The student in charge of this part also carried out a state of the art of existing tools in this field. He is currently testing different tools and studying different models. He, unfortunately, started with a little delay but will continue his work in order to be able to adapt a model and to make tests with the collected data in order to be able to carry out the vocal synthesis of the text in mooré audio.

Results

At this stage, while we just crossed the mid-term of the project execution, we can report that a number of milestones have been achieved:

  • Data collection has been done and is still ongoing.
  • Pre-processing of audio and text content as well as audio and text mapping in Mooré and alignment of text in Mooré et al correspondence in French have been performed.
  • A transcription model for Mooré to French based on deepSpeech has been implemented.
  • The classical translation has been implemented and tested on the Bible dataset

Main challenges

Access to Data

After going through about ten structures, we were confronted with the availability of resources. Indeed, apart from the Bible, some training materials and official documents translated, there were very few documents available in Moore and French.

The structures that produce Moore content, most often do so for training or awareness-raising for the illiterate population. As a result, they do not produce the same content in French. As for radio and television channels, they have interventions directly in Moore, without written notes, even for the presentation of the television news.

However, we found a lot of printed material, without digital versions and only in Moore. For this phase of the project, we collected and carried out the alignment for the already existing data in both languages in digital format. This allowed us to test the model, and although it did not lead to conclusive results, we did identify the problem of data availability. For further work, we plan to translate the existing documents into Moore so that we have both versions to continue the work. We are aware that this is a long term work, but it is the indispensable condition to have enough data to make the results of the algorithms interesting.

Copyright

A second problem we encountered was copyright. Indeed, we do not always have direct access to the authors, and the holders of the documents are reserved to share them without their agreement. In other cases, the documents had been commissioned by international organizations. It was therefore necessary for our interlocutors here to have the agreement of these institutions before giving us access to the data. This takes time and has delayed access to the working data.

In the long term, we plan to bring together a group of authors to raise their awareness of the project so that they can facilitate advocacy for the project.

Computing capacity

We unfortunately do not have a laboratory equipped with servers powerful enough to run our models. Our partnership with Anptic was supposed to allow us to use VMs with greater capacity to go faster in testing, but the administrative burden also delayed the availability of VMs.

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D