May 20, 2022

General

Knowledge 4 All Foundation Acknowledged by Masakhane Research Foundation in Groundbreaking NLP Publications

The Knowledge 4 All Foundation is proud to have been acknowledged by the Masakhane Research Foundation in their recent influential publications advancing Natural Language Processing (NLP) for African languages. These publications highlight the Foundation’s pivotal contributions to developing datasets and fostering AI innovation across the continent.

“A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation” (NAACL 2022)
This paper explores how a small number of translations can significantly enhance pre-trained models for African news translation, addressing the scarcity of African-language datasets. Read the full paper here.
“MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition” (EMNLP 2022)
This work presents MasakhaNER 2.0, a model leveraging Africa-centric transfer learning techniques for Named Entity Recognition (NER) in African languages, providing a vital resource for African NLP tasks. Read the full paper here.
“MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages” (ACL 2023)
This research introduces MasakhaPOS, which addresses the challenges of Part-of-Speech (POS) tagging in the diverse and underrepresented African linguistic landscape. Read the full paper here.

These projects, made possible through collaborative efforts and the contributions of Knowledge 4 All Foundation, have significantly advanced NLP for African languages, paving the way for inclusive and representative AI solutions.

The Foundation expresses its gratitude to Masakhane Research Foundation and remains committed to supporting initiatives that promote linguistic diversity, inclusivity, and technological progress for African communities. Together, these partnerships exemplify the power of global collaboration in driving impactful AI research and development.

November 20, 2020

AI4D

AI4D blog series: The First Tunisian Arabizi Sentiment Analysis Dataset

Motivation

On social media, Arabic speakers tend to express themselves in their own local dialect. To do so, Tunisians use “Tunisian Arabizi”, which consists in supplementing numerals to the Latin script rather than the Arabic alphabet.

In the African continent, analytical studies based on Deep Learning are data hungry. To the best of our knowledge, no annotated Tunisian Arabizi dataset exists.

Twitter, Facebook and other micro-blogging systems are becoming a rich source of feedback information in several vital sectors, such as politics, economics, sports and other matters of general interest. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Tunisian Arabizi.

TUNIZI is composed of one instance presented as text comments collected from Social Media, annotated as positive, negative or neutral. This data does not include any confidential information. However, negative comments may include offensive or insulting content.

TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect. TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.

Team

TUNIZI Dataset is collected, preprocessed and annotated by iCompass team, the Tunisian Startup speciallized in NLP/NLU. The team composed of academics and engineers specialized in Information technology, mathematics and linguistics were all dedicated to ensure the success of the project. iCompass can be contacted through emails or through the website: www.icompass.tn

Implementation

Data Collection: TUNIZI is collected from comments on Social Media platforms. All data was directly observable and did not require other data to be inferred from. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Arabizi. This dataset relates directly to Tunisians from different regions, different ages and different genders. Our dataset is collected anonymously and contains no information about users identity.
Data Preprocessing & Annotation: TUNIZI was preprocessed by removing links, emoji symbols and punctuation. Annotation was then performed by five Tunisian native speakers, three males and two females at a higher education level (Master/PhD).
Distribution and Maintenance: TUNIZI dataset is made public for all upcoming research and development activitieson Github. TUNIZI is maintained by iCompass team that can be contacted through emails or through the Github repository. Updates will be available on the same Github link.
Conclusion: As the interest in Natural Language Processing, particularly for African languages is growing, a natural future step would involve building Arabizi datasets for other underrepresented north African dialects such as Algerian and Moroccan.

November 20, 2020

AI4D

AI4D blog series: Arabic Speech-to-Moroccan Sign Language Translator: “Learning for Deaf”

Over 5% of the world’s population (466 million people) has disabling hearing loss. 4 million are children [1]. They can be hard of hearing or deaf. Hard of hearing people usually communicate through spoken language and can benefit from assistive devices like cochlear implants. Deaf people mostly have profound hearing loss, which implies very little or no hearing.

Abdelhak Mahmoudi is Associate Professor at the Department of Computer Science of Ecole Normale Supérieure (ENS-Rabat) of Mohammed V University

Salma EL ANIGRI is a Ph.D. student at Laboratoire d’Informatique, Mathématique Appliquée, Intelligence Artificielle et reconnaissance de Formes (LIMIARF)

Abdessamad EZZOU is a Ph.D. student at Laboratoire d’Informatique, Mathématique Appliquée, Intelligence Artificielle et reconnaissance de Formes (LIMIARF)

Mohamed El-kaddoury is a Ph.D. student in machine learning at Laboratoire d’Informatique, Mathématique Appliquée, Intelligence Artificielle et reconnaissance de Formes (LIMIARF)

Younes Choubik is a PhD student at Laboratoire d’Informatique, Mathématique Appliquée, Intelligence Artificielle et reconnaissance de Formes (LIMIARF)

The main impact of deaf people is on the individual’s ability to communicate with others in addition to the emotional feelings of loneliness and isolation in society. Consequently, they cannot equally access public services, mostly education and health and have no equal rights in participating in an active and democratic life. This leads to a negative impact in their lives and the lives of the people surrounding them.

Over the world, deaf people use sign language to interact in their community. Hand shapes, lip patterns, and facial expressions are used to express emotions and to deliver meanings. Sign languages are full-fledged natural languages with their own grammar and lexicon. However, they are not universal although they have striking similarities. Sign language can be represented by a form of annotation called Gloss. Each sign is represented by a gloss.

In Morocco, deaf children receive very little education assistance. For many years, they were learning the local variety of sign language from Arabic, French, and American Sign Languages [2]. In April 2019, the government standardized the Moroccan Sign Language (MSL) and initiated programs to support the education of deaf children [3]. However, the involved teachers are mostly hearing, have limited command of MSL and lack resources and tools to teach deaf to learn from written or spoken text. Schools recruit interpreters to help the student understand what is being taught and said in class. Otherwise, teachers use graphics and captioned videos to learn the mappings to signs, but lack tools that translate written or spoken words and concepts into signs.

Around the world, many efforts by different countries have been done to create Machine translations systems from their Language into Sign language. At Laboratoire d’Informatique de Mathématique Appliquée d’Intelligence Artificielle et de Reconnaissance des Formes (LIMIARF https://limiarf.github.io/www/) of Faculty of Sciences of Mohammed V University in Rabat, the Deep Learning Team (DLT) proposed the development of an Arabic Speech-to-MSL translator. The translation could be divided into two big parts, the speech-to-text part and the text-to-MSL part. Our main focus in this current work is to perform Text-to-MSL translation.

This project brings up young researchers, developers and designers. As a team, we conducted many reviews of research papers about language translation to glosses and sign languages in general and for Modern Standard Arabic in particular. We collected data of Moroccan Sign language from governmental, non-governmental sources and form the web. The young researchers also conducted some research on a new way to translate Arabic to a sign gloss. In parallel, young developers was creating the mobile application and the designers designing and rigging the animation avatar. In the following we detail these tasks.

Research reviews

[4] built a translation system ATLASLang that can generate real-time statements via a signing avatar. The system is a machine translation system from Arabic text to the Arabic sign language. It performs a morpho-syntactic analysis of the text in the input and converts it to a video sequence sentence played by a human avatar. They animate the translated sentence using a database of 200 words in gif format taken from a Moroccan dictionary. If the input sentence exists in the database, they apply the example-based approach (corresponding translation), otherwise the rule-based approach is used by analyzing each word of the given sentence in the aim of generating the corresponding sentence.
[5] decided to keep the same model above changing the technique used in the generation step. Instead of the rules, they have used a neural network and their proper encoder-decoder model. They analyse the Arabic sentence and extract some characteristics from each word like stem, root, type, gender etc. These features are encapsulated with the word in an object then transformed into a context vector Vc which will be the input to the feed-forward back-propagation neural network. The neural network generates a binary vector, this vector is decoded to produce a target sentence.
[6] This paper describes a suitable sign translator system that can be used for Arabic hearing impaired and any Arabic Sign Language (ArSL) users as well.The translation tasks were formulated to generate transformational scripts by using bilingual corpus/dictionary (text to sign). They used an architecture with three blocks: First block: recognize the broadcast stream and translate it into a stream of Arabic written script.in which; it further converts such stream into animation by the virtual signer. Therefore, the proposed solution covers the general communication aspects required for a normal conversation between an ArSL user and Arabic speaking non-users. The second block: converts the Arabic script text into a stream of Arabic signs by utilising the rich module of semantic interpretation, language model and supported dictionary of signs. From the language model they use word type, tense, number, and gender in addition to the semantic features for subject, and object will be scripted to the Signer (3D avatar). Third block: works to reduce the semantic descriptors produced by the Arabic text stream into simplified from <Subject, Verb, Object> by helping of ontological signer concept to generalize some terminologies. The proposed tasks employ two phases: training and generative phases. The two phases are supported by the bilingual dictionary/corpus; BC = {(DS, DT)}; and the generative phase produces a set of words (WT) for each source word WS.
[7] This paper presents DeepASL, a transformative deep learning-based sign language translation technology that enables non-intrusive ASL translation at both word and sentence levels.ASL is a complete and complex language that mainly employs signs made by moving the hands. Each individual sign is characterized by three key sources of information: hand shape, hand movement and relative location of two hands. They use Leap Motion as their sensing modality to capture ASL signs.DeepASL achieves an average 94.5% word-level translation accuracy and an average 8.2% word error rate on translating unseen ASL sentences.
[8] Achraf and Jemni, introduced a Statistical Sign Language Machine Translation approach from English written text to American Sign Language Gloss. First, a parallel corpus is provided, which is a simple file that contains a pair of sentences in English and ASL gloss annotation. Then a word alignment phase is done using statistical models such as IBM Model 1, 2, 3, improved using a string-matching algorithm for mapping each English word into its corresponding word in ASL Gloss annotation. Then a Statistical Machine translation Decoder is used to determine the best translation with the highest probability using a phrase-based model. Regarding that Arabic deaf community represent 25% from the deaf community around the world, and while the Arabic language is a low-resource language. Many ArSL translation systems were introduced.
[9] Aouiti and Jemni, proposed a translation system called ArabSTS (Arabic Sign Language Translation System) that aims to translate Arabic text to Arabic Sign Language. This system takes MSA or EGY text as input, then a morphological analysis is conducted using the MADAMIRA tool, next, the output directed to the SVM classifier to determine the correct analysis for each word. Later, the result is written in an XML file and given to an Arabic gloss annotation system. The proposed gloss annotation system provides a global text representation that covers a lot of features (such as grammatical and morphological rules, hand-shape, sign location, facial expression, and movement) to cover the maximum of relevant information for the translation step. This system is based on the Qatari Sign Language rules, each gloss is represented by an Arabic word that identifies one Arabic Sign. Then, The XML file contains all the necessary information to create a final Arab Gloss representation or each word, it is divided into two sections. In the first part, each word is assigned to several fields (id, genre, num, function, indication), and the second part gives the final form of the sentence ready to be translated. By the end of the system, the translated sentence will be animated into Arabic Sign Language by an avatar.
[10] Luqman and Mahmoud, build a translation system from Arabic text into ArSL based on rules. The proposed work introduces a textual writing system and a gloss system for ArSL transcription. This approach is semantic rule-based. The architecture of the system contains three stages: Morphological analysis, syntactic analysis, and ArSL generation. The Morphological analysis is done by the MADAMIRA tool while the syntactic analysis is performed using the CamelParser tool and the result for this step will be a syntax tree. For generating the ArSL Gloss annotations, the phrases and words of the sentence are lexically transformed into its ArSL equivalents using the ArSL dictionary. After the lexical transformation, the rule transformation is applied. Those rules are built based on differences between Arabic and ArSL, that maps Arabic to ArSL in three levels: word, phrase, and sentence. Then the final representation will be given in the form of ArSL gloss annotation and a sequence of GIF images.
[11] Automatic speech recognition is the area of research concerning the enablement of machines to accept vocal input from humans and interpreting it with the highest probability of correctness. Arabic is one of the most spoken languages and least highlighted in terms of speech recognition. The Arabic language has three types: classical, modern, and dialectal. Classical Arabic is the language Quran. Modern Standard Arabic (MSA) is based on classical Arabic but with dropping some aspects like diacritics. It is mainly used in modern books, education, and news. Dialectal Arabic has multiple regional forms and is used for daily spoken communication in non-formal settings. With the advent of social media, dialectal Arabic is also written. Those forms of the language result in lexical, morphological and grammatical differences resulting in the hardness of developing one Arabic NLP application to process data from different varieties. Also there are different types of problem recognition but we will focus on continuous speech. Continuous speech recognizers allow the user to speak almost naturally. Due to the utterance boundaries, it uses a special method, which is why it is considered as one of the most difficult systems to create.
[12] An AASR system was developed with a 1,200-h speech corpus. The authors modeled a different DNN topologies including: Feed-forward, Convolutional, Time-Delay, Recurrent Long Short-Term Memory (LSTM), Highway LSTM (H-LSTM) and Grid LSTM (GLSTM). The best performance was from a combination of the top two hypotheses from the sequence trained GLSTM models with 18.3% WER.
[13] A comparison for some of the state-of-the-art speech recognition techniques was shown. The authors applied those techniques only to a limited Arabic broadcast news dataset. The different approaches were all trained with a 50-h of transcription audio from a news channel “Al-jazirah”. The best performance obtained was the hybrid DNN/HMM approach with the MPE (Minimum Phone Error) criterion used in training the DNN sequentially, and achieved 25.78% WER.
[14] Speech recognition using deep-learning is a huge task that its success depends on the availability of a large repository of a training dataset. The availability of open-source deep-learning enabled frameworks and Application Programming Interfaces (API) would boost the development and research of AASR. There are multiple services and frameworks that provide developers with powerful deep-learning abilities for speech recognition. One of the marked applications is Cloud Speech-to-Text service from Google which uses a deep-learning neural network algorithm to convert Arabic speech or audio file to text. Cloud Speech-to-Text service allows for its translator system to directly accept the spoken word to be converted to text then translated. The service offers an API for developers with multiple recognition features.
[15] Another service is Microsoft Speech API from Microsoft. This service helps developers to create speech recognition systems using deep neural networks. IBM cloud provides Watson service API for speech to text recognition support modern standard Arabic language.

Data collection

Because of the lack of data resources about the Arabic sign language. We dedicated a lot of energy to collect our own datasets. For this end, we relied on the available data from some official [16] and non-official sources [17, 18, 19] and collected, until now, more than 100 signs. The dataset is composed of videos and a .json file describing some meta data of the video and the corresponding word such as the category and the length of the video.

Published Research

Our long abstract paper [20] intitled ‘Towards A Sign Language Gloss Representation Of Modern Standard Arabic’ was accepted for presentation at the Africa NLP workshop of the 8^th International Conference on Learning Representations (ICLR 2020) in April 26th in Addis Ababa Ethiopia. In this paper we were interested in the first stage of the translation from Modern Standard Arabic to sign language animation that is generating a sign gloss representation. We identified a set of rules mandatory for the sign language animation stage and performed the generation taking into account the pre-processing proven to have significant effects on the translation systems. The presented results are promising but far from well satisfying all the mandatory rules.

Mobile Application

The application is developed with Ionic framework which is a free and open source mobile UI toolkit for developing cross-platform apps for native iOS, Android, and the web : all from a single codebase. The application is composed of three main modules: the speech to text module, the text to gloss module and finally the gloss to sign animation module.

In the speech–to–text module, the user can choose between the Modern Standard Arabic language and the French language. The user can long-press on the microphone and speak or type a text message. The voice message will be transcribed to a text message using the google cloud API services. In the text-to-gloss module, the transcribed or typed text message is transcribed to a gloss. This module is not implemented yet. The results from our published paper are currently under test to be adopted. Finally, in the the gloss–to-sign animation module, at first attempts, we tried to use existing avatars like ‘Vincent character’ [ref], a popular avatar with high-quality rigged character freely available on Blender Cloud. We started to animate Vincent character using Blender before we figured out that the size of generated animation is very large due to the character’s high resolution. Therefore, in order to be able to animate the character with our mobile application, 3D designers joined our team and created a small size avatar named ‘Samia’. The designers recommend using Autodesk 3ds Max instead of Blender initially adopted. 3ds Max is designed on a modular architecture, compatible with multiple plugins and scripts written in a proprietary Maxscript language. In future work, we will animate ‘Samia’ using Unity Engine compatible with our Mobile App.

References

[1] World Health Organization website: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss
[2] Ethnologue website: https://www.ethnologue.com/language/xms
[3] Moroccan governement website: http://www.maroc.ma/fr/actualites/mme-hakkaouila-standardisation-de-la-langue-des-signes-un-pas-vers-lintegration-sociale
[4] Brour, Mourad & Benabbou, Abderrahim. (2019). ATLASLang MTS 1: Arabic Text Language into Arabic Sign Language Machine Translation System. Procedia Computer Science. 148. 236-245. 10.1016/j.procs.2019.01.066.
[5] Brour, Mourad & Benabbou, Abderrahim. (2019). ATLASLang NMT: Arabic text language into Arabic sign language neural machine translation. Journal of King Saud University – Computer and Information Sciences. 10.1016/j.jksuci.2019.07.006.
[6] Biyi Fang, Jillian Co, Mi Zhang. (2018). ”DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation”. 15th ACM Conference on Embedded Network Sensor Systems.https://doi.org/10.1145/3131672.3131693
[7] Omar H. Al-Barahamtoshy, Hassanin M. Al-Barhamtoshy. (2017). ”Arabic Text-to-Sign (ArTTS) Model from Automatic SR System”. 3rd International Conference on Arabic Computational Linguistics, ACLing 2017, Dubai, United Arab Emirates. https://doi.org/10.1016/j.procs.2017.10.122
[8] A. Othman and M. Jemni, “Statistical Sign Language Machine Translation: from English written text to American Sign Language Gloss,” vol. 8, no. 5, p. 9, 2011.
[9] N. Aouiti and M. Jemni, “Translation System from Arabic Text to Arabic Sign Language,” JAIS, vol. 3, no. 2, pp. 57–70, Dec. 2018, doi:33633/jais.v3i2.2041.
[10] H. Luqman and S. A. Mahmoud, “Automatic translation of Arabic text-to-Arabic sign language,” Universal Access in the Information Society, vol. 18, pp. 939–951, 2018, doi:1007/s10209-018-0622-8.
[11] Algihab, W., Alawwad, N., Aldawish, A., & AlHumoud, S. (2019). Arabic Speech Recognition with Deep Learning: A Review. Lecture Notes in Computer Science, 15–31. doi:10.1007/978-3-030-21902-4_2
[12] AlHanai, T., Hsu, W.-N., Glass, J.: Development of the MIT ASR system for the 2016 Arabic multi-genre broadcast challenge. In: 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, pp. 299–304 (2016)
[13] Cardinal, P., et al.: Recent advances in ASR applied to an Arabic transcription system for AlJazeera, p. 5.
[14] Khurana, S., Ali, A.: QCRI advanced transcription system (QATS) for the Arabic multidialect broadcast media recognition: MGB-2 challenge. In: 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, pp. 292–298 (2016)
[15] Graciarena, M., Kajarekar, S., Stolcke, A., Shriberg, E.: Noise robust speaker identification for spontaneous Arabic speech. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, Honolulu, HI, pp. IV-245–IV-248 (2007)
[16] http://www.social.gov.ma/fr/accueil
[17] https://www.handspeak.com/word/search/index.php?id=7508
[18] https://www.ifes.org/sites/default/files/electoral-lexicon-manual-in-moroccan-sign-language.pdf
[19] https://www.youtube.com/channel/UC-KdJajipGWAYrrQZ8NHl7g
[20]- https://arxiv.org/login?next_page=/submit/3105331/view

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D

October 20, 2020

AI4D

AI4D blog series: Text to speech WOLOF dataset

In this work, we propose to create a Wolof text to speech dataset. Text To Speech(TTS) dataset is composed of pairs of text and audio, where it’s text is the transcription of the associated audio. But before we deep dive in the process of collecting the dataset, let’s take a look at some interesting facts about Wolof language and why it is important to build such dataset.

Wolof /ˈwoʊlɒf/[4] is a language of Senegal, the Gambia and Mauritania, and the native language of the Wolof people. Like the neighbouring languages Serer and Fula, it belongs to the Senegambian branch of the Niger–Congo language family. Unlike most other languages of the Niger-Congo family, Wolof is not a tonal language.[1]

Wolof is spoken by more than 10 million people and about 40 percent (approximately 5 million people) of Senegal’s population speak Wolof as their native language. Increased mobility, and especially the growth of the capital Dakar, created the need for a common language.

Today, an additional 40 percent of the population speak Wolof as a second or acquired language. In the whole region from Dakar to Saint-Louis, and also west and southwest of Kaolack, Wolof is spoken by the vast majority of the people.

Typically when various ethnic groups in Senegal come together in cities and towns, they speak Wolof. It is therefore spoken in almost every regional and departmental capital in Senegal.[1]

Goal and benefits

Our goal here is to help researchers and companies to have a dataset that they can use to experiment and build automatic systems that can convert text to audio. This type of system can help people with reading problems (e.g blind or illiterate people) to get information and interact with other people or even new technologies(e.g web and mobile applications). There is also the fact that Wolof is not written correctly by most Wolof natifs and non natif, so it becomes difficult for them to read Wolof text, but with a TTS system they could easily be able to understand Wolof text and also learn how Wolof should be written in the process.

Text data collection and preparation

The text collection is the phase of creating clean and representative text that can be used to do the recordings.

Sources: Unlike popular languages, such as english or french, Wolof texts are very scarce on the Internet and in general in digital form, so we had to make more effort just to get the raw text data.

The text used to build this dataset is collected from different sources, such as Wolof website news(sabaal and dekufi), wikipedia and many many texts provided by the Wolof expert in our team.

Cleaning: The cleaning of the text was the most challenging and time consuming task of this work. We had to remove some non Wolof sentences, non used symbols or words, long sentences or paragraphs. There was also manual cleaning of the data.

We also developed with the help of our Wolof expert an algorithm to convert Wolof text to Wolof phonemes. This part is crucial here, because we needed to verify if the text corpus covers correctly the Wolof phonemes, because if phonemes are not covered correctly, the resulting TTS system will have difficulties converting some phonemes to their corresponding sound. After some iteration, we were able to choose sentences that cover all Wolof phonemes with a good distribution with respect to phonemes frequencies .

Where are we: the collection of the text is complete with more than 30 000 sentences cleaned and ready for recording.

Audio data collection

The audio collection is the phase of creating recordings that correspond to the already cleaned text.

The human part: The audio collection is done by two actors, a male and a female voice. Each one needs to record at least 20 000 out of the 30 000 cleaned sentences. We have had some issues on starting the recording due to the time of text cleaning but, we also had some delay with respect to having the microphone and the material resources needed to do the recording. The other problem we had here was building the Web platform that actors use to do the recording, which we will talk about in the next section.

The platform for recording: We forked and modified open source project Common Voice[2], which is a project from Mozilla to help our actors to easily due the recording using just their internet browser. Data is collected and automatically sent to an S3 bucket after each 5 recordings.

Where are we: We had a big delay on the collection of the recordings (the recording started just three weeks ago), as writing this article(2020/10/20), we have collected 4000 out of the 40 000 recordings. But we hope that the recording rate will be higher after the actors are more used to the process. We are expecting to collect at least 1400 recordings per week from the two actors.

Also after the recording is done, we will verify and clean the audio data set, for example, trim silences, duration check, and so on. We will also build à baseline model with this dataset and make it available to the community.

Conclusion

We are really grateful to AI4D for giving us the possibility and the means to build and collect a Wolof TTS dataset and we hope that this kind of initiative will be more frequent to help create more and more dataset for our local languages so that new system and models can be built with them and so increase accessibility to new technologies but also help more people to have access to information in their own language.

References

[1] https://en.wikipedia.org/wiki/Wolof_language

[2] https://github.com/mozilla/common-voice

June 20, 2020

AI4D

AI4D blog series: Preservation of Indigenous Languages

Context

In most African countries, perhaps more so in Africa than elsewhere, the majority of the populations do not speak the official languages; instead, they speak traditional languages. In some countries, this proportion is as high as 80%. Because of this language barrier, this large part of the population is practically excluded from the march of society: they have no access to information or education and cannot really participate in the debates on the socio-economic development of their country.

From another point of view, our values, cultures, knowledge of all kinds and history are conveyed orally in these languages and thus remain inaccessible to the rest of the world.

Dr Aminata SABANE, Université Joseph Ki-Zerbo

Teg-wende Idriss Tinto

Objectives

The main objective of the Preservation of Indigenous Languages project is to contribute to the preservation of local languages and the enhancement of local language content through (1) archiving, (2) promotion and (3) popularization of local language content. Archiving will make it possible to preserve content and knowledge in local languages. We will collect and use existing data in local languages for this purpose. The promotion will be done by exploiting the richness of this local language content. And popularisation will be made possible by making this content accessible in the official languages. In order to achieve these objectives, our project is divided into three parts, all of which have an important upstream data collection and pre-processing stage:

Transcription from local languages to text in local languages
Translation from local languages to official languages (French) and vice versa
Voice synthesis of texts in local languages into audios in local languages.

Team

To successfully carry out the project, we have set up a dedicated team of 10 people:

A research mentor with a background in AI,
Two practice mentors with a background in local languages. The first one is a specialist of education in local languages and the second one is with various works in translation from French to Moore, the main local language in Burkina Faso.
A research assistant with a background in linguistic. In this case, the assistant was a student whose responsibility was to help on the collection of content in languages, pre-treatement of data,
Three computer programmers. In this case, the programmers were computer science students (master and PhD students). Each of them has in charge one of the three parts of the project plus some pretreatment tasks.

Implementation

For this project, we limited ourselves to one local language, Mooré. This language is the main language of Burkina Faso and is spoken by more than half of the population. There are also many sources of data in this language and important work has already been done on translations from French into this language, especially in the educational and religious fields.

(0) Data Collection: As announced, data collection is an important and necessary step for the different parts of the project. It is also one of the most difficult steps. The opening of data is not yet compulsory in our countries.

With the invaluable help of practice mentors, meetings were organised with the main institutions, both public and private, to explore existing data and the extent to which these data could be exploited.

Among the institutions that were contacted, the main ones are the following:

Fondation pour le Développement Communautaire/ Burkina Faso(FDC-BF);
the biblical alliance of Burkina Faso;
Fonds pour l’alphabétisation et l’éducation non formelle (FONAENF);
The Directorate of Research in Non-Formal Education (DRENF);
The DPDMT;
Ecole et langue nationale en Afrique (ELAN);
Savane Media.

We were thus able to access a certain amount of data but not always in digital format or not always complete. This required an enormous amount of pre-processing work either to put the data in digital format or to complete it either with translations or transcriptions.

One of the first sources of data we had access to was the Moore Bible in text and audio. It is this source that was also used after pre-processing (audio cutting sentence by sentence or verse by verse, alignment of Moore and French texts) for the first tests for the different parts of the project.

The collection and pre-processing work is still in progress to enrich our data sources and improve our models.

(1) Transcription: Since writing is not yet very popular in our local languages, we have a large amount of data in local languages in audio format. In addition, people who cannot write will always use oral communication to express themselves. The step of transcribing the audio content into local languages is an essential step to not only collect existing information but also to gather what people have to say.

After a state of the art and testing of existing transcription tools, the student in charge of this part implemented his transcription model based on the DeepSeepch tool. He uses data from the bible for these tests. In addition to the workload for pre-processing and the working conditions made a bit difficult because of the Covid19 pandemic, we unfortunately had problems with computing capacity and are working with one of the partners to increase the capacities of the leased Virtual Machines.

(2) Translation: Translation is at the heart of this project. It aims to make official language information accessible to people in rural areas but also to provide access to the wealth of local language content.

The student in charge of this component has, after a state of the art of existing translation approaches, applied classical neural machine translation techniques on bible data using OpenMT. But the results were not very good as one could expect given the lack of training data. So he is now implementing meta-learning using the Meta-NMT tool. Meta-learning has been described in the literature as performing better than the classical approach when there is little data.

Here, too, in addition to the need for more data, we face a need for computing capacity that should also be resolved with the provision of VMs.

(3) Voice synthesis: Voice synthesis will make it possible, after translation from the official languages into local languages, to make the content available to populations who cannot read but who will be able to have it in audio format. The student in charge of this part also carried out a state of the art of existing tools in this field. He is currently testing different tools and studying different models. He, unfortunately, started with a little delay but will continue his work in order to be able to adapt a model and to make tests with the collected data in order to be able to carry out the vocal synthesis of the text in mooré audio.

Results

At this stage, while we just crossed the mid-term of the project execution, we can report that a number of milestones have been achieved:

Data collection has been done and is still ongoing.
Pre-processing of audio and text content as well as audio and text mapping in Mooré and alignment of text in Mooré et al correspondence in French have been performed.
A transcription model for Mooré to French based on deepSpeech has been implemented.
The classical translation has been implemented and tested on the Bible dataset

Main challenges

Access to Data

After going through about ten structures, we were confronted with the availability of resources. Indeed, apart from the Bible, some training materials and official documents translated, there were very few documents available in Moore and French.

The structures that produce Moore content, most often do so for training or awareness-raising for the illiterate population. As a result, they do not produce the same content in French. As for radio and television channels, they have interventions directly in Moore, without written notes, even for the presentation of the television news.

However, we found a lot of printed material, without digital versions and only in Moore. For this phase of the project, we collected and carried out the alignment for the already existing data in both languages in digital format. This allowed us to test the model, and although it did not lead to conclusive results, we did identify the problem of data availability. For further work, we plan to translate the existing documents into Moore so that we have both versions to continue the work. We are aware that this is a long term work, but it is the indispensable condition to have enough data to make the results of the algorithms interesting.

Copyright

A second problem we encountered was copyright. Indeed, we do not always have direct access to the authors, and the holders of the documents are reserved to share them without their agreement. In other cases, the documents had been commissioned by international organizations. It was therefore necessary for our interlocutors here to have the agreement of these institutions before giving us access to the data. This takes time and has delayed access to the working data.

In the long term, we plan to bring together a group of authors to raise their awareness of the project so that they can facilitate advocacy for the project.

Computing capacity

We unfortunately do not have a laboratory equipped with servers powerful enough to run our models. Our partnership with Anptic was supposed to allow us to use VMs with greater capacity to go faster in testing, but the administrative burden also delayed the availability of VMs.

Knowledge 4 All Foundation Acknowledged by Masakhane Research Foundation in Groundbreaking NLP Publications

AI4D blog series: The First Tunisian Arabizi Sentiment Analysis Dataset

Motivation

Team

Implementation

AI4D blog series: Arabic Speech-to-Moroccan Sign Language Translator: “Learning for Deaf”

Research reviews

Data collection

Published Research

Mobile Application

References

AI4D blog series: Text to speech WOLOF dataset

Goal and benefits

Text data collection and preparation

Audio data collection

Conclusion

References

AI4D blog series: Preservation of Indigenous Languages

Context

Objectives

Team

Implementation

Results

Main challenges

Access to Data

Copyright

Computing capacity

Knowledge 4 All Foundation Ltd.