AI4D blog series: Preservation of Indigenous Languages

Context

In most African countries, perhaps more so in Africa than elsewhere, the majority of the populations do not speak the official languages; instead, they speak traditional languages. In some countries, this proportion is as high as 80%. Because of this language barrier, this large part of the population is practically excluded from the march of society: they have no access to information or education and cannot really participate in the debates on the socio-economic development of their country.

From another point of view, our values, cultures, knowledge of all kinds and history are conveyed orally in these languages and thus remain inaccessible to the rest of the world.

Objectives

The main objective of the Preservation of Indigenous Languages project is to contribute to the preservation of local languages and the enhancement of local language content through (1) archiving, (2) promotion and (3) popularization of local language content. Archiving will make it possible to preserve content and knowledge in local languages. We will collect and use existing data in local languages for this purpose. The promotion will be done by exploiting the richness of this local language content. And popularisation will be made possible by making this content accessible in the official languages. In order to achieve these objectives, our project is divided into three parts, all of which have an important upstream data collection and pre-processing stage:

  • Transcription from local languages to text in local languages
  • Translation from local languages to official languages (French) and vice versa
  • Voice synthesis of texts in local languages into audios in local languages.

Team

To successfully carry out the project, we have set up a dedicated team of 10 people:

  • A research mentor with a background in AI,
  • Two practice mentors with a background in local languages. The first one is a specialist of education in local languages and the second one is with various works in translation from French to Moore, the main local language in Burkina Faso.
  • A research assistant with a background in linguistic. In this case, the assistant was a student whose responsibility was to help on the collection of content in languages, pre-treatement of data,
  • Three computer programmers. In this case, the programmers were computer science students (master and PhD students). Each of them has in charge one of the three parts of the project plus some pretreatment tasks.

Implementation

For this project, we limited ourselves to one local language, Mooré. This language is the main language of Burkina Faso and is spoken by more than half of the population. There are also many sources of data in this language and important work has already been done on translations from French into this language, especially in the educational and religious fields.

(0) Data Collection: As announced, data collection is an important and necessary step for the different parts of the project. It is also one of the most difficult steps. The opening of data is not yet compulsory in our countries.

With the invaluable help of practice mentors, meetings were organised with the main institutions, both public and private, to explore existing data and the extent to which these data could be exploited.

Among the institutions that were contacted, the main ones are the following:

  • Fondation pour le Développement Communautaire/ Burkina Faso(FDC-BF);
  • the biblical alliance of Burkina Faso;
  • Fonds pour l’alphabétisation et l’éducation non formelle (FONAENF);
  • The Directorate of Research in Non-Formal Education (DRENF);
  • The DPDMT;
  • Ecole et langue nationale en Afrique (ELAN);
  • Savane Media.

We were thus able to access a certain amount of data but not always in digital format or not always complete. This required an enormous amount of pre-processing work either to put the data in digital format or to complete it either with translations or transcriptions.

One of the first sources of data we had access to was the Moore Bible in text and audio. It is this source that was also used after pre-processing (audio cutting sentence by sentence or verse by verse, alignment of Moore and French texts) for the first tests for the different parts of the project.

The collection and pre-processing work is still in progress to enrich our data sources and improve our models.

(1) Transcription: Since writing is not yet very popular in our local languages, we have a large amount of data in local languages in audio format. In addition, people who cannot write will always use oral communication to express themselves. The step of transcribing the audio content into local languages is an essential step to not only collect existing information but also to gather what people have to say.

After a state of the art and testing of existing transcription tools, the student in charge of this part implemented his transcription model based on the DeepSeepch tool. He uses data from the bible for these tests. In addition to the workload for pre-processing and the working conditions made a bit difficult because of the Covid19 pandemic, we unfortunately had problems with computing capacity and are working with one of the partners to increase the capacities of the leased Virtual Machines.

(2)  Translation: Translation is at the heart of this project. It aims to make official language information accessible to people in rural areas but also to provide access to the wealth of local language content.

The student in charge of this component has, after a state of the art of existing translation approaches, applied classical neural machine translation techniques on bible data using OpenMT. But the results were not very good as one could expect given the lack of training data. So he is now implementing meta-learning using the Meta-NMT tool. Meta-learning has been described in the literature as performing better than the classical approach when there is little data.

Here, too, in addition to the need for more data, we face a need for computing capacity that should also be resolved with the provision of VMs.

(3) Voice synthesis: Voice synthesis will make it possible, after translation from the official languages into local languages, to make the content available to populations who cannot read but who will be able to have it in audio format. The student in charge of this part also carried out a state of the art of existing tools in this field. He is currently testing different tools and studying different models. He, unfortunately, started with a little delay but will continue his work in order to be able to adapt a model and to make tests with the collected data in order to be able to carry out the vocal synthesis of the text in mooré audio.

Results

At this stage, while we just crossed the mid-term of the project execution, we can report that a number of milestones have been achieved:

  • Data collection has been done and is still ongoing.
  • Pre-processing of audio and text content as well as audio and text mapping in Mooré and alignment of text in Mooré et al correspondence in French have been performed.
  • A transcription model for Mooré to French based on deepSpeech has been implemented.
  • The classical translation has been implemented and tested on the Bible dataset

Main challenges

Access to Data

After going through about ten structures, we were confronted with the availability of resources. Indeed, apart from the Bible, some training materials and official documents translated, there were very few documents available in Moore and French.

The structures that produce Moore content, most often do so for training or awareness-raising for the illiterate population. As a result, they do not produce the same content in French. As for radio and television channels, they have interventions directly in Moore, without written notes, even for the presentation of the television news.

However, we found a lot of printed material, without digital versions and only in Moore. For this phase of the project, we collected and carried out the alignment for the already existing data in both languages in digital format. This allowed us to test the model, and although it did not lead to conclusive results, we did identify the problem of data availability. For further work, we plan to translate the existing documents into Moore so that we have both versions to continue the work. We are aware that this is a long term work, but it is the indispensable condition to have enough data to make the results of the algorithms interesting.

Copyright

A second problem we encountered was copyright. Indeed, we do not always have direct access to the authors, and the holders of the documents are reserved to share them without their agreement. In other cases, the documents had been commissioned by international organizations. It was therefore necessary for our interlocutors here to have the agreement of these institutions before giving us access to the data. This takes time and has delayed access to the working data.

In the long term, we plan to bring together a group of authors to raise their awareness of the project so that they can facilitate advocacy for the project.

Computing capacity

We unfortunately do not have a laboratory equipped with servers powerful enough to run our models. Our partnership with Anptic was supposed to allow us to use VMs with greater capacity to go faster in testing, but the administrative burden also delayed the availability of VMs.

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D

New call for AI4D innovation grants open now

Deep Learning Indaba 2019, Nairobi, Kenya
Deep Learning Indaba 2019, Nairobi, Kenya

Knowledge 4 All Foundation partnered with the Deep Learning Indaba to fund research projects across Africa that are collaborative at heart and have a strong development focus.

This Call for Proposals invites individuals, grassroots organizations, initiatives, academic, and civil society institutions to apply for funding for mini-projects.

A mini-project could also be early-stage research around our Grand Challenge of curing leishmaniasis.

AI4D blog series: Building a Medicinal Plant Database for Facilitating the Exploitation of Local Ethnopharmacological Knowledge

Context

In many African countries such as Burkina Faso, people still rely quite often on traditional medicine for both common and uncommon diseases. This is particularly true in rural areas where 71% of the Burkinabe people live. While the research literature acknowledges the pharmacological virtues of some plants, the relevant knowledge is neither sufficiently organized nor widely shared.

Objectives

The ultimate goal of this project is to build an open and searchable database on medical plants. To that end, the project focuses on (1) collecting a variety of information on such plants from diverse sources, (2) implementing a platform to expose the constructed knowledge, (3) develop context-specific tools to accelerate the accurate identification of plants in the wild.

Team

To successfully carry out the project, we have set up a dedicated team of 10 people:

  • A research mentor with a background in AI,
  • A practice mentor with a background in traditional medicine. In this case, the mentor happened to be the director of the promotion of traditional medicine at the Ministry of Health,
  • A research assistant with a background in Sociology. In this case, the assistant was a student whose responsibility was to help on the collection of ethnobotanical data,
  • Three computer programmers. In this case, the programmers were computer science students who were tasked to devise and implement the database, the search engine as well as the plant identification tool.

And four investigators to collect data on the virtues of plants

Implementation

(1) Data collection: Work sessions with the practice mentor allowed us to devise an adapted methodology and identify data sources.

The adopted methodology consists of drawing a list of plants based on relevant research literature and leveraging online databases. Then, the team can conduct an ethnobotanical study with traditional medicine practitioners to gather information on the uses of plants for therapeutic purposes. For each plant, we agreed to focus on the following information:  Scientific name, Species, Family, Name in three local languages (Moore, Dioula, Fulfulde), Spatial location,  Status (endangered or not), medical use (virtues).

The data collection is mainly performed in the two largest cities in the country, namely Ouagadougou and Bobo-Bobo-Dioulasso. In the implementation of the activities, we were surprised by the amount of research that has already been done on medicinal plants, although the data is not sufficiently structured and shared. In addition, we discovered that both at the level of traditional practitioners as well as the state, there are actions being structured for the valorization of traditional medicine. Our project, therefore, reinforces the existing mechanism. In the continuation of the activities, in addition to plants, we plan to create a database of traditional practitioners. In order to be able to reference them more easily in the research works that are carried out.

(2) Platform development: With respect to the platform, we leverage the ElasticSearch engine to build the backend database and search engine.

(3) Plant detector implementation: We also devised a deep learning system to classify plant leaf images for fast identification in the wild. This work required contextualization as we supposed that users will carry mobile phones with little computing power and potentially no data network connectivity. Thus we implemented a neural network model compression algorithm that yielded a classifier with reasonable prediction accuracy and yet was runnable on low-resource devices.

Results

At this stage, while we just crossed the mid-term of the project execution, we can report that a number of milestones have been achieved:

  • the plant detector has been implemented
  • the first batch of medicinal plant dataset has been collected
  • the platform backend architecture has been finalized

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D

Mini-documentary on Artificial Intelligence 4 Development

Mini-documentary on AI4D Artificial Intelligence 4 Development
Mini-documentary on AI4D Artificial Intelligence 4 Development

We produced a mini-documentary describing the ideas, aspirations, and research potential of our African colleagues in the field of Artificial Intelligence.

The footage was taken at the kick-off of the workshop Organized by K4A, IDRC, SIDA at workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019 @IDRC_CRDI #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D

The emerging network of machine learning and AI practitioners and researchers undertaking a collaborative roadmap for AI for Development in Africa. The three-day workshop zoomed in on three critical areas of 1) policy and regulations, 2) skills and capacity building and 3) the application of AI in Africa.

 

K4A grant to solve access to Nigeria’s legislative bills with AI

AI4D mini-grants presentations, Nairobi 2019
AI4D mini-grants presentations, Nairobi 2019

K4A grant recipients Adewale Akinfaderin, Olamilekan Wahab and Olubayo Adekanmb, are successfully using Artificial Intelligence to digitize parliamentary bills in Sub-Saharan Africa and Specifically in Nigeria. Read their recent interview in the Techpoint.Africa article.

Knowledge 4 All Foundation sponsors #AI4D Africa Innovation Awards @Indaba

In July 2019 we issued a call for proposals to invite individuals, grassroots organizations, initiatives, academic, and civil society institutions to submit their proposals for mini-projects within the 2019 Artificial Intelligence for Development (AI4D) initiative funded by IDRC.  We selected 10 winners and invited the Awardees to present their solutions at the AI4D workshop during the Deep Learning Indaba 2019 conference.

Deep Learning Indaba is the most exciting AI event in Africa

Last week more than 700 machine learning researchers from over 30 African countries came together at Kenyatta University in Nairobi, Kenya for the third annual Deep Learning Indaba. Knowledge 4 All Foundation helped fundraising for Indaba in order to support its aims to Strengthen African Machine Learning.

Knowledge 4 All Foundation runs #IJCAI presentations that show appetite for AI and SDGs

At this years edition of IJCAI, we successfully organised a workshop  on the topic of Sustainable Development Goals. The talks presented machine learning solutions in practice, from poaching to building structures, and poverty and wealth. It seems that the topic of SDGs  and AI in development are gaining traction in practice and can have a valid contribution to science. The X5GON project focusing on education and Open Educational Resources was featured with a keynote presentation.

AI4D interview series: Fernando Perini, IDRC

Fernando Perini, IDRC (attribution by AI4D, CC-BY 2.0., https://ai4d.ai/blog-fernando/)

My bluesky project is to make reality out of the discussions in Nairobi and the AI4D Programme

Organised by K4A, IDRC, SIDA at workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019, @IDRC_CRDI #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D