Motivation

In Africa, English, French, Portuguese and Arabic are the typical languages of instruction as well as official communication. On the other hand, there are approximately 2,000 indigenous languages.

Over time, indigenous languages are being replaced even among people of the same community of origin. The situation is exacerbated by the advent of digital platforms which have made communication easier in English but tedious in other languages.

Natural language processing tools such as autocorrection and autocompletion, that have enhanced the usability of electronic communication in only a few languages, present obstacles for indigenous languages. The absence of these facilities causes frustration.

For example, the experience of typing in an indigenous language and having the autocorrect program replace words with English ones that are similar in spelling but completely different in meaning is common.

This reduction in the usefulness of indigenous languages puts them at risk, and it is, therefore, necessary to develop digital resources for these languages to make them relevant in the digital age, and, hence boost their use and their preservation.

Objectives

To develop openly licensed free to use African language corpora.
To set up a web-based platform for crowd sourcing stories in African languages
To set up an African language short story competition on the platform and create awareness
To collect a written corpus of African languages
To provide openly accessible material for natural language processing research for African languages
To develop digital resources for indigenous African languages, in particular spell checkers(Etoori, Chinnakotla, and Mamidi 2018; Monson et al. 2004) for desktop, mobile and web applications for which computational resources from CHPC can be used for training deep learning models

Long term vision

It is hoped that the competition is successful and provides a model that can be continued on an annual basis. It can provide a means to extend the available corpora, encourage literacy, create pride in indigenous cultures and improve cultural understanding between peoples in different language groups.

By appropriately licensing materials, they can also be used in creating good text to speech systems. For example, the common voice project records people reading openly licensed material in different languages so that the recordings can be used for training deep learning models to provide realistic human like text to speech systems. Such systems can be used in a variety of commercial applications, such as car navigation systems. The project will pilot recording of short story titles to determine if this crowd sourcing strategy can also be used for collecting a speech corpus.

In addition, by providing written materials suitable for school pupils, it should help in attaining the sustainable development goal of quality education for all by increasing literacy. Another sustainable development goal is peace and justice. Many wars in Africa occur between peoples who speak different African languages. By encouraging knowledge of more than one African language, one also creates better cultural understanding which should result in fewer conflicts.

The corpora should enable the use of machine learning methods to identify the language a text is written in, if it is from one of the collected African languages, for example, in order to route a query to the right language engine. In addition, by encouraging cooperation between computer scientists and people who study literature, it
is hoped to collaboratively build spelling and grammar checkers, plagiarism checkers, stylometric analysis tools and deep learning enabled content generation software that are useful for African languages.

In the long term, it is hoped to stimulate the production of natural language processing tools and aids such as morphological analyzers, lemmatizers, tokenizers, parsers, parts of speech taggers, and sentiment analysis tools for African languages. The natural language processing toolkit (NLTK) contains many of the above tools and is a
free and open source library that enables automated analysis of English text.

This allows many researchers to perform natural language processing tasks, and many businesses to become more efficient by freeing expensive staff to perform tasks with high value addition as computers can automate tasks such as answering common customer queries. A few languages such as Kinyarwanda, Kiswahili and Afrikaans have begun the development of these tools, but much work remains to be done and for many other African languages tool development has not yet begun. Such tools enable more effective machine translation.

This can greatly reduce the cost of producing document translations, especially useful for government communications and more pleasant retail consumer experiences. The current state of the art for machine translation of low resource languages uses 60,000 sentences for a
language pair(Fraser et al. 2020). This pilot project will not be able to collect such a data set, but will identify places in Africa where such datasets can be collected. A bible translation is approximately 60,000 sentences, but the bible is not a typical text and there are many domain areas where alternative and more texts are required.

Personnel

Prof. Audrey Mbogho: Research interests include applications of machine learning to developing world problems, including processing and preservation of low-resource languages.
Dr. Lilian Wanzare: Research interest is Artificial Intelligence, in particular Natural Language Processing and building text processing tools for low-resource languages.
Dr. Benson Muite: Research interests include high performance computing and big data analysis. He will be the principal project coordinator.
Prof. Constantine Yuka: Research interests include African linguistics and literature.
Mr. Juan Steyn: Research interests include digital learning and digital humanities.

An African Short Story Language Corpus

Motivation

Objectives

Long term vision

Personnel

Knowledge 4 All Foundation Ltd.