Abstract

Existing speech recognition services are only available in major languages. Currently, neither Amazon’s Alexa, Apple’s Siri, nor Google Home, the main players in the global voice assistants market, support a single native African language. They also tend to work better for men than women and struggle to understand people with different accents, all of which is a result of biases within the data on which they are trained.

Project Description and Specifications

Currently, Igbo language has no public open-source speech dataset despite having 42 million speakers, even beyond the boundaries of Nigeria. Similar cases occur in Yoruba and Hausa languages and other African languages. By leveraging the Common Voice platform, which was launched to help address biases and subsequent inequalities in voice data, by incorporating community events and incentive mechanisms (much like the ‘Umuganda’ of Rwanda), we plan to curate 1000 hours of diverse (in terms of gender, age, dialect) speech recordings each for Igbo, Hausa and Yoruba languages. This, we believe, will be a step in contributing to the inclusion of African languages in speech technologies. For Igbo and Yoruba, which are not fully localized on Common Voice, we will first complete the localization process before proceeding with the recording. Localization involves translating project tools and material on the Common Voice platform to be understood by contributors in their language.

Anticipated Use Cases and Benefits

Existing speech recognition services are not available in many African languages, and the speakers of these languages are excluded from the benefits of voice-enabled technologies. This dataset will no doubt pave the way for speech technologies – like speech-to-text, text-to-speech, speech translation and modelling – for these African languages, which hitherto had little or no public dataset. For one, it can be used as a training and/or evaluation dataset for speech processing tasks. For another, the availability of such dataset will enable easy creation of voice-enabled services that can be targeted at the indigenous grassroot communities. For example, during the COVID-19 pandemic period in Rwanda, Digital Umuganda was able to use their large curated Kinyarwanda speech-text data on Common Voice to create a health chatbot that helped provide necessary health information to the local Rwandan communities. Thus, this project will engender inclusiveness of most of Africa’s grassroot population, who speak these languages in areas of health, education and information.

Personnel

  1. Chris Emezue is a Masters student at the Technical University of Munich, studying Mathematics in Data Science. He has worked extensively on (and contributed at Masakhane to) a number of projects in AfricaNLP (like MMTAfrica, OkwuGbe). He has worked as a natural language processing (NLP) researcher at Siemens AI Lab, LMU, and HuggingFace (in March 2022).
  2. Adaeze Adigwe is a PhD student at the University of Helsinki, Finland researching on Deep Learning Models for Speech Synthesis within Conversational A.I. Applications. She also extends her research work in the capacity of a speech scientist at ReadSpeaker in the Netherlands.  Her past academic background includes a Masters in Computer Science at Columbia University and Bachelors in Electrical Engineering at Northeastern University. Her research interests include speech and language processing with a focus on prosody, spoken dialogue systems, and low-resource languages.
  3. David Adelani (NLP researcher, https://dadelani.github.io/ ) is a PhD student in computer science at Saarland University, Germany. He led the development of MasakhaNER (Adelani et al. 2021) – a named entity recognition dataset for 10 African languages. The dataset is being expanded to 20 languages supported by Lacuna Fund.
  4. Shamsuddeen Muhammad is a PhD candidate at the University of Porto, Portugal. He is a faculty member at the Faculty of Computer Science and Information Technology, Bayero University, Kano-Nigeria. He is also a researcher at Masakhane and the Laboratory of Artificial Intelligence and Decision Support, Portugal. His research interest focuses on NLP for African low resource languages. His open-source community, HausaNLP, has connections to many local groups and seasoned researchers on Hausa language.
  5. Professor Gloria Monica Tobechukwu Emezue, commonly known as G.M.T Emezue, is a professor of English at the Alex Ekwueme Federal University, Nigeria (AE-FUNAI). As a literary critic and linguist, her major research interests include post-colonial studies, interfaces between the digital and human languages and Literature. She pioneered the Igbo Village project as well as the Igbo Day Cultural festival at AE-FUNAI. Part of her present research that connects with artificial intelligence is the Jidenka Machine Modelling (accepted at the MLCD Workshop in NeurIPS 2021), a project which she and other scholars from around the world have undertaken in order to develop a ML model that can create African literature.

Abstract

Conversational AI and dialog systems tools have become ubiquitous and have been very useful for many practical applications, for example, planning for travel, communication with medical chatbots, and basic household activities like setting alarm or switching on/off the light bulb.

However, these tools are only available for high resource languages like English or French because of the lack of important datasets to power these technologies in many low-resource languages, especially African languages.

Two important tasks needed to power conversation AI systems are intent detection and slot-filling tasks that are required by the dialog system manager to understand and reply to users’ requests.

In this project, we intend to create conversational AI datasets for intent detection and slot-filling tasks needed by voice assistants like Amazon Alexa and Google Home.

In parallel to that, we intend to expand benchmark datasets that are available for African languages to cover more linguistically oriented tasks like commonsense reasoning and natural language inference since they are popular tasks (in multilingual NLU benchmark datasets) needed to develop multilingual pre-trained language models for African languages.

Personnel

  1. David Adelani (Principal Investigator), is a PhD student in computer science at Saarland University, Germany. He led the development of MasakhaNER (Adelani et al. 2021) – a named entity recognition dataset for 10 African languages. The dataset is being expanded to 20 languages supported by Lacuna
  2. Andiswa Bukula (Co-Investigator) is a Digital Humanities researcher at the South African Centre for Digital Language Resources, with a speciality in isiXhosa. She was also an assistant lecturer for isiXhosa at the Nelson Mandela Metropolitan University. Andiswa is a PhD candidate at Rhodes University and is focusing her research on the influence of language technologies on the effectiveness of multilingualism in Higher Education.
  3. Annie En-Shiun Lee (Collaborator) is an assistant professor (teaching stream) at the Computer Science Department at the University of Toronto. She received her PhD from the University of Waterloo and has been a visiting researcher at the Fields Institute and Chinese University of Hong Kong as well as a research scientist in industry. Her research focuses on finding patterns in society and in nature. More specifically, she is interested in exploring data for discovering patterns and their structures in order to uncover the underlying knowledge.

Abstract

The output of machine translation systems depends on the availability of in-domain training data which is not always available for all language pairs.

For instance, several African languages do not only lack in-domain parallel data but also generic parallel data of decent quality and this hinders the development of domain-specific neural machine translation (NMT) models capable of generating high-quality translations.

This is despite the fact that 30% of the world’s languages are spoken in Africa. Hence, the goal of this project is to create a corpus of 10,000 parallel sentences in 2 different domains for machine translation (MT) for 5 African languages. We hope to extend this work to other African languages and domains in the future.

Personnel

  1. Jesujoba Alabi (Co-Investigator) is a research engineer at the ALMAnaCH project-team, Inria Paris, where he is working on domain adaptation in neural machine translation.
  2. David Ifeoluwa Adelani (Co-Investigator) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. He led the development of MasakhaNER (Adelani et al. 2021) – a named entity recognition dataset for 10 African languages. He is currently leading the expansion of the dataset to 20 languages through the grant supported by Lacuna
  3. Andrew Caines (Collaborator) is Senior Research Associate in the NLIP Group & ALTA Institute directed by Prof Paula Buttery, based in the Computer Laboratory at the University of Cambridge, U.K. He recently led a project on the machine translation of public health information from English into Swahili, Igbo and Yoruba.

Introduction

When it comes to scientific communication and education, language matters. The ability of science to be discussed in local indigenous languages not only has the ability to reach more people who do not speak English or French as a first language, but also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past. As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of globalization.”

During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated that the machine translation tools failed to translate COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Thus, we propose to build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into 6 diverse African languages.

Proposed Dataset and Use Cases

When it comes to scientific communication, language matters. Jantjies (2016) demonstrates how language matters when it comes to STEM education: students perform better when taught mathematics in their home language. Language matters, in scientific communication, in how it can dehumanise the people it chose to study – Robyn Humphreys, at the #LanguageMatters seminar at UCT Heritage 2020, noted the following “During the continent’s colonial past, language – including scientific language – was used to control and subjugate and justify marginalisation and invasive research practices”.

The ability of science being discussed in local indigenous languages not only has the ability to reach more people who do not speak English as a first language, it also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past.

As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of
globalization.” (Prah, Kwesi Kwaa, 2007). When science becomes “foreign” or something non-African, when one has to assume another identity just to theorize and practice science, it’s a subjugation of the mind – mental colonization.

There is a substantial amount of distrust in science, in particular by many black South Africans who can cite many examples of how it has been abused for oppression in the past. In addition, the communication and education of science was weaponized by the oppressive apartheid government in South Africa, and that has left many seeds of distrust in citizens who only experience science being discussed in English.

Through government-funded efforts, European derived Languages such as Afrikaans, English, French, and Portuguese, have been used as vessels of science, but African indigenous languages have not been given the same treatment. Modern digital tools like machine learning
offer new, low-cost opportunities for scientific terms and ideas to be communicated in African indigenous languages.
During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated the difficulty in translating COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Use cases:

  • A machine translation tool for AfricArxiv to aid translation of their research to and from African languages
  • Terminology developed will be submitted to respective boards for addition to official language glossaries for further improvements to scientific communication
  • A machine translation tool for African universities to ensure accessibility of their publications
  • A machine translation tool for scientific journalists to assist in widely distributing their work on the African continent
  • A machine tool to aid translation of impactful STEM University curricula into African languages

Personnel

Jade Abbott is the co-founder of Masakhane and Staff Engineer at Retro Rabbit South Africa, working primarily in NLP with an MSc in Computer Science from the University of Pretoria. She is a thought leader in the space of NLP in production, African NLP (especially machine translation) and has published and spoken at numerous conferences across the world, including the Deep Learning Indaba, ICLR 2020,and the UN World Data Forum. In 2019, she co-founded and leads Masakhane – an initiative to spur NLP research in Africa, which have collectively published over 15 works in the past year and are leading the conversation around geographic and language diversity in NLP in Africa

Dr. Johanna Havemann is a trainer and consultant in [Open] Science Communication and [digital] Science Project Management and AfricArxiv. Her work experience covers NGOs, a science startup and international institutions including the UN Environment Programme. With a focus on digital tools for science and her label Access 2 Perspectives, she aims at strengthening global science communication in general – and with a regional focus on Africa – through Open Science. For the past two years, she has laid an additional focus on language diversity in Science and the pan-African Open Access portal coordinated provides information and accepts submissions in 12 official African languages.

Sibusiso Biyela has been a science communicator at ScienceLink since 2016, where he has worked with South African universities and international research institutions to produce science communication content for many audiences that include policymakers, the research
community, and the lay public. He has experience as a thought leader on the decolonisation of science and science communication. He has given talks on the topic at international conferences, contributing to discussions on platforms such as national radio and international
podcasts. He is the author of a widely regarded article; “Decolonizing Science Writing in South Africa” in which he has been vocal about creating scientific terms in the isiZulu language.

Introduction

Kenyan author Ngugi Wa Thiong’o in his novel Decolonising the Mind states “The effect of a cultural bomb is to annihilate a people’s belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves.”. When a technology treats something as simple and fundamental as your name as an error, it in turn robs you of your personhood and reinforces the colonial narrative that you are other.

Named entity recognition (NER) is a core NLP task in information extraction and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and that need to identify African names, places and people for information retrieval.

Currently, the majority of existing NER datasets for African languages are WikiNER which are automatically annotated, and are very noisy since the text quality for African languages is not verified. Only a few African languages have human-annotated NER datasets. To our knowledge, the only open-source Part-of-speech
(POS) datasets that exist are a small subset of languages in South Africa, and Yoruba, Naija, Wolof and Bambara (Universal Dependencies).

Pre-trained language models such as BERT and XLM-RoBERTa are producing state-of-the-art NLP results which would undoubtedly benefit African NLP. Beyond the direct uses, NER also is a popular benchmark for evaluating such language models. For the above reasons, we have chosen to develop a wide-spread POS and NER corpus for 20 African languages based on news data.

Personnel

Peter Nabende is a Lecturer at the Department of Information Systems, School of
Computing and Informatics Technology, College of Computing and Information Sciences, Makerere University. He has a PhD in Computational Linguistics from the University of Groningen, The Netherlands. He has conducted research on named entities across several writing systems and languages in the NLP subtasks of transliteration detection and generation. He has also conducted experimental research on an NLP main task of machine translation between three low resourced indigenous Ugandan languages (Luganda, Acholi, and Lumasaaba) and English using statistical and neural machine translation methods and tools such as moses and opennmt-py. He has supervised the creation of language technology resources involving another three Ugandan languages (a Lusoga-English parallel corpus and Grammatical Framework (GF)-based computational grammar resources for Runyankore-Rukiga and Runyoro-Rutooro).

Jonathan Mukiibi is a Masters student in Computer Science at Makerere University. His current research focuses on topic classification of speech documents for crop disease surveillance using Luganda language radio data. He is the coordinator of natural language processing tasks at the Artificial Intelligence Lab, Department of Computer science, Makerere University.

David Ifeoluwa Adelani (an NLP Researcher, https://dadelani.github.io/) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages. He was involved in the creation of the first NER dataset for Hausa [Hedderich et al., 2020] and Yoruba [Alabi et al., 2020] in the news domain.

Daniel D’souza has an MS in Computer Science ( Specialization in Natural Language
Processing ) from the University of Michigan, Ann Arbor. He currently works as a Data Scientist at ProQuest LLC.

Jade Abbott has an MSc in Computer Science from the University of Pretoria. She is a
Machine Learning lead at Retro Rabbit South Africa, working primarily in NLP. Additionally, she co-founded Masakhane – an initiative to spur NLP research in Africa and has widelypublished in African NLP tasks.

Olajide Ishola has an MA in Computational Linguistics. He is one of the pioneers of the first dependency treebank for the Yoruba language [Ishola et. al, 2020]. His interest lies in corpus development and NLP for indigenous Nigerian languages.

Constantine Lignos is an Assistant Professor in the Department of Computer Science at Brandeis University where he directs the Broadening Linguistic Technologies lab. He received his PhD from the University of Pennsylvania in 2013. His research focus is the construction of human language technology for previously-underserved languages. He has worked on named entity annotation and system creation for Tigrinya and Oromo, and additionally developed entity recognition systems for Amharic, Hausa, Somali, Swahili, and Yoruba. He has also worked on natural language processing tasks for other African languages, including cross-language information retrieval for Somali and information extraction for Nigerian English.