Introduction

Kenyan author Ngugi Wa Thiong’o in his novel Decolonising the Mind states “The effect of a cultural bomb is to annihilate a people’s belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves.”. When a technology treats something as simple and fundamental as your name as an error, it in turn robs you of your personhood and reinforces the colonial narrative that you are other.

Named entity recognition (NER) is a core NLP task in information extraction and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and that need to identify African names, places and people for information retrieval.

Currently, the majority of existing NER datasets for African languages are WikiNER which are automatically annotated, and are very noisy since the text quality for African languages is not verified. Only a few African languages have human-annotated NER datasets. To our knowledge, the only open-source Part-of-speech
(POS) datasets that exist are a small subset of languages in South Africa, and Yoruba, Naija, Wolof and Bambara (Universal Dependencies).

Pre-trained language models such as BERT and XLM-RoBERTa are producing state-of-the-art NLP results which would undoubtedly benefit African NLP. Beyond the direct uses, NER also is a popular benchmark for evaluating such language models. For the above reasons, we have chosen to develop a wide-spread POS and NER corpus for 20 African languages based on news data.

Personnel

Peter Nabende is a Lecturer at the Department of Information Systems, School of
Computing and Informatics Technology, College of Computing and Information Sciences, Makerere University. He has a PhD in Computational Linguistics from the University of Groningen, The Netherlands. He has conducted research on named entities across several writing systems and languages in the NLP subtasks of transliteration detection and generation. He has also conducted experimental research on an NLP main task of machine translation between three low resourced indigenous Ugandan languages (Luganda, Acholi, and Lumasaaba) and English using statistical and neural machine translation methods and tools such as moses and opennmt-py. He has supervised the creation of language technology resources involving another three Ugandan languages (a Lusoga-English parallel corpus and Grammatical Framework (GF)-based computational grammar resources for Runyankore-Rukiga and Runyoro-Rutooro).

Jonathan Mukiibi is a Masters student in Computer Science at Makerere University. His current research focuses on topic classification of speech documents for crop disease surveillance using Luganda language radio data. He is the coordinator of natural language processing tasks at the Artificial Intelligence Lab, Department of Computer science, Makerere University.

David Ifeoluwa Adelani (an NLP Researcher, https://dadelani.github.io/) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages. He was involved in the creation of the first NER dataset for Hausa [Hedderich et al., 2020] and Yoruba [Alabi et al., 2020] in the news domain.

Daniel D’souza has an MS in Computer Science ( Specialization in Natural Language
Processing ) from the University of Michigan, Ann Arbor. He currently works as a Data Scientist at ProQuest LLC.

Jade Abbott has an MSc in Computer Science from the University of Pretoria. She is a
Machine Learning lead at Retro Rabbit South Africa, working primarily in NLP. Additionally, she co-founded Masakhane – an initiative to spur NLP research in Africa and has widelypublished in African NLP tasks.

Olajide Ishola has an MA in Computational Linguistics. He is one of the pioneers of the first dependency treebank for the Yoruba language [Ishola et. al, 2020]. His interest lies in corpus development and NLP for indigenous Nigerian languages.

Constantine Lignos is an Assistant Professor in the Department of Computer Science at Brandeis University where he directs the Broadening Linguistic Technologies lab. He received his PhD from the University of Pennsylvania in 2013. His research focus is the construction of human language technology for previously-underserved languages. He has worked on named entity annotation and system creation for Tigrinya and Oromo, and additionally developed entity recognition systems for Amharic, Hausa, Somali, Swahili, and Yoruba. He has also worked on natural language processing tasks for other African languages, including cross-language information retrieval for Somali and information extraction for Nigerian English.

Named Entity Recognition and parts of Speech Datasets for African Languages

Introduction

Personnel

Knowledge 4 All Foundation Ltd.