Abstract
The output of machine translation systems depends on the availability of in-domain training data which is not always available for all language pairs.
For instance, several African languages do not only lack in-domain parallel data but also generic parallel data of decent quality and this hinders the development of domain-specific neural machine translation (NMT) models capable of generating high-quality translations.
This is despite the fact that 30% of the world’s languages are spoken in Africa. Hence, the goal of this project is to create a corpus of 10,000 parallel sentences in 2 different domains for machine translation (MT) for 5 African languages. We hope to extend this work to other African languages and domains in the future.
Personnel
- Jesujoba Alabi (Co-Investigator) is a research engineer at the ALMAnaCH project-team, Inria Paris, where he is working on domain adaptation in neural machine translation.
- David Ifeoluwa Adelani (Co-Investigator) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. He led the development of MasakhaNER (Adelani et al. 2021) – a named entity recognition dataset for 10 African languages. He is currently leading the expansion of the dataset to 20 languages through the grant supported by Lacuna
- Andrew Caines (Collaborator) is Senior Research Associate in the NLIP Group & ALTA Institute directed by Prof Paula Buttery, based in the Computer Laboratory at the University of Cambridge, U.K. He recently led a project on the machine translation of public health information from English into Swahili, Igbo and Yoruba.