Abstract

The output of machine translation systems depends on the availability of in-domain training data which is not always available for all language pairs.

For instance, several African languages do not only lack in-domain parallel data but also generic parallel data of decent quality and this hinders the development of domain-specific neural machine translation (NMT) models capable of generating high-quality translations.

This is despite the fact that 30% of the world’s languages are spoken in Africa. Hence, the goal of this project is to create a corpus of 10,000 parallel sentences in 2 different domains for machine translation (MT) for 5 African languages. We hope to extend this work to other African languages and domains in the future.

Personnel

  1. Jesujoba Alabi (Co-Investigator) is a research engineer at the ALMAnaCH project-team, Inria Paris, where he is working on domain adaptation in neural machine translation.
  2. David Ifeoluwa Adelani (Co-Investigator) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. He led the development of MasakhaNER (Adelani et al. 2021) – a named entity recognition dataset for 10 African languages. He is currently leading the expansion of the dataset to 20 languages through the grant supported by Lacuna
  3. Andrew Caines (Collaborator) is Senior Research Associate in the NLIP Group & ALTA Institute directed by Prof Paula Buttery, based in the Computer Laboratory at the University of Cambridge, U.K. He recently led a project on the machine translation of public health information from English into Swahili, Igbo and Yoruba.