Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Machine Translation of Fongbe.

Language profile: Fongbe

Language profile for Fongbe
Language profile for Fongbe

Overview

Fon or fɔ̀ngbè is a low resource language, part of the Eastern Gbe language cluster and

belongs to the Volta–Niger branch of the Niger–Congo languages. Fongbe is spoken in Nigeria, Togo and mainly in Benin by approximately 4.1 million speakers. Like the other Gbe languages, Fongbe is an analytic language with an SVO basic word order. It’s also a tonal language and contains diacritics which makes it difficult to study. [1]

The standardized Fongbe language is part of the Fongbe cluster of languages inside the Eastern Gbe languages. In that cluster, there are other languages like Goun, Maxi, Weme, Kpase which share a lot of vocabulary with the Fongbe language. Standard Fongbe is the primary target of language planning efforts in Benin, although separate efforts exist for Goun, Gen, and other languages of the country. To date, there are about 53 different dialects of the Fon language spoken throughout Benin.

Pertinence

Fongbe holds a special place in the socio economic scene in Benin. It’s the most used language in markets, health care centers, social gatherings, churches, banks, etc.. Most of the ads and some programs on National Television are in Fongbe. French used to be the only language of education in Benin, but in the second decade of the twenty first century, the government is experimenting with teaching some subjects in Benin schools in the country’s local languages, among them Fongbe.

Example of Fongbe Text:

Fongbe : Mǐ kplɔ́n bo xlɛ́ ɖɔ mǐ yí wǎn nú mɛ ɖevo lɛ

English : We have learned to show love to others [3]

Existing Work

Some previous work has been done on the language. There are doctorate thesis, books, French to Fongbe and Fongbe to French dictionaries, blogs and others. Those are sources for written fongbe language.

Researcher Profile: Kevin Degila

Kevin is a Machine Learning Research Engineer at Konta, an AI startup based in Casablanca. he holds an engineering degree in Big Data and AI and it’s currently enrolled in a PhD program focused on business document understanding at Chouaib Doukkali University. In his day to day activities, Kevin train, deploy and monitor in production machine learning models. With his friends, they lead TakwimuLab, an organisation working on training the next young, french speaking, west africans talents in AI and solving real-life problems with their AI skills. In his spare time, Kevin also create programming and AI educational content on Youtube and play video games.

Researcher Profile: Momboladji Balogoun

Momboladji BALOGOUN is the Data Analyst of Gozem, a company providing ride-hailing and other services in West and Central Africa. He is a former Data Scientist at Rintio, an IT startup based in Benin, that uses data and AI to create business solutions for other enterprises. Momboladji holds a M.Sc. degree in Applied Statistics from ICMPA UNESCO Chair, Cotonou, and migrated to the Data Science field after having attended a regional Big Data Bootcamp in his country Benin. He aims to pursue a Ph.D. program on low resources languages speech to speech translation. Bola created Takwimu LAB in August 2019, and he leads it currently with 3 other friends in order to promote Data Science in their countries, but also the creation and the use of AI to solve real-life problems in their communities. His hobbies are: Reading, Documentaries, and Tourism.

Researcher Profile: Godson Kalipe

Godson started in the IT field with software engineering with a specialization on mobile applications. After his bachelor in 2015, he worked for a year as web and mobile application developer before joining a master in India in Big Data Analytics. His master thesis consisted comparative analysis of international news impact on economic indicators of African countries using news Data, Google Cloud storage and visualization assets. After his Master,

in 2019, he gained a first experience as Data Engineer creating data ingestion pipelines for real time sensor data at Activa Inc, India. He parallely has been working with Takwimu Lab on various projects with the aim of bringing AI powered solutions to common african problems and make the field more popular in the west African francophone industry.

Researcher Profile: Jamiil Toure

Jamiil is a design engineer in electrical engineering from Ecole Polytechnique d’Abomey-Calavi (EPAC), Benin in 2015 and a master graduate in mathematical sciences from African School of Mathematical Sciences (AIMS) Senegal in 2018. Passionate of languages and Natural Language Processing (NLP), he contributes to the Masakhane project by working on the creation of a dataset for the language Dendi.

Meanwhile, he complements his education on NLP concepts via online courses, events, conferences for a future research career in NLP. With his friends at Takwimu Lab they work at creating active learning and working environments to foster the applications and usages of AI to tackle real-life problems. Currently, Jamiil is a consultant in Big Data at Cepei – a think tank based in Bogota that promotes dialogue, debate, knowledge and multi-stakeholder participation in global agendas and sustainable development.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

 

Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Machine Translation of Yoruba.

Language profile: Yoruba

Language profile for Yoruba
Language profile for Yoruba

Overview

The Yorùbá language is the third most spoken language in Africa, and is native to the south-western Nigeria and the Republic of Benin in West Africa (as shown in Figure 1). It is one of the national languages in Nigeria, Benin and Togo, and it is also spoken in other countries like Ghana, Côte d’Ivoire, Sierra Leone, Cuba, Brazil and by a significant Yorùbá diaspora population in the US and United Kingdom mostly from the Nigerian ancestry. The language belongs to the Niger-Congo family, and is spoken by over 40 million native speakers [1].

Yorùbá has several dialects but the written language has been standardized by the 1974 Joint Consultative Committee on Education [2], it has 25 letters without the Latin characters (c, q, v, x and z) and with additional characters (ẹ , gb, ṣ , and ọ). There are 18 consonants (b, d, f, g, gb, j, k, l, m, n, p, r, s, s., t, w y), and 7 oral vowels (a, e, ẹ , i, o, ọ , u). Yorùbá is a tonal language with three tones: low, middle and high.

These tones are represented by the grave (“\”), optional macron (“- ”) and acute (“/”) accents respectively. These tones are applied on vowels and syllabic nasals, but the mid tone is usually ignored in writings. The tones are represented in written texts along with a modified Latin alphabet. A few alphabets have underdots (i.e. “ẹ ”, “ọ ”, and “ṣ”), we refer to the tonal marks and underdots as diacritics. It is important to note that tone information is needed for correct pronunciation and to have the meaning of a word [2, 3].

As noted in [4], most of the Yorùbá texts found in websites or public domain repositories either use the correct Yorùbá orthography or replace diacriticized characters with un-diacriticized ones.

Oftentimes, articles written online including news articles1 like BBC and VON ignore diacritics. Ignoring diacritics makes it difficult to identify or pronounce words except they are in a context.  For example, owó (money), ọwọ̀  (broom), òwò (business), ọ̀wọ̀ (honour), ọwọ́  (hand), and ọ̀wọ́ (group) will be mapped to owo without diacritics.

Existing work

Due to the problem with the diacritics in Yorùbá language, it has greatly reduced the amount of available parallel texts that can be used for many NLP tasks like machine translation. This has led to research on automatically applying diacritics to Yorùbá texts [5, 6], but the problem has not been completely solved. We will divide the existing work on Yorùbá language into four categories:

Automatic Diacritics Application

The main idea for the automatic diacritic application (ADA) model is to predict the correct diacritics of a word based on the context it appears. We can make use of a sequence-to-sequence deep learning model like Long Short Term Memory networks (LSTM) [7] to achieve this task.

The task is similar to a machine translation task where we need to translate from a source language to a target language, ADA takes a source text that is non-diacriticized (e.g “bi o tile je pe egbeegberun ti pada sile”) and outputs target texts with diacritics (e.g. “bí ó tilẹ̀ jẹ́ pé ẹgbẹẹgbẹ̀rún ti padà síléé”). The first attempt of applying deep learning models to Yorùbá ADA was by Iroro Orife [5].

They proposed a soft-attention seq2seq model to automatically apply diacritics to Yorùbá texts, their model was trained on the Yorùbá bible, Lagos-NWU speech corpus and some language blogs. However, the model does not generalize to other domains like dialog conversation and news domain because the majority of the texts are from the Bible. Orife et al [6] recently addressed the issue of domain-mismatch by gathering texts from various sources like conversation interviews, short stories and proverbs, books, and JW300 Yorùbá texts but they evaluated the performance of the model on the news domain (i.e Global Voices articles) to measure domain generalization.

Word Embeddings

Word embeddings are the primary features used for many downstream NLP tasks. Facebook released FastText [8] word embeddings for over 294 languages 2 but the quality of the embeddings are not very good. Recently, Alabi et. al [9] showed that Facebook’s FastText embeddings for Yorùbá gives a lower performance in word similarity tasks, which indicates that they would not work well for many downstream NLP tasks. They released a better quality FastText embeddings and contextualized BERT [10] embeddings obtained by fine-tuning multi-lingual BERT embeddings.

Datasets for Supervised Learning Tasks

Yorùbá, like many other low-resourced languages, does not have many supervised learning datasets such as named entity recognition (NER), text classification and parallel sentences for machine translation. Alabi et al. [9] created a small NER dataset with 26K tokens. Through the support of AI4D 3 and Zindi Africa 4, we have created parallel English-Yorùbá dataset for machine translation and news title classification dataset for Yorùbá from articles crawled from BBC Yorùbá 5. The summary of the AI4D dataset creation competition is in [11].

Machine Translation

Commercial machine translation models like Google Translate 6 exist for Yorùbá  to other languages but the quality  is not very good because of the diacritics problem and the small amount of data available to train a good neural machine translation (NMT) model. JW300[12] based on Jehovah Witness publications is another popular dataset for training NMT models for low-resource African languages, it has over 10 million tokens of Yorùbá texts. However, the NMT models trained on JW300, do not generalize to other non-religious domains. There is a need to create more multi-domain parallel datasets for Yorùbá language.

Researcher Profile: David Adelani

David Ifeoluwa Adelani is a doctoral student in computer science at Spoken Language Systems Group, Saarland Informatics Campus, Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialog systems and online social interactions.

He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with special focus on African languages. He has published a few papers in top Web technology, language and speech conferences including The Web Conference, LREC, and Interspeech.

During his graduate studies, he conducted research on social computing at the Max Planck Institute of Software Systems, Germany and on fake review detection at the National Institute of Informatics, Tokyo, Japan. He holds an MSc in Computer Science from the African University and Science and Technology, Abuja, Nigeria and a BSc in Computer Science from the University of Agriculture, Abeokuta, Nigeria.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Document Classification datasets of Chichewa.

Language profil: Chichewa

Language profile for Chichewa
Language profile for Chichewa

What is Chichewa?

Chichewa is part of the Niger-Congo Bantu group and it is one of the most spoken indigenous languages of Africa. Chichewa is both an individual dialect and a language group as we shall discuss in this short article.

The language, Chichewa, also written as Cichewa, or, in Zambia, Cewa, is the native language of the Chewa. The word ‘chi’ or ‘ci’ is a Bantu prefix used for the tribal name, designating the language rather than the geographical region of the tribe. The word Chewa is the name of a group of people. Chichewa is called Chinyanja, for example in Zambia and Mozambique. Chinyanja was also the old name for the language in Malawi, before the country became a Republic. During that time, as a British Protectorate, Malawi was called Nyasaland.

Chichewa, with the code ‘ny’ is also one of the 13 African languages with a Google automatic translation. The code ‘ny’ was most likely chosen because the language was known first as Chinyanja. This probably reflects the availability of written text in Chichewa compared to other African languages. However, as we will discuss in this article, there are several dialects of Chichewa which differ from each other in noticeable ways. I do not know whether this was taken into account for the text used in the  machine language models by Google. But this is a whole new interesting topic in itself!

Who are the Chewa?

The Chewa are a Bantu speaking people, traditionally described as the descendants of the Maravi, who in the 16th (some say, in the 14th) century migrated to the present day Malawi from the region now called Congo-Kinshasa. Most of what we know about the migrations of the Cewa come from oral tradition. Samuel Nthara collected some of the oral traditions in his book Mbiri ya Achewa, published in 1944. The name Maravi first appeared in Portuguese documents in 1661.

Nowadays, some of the well known districts in Malawi where the Chewa live are: Mchinji, Lilongwe, Kasungu, Nkhotakota, Dowa and Dedza. The consensus is that the Chewa of the mainland kept their name as Chewa and lived mainly in the Central Region. The Manganja are the Chewa who settled in the Southern region. And some Chewa groups who settled at the lake or around the Shire River in the south are called Nyanja. Man’ganja (or Maganja) is southern Chichewa as opposed to the language spoken in the Central Region (which was also called Western Chichewa / Nyanja). There are phonetical, grammatical and vocabulary differences between these dialects.

Where is Chichewa spoken?

In Malawi, Chichewa is widely understood. It was declared the national language in 1968 and it is viewed as a symbol of national unity by diverse groups. In Mozambique it is spoken especially in the provinces of Tete and Niassa, where it is referred to as Chinyanja.  In Zambia, it is spoken in Lusaka and in the Eastern Province (the language is referred to as Nyanja). The language spoken in Lusaka is sometimes called town-Nyanja as opposed to the Nyanja spoken in rural areas in other parts of Zambia, where it is referred to as deep-Nyanja. Nyanja is the language of the Police and the Army.  In Zimbabwe, according to some estimates, Chichewa is the third most widely used language after Shona and Ndebele. There is a sizable community of descendents from those who migrated to this area from Nyasaland during colonial times to work in the mines.

Chichewa is spoken in South Africa. There are a significant number of migrants from Malawi who work in mining, as domestic workers or in other industries.  There are radio services in Chichewa in Malawi, Zambia, South Africa and even in Ethiopia.

How many people speak the language?

According to sources quoted in Wikipedia, there are 12 million native speakers of Chichewa. A similar number is mentioned on the Joshua project website and includes Chichewa speakers from 8 countries of the world. This number seems then to refer to all the people who identify themselves as Chewa, Nyanja and Manganja, as these, according to the Malawi Population Census of 2018, make about 40% of the population in Malawi. However, in Malawi, the large ethnic groups of Lomwe, Yao and Ngoni have over the course of time adopted Chichewa as their native language.

It is the case that the number of people understanding and using Chichewa is much higher than the 12 million native speakers. Like Swahili, Chichewa is considered by some a universal language, a common skill enabling people of varying tribes and those living in Malawi, Zambia, Mozambique to communicate without following the strict grammar of specific local languages. In Zambia, many of those whose mother tongue is now Chinyanja have come to consider themselves Ngoni; Nyanja is a lingua franca, being spoken by the police and the administration.

The Need for Datasets in Chichewa

As discussed, seven important facts provide impetus to the initiative to develop data set for Chichewa: (1) Chichewa is an important African language, (2) it is representative of the Niger Congo Bantu group of languages, (3) it is widely spoken, (4) it contains a considerable literature, more than other local African languages, (5) there are several methodological grammar and phonetics studies and (6) several translations from languages such as English and (7) it is spoken by old and young alike.

There has been an interest in developing digital tools for language documentation and natural language processing. Such initiatives have come from researchers involved in linguistics, such as those belonging to linguistics departments at universities in Malawi and Zambia. For example, in Malawi, we found the Chichewa monolingual dictionary corpus containing about 13,000 nouns or this one phonetically annotated short corpus.

The comparative online Bantu dictionary at Berkley includes a dataset for Chichewa, however, the project seems to have stalled in 1997. More recently, there has been an interest in creating datasets used in NLP tools and machine translation and, recently, according to Professor Kishindo, there is a PhD candidate at the University of Malawi interested in working on Machine Translation for Chichewa.

From our investigation, we observe that these datasets or tools tend to be kept in the private domain, are not regularly maintained, or are used only once, and are not well documented. However, their existence is important and it shows that there is a desire and need for such tools.

Conclusions

Chichewa is an important African language. There are differences between the main dialects of Chichewa and the language is undergoing continuous change. Improved methods for discovering online content and digitizing text can open new opportunities for organising Chichewa text into useful corpora. These can then be useful in linguistic work, in building tools for manipulating and comparing text, for finding and visualising connections between texts and for improving machine translation.

Chichewa continues to change as new terms are added to the vocabulary arising from technological needs for example. Its use by the younger generation creates new idioms and meaning, and the creative expressions through poetry and literature find venues online. Looking at language in new and novel ways using technology, can also help engage with the new generation in how they use, view and develop their language.

In this short article, we looked at the use of Chichewa and why we think it is important to build data sets for this language. We hope that this will be motivating and inspiring to others who are interested in this language or other African languages. This article was written as the author embarked on an AI4D Language Dataset Fellowship for putting together a Chichewa dataset. This is a small but important initiative aimed at engaging with the Machine Learning generation on the African continent. I am honoured to be a small part in the building of such datasets.

Researcher Profile: Amelia Taylor

Amelia graduated with a PhD in Mathematical Logic from Heriot-Watt University in 2006 where I was part of the ULTRA group. After that she worked as a research assistant on a project with Heriot-Watt University and the Royal Observatory in Edinburgh, aiming at developing an intelligent query language for astronomical data. From 2006 to 2013, Amelia also worked in finance in the City of London and Edinburgh – she built risk models for asset allocation and liability-driven investments. F

or the last 5 years, Amelia has been teaching programming and AI courses at the University of Malawi in the CIT and engineering department. Amelia also teaches research methodology and supervises MSc and PhD students. While my first interest in AI as an undergraduate was in the field of Natural Language Processing and intelligent query systems, she is interested in the other use of technology and AI for solving real-world problems.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular a Text-to-Speech dataset of Wolof.

Language profil: Wolof

Language profile for Wolof
Language profile for Wolof

Overview

Wolof /ˈwoʊlɒf/[4] is a language of Senegal, the Gambia and Mauritania, and the native language of the Wolof people. Like the neighbouring languages Serer and Fula, it belongs to the Senegambian branch of the Niger–Congo language family. Unlike most other languages of the Niger-Congo family, Wolof is not a tonal language.[1]

Pertinence

Wolof is spoken by more than 10 million people and about 40 percent (approximately 5 million people) of Senegal’s population speak Wolof as their native language. Increased mobility, and especially the growth of the capital Dakar, created the need for a common language. Today, an additional 40 percent of the population speak Wolof as a second or acquired language. In the whole region from Dakar to Saint-Louis, and also west and southwest of Kaolack, Wolof is spoken by the vast majority of the people. Typically when various ethnic groups in Senegal come together in cities and towns, they speak Wolof. It is therefore spoken in almost every regional and departmental capital in Senegal.[1]

Nevertheless, in Senegal, while the communication in schools and formally registered companies takes place in french, Wolof remains the most used language in critical settings such as :

  • Market places
  • Medical centers
  • In apprenticeship for a major array of occupations such as hairdressing, tailoring, engine
  • repairing, carpenting, agriculture among other manual jobs.
  • at police stations
  • in banking or telecommunications agencies
  • in shops and restaurants

Existing work

Senegalease Government has created a linguistic Department for Wolof and other local languages to promote the use of Wolof in some environments like school and also translation of different book in Wolof languages, but there is still a lot of work to have Wolof used in official documents and schools. There also exist some french-wolof dictionaries. In the academic world, some work has been done to better understand Wolof Phonemes[2], POS[3], automatic translation of wolof to french[4], Automatic Speech Recognition From a startup called BAAMTU.

Researcher Profile: Thierno Diop

Thierno Ibrahima DIOP My name is a computer science engineer. He is lead data scientist at Baamtu and passionate about NLP and everything that revolves around machine learning.he has been mentoring data scientist students and apprentices for two years.

Before getting into data science,he did a lot of freelancing in the development of web and mobile applications for local and international clients. he is co-founder of GalsenAI, an artificial intelligence community in Senegal, he is also ZINDI ambassador in Senegal and co-organizer of GDG Dakar.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Document Classification datasets of Kiswahili.

Language profile: Kiswahili

Language profile for Kiswahili
Language profile for Kiswahili

Overview

Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is spoken by countries such as Tanzania, Kenya, Uganda, Rwanda, and Burundi, some parts of Malawi, Somalia, Zambia, Mozambique and the Democratic Republic of the Congo (DRC).

Pertinence

In Tanzania, [1]Swahili is the official language and main communication medium for economic, social, and government activities across the country and it is the official language of instruction in all schools.

Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. Swahili has been influenced by Arabic and even had an Arabic script during its early years., given its presence within the continent and outside.

Swahili is also one of the working languages of the African Union and officially recognized as a lingua franca of the East African Community. In 2018, South Africa legalized the teaching of Swahili in South African schools as an optional subject to begin in 2020. The Southern African Development Community (SADC) officially recognized the Swahili as their official language.

Existing work

In Tanzania, [2]Baraza la Kiswahili la Taifa (National Swahili Council, abbreviated as BAKITA) is a Tanzanian institution responsible for regulating and promoting the Kiswahili language. Key activities mandated for the organization include creating a healthy atmosphere for the development of Kiswahili, encouraging the use of the language in government and business functions, coordinating activities of other organizations involved with Kiswahili, standardizing the language.

BAKITA cooperates with organizations like [3]TATAKI in creation, standardization, and dissemination of specialized terminologies Other institutions can propose new vocabulary to respond to emerging needs but only BAKITA can approve usage. Also, BAKITA coordinates its activities with similar bodies in Kenya and Uganda to aid in the development of Kiswahili.

There exist different English to Swahili dictionaries online from [4]elimuyetu website and Swahili to English dictionaries online from [5]africanlanguages website and the mobile Swahili Dictionary [6] on the Android play store.

Researcher profile: Davis David

He graduated with a Bachelor’s Degree in Computer Science from the University of Dodoma in 2017 where I was a Co-organizer of Python Community during my time at university. After that, he worked as a Software Developer at TYD innovation Incubator developing different innovative systems to solve educational and economical challenges in Tanzania. Davis also worked as a Data scientist at ParrotAI developing different AI solutions focus on Agriculture, health, and finance.

He built computer vision models for classifying Banana Diseases from Leaf Images.. For the last 4 years, Davis has been teaching machine learning and data science across different universities, tech communities, and events with a passion to build a community of Data Scientists in Tanzania to solve local problems

He is also working with Zindi Africa as a Zindi Ambassador and a mentor in Tanzania, he organizes different machine learning hackathons across different cities in Tanzania and mentored different students and junior data scientists across Africa.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular a Sentiment Analysis datase of Arabizi.

Language profile: Tunisian Arabizi

Language profile for Tunisian Arabizi
Language profile for Tunisian Arabizi

Overview

On Social Media, users tend to express themselves in their own local dialect. To do so, Tunisians use Tunisian Arabizi which consists in supplementing numerals to the Latin script rather than using the Arabic alphabet. [7] mentioned that 81\% of the Tunisian comments on Facebook used the Romanized alphabet.

In [8], a study was conducted on 1,2M social media Tunisian comments  (16M  words  and  1M  unique  words)  showed  that  53%  of  the  comments used the Romanized alphabet while 34% used Arabic alphabet and 13% used script-switching.

The study also mentioned that 87% of the comments based on the Romanized alphabet are TUNIZI, while the rest are French and English.  TUNIZI,  our  dataset  includes  100%  Tunisian  Arabizi  sentences  collected from people expressing themselves in their own local dialect using Latin characters and numerals.  TUNIZI is a Sentiment Analysis Tunisian Arabizi Dataset, collected, preprocessed, and annotated

Previous projects on Tunisian Dialect

In [1], a lexicon-based sentiment analysis system was used to classify the sentiment  of  Tunisian  tweets.   The author  developed  a  Tunisian  morphological analyzer to produce linguistic features and achieved an accuracy of 72.1% using the small-sized TAC dataset (800 Arabic script tweets). [2]  presented  a  supervised  sentiment  analysis  system  for  Tunisian  Arabic script tweets.

With different bag-of-word schemes used as features, binary and multiclass classifications were conducted on a Tunisian Election dataset (TEC)of  3,043  positive/negative  tweets  combining  MSA  and  Tunisian  dialect.

The support vector machine was found of the best results for binary classification with an accuracy of 71.09% and an F-measure of 63%. In  [3],  the  doc2vec  algorithm  was  used  to  produce  document  embeddings of Tunisian Arabic and Tunisian Romanized alphabet comments.

The generated embeddings were fed to train a Multi-Layer Perceptron (MLP) classifier where both the achieved accuracy and F-measure values were 78% on the TSAC (Tunisian  Sentiment  Analysis  Corpus)  dataset.

This  dataset  combines  7,366 positive/negative Tunisian Arabic and Tunisian Romanized alphabet Facebook comments.  The same dataset was used to evaluate Tunisian code-switching sentiment analysis in [5] using the LSTM-based RNNs model reaching an accuracy of 90%.

In [4], authors conducted a study on the impact on the Tunisian sentiment classification  performance  when  it  is  combined  with  other  Arabic  based  pre-processing tasks (Named Entities tagging,  stopwords removal,  common emoji recognition,  etc.).

A lexicon-based approach and the support vector machine model were used to evaluate the performances on the above-mentioned datasets (TEC and TSAC datasets).

In  order  to  avoid  the  hand-crafted  features  labor-intensive  task,  syntax-ignorant n-gram embeddings representation composed and learned using an unordered composition function and a shallow neural model was proposed in [6].The proposed model, called Tw-StAR, was evaluated to predict the sentiment on five Arabic dialect datasets including the TSAC dataset  [3].

We  observe  that  none  of  the  existing  Tunisian  sentiment  analysis  studies focused on the Tunisian Romanized alphabet which is the aim of this work.

Tunisian Arabizi vs Arabic Arabizi

Tunisian dialect, also known as “Tounsi” or “Derja”, is different from ModernStandard Arabic.  In fact,  Tunisian dialect features Arabic vocabulary spiced with  words  and  phrases  from  Tamazight,  French,  Turkish,  Italian  and  other languages [9].Tunisia is recognized as a high contact culture where online social networks play  a  key  role  in  facilitating  social  communication  [10].

].   To  illustrate  more, some examples of Tunisian Arabizi words translated to MSA and English are presented in Table 1.

 

TUNIZI MSA translation English Translation
3asslema مرحبا Hello
Chna7welek كيف حالك How are you
Sou2el سؤال Question
5dhit أخذت I took

Table 1: Examples of TUNIZI common words translated to MSA and English

Since some Arabic characters do not exist in the Latin alphabet, numerals, and multigraphs instead of diacritics for letters, are used by Tunisians when they write on social media. For instance, ”ch” is used to represent the character ش.

An example is the word شرير (wicked) represented as ”cherrir” in TUNIZI characters. After a few observations from the collected datasets, we noticed that Arabizi used by Tunisians is slightly different from other informal Arabic dialects such as Egyptian Arabizi.  This may be due to the linguistic situation specific to each country.  In fact, Tunisians generally use the French background when writing in Arabizi, whereas, Egyptians would use English.

For example, the word مشيت would be written as ”misheet” in Egyptian Arabizi, the second language being English.  However, because the Tunisian’s second language is French, the same word would be written as ”mchit”.In Table 2, numerals and multigraphs are used to transcribe TUNIZI char-acters that compensate the absence of equivalent Latin characters for exclusively Arabic Arabic sounds.

They are represented with their corresponding Arabic characters and Arabizi characters in other countries.  For instance, the number 5 is used to represent the character خ in the same way as the multigraph ”kh”.

For example, the word ”5dhit” is the representation of the word أخذت as shown in Table 1.  Numerals and multigraphs used to represent TUNIZI are different from those used to represent Arabizi.  As an example, the word غالية (expensive) written as ”ghalia” or ”8alia” in TUNIZI corresponds to ”4’alia” in Arabizi.

 

Arabic Arabizi TUNIZI
ح 7 7
خ 5 or 7’ 5 or kh
ذ d’ or dh dh
ش $ or sh ch
ث t’ or th or 4 th
غ 4’ gh or 8
ع 3 3
ق 8 9

Table 2: Special Tunizi characters and their corresponding Arabic and Arabizi characters

Tunizi Uses

TUNIZI dataset can be used for Sentiment Analysis projects dedicated for other underrepresented Maghrebian dialects, such as the Libyan, Moroccan or Algerian because of similarities of the dialects.  Also, this dataset can be used also for other NLP projects, such as chatbots.

Tunizi in the industry

TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect.  TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.

Researcher Profile: Chayma Fourati

Chayma Fourati is an AI R&D Engineer at iCompass. She is a graduate of Software Engineering (June 2020) from the Mediterranean Institute of Technology in Tunisia. She had her final year project at iCompass where she participated in most of the R&D projects. She was invited as a speaker at a webinar during the covid-19 crisis in March 2020 to talk about African IT solutions in fighting the Covid-19 through the latest AI Technologies.

During her last academic years, in both internships and university classes, she developed her skills in the AI field, and at iCompass, in the NLP field. During her final year internship at iCompass, she published a paper with two teammates at iCompass in the ICLR 2020 workshop. Her current research intersts include Natural Language Processing, Neural Networks and Deep Learning.

Researcher Profile: Hatem Haddad

Hatem Haddad is Co-Founder, CTO and RD director of iCompass. He received a doctorate in Computer Science (2002) from University Grenoble Alpes, France. He occupied assistant professor positions at Grenoble Alpes university (France), NTNU (Norway), at UAEU (EAU), at Sousse university (Tunisia), at Mevlana university (Turkey) and at ULB (Belgium). He worked for industrial corporations in R&D at VTT Technical Research Centre of Finland and Institute for Infocomm Research, Image Processing and Applications Lab of Singapore.

He was an invited researcher at Leibniz-Fachhochschule School of Business (Germany) and Polytechnic Institute of Coimbra (Portugal). His current research interests include Natural Languages Processing, Machine Learning and Deep Learning. He is author or co-author of more than 50+ papers published in peer-reviewed international Journals and Conferences and a frequent reviewer for international journals, conferences and R&D projects.

Researcher Profile: Malek Naski

Malek Naski is currently a summer intern at iCompass. She will graduate in June 2021 as a software engineer from the national school of engineering of Tunis (ENIT). Previously, she did her academic end-of-year project for the year 2019/2020 at iCompass, working on sentiment analysis and classification for the tunisian dialect using state-of-the-art NLP methods and technology. She is now focusing on natural language processing and natural language understanding and her current research interests include sentiment analysis and conversational agents.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.

 

Motivation

In Africa, English, French, Portuguese and Arabic are the typical languages of instruction as well as official communication. On the other hand, there are approximately 2,000 indigenous languages.

Over time, indigenous languages are being replaced even among people of the same community of origin. The situation is exacerbated by the advent of digital platforms which have made communication easier in English but tedious in other languages.

Natural language processing tools such as autocorrection and autocompletion, that have enhanced the usability of electronic communication in only a few languages, present obstacles for indigenous languages. The absence of these facilities causes frustration.

For example, the experience of typing in an indigenous language and having the autocorrect program replace words with English ones that are similar in spelling but completely different in meaning is common.

This reduction in the usefulness of indigenous languages puts them at risk, and it is, therefore, necessary to develop digital resources for these languages to make them relevant in the digital age, and, hence boost their use and their preservation.

Objectives

  • To develop openly licensed free to use African language corpora.
  • To set up a web-based platform for crowd sourcing stories in African languages
  • To set up an African language short story competition on the platform and create awareness
  • To collect a written corpus of African languages
  • To provide openly accessible material for natural language processing research for African languages
  • To develop digital resources for indigenous African languages, in particular spell checkers(Etoori, Chinnakotla, and Mamidi 2018; Monson et al. 2004) for desktop, mobile and web applications for which computational resources from CHPC can be used for training deep learning models

Long term vision

It is hoped that the competition is successful and provides a model that can be continued on an annual basis. It can provide a means to extend the available corpora, encourage literacy, create pride in indigenous cultures and improve cultural understanding between peoples in different language groups.

By appropriately licensing materials, they can also be used in creating good text to speech systems. For example, the common voice project records people reading openly licensed material in different languages so that the recordings can be used for training deep learning models to provide realistic human like text to speech systems. Such systems can be used in a variety of commercial applications, such as car navigation systems. The project will pilot recording of short story titles to determine if this crowd sourcing strategy can also be used for collecting a speech corpus.

In addition, by providing written materials suitable for school pupils, it should help in attaining the sustainable development goal of quality education for all by increasing literacy. Another sustainable development goal is peace and justice. Many wars in Africa occur between peoples who speak different African languages. By encouraging knowledge of more than one African language, one also creates better cultural understanding which should result in fewer conflicts.

The corpora should enable the use of machine learning methods to identify the language a text is written in, if it is from one of the collected African languages, for example, in order to route a query to the right language engine. In addition, by encouraging cooperation between computer scientists and people who study literature, it
is hoped to collaboratively build spelling and grammar checkers, plagiarism checkers, stylometric analysis tools and deep learning enabled content generation software that are useful for African languages.

In the long term, it is hoped to stimulate the production of natural language processing tools and aids such as morphological analyzers, lemmatizers, tokenizers, parsers, parts of speech taggers, and sentiment analysis tools for African languages. The natural language processing toolkit (NLTK) contains many of the above tools and is a
free and open source library that enables automated analysis of English text.

This allows many researchers to perform natural language processing tasks, and many businesses to become more efficient by freeing expensive staff to perform tasks with high value addition as computers can automate tasks such as answering common customer queries. A few languages such as Kinyarwanda, Kiswahili and Afrikaans have begun the development of these tools, but much work remains to be done and for many other African languages tool development has not yet begun. Such tools enable more effective machine translation.

This can greatly reduce the cost of producing document translations, especially useful for government communications and more pleasant retail consumer experiences. The current state of the art for machine translation of low resource languages uses 60,000 sentences for a
language pair(Fraser et al. 2020). This pilot project will not be able to collect such a data set, but will identify places in Africa where such datasets can be collected. A bible translation is approximately 60,000 sentences, but the bible is not a typical text and there are many domain areas where alternative and more texts are required.

Personnel

  1. Prof. Audrey Mbogho: Research interests include applications of machine learning to developing world problems, including processing and preservation of low-resource languages.
  2. Dr. Lilian Wanzare: Research interest is Artificial Intelligence, in particular Natural Language Processing and building text processing tools for low-resource languages.
  3. Dr. Benson Muite: Research interests include high performance computing and big data analysis. He will be the principal project coordinator.
  4. Prof. Constantine Yuka: Research interests include African linguistics and literature.
  5. Mr. Juan Steyn: Research interests include digital learning and digital humanities.

 

Motivation

In recent years, Artificial Intelligence (AI) has made tremendous advances in identifying diseases from radiology images. Convolutional Neural Networks (CNNs), a class of deep learning algorithm trained on large volumes of labelled radiological images, have led these advances. Various results has shown that CNNs improves the speed, accuracy and consistency of diagnosis [Liu et al., 2017].

However, the adoption of deep learning diagnostic system by healthcare practitioners is prevented by two major challenges: 1) interpreting the prediction outputs from a deep learning network is not trivial, and 2) Privacy of patient data is not guaranteed when using online services that provides deep learning models.

These two challenges may be part of the reasons why healthcare practitioners remain wary of using AI-driven diagnostic [Ribeiro et al., 2016].

A medical practitioner cannot fully trust the CNN network except it can explain its logic, semantically or visually. Earlier methods in machine learning are transparent in how they compute the predictions but deep learning models are not so.

Deep learning models automate the hand crafted feature engineering and hence no knowledge of how the predictions are computed. Diagnosing with CNN involves studying image regions that contribute most to prediction outputs at the pixel level. In interpretability, we expect the CNN to explain its logic at the object-part level.

Given an interpret-able CNN, previous work reveals the distribution of object parts that are memorized by the CNN for object classification Wang et al. [2020].

Typically, a CNN model is being deployed through a server client architecture which requires the data to be sent to the model online for prediction. Deep learning models are large in memory and computation. Hence, they need large computing power like GPUs that requires an existing remote servers.

To get predictions, doctors have to upload the patient radiological scan through the internet making it at risk of data privacy. What we do in this work instead, is to use solutions that make such models run locally on the web browser thereby solving the issue of privacy. This technique also solves the challenge faced in developing countries where access to internet could be expensive.

Goal

In addressing these two challenges earlier discussed, the goal of this project is to therefore build a locally run web based application for interpreting deep learning models on breast cancer diagnosis. The project has no interaction with internet neither does it collect data from user. It is locally run on the web browser of the end-user.

Proposal

In other to have a CNN model having some level of certainty in its prediction, we would use out of distribution prediction method as the first step in evaluating our CNN model. This checks for a image belonging to the data distribution or not. Leveraging on the baseline methods discussed in (Cao et al. [2020]), we aim to experiment with these methods using dataset on breast cancer diagnosis (Araújo et al.).

This is a classification problem where the model outputs a sample has belonging to one of; normal, benign, in situ carconima or invasive carcinoma. We aim to extend this decision by explaining why the model make such a decision. We would achieve such by leveraging works done in model interpretability and explainability in (Wang et al. [2020]). The system shows a graphical visualization of the model outputs and its interpretation.

The novelty of our proposal is that, we are building this AI system with privacy concerns in mind and also overcoming challenges faced in developing countries like access to internet. Previous work done in medical diagnosis using CNN are large models trained on very large datasets which in result take up huge memory space and computation.

Thereby making it extremely difficult to be used on edge devices. With the long term goal of deploying on a local web browser using tools like tensorflow.js, we would also put into consideration post training model quantization. Jacob et al. [2017] which reduces the model parameters and making it easily deploy-able on a low resource device.

Long-term vision

The long-term vision of this project is that our software system will help radiologists to effectively and efficiently diagnose breast cancer in hospitals and clinics across Ghana, Cameroon, and ultimately across Africa. Importantly, our interpretable models will help radiologists to better understand the predictions given by the model, and ultimately provide safer medical care to patients.

To scale our diagnosis system across Africa, we plan to open-source our research code. This will allow machine learning researchers in other African countries to train our machine learning models, in partnership with their own local hospitals and data, and deliver reliable diagnoses to broad populations of African people. Adding X-Rays, MRIs and other medical image modalities to our diagnostic suite of software is also part of the long term vision.

Personnel

  1. Jeremiah Fadugba, Core team member, project lead. M.Sc in Mathematical Sciences, African Institute for Mathematical Sciences (AIMS), Ghana. Has 3 years experience as Machine Learning Engineer.
  2. Moshood Olawale, Core team member. M.Sc in Mathematical Sciences, African Institute for Mathematical Sciences (AIMS), Cameroon. Has 1 year experience as a Machine learning Engineer.
  3. Conrad Tankou, Medical domain expert, external advisor. Doctor of Medicine, University of Yahounde, Cameroon. He is a medical professional and radiologist. He works on cancer diagnosis.
  4. Oluwayetunde Sanni, Team member, Software Engineer, MSc AI and Robotics, Sapienza University of Rome, Italy. More than 5 years experience in software engineering.
  5. Two Research Interns. We have budgeted to hire two research interns who will be responsible for evaluating machine learning models and developing our system.

Description

Visual Question Answering on Medical Images: Our system takes as input a medical image and a clinical relevant question and outputs the answer based on the visual content.

Motivation

With the increasing interest in artificial intelligence (AI) to support clinical decision making and improve patient engagement, opportunities to generate and leverage algorithms for automated medical image interpretation are becoming increasingly more important.

Since patients in Africa may now access structured and unstructured data related to their health via patient portals, such access to medical AI assistants will likely improve the understanding of their condition based on their medical data.

The need for medical AI models is more profound when hospitals are not manned with medical specialists resulting in inaccurate diagnoses. For instance in Cameroon not all hospitals have a radiologist, a gynecologist or even worse a cardiologist. On several occasions during hospital visits, patients meet with the general practitioners who have no expert training.

At best, the practitioners refer patients to specialized doctors. Even then, scheduling an appointment with a specialized doctor is sometimes impossible as they tend to cater to a large number of patients. This inevitably decreases the chances for a correct diagnosis which can have fatal consequences.

Further, the clinician’s confidence in interpreting complex medical images can be significantly enhanced by a “second opinion” provided by an automated system. In addition, patients may be interested in the morphology/physiology and disease-status of anatomical structures around a lesion that has been well characterized by their healthcare providers – and they may not necessarily be willing to meet a specialist that they are not sure to see, to pay significant amounts for a separate office- or hospital visit just to address such questions.

Although some patients in Africa often turn to search engines (e.g. Google) to disambiguate complex terms or obtain answers to confusing aspects of a medical image, results from search engines may be nonspecific, erroneous and misleading, or overwhelming in terms of the volume of information and the plague of misinformation.

In this project we aim to build a medical AI assistant which has the potential to complement clinician’s diagnoses. We focus on radiology images and tackle four main categories:

  1. Modality, used in radiology to refer to the form of imaging e.g. CT scan, mammography;
  2. Plane is a radiographic positioning terminology which is used routinely to describe the position of the patient for taking various radiographs. e.g. longitudinal, coronal;
  3. Organ system refers to the different body organs e.g.lung for the link to COVID-19;
  4. Abnormality e.g. ectopic pregnancy, fat embolism. These categories are designed with different degrees of difficulty leveraging both classification and text generation approaches.

Goals

The goal of our project is to build a Visual Question Answering (VQA) model on medical images. Our system takes as input a medical image and a clinical relevant question and outputs the answer based on the visual content.

This project meets some of the sustainable development goals:

  1. Reduce the inequality among and within the Africa country (number 10) because it will no longer matter that the patient is in the place where there is no specialist, he/she could still have a solution to his/her problem;
  2. Ensure healthy lives and promote well-being for all ages (number 3). All young people could be able to use the system and if a patient is old even the generalist medical doctors could find a solution to the more specific problem the old person may be facing;
  3. Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation (number 9). Since the code of this project will be open source, this project will enhance scientific research in the domain of medicine in Africa.

Long-term vision

We are planning to open source this project at the end of the development, so that researchers in Africa and beyond could use it as a baseline. This could also encourage and open the path for more specialized data collection and the start of more in-depth research in the field of health. We believe our work could help Africa out of the bad health situation it is facing now.

The code will be well structured and made available for everybody on github at the end of the project (6 months after the beginning of the project). It will be easy to run (just run one line of code) and evaluate. We are committed to maintaining our github repo and address any issues that emerge from users.

This project will be presented at the Deep Learning IndabaX to be held in Cameroon. The team lead will be organising this event in 2021. We will also submit the paper of this project to Information Technology, Data science and Digital Health Summit and Expo conference and many other workshops to be well known by the African researchers and beyond the world.

By 2022, this system will be deployed to be used by medical experts, radiologists , to make a good and sure prediction for their patients and also allow patients who had their medical result to be able to be sure of their health situation.

Personnel

  1. Volviane Saphir MFOGO, project lead, is a student at African Masters in Machine Intelligence (AMMI). She is a computer vision enthusiast with a background in computer science and mathematics. Graduated from African Institute of Mathematical Science in Cameroon.
  2. Dr. Georgia Gkioxari, coordinator was a lecturer at AMMI , she is a research scientist at FAIR. She received her PhD from UC Berkeley, working mainly on computer vision.
  3. Dr. Xinlei Chen, mentor. is a research Scientist at Facebook AI Research, he was a PhD student at Language Technology Institute , Carnegie Mellon University, working mainly on computer vision, computational linguistics and the combination of both.
  4. Jeremiah Fadugba, Core team member. M.Sc in Mathematical Sciences, African Institute for Mathematical Sciences (AIMS), Rwanda. Has 3 years experience as Machine Learning Engineer.

Description

Making online educational content accessible through the reformulation of such content in local accents.

Motivation

It has become increasingly desirable to learn via the internet in developing countries. Most students seeking to learn online would naturally flock towards Massive Open Online Courses (MOOC). These are largely offered in English, a norm reflected in the use of English in scientificc literature.

While there may be positive e ffects of listening to a foreign accent on listening comprehension in speci fic younger age groups [9], this is not more generally the case. Limited prior exposure to a given accent often results in reduced listening comprehension.

This challenge naturally reoccurs in the domain of internet-based learning, where listening comprehension is inherently tied to the utility of a spoken lecture or course. Our project aims to provide greater variety of options for internet-based education by creating algorithmic solutions that allow students to learn online in accents that are more familiar to them, if they so desire.

Same content, diff erent accent. We expect we can achieve this by engineering a generative model which can be used to convert audio between various accents. We conceptualise this as a challenge in accent transfer.

Outcomes

  1. Formalise the problem of swapping various standard English accents with particular accents local to developing countries, using accent transfer
  2. Curate a dataset that is consistent with other publicly available audio datasets, which contains mirror counterparts of standard English accents in said local accents
  3. Demonstrate a methodology by which short phrases can be converted from standard English accents to local accents, and extend this methodology to handle longer audio frames

Larger Vision

We find agreement with our motivations for this project in the 4th and 10th Sustainable Development Goals (SGD). The 4th SGD concerns Quality Education, while the 10th SGD concerns Reducing Inequality. If students do indeed choose to use our tool, they will have enhanced access to educational materials, due to better comprehension, which improves the quality of their education. Improved educational quality in turn contributes to the more understated aim of improved equality of opportunity, which aids in reducing inequality of outcome.

Personnel

Tejumade Afonja is a Graduate Student at Saarland University studying Computer Science. Previously, she worked as an AI Software Engineer at InstaDeep Nigeria. She holds a B.Tech in Mechanical Engineering from Ladoke Akintola University of Technology (2015) and worked on the Fabrication and Design of Robot Vacuum Cleaner for her under-graduate thesis which was published in Alexandria Engineering Journal hosted by Elsevier (2018). She’s currently a remote research intern at Vector Institute where she is conducting research in the areas of privacy, security and machine learning under the supervision of Prof. Nicolas Papernot from the University of Toronto. Tejumade is the co-founder of AI Saturdays Lagos, an AI community in Lagos, Nigeria focused on conducting research and teaching machine learning related subjects to Nigerian youths. She is also an Intel Software Innovator for Machine Learning in Nigeria and 2020 Google EMEA Women Techmakers Scholar.

Munachiso Nwadike: Munachiso is researcher with the Clinical Artificial Intelligence Lab at New York University, Abu Dhabi. He was trained in Computer Science and Mathematics at New York University for his undergraduate degree. While at NYU, his undergraduate thesis was on Semantic Segmentation of Satellite Images, and he built many interesting projects such as a mobile application that interprets sign language with just a smartphone camera. His current work focuses on robustness of deep learning disease classifiers for chest x-rays. Munachiso will be beginning his Masters degree in January 2021 at the Mohammed Bin Zayed University of Artificial Intelligence.

Olumide Okubadejo is a research scientist at Spotify, Paris. His research is centered on automatic and conditional generative music. He holds a B.Eng in Electrical and computer engineering from FUT, Minna, an MSc with Distinction in Artificial intelligence from University of Southampton, and a PhD from Universite Grenoble Alpes. He has authored and co-authored several papers and was a visiting researcher to GeorgiaTech, Atlanta. He is also the recipient of several awards and grants including the Northumbria grant and GEORAMP grant for two years consecutively.

Clinton Mbataku: Clinton is a clinical laboratory scientist, interested in solving Africa’s wide range of health problems using technology. His current research is focused on disease diagnosis using natural language processing algorithms. He is an assistant volunteer tutor with AI Saturdays Lagos where he teaches data science and machine learning.

Lawrence Francis: Lawrence is a programmer with a zeal to understand and build software solutions to problems. He is currently a machine learning research engineer at InstaDeep. His current research is focused on improving sample eciency and generalization of reinforcement learning algorithms and also on the robustness of visual recognition models. He is also a co-organiser at AI Saturdays Lagos where he enjoys understanding, implementing and clearly explaining AI algorithms and was the lead instructor for the Deep learning and Computer vision tracks.

Oluwafemi Azeez: Femi currently works as a research Engineer at Instadeep with focus on reinforcement learning projects. He is a recent masters graduate of Carnegie Mellon University, He spent some time at the African campus in Kigali and in the Pittsburgh Campus where He studied Electrical and Computer Engineering. His research focused on unsupervised domain adaptation in image segmentation, which He did with Yang Zou under the supervision of VijayaKumar Bhagavatula, and speech separation with Yuichiro Koyama under the supervision of Bhiksha Raj. He also co founded AI Saturdays Lagos and Kigali with the focus of helping others learn AI through community study groups,
free online resources and peer motivation.