This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular Document Classification datasets of Kiswahili.

Language profile: Kiswahili

Language profile for Kiswahili
Language profile for Kiswahili


Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is spoken by countries such as Tanzania, Kenya, Uganda, Rwanda, and Burundi, some parts of Malawi, Somalia, Zambia, Mozambique and the Democratic Republic of the Congo (DRC).


In Tanzania, [1]Swahili is the official language and main communication medium for economic, social, and government activities across the country and it is the official language of instruction in all schools.

Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. Swahili has been influenced by Arabic and even had an Arabic script during its early years., given its presence within the continent and outside.

Swahili is also one of the working languages of the African Union and officially recognized as a lingua franca of the East African Community. In 2018, South Africa legalized the teaching of Swahili in South African schools as an optional subject to begin in 2020. The Southern African Development Community (SADC) officially recognized the Swahili as their official language.

Existing work

In Tanzania, [2]Baraza la Kiswahili la Taifa (National Swahili Council, abbreviated as BAKITA) is a Tanzanian institution responsible for regulating and promoting the Kiswahili language. Key activities mandated for the organization include creating a healthy atmosphere for the development of Kiswahili, encouraging the use of the language in government and business functions, coordinating activities of other organizations involved with Kiswahili, standardizing the language.

BAKITA cooperates with organizations like [3]TATAKI in creation, standardization, and dissemination of specialized terminologies Other institutions can propose new vocabulary to respond to emerging needs but only BAKITA can approve usage. Also, BAKITA coordinates its activities with similar bodies in Kenya and Uganda to aid in the development of Kiswahili.

There exist different English to Swahili dictionaries online from [4]elimuyetu website and Swahili to English dictionaries online from [5]africanlanguages website and the mobile Swahili Dictionary [6] on the Android play store.

Researcher profile: Davis David

He graduated with a Bachelor’s Degree in Computer Science from the University of Dodoma in 2017 where I was a Co-organizer of Python Community during my time at university. After that, he worked as a Software Developer at TYD innovation Incubator developing different innovative systems to solve educational and economical challenges in Tanzania. Davis also worked as a Data scientist at ParrotAI developing different AI solutions focus on Agriculture, health, and finance.

He built computer vision models for classifying Banana Diseases from Leaf Images.. For the last 4 years, Davis has been teaching machine learning and data science across different universities, tech communities, and events with a passion to build a community of Data Scientists in Tanzania to solve local problems

He is also working with Zindi Africa as a Zindi Ambassador and a mentor in Tanzania, he organizes different machine learning hackathons across different cities in Tanzania and mentored different students and junior data scientists across Africa.


Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa


The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.