AI4D blog series: Collecting and Organizing News articles in Swahili Language

Context

Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. Given its presence within the continent and outside, learning Swahili is a popular choice for many language enthusiasts. In Tanzania, it is one of two national languages (the other is English).

 

News in Swahili is an important part of the media sphere in Tanzania. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many Africa countries. In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.

Objective

Swahili open-source text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.

The goal of this project is to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news at different categories such as Local, International, Business or Financial, health, sports, and entertainment. The dataset will be open-source, and NLP practitioners will be able to access the dataset and learn from it.

Implementation

I was able to implement the following phases of the project in order to achieve the objective of the project.

  1. Collect website with Swahili news: The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites  provide news in Swahili only and others in different languages including Swahili.
  2. Understand policy and copyright: In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.AI4D helped me to understand this by providing a Data Protection Guidelines to consider for data collection and data mining.
  3. Understand the structure of the news website: Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.
  4. Data Collection: news articles were collected by using different tools and programming languages. These tools are as follow: Python programming language, Jupyter notebook, Python open-source packages (NumPy, pandas, and BeautifulSoup). The collected news articles were saved in a CSV file (contains the content and the category of particular news e.g sports)
  5. Analyze and Cleanin: The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.

Results

At the end of this project, I was able to achieve the following milestones

  • Collecting and organizing a total of 40,331 news (with a total number of words = 12,488,239).
  • I have collected news from different six categories which are local,International,business,health,sports and entertainment

The main challenge is the imbalance of collected news from different categories. For example we have few news focus on international, business and health news.

I  would like to extend my thanks to the AI4D(Artificial Intelligence for Development  Africa) team and other partners in this AI4D-language dataset fellowship for their support and guidance throughout the project. Also, I have learned a lot from my fellow researchers across Africa who participated in this program to develop datasets in our Africa languages.