Abstract

The monitoring of work towards the SDGs is essential to assess progress and obstacles to realise our shared agenda.

A large amount of SDG documents created by governments, universities, as well as private and public entities are often assessed by the UN to measure progress, usually requiring expert labelling. However, annual SDG progress reports are becoming more common beyond the UN (for example in academia, to evaluate the contribution of research/teaching to this agenda), aiming to identify challenges and achievements.

In this project we propose to create an automatic tool for SDG labelling based on Artificial Intelligence (AI), which can save time in expert querying, facilitating this labelling. Additionally, we propose to leverage the power of cutting-edge AI-based language models. These models are usually trained on the whole internet before being fine tuned on a task (such as SDG tagging). As such, they bring an enormous level of expertise that could reduce the bias in expert labels, as well as represent the interconnectedness of our SDGs.

Our final objective is to build an online tool (web app and API) for querying the model, which has a wide range of use cases in research and education.

Personel

Dr Perez-Ortiz is an Assistant Professor at the Centre for Artificial Intelligence at UCL. She isprogram co-founder and Deputy Director of a new MSc program on AI for Sustainable Development, which engages the new generations of engineers in developing responsible and innovative AI technologies for people and the planet. She teaches two modules related to AI and the intersection of the UN’s SDG agenda, as well as how to build responsible and ethical AI systems. Her research is fully interdisciplinary, actively collaborating with psychologists, medical doctors, social scientists, educators, agronomists and climate scientists alike. Every summer, Perez-Ortiz leads a group of MSc students to complete their dissertation in the technology for sustainable development domain, creating new technologies for identifying illegal deforestation/fishing, enabling the energy transition, designing tools to understand the impact of policies, etc. Perez-Ortiz has more than 12 years of experience doing theoretical and applied AI research (h-index 21), with a focus on environmental AI and educational recommender systems. Perez-Ortiz has collaborated in fruitful research with the European Space Agency, the HumaneAI network, the Knowledge 4 All Foundation, Apple, Google’s DeepMind, Spotify and multiple European and American universities.

Sahan Bulathwela is a Research Assistant contributing to multiple large projects on the topic of “AI for Education”. His contributions to the area, published in esteemed research venues, span multiple topics connected to this grant, namely text-tagging, recommender systems and natural language processing. Before joining UCL, he worked in several research roles in the industry where he gained experience in creating data products in a big data landscape. He has experience managing engineering teams to build API and web services.

John Shawe-Taylor is Professor of AI at UCL, Director of the UCL MSc on AI for Sustainable Development, Director of the International Research Center on Artificial Intelligence under the auspices of UNESCO and UNESCO Chair in AI. His foundational work in AI has attracted around 85.000 citations, making him one of the most featured and prolific researchers in the field.

Dr Wayne Holmes is a learning sciences and innovation researcher at the UCL Institute of Education, as well as a consultant researcher on AI and  education for UNESCO. Wayne brings a critical studies perspective to the connections between AI and education, and their ethical, human
and social justice implications.

 

 

International studies identify a lack of preparation and training in OE usage. However, the problem is to build the capacity to use OE as a tool to solve social problems. The OE4BW educational program allows its mentees/project developers to develop an advanced understanding, while addressing specific challenges in the areas of capacity and community building in OE.

The OE4BW mentoring programme is at the forefront of combining OER and SDGs and helping create a more personal approach towards building OER that can inform, educate and present value in new ways. The OERs have to address at least one of the 17 Sustainable Development Goals (SDGs), from ending poverty to a range of social needs including education, health, equality and job opportunities, while tackling climate change and preserving our environment.

It is a half year-long programme which is organised in a sustainable way as it takes place fully online for students from all backgrounds, regions and continents with the potential and desire to employ Open Educational Resources to solve large scale and relevant problems important in relation to today’s global landscape.

New project developers and new communities will require technical and media knowledge, educational content, pedagogical and didactical principles, social and psychological aspects, new organization and value-added models, strategies, and the potential paths for the organizational change, relevant policies, and legal aspects.

In addition, OE projects occur in a social context, requiring a social justice component. Many formal programs are inaccessible for students from the global South, underdeveloped countries, and underrepresented communities. Furthermore, leaders and their projects may not be properly connected to others. A critical mass of leaders in open education is fundamentally important to start making global changes.

The OE4BW addresses educational pathways, network development, and improved outcomes for open education in meaningful ways. From new participants to next-generation leadership, the program accelerates personal, professional and educational development.  Together, it creates new networks of first-time participants, mentors, coordinators, advisors. In particular, by creating networks of new participants, OE4BW strives to build inclusivity in the OE movement.

Outcome 1: Improve the communications infrastructure of OE4BW and staff capabilities

Enhance communication and collaboration among the developers, mentors, hub coordinators, and alumni of the OE4BW mentorship program through the customized MiTeam platform.

Outcome 2: Strengthened networks and new topical hubs

Support participants to physically join the OE4BW yearly final event EDUSCOPE in 2022, projected as a live event.

Outcome 3: Research the assumptions, practices, and results of OE4BW

Investigate the respective impacts and results of OE4BW. Provide research results as a basis for further improvement. Use qualitative and quantitative analysis through surveys/questionnaires of the OE4BW participants to determine the programs’ current impacts on society and connection to SDGs.

 

Description

Namibia is home to 2.5 million people with a rich cultural and colonial history spanning over 100 years.

The stories of the Namibian people have not been told with regards to their cultural practises, knowledge, nor its history from the perspectives of the Namibian people. As Goring said at the Nuremberg trials “The victor will always be the judge, and the vanquished the accused.”

As such, this project aims to capture this knowledge in the historical and cultural context, for one of the most critically endangered languages, Khoekhoegowab and the Namibian most widely spoken, Oshiwambo — and in doing so provide data for NLP tasks.

This project builds on prior efforts to create cultural and historical texts in the khoekhoegowab language, by crowdsourcing a speech dataset from 300 war veterans from a potential 10000 Namibian war veterans, mostly Oshiwambo speaking and a community of Khoekhoegowab elders, whose traditional methods are still used in wildlife conservation, for monitoring and tracking.

The project will consider various data gathering methods such as interviews, focus groups and web apps to capture the data. The speech data will be annotated and translated into English

Introduction

When it comes to scientific communication and education, language matters. The ability of science to be discussed in local indigenous languages not only has the ability to reach more people who do not speak English or French as a first language, but also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past. As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of globalization.”

During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated that the machine translation tools failed to translate COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Thus, we propose to build a multilingual parallel corpus of African research, by translating African preprint research papers released on AfricArxiv into 6 diverse African languages.

Proposed Dataset and Use Cases

When it comes to scientific communication, language matters. Jantjies (2016) demonstrates how language matters when it comes to STEM education: students perform better when taught mathematics in their home language. Language matters, in scientific communication, in how it can dehumanise the people it chose to study – Robyn Humphreys, at the #LanguageMatters seminar at UCT Heritage 2020, noted the following “During the continent’s colonial past, language – including scientific language – was used to control and subjugate and justify marginalisation and invasive research practices”.

The ability of science being discussed in local indigenous languages not only has the ability to reach more people who do not speak English as a first language, it also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past.

As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of
globalization.” (Prah, Kwesi Kwaa, 2007). When science becomes “foreign” or something non-African, when one has to assume another identity just to theorize and practice science, it’s a subjugation of the mind – mental colonization.

There is a substantial amount of distrust in science, in particular by many black South Africans who can cite many examples of how it has been abused for oppression in the past. In addition, the communication and education of science was weaponized by the oppressive apartheid government in South Africa, and that has left many seeds of distrust in citizens who only experience science being discussed in English.

Through government-funded efforts, European derived Languages such as Afrikaans, English, French, and Portuguese, have been used as vessels of science, but African indigenous languages have not been given the same treatment. Modern digital tools like machine learning
offer new, low-cost opportunities for scientific terms and ideas to be communicated in African indigenous languages.
During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated the difficulty in translating COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.

Use cases:

  • A machine translation tool for AfricArxiv to aid translation of their research to and from African languages
  • Terminology developed will be submitted to respective boards for addition to official language glossaries for further improvements to scientific communication
  • A machine translation tool for African universities to ensure accessibility of their publications
  • A machine translation tool for scientific journalists to assist in widely distributing their work on the African continent
  • A machine tool to aid translation of impactful STEM University curricula into African languages

Personnel

Jade Abbott is the co-founder of Masakhane and Staff Engineer at Retro Rabbit South Africa, working primarily in NLP with an MSc in Computer Science from the University of Pretoria. She is a thought leader in the space of NLP in production, African NLP (especially machine translation) and has published and spoken at numerous conferences across the world, including the Deep Learning Indaba, ICLR 2020,and the UN World Data Forum. In 2019, she co-founded and leads Masakhane – an initiative to spur NLP research in Africa, which have collectively published over 15 works in the past year and are leading the conversation around geographic and language diversity in NLP in Africa

Dr. Johanna Havemann is a trainer and consultant in [Open] Science Communication and [digital] Science Project Management and AfricArxiv. Her work experience covers NGOs, a science startup and international institutions including the UN Environment Programme. With a focus on digital tools for science and her label Access 2 Perspectives, she aims at strengthening global science communication in general – and with a regional focus on Africa – through Open Science. For the past two years, she has laid an additional focus on language diversity in Science and the pan-African Open Access portal coordinated provides information and accepts submissions in 12 official African languages.

Sibusiso Biyela has been a science communicator at ScienceLink since 2016, where he has worked with South African universities and international research institutions to produce science communication content for many audiences that include policymakers, the research
community, and the lay public. He has experience as a thought leader on the decolonisation of science and science communication. He has given talks on the topic at international conferences, contributing to discussions on platforms such as national radio and international
podcasts. He is the author of a widely regarded article; “Decolonizing Science Writing in South Africa” in which he has been vocal about creating scientific terms in the isiZulu language.

Introduction

Kenyan author Ngugi Wa Thiong’o in his novel Decolonising the Mind states “The effect of a cultural bomb is to annihilate a people’s belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves.”. When a technology treats something as simple and fundamental as your name as an error, it in turn robs you of your personhood and reinforces the colonial narrative that you are other.

Named entity recognition (NER) is a core NLP task in information extraction and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and that need to identify African names, places and people for information retrieval.

Currently, the majority of existing NER datasets for African languages are WikiNER which are automatically annotated, and are very noisy since the text quality for African languages is not verified. Only a few African languages have human-annotated NER datasets. To our knowledge, the only open-source Part-of-speech
(POS) datasets that exist are a small subset of languages in South Africa, and Yoruba, Naija, Wolof and Bambara (Universal Dependencies).

Pre-trained language models such as BERT and XLM-RoBERTa are producing state-of-the-art NLP results which would undoubtedly benefit African NLP. Beyond the direct uses, NER also is a popular benchmark for evaluating such language models. For the above reasons, we have chosen to develop a wide-spread POS and NER corpus for 20 African languages based on news data.

Personnel

Peter Nabende is a Lecturer at the Department of Information Systems, School of
Computing and Informatics Technology, College of Computing and Information Sciences, Makerere University. He has a PhD in Computational Linguistics from the University of Groningen, The Netherlands. He has conducted research on named entities across several writing systems and languages in the NLP subtasks of transliteration detection and generation. He has also conducted experimental research on an NLP main task of machine translation between three low resourced indigenous Ugandan languages (Luganda, Acholi, and Lumasaaba) and English using statistical and neural machine translation methods and tools such as moses and opennmt-py. He has supervised the creation of language technology resources involving another three Ugandan languages (a Lusoga-English parallel corpus and Grammatical Framework (GF)-based computational grammar resources for Runyankore-Rukiga and Runyoro-Rutooro).

Jonathan Mukiibi is a Masters student in Computer Science at Makerere University. His current research focuses on topic classification of speech documents for crop disease surveillance using Luganda language radio data. He is the coordinator of natural language processing tasks at the Artificial Intelligence Lab, Department of Computer science, Makerere University.

David Ifeoluwa Adelani (an NLP Researcher, https://dadelani.github.io/) is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages. He was involved in the creation of the first NER dataset for Hausa [Hedderich et al., 2020] and Yoruba [Alabi et al., 2020] in the news domain.

Daniel D’souza has an MS in Computer Science ( Specialization in Natural Language
Processing ) from the University of Michigan, Ann Arbor. He currently works as a Data Scientist at ProQuest LLC.

Jade Abbott has an MSc in Computer Science from the University of Pretoria. She is a
Machine Learning lead at Retro Rabbit South Africa, working primarily in NLP. Additionally, she co-founded Masakhane – an initiative to spur NLP research in Africa and has widelypublished in African NLP tasks.

Olajide Ishola has an MA in Computational Linguistics. He is one of the pioneers of the first dependency treebank for the Yoruba language [Ishola et. al, 2020]. His interest lies in corpus development and NLP for indigenous Nigerian languages.

Constantine Lignos is an Assistant Professor in the Department of Computer Science at Brandeis University where he directs the Broadening Linguistic Technologies lab. He received his PhD from the University of Pennsylvania in 2013. His research focus is the construction of human language technology for previously-underserved languages. He has worked on named entity annotation and system creation for Tigrinya and Oromo, and additionally developed entity recognition systems for Amharic, Hausa, Somali, Swahili, and Yoruba. He has also worked on natural language processing tasks for other African languages, including cross-language information retrieval for Somali and information extraction for Nigerian English.

The objective of this project is to build a Wolof text-to-speech system. Three people will be involved Thierno Ibrahima DIOP, senior data scientist at Baamtu SARL, El Hadj Mamadou Nguer, Assistant Professor at Universite Virtuelle du Senegal, and Sileye BA, Senior machine learning researcher at L’Oreal Innovation Center, in Paris. Thierno Ibrahima DIOP, and Mamadou Nguer will be the project’s principal investigators.

The project will exploit a dataset of 40000 Wolof phrases uttered by two actors. This open-source dataset is a deliverable of a previous project.

The project will be conducted following four phases:
1. Evaluation of the quality of the dataset
2. Implementation of a machine learning model mapping Wolof texts into their
corresponding utterances
3. Quantitative and qualitative evaluation of the implemented model’s performances
4. Development of and API exposing implemented text to speech model

Database quality will be assessed on a randomly sampled portion of about a thousand uttered phrases. These phrases will be qualitatively validated in terms of comprehensiveness by fluent Wolof speakers.

A state of the art in neural network speech synthesizer will be implemented and evaluated using the dataset. Neural network models have been selected as they can be trained end to end without requiring word segmentation at the phoneme level as required by competing statistical models. We will investigate Text-to-Spectrogram models such as Tacotron, Glow-TTS, Speedy-Speech, and also Vocoders models such as MelGAN.

The trained model will be evaluated quantitatively and qualitatively. The quantitative evaluation will be done using metrics provided in standard text to speech evaluation libraries. The qualitative evaluation will be based on fluent Wolof speakers’ comprehension of synthesized Wolof utterances.

The model will be exposed via an API which will take as input a language token and input text, and returns the synthesized input text into an audio file. This API will be plugged to à web platform based on the Masakhane MT web platform.

For the deployment a kubernetes cluster will be used to have a horizontal scaling, in the beginning, we can have only one instance, and depending on the load, the number will be automatically adjusted. The cost of an instance (8 cores, 32GB of RAM) will be about $83.95 per Month subject to a yearly reservation basis.

An objective of this project is to publish work done on the dataset, and the developed speech synthesis model in a natural language processing conference such as African NLP Workshop, or Deep Learning Indaba. This will give more visibility to this work, and at the same time advances machine learning based African language processing activities.

 

Introduction

Wildlife tourism is a significant and growing contributor to the economic and social development in the African region through revenue generation, infrastructure development and job creation. According to a recent press release by the World Travel and Tourism Council [1], travel and tourism contributed $194.2 billion (8.5% of GDP) to the African region in 2018 and supported 24.3 million jobs (6.7% of total employment). Globally, travel and tourism is a $7.6 trillion industry, and is responsible for an estimated 292 million jobs [2]. Tourism is also one of the few sectors in which female labor participation is already above parity, with women accounting for up to 70% of the workforce [2].

However, the wildlife tourism industry in Africa is increasingly threatened by rising human population and wildlife crime. As poaching becomes more organised and livestock incursions become frequent occurrences, shortages in ranger workforce and shortcomings in technological developments in this space have put thousands of species at risk of endangerment, and threaten to collapse the wildlife tourism industry and ecosystem.

Tourism in Kenya contributed a revenue of $1.5 billion in 2018 [3]. And The National Wildlife Conservation Status Report, 2015 – 2017 [4] presented by the Ministry of Tourism and Wildlife of Kenya claimed that wildlife conservancies in Kenya supported over 700,000 community livelihoods. The recession of the wildlife tourism industry could therefore have major adverse economic and social impacts on the country. It is thus critical that sustainable solutions are reached to save the wildlife tourism industry, and further research is fuelled in this area.

Problem definition

According to The National Wildlife Conservation Status Report, 2015 – 2017 [4] presented by the Ministry of Tourism and Wildlife of Kenya, there is currently a shortage of 1038 rangers, from the required 2484 rangers in Kenyan national parks and reserves, a deficit of over 40%. To address shortages in ranger workforce, carry out monitoring activities more effectively, and detect criminal or endangering activities with greater precision, we propose the deployment of Unmanned Ground Vehicles (UGVs) for intelligent patrol and wildlife monitoring across the national parks and reserves in Kenya.

The UGVs would be fitted with a suite of cameras and sensors that would enable it to navigate autonomously within the parks, and run multiple deep learning and computer vision algorithms that can carry out numerous monitoring activities such as detection of poaching, livestock incursions, human wildlife conflict, distressed wildlife, and species identification.

The UGVs could be monitored from a central surveillance system, where alerts can be generated on detection of any alarming activity, and rangers dispatched to respond. Ethical considerations can be made to facilitate the deployment of these UGVs in a manner that aids the ranger workforce in their routine surveillance tasks throughout the national parks and reserves that often span thousands of square kilometers, rather than replace them. Sustainable and ethical automation could help create more jobs in the automotive and technology sectors without replacing current jobs.

The deployment of a project of this scale, however, would require significant investments in building the UGV, and require feasibility studies from the government and international wildlife conservation bodies. Furthermore, without reasonable computer vision and autonomous navigation accuracies, investments towards building the unmanned vehicle would be futile. It is thus crucial that efforts are first made towards solving the computer vision and autonomous navigation challenges posed by the rough terrains prevalent in national parks and reserves.

This project therefore serves as a stepping-stone towards adopting autonomous vehicle technology in Africa and pioneering further research in the field and its applications to broader areas beyond just transportation. Additionally, its adaptation in national park environments would allow it to be tested in unstructured environments lacking road infrastructure and free of traffic and pedestrians, thus allowing the systems to be tested safely and get quicker policy approvals. The scope of this research is hence limited to developing an end-to-end deep learning model that can autonomously navigate a vehicle over dirt roads and challenging terrain that is present in national parks and reserves.

The model will be trained on trail video as well as driving data such as steering wheel angle, speed, acceleration, and Inertial Measurement Unit (IMU) data. The accuracy of the model will be measured by calculating the error rate between the model’s prediction and the driver’s actual inputs over a given distance. We also look to publish the dataset of annotated driving data from national parks and reserves, the first of its kind, to encourage further research in this space. Additionally, we shall collect metadata such as number of patrol vehicles per square kilometer, average distance travelled per vehicle per day, distance of traversable road in the park per square kilometer, that can be used to give a preliminary analysis on the feasibility of the project results towards automated wildlife patrol.

References

[1] “African tourism sector booming – second-fastest growth rate in the world”, WTTC press release, Mar. 13, 2019. Accessed on Jul. 11, 2019. [Online]. Available:
https://www.wttc.org/about/media-centre/press-releases/press-releases/2019/african-tourism-sector-booming-second-fastest-growth-rate-in-the-world/
[2] “Supporting Sustainable Livelihoods through Wildlife Tourism”, World Bank Group, 2018.
[3] “Tourism Sector Performance Report – 2018”, Hon. Najib Balala, 2018.
[4] “The National Wildlife Conservation Status Report, 2015 – 2017”, pp. 131, 74, 75 Ministry of Tourism and Wildlife, Kenya, 2017.

 

Abstract

According to the Open Data Barometer by the World Wide Web Foundation, countries in sub-Saharan Africa are ranked poorly with an average score of about 20 out of a maximum of 100 on open data initiatives based on readiness, implementation, and impact [1]. To make the processing of creation, introduction, and passage of parliamentary bills a force for public accountability, the information needs to be easier to analyze and process by the average citizen.

This is not the case for most of the bills introduced and passed by parliaments in Sub-Saharan Africa. In this work, we present a method to overcome implementation barrier. For the Nigerian parliament, we used a pre-trained optical character recognition tool (OCR), natural language processing techniques and machine learning algorithms to categorize congress bills. We propose to improve the work on the Nigeria parliamentary bills by using text detection models to build a custom OCR tool. We also propose to extend our method to three other African countries:  South Africa, Kenya, and Ghana.

Introduction

Given the challenges and precariousness facing developing and underdeveloped countries, the quality of policymaking and legislation is of enormous importance. This legislation can be used to impact the success of some of the United Nations Sustainable Development Goals (SDGs) like poverty alleviation, good public health system, quality education, economic growth and, sustainability. Targets 16.6 and 16.7 from the UN SDGs is to “develop effective, accountable, and transparent institutions at all levels” and to “ensure responsive, inclusive, participatory and representative decision making at all levels” [2]. For countries in Sub-Saharan Africa to meet this target, an open data revolution needs to happen at all levels of government and more importantly, at the parliamentary level.

Objectives and Expectations

To achieve the goal of meeting the UN SDG targets 16.6 & 16.7, making effective use of data is key. However, does such data currently exists? If so, how should it be organized in a framework that is amenable to decisionmaking process? Here, we propose expanding our work on categorizing parliamentary bills in Nigeria using Optical Character Recognition (OCR), document embedding and recurrent neural networks to three other  countries in Africa: Kenya, Ghana, and South Africa.

We also plan to improve our text extraction process by training a custom OCR using AI. The objective of this project is to generate semantic and structured data from the bills and in turn, categorize them into socio-economic driven labels. We plan to recruit three interns to work on this project for five months: two machine learning and one software engineering interns.

Conclusion and Long Term Vision

Our initial experimental results show that our model is effective for categorizing the bills which will aid our large scale digitization efforts. However, we identified a key remaining challenge based on our results. The output from the pre-trained OCR tool is not generally a very accurate representation of the text in the bills, especially for the low-quality PDFs. A fascinating possibility is to solve this by training our custom OCR which we proposed. The intensive acceleration of text detection research with novel deep learning methods can help us in this area.

Methods such as region-based or single-shot based detectors can be employed. In addition to this, we plan to use image augmentation to alter the size, background noise or color of the bills. A large scale annotation effort of the texts can be as the labels for us to train our custom OCR for text identification and named entity recognition. We are also extending our methodology to other countries in Sub-Saharan Africa. Results that lead to accurate categorization of parliamentary bills are well-positioned to have a substantial impact on governmental policies and on the quest for governments in low resource countries to meet the open data charter principles and United Nation’s sustainability development goals on open government.

Also, it can empower policymakers, stakeholders and governmental institutions to identify and monitor bills introduced to the National Assembly for research purposes and facilitate the efficiency of bill creation and open data initiatives. We plan to design an intercontinental tool that combines information from all bills and categories and make them easily accessible to everyone. For our long term vision, we plan to analyze documents on parliamentary votes and proceedings to give us more insight into legislative debates and patterns.

Description

Algorithms for text classification still contain some open problems for example dealing with long pieces of texts and with texts in under-resourced languages.

This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are of varying length, some being quite long and will pose some challenges in chunking and classification. The texts are made up of news articles.

The objective of this challenge is to classify news articles.

We hope that your solutions will illustrate some challenges and offer solutions.

Algorithms for text classification have come a long way, but classifying long texts and working with under-resourced languages can still pose difficulties. This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are made up of news articles or varying lengths. The objective of this challenge is to classify these articles by topic. We hope that your solutions will illustrate some challenges and offer solutions.

Chichewa is a Bantu language spoken in much of Southern, Southeast and East Africa, namely the countries of Malawi and Zambia, where it is an official language, and Mozambique and Zimbabwe where it is a recognised minority language.

tNyasa Ltd Data Science Lab

We are a company based in Malawi offering intelligent technological solutions for the travel, technology, trade, cultural and education sector in Malawi. Part of the data Science Lab we work on language tools for Chichewa such as the construction and curation of data sets, speech to text and information processing.

AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policymakers.

Datasets

The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues.

The articles were cleaned by removing special characters and html tags.

Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.

List of classes: [‘SOCIAL ISSUES’, ‘EDUCATION’, ‘RELATIONSHIPS’, ‘ECONOMY’, ‘RELIGION’, ‘POLITICS’, ‘LAW/ORDER’, ‘SOCIAL’, ‘HEALTH’, ‘ARTS AND CRAFTS’, ‘FARMING’, ‘CULTURE’, ‘FLOODING’, ‘WITCHCRAFT’, ‘MUSIC’, ‘TRANSPORT’, ‘WILDLIFE/ENVIRONMENT’, ‘LOCALCHIEFS’, ‘SPORTS’, ‘OPINION/ESSAY’]

Files available for download:

  • Train.csv – contains the target. This is the dataset that you will use to train your model.
  • Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode.
  • SampleSubmission.csv – shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct.

Partners

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network

 

Description

Ewe and Fongbe are Niger–Congo languages, part of a cluster of related languages commonly called Gbe. Fongbe is the major Gbe language of Benin (with approximately 4.1 million speakers), while Ewe is spoken in Togo and southeastern Ghana by approximately 4.5 million people as a first language and by a million others as a second language. They are closely related tonal languages, and both contain diacritics that can make them difficult to study, understand, and translate.

Although those languages are at the core of the economic and social life of at least 3 major West African capital cities (namely Cotonou, Lome and Accra), they are today mostly spoken and very rarely written. Due to that fact (among other reasons), there is very little official or formal communication in those languages, leaving non-French/English speakers often unable to access critical facilities like education, banking, and healthcare. This challenge is part of an initiative that wishes to bring down the barriers between African local language speakers and modern society.

The objective of this challenge is to create a machine translation system capable of converting text from French into Fongbe or Ewe. You may train one model per language or create a single model for both. You may not use any external data, so a key component of this competition is finding a way to work with the available data efficiently.

This is a pioneer competition as far as low-resourced West African languages are concerned. A good solution would be a model that can be improved upon or used by researchers across the world to create APIs that can be integrated into day-to-day tools like ATMs, delivery applications etc., and help bridge the gap between rural West Africa and the modernized services.

This competition is one of five NLP challenges we will be hosting on Zindi as part of AI4D’s ongoing African language NLP project, and is a continuation of the African language dataset challenges we hosted earlier this year. You can read more about the work here.

About Takwimu Lab (takwimulab.gitlab.io)

TakwimuLab is an association of francophone west african who are professionals and enthusiasts about AI technologies. Our goal is to spread awareness about the challenges AI can help solve in our communities, disseminate knowledge and build solutions that can resolve real issues in our countries. Takwimu Lab is based in Benin.

Data

This is a parallel corpus dataset for machine translation from French to Ewe and French to Fongbe, languages from Togo and Benin respectively. It contains roughly 23 000 French to Ewe and 53 000 French to Fongbe parallel sentences, collected from blogs, tales, newspapers, daily conversations, webpages and annotated for neural machine translation. The collected sentences were preprocessed and aligned manually.

Variable definitions

  • ID : Unique identifier of the text
  • French : Text in French
  • Target_Laguauge: The target language
  • Target : Text in Fongbe or Ewe

Files available for download:

  • Train.csv – contains parallel sentences for training your model or models. There are 77,177 rows, of which 53,366 are French-Fongbe and 23,811 are French-Ewe
  • Test.csv- resembles Train.csv but without the Target column. This is the dataset on which you will apply your model(s).
  • SampleSubmission.csv – shows the submission format for this competition, with the ID column mirroring that of Test.csv and the ‘Target’ column containing your translation in Ewe or Fongbe. The order of the rows does not matter, but the names of the ‘ID’ must be correct.

Partners

AI4D-Africa; Artificial Intelligence for Development-Africa Network
AI4D-Africa; Artificial Intelligence for Development-Africa Network