Knowledge 4 All Foundation Completes NLP Projects with Lacuna Fund, Transitions Efforts to Deep Learning Indaba Charity

Knowledge 4 All Completes NLP Projects, Passing the Torch to Deep Learning Indaba
Completed NLP Projects, Passing the Torch to Deep Learning Indaba

The Knowledge 4 All Foundation is pleased to announce the successful completion of its Natural Language Processing (NLP) projects under the Lacuna Fund initiative. These projects aimed to develop open and accessible datasets for machine learning applications, focusing on low-resource languages and cultures in Africa and Latin America.

The portfolio includes impactful initiatives such as NaijaVoice, which focuses on creating datasets for Nigerian languages, Masakhane Natural Language Understanding, which advances NLU capabilities for African languages, and Masakhane Domain Adaptation in Machine Translation, targeting improved domain-specific machine translation systems. The Foundation’s efforts have significantly contributed to assisting African researchers and research institutions in creating inclusive datasets that address critical needs in these regions.

As part of a strategic transition, the Foundation has entrusted the continuation and expansion of these initiatives to the Deep Learning Indaba charity. The Deep Learning Indaba, dedicated to strengthening machine learning and artificial intelligence across Africa, is well-positioned to build upon the groundwork laid by Knowledge 4 All. The Foundation extends its gratitude to the Deep Learning Indaba charity for taking over these projects and is confident that their expertise will further the mission of fostering inclusive and representative AI development in the future.

AI4D blog series: The First Tunisian Arabizi Sentiment Analysis Dataset

Motivation

On social media, Arabic speakers tend to express themselves in their own local dialect. To do so, Tunisians use “Tunisian Arabizi”, which consists in supplementing numerals to the Latin script rather than the Arabic alphabet.

In the African continent, analytical studies based on Deep Learning are data hungry. To the best of our knowledge, no annotated Tunisian Arabizi dataset exists.

Twitter, Facebook and other micro-blogging systems are becoming a rich source of feedback information in several vital sectors, such as politics, economics, sports and other matters of general interest. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Tunisian Arabizi.

TUNIZI is composed of one instance presented as text comments collected from Social Media, annotated as positive, negative or neutral. This data does not include any confidential information. However, negative comments may include offensive or insulting content.

TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect. TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.

Team

 TUNIZI Dataset is collected, preprocessed and annotated by iCompass team, the Tunisian Startup speciallized in NLP/NLU. The team composed of academics and engineers specialized in Information technology, mathematics and linguistics were all dedicated to ensure the success of the project. iCompass can be contacted through emails or through the website: www.icompass.tn

Implementation

  1. Data Collection: TUNIZI is collected from comments on Social Media platforms. All data was directly observable and did not require other data to be inferred from. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Arabizi. This dataset relates directly to Tunisians from different regions, different ages and different genders. Our dataset is collected anonymously and contains no information about users identity.
  2. Data Preprocessing & Annotation: TUNIZI was preprocessed by removing links, emoji symbols and punctuation. Annotation was then performed by five Tunisian native speakers, three males and two females at a higher education level (Master/PhD).
  3. Distribution and Maintenance: TUNIZI dataset is made public for all upcoming research and development activitieson Github. TUNIZI is maintained by iCompass team that can be contacted through emails or through the Github repository. Updates will be available on the same Github link.
  4. Conclusion: As the interest in Natural Language Processing, particularly for African languages is growing, a natural future step would involve building Arabizi datasets for other underrepresented north African dialects such as Algerian and Moroccan.