Description

This dataset is part of a 3-4 month Fellowship Program within the AI4D – African Language Program, which was conceptualized as part of a roadmap to work towards better integration of African languages on digital platforms, in aid of lowering the barrier of entry for African participation in the digital economy.

This particular dataset is being developed through a process covering a variety of languages and NLP tasks, in particular a Sentiment Analysis datase of Arabizi.

Language profile: Tunisian Arabizi

Language profile for Tunisian Arabizi
Language profile for Tunisian Arabizi

Overview

On Social Media, users tend to express themselves in their own local dialect. To do so, Tunisians use Tunisian Arabizi which consists in supplementing numerals to the Latin script rather than using the Arabic alphabet. [7] mentioned that 81\% of the Tunisian comments on Facebook used the Romanized alphabet.

In [8], a study was conducted on 1,2M social media Tunisian comments  (16M  words  and  1M  unique  words)  showed  that  53%  of  the  comments used the Romanized alphabet while 34% used Arabic alphabet and 13% used script-switching.

The study also mentioned that 87% of the comments based on the Romanized alphabet are TUNIZI, while the rest are French and English.  TUNIZI,  our  dataset  includes  100%  Tunisian  Arabizi  sentences  collected from people expressing themselves in their own local dialect using Latin characters and numerals.  TUNIZI is a Sentiment Analysis Tunisian Arabizi Dataset, collected, preprocessed, and annotated

Previous projects on Tunisian Dialect

In [1], a lexicon-based sentiment analysis system was used to classify the sentiment  of  Tunisian  tweets.   The author  developed  a  Tunisian  morphological analyzer to produce linguistic features and achieved an accuracy of 72.1% using the small-sized TAC dataset (800 Arabic script tweets). [2]  presented  a  supervised  sentiment  analysis  system  for  Tunisian  Arabic script tweets.

With different bag-of-word schemes used as features, binary and multiclass classifications were conducted on a Tunisian Election dataset (TEC)of  3,043  positive/negative  tweets  combining  MSA  and  Tunisian  dialect.

The support vector machine was found of the best results for binary classification with an accuracy of 71.09% and an F-measure of 63%. In  [3],  the  doc2vec  algorithm  was  used  to  produce  document  embeddings of Tunisian Arabic and Tunisian Romanized alphabet comments.

The generated embeddings were fed to train a Multi-Layer Perceptron (MLP) classifier where both the achieved accuracy and F-measure values were 78% on the TSAC (Tunisian  Sentiment  Analysis  Corpus)  dataset.

This  dataset  combines  7,366 positive/negative Tunisian Arabic and Tunisian Romanized alphabet Facebook comments.  The same dataset was used to evaluate Tunisian code-switching sentiment analysis in [5] using the LSTM-based RNNs model reaching an accuracy of 90%.

In [4], authors conducted a study on the impact on the Tunisian sentiment classification  performance  when  it  is  combined  with  other  Arabic  based  pre-processing tasks (Named Entities tagging,  stopwords removal,  common emoji recognition,  etc.).

A lexicon-based approach and the support vector machine model were used to evaluate the performances on the above-mentioned datasets (TEC and TSAC datasets).

In  order  to  avoid  the  hand-crafted  features  labor-intensive  task,  syntax-ignorant n-gram embeddings representation composed and learned using an unordered composition function and a shallow neural model was proposed in [6].The proposed model, called Tw-StAR, was evaluated to predict the sentiment on five Arabic dialect datasets including the TSAC dataset  [3].

We  observe  that  none  of  the  existing  Tunisian  sentiment  analysis  studies focused on the Tunisian Romanized alphabet which is the aim of this work.

Tunisian Arabizi vs Arabic Arabizi

Tunisian dialect, also known as “Tounsi” or “Derja”, is different from ModernStandard Arabic.  In fact,  Tunisian dialect features Arabic vocabulary spiced with  words  and  phrases  from  Tamazight,  French,  Turkish,  Italian  and  other languages [9].Tunisia is recognized as a high contact culture where online social networks play  a  key  role  in  facilitating  social  communication  [10].

].   To  illustrate  more, some examples of Tunisian Arabizi words translated to MSA and English are presented in Table 1.

 

TUNIZI MSA translation English Translation
3asslema مرحبا Hello
Chna7welek كيف حالك How are you
Sou2el سؤال Question
5dhit أخذت I took

Table 1: Examples of TUNIZI common words translated to MSA and English

Since some Arabic characters do not exist in the Latin alphabet, numerals, and multigraphs instead of diacritics for letters, are used by Tunisians when they write on social media. For instance, ”ch” is used to represent the character ش.

An example is the word شرير (wicked) represented as ”cherrir” in TUNIZI characters. After a few observations from the collected datasets, we noticed that Arabizi used by Tunisians is slightly different from other informal Arabic dialects such as Egyptian Arabizi.  This may be due to the linguistic situation specific to each country.  In fact, Tunisians generally use the French background when writing in Arabizi, whereas, Egyptians would use English.

For example, the word مشيت would be written as ”misheet” in Egyptian Arabizi, the second language being English.  However, because the Tunisian’s second language is French, the same word would be written as ”mchit”.In Table 2, numerals and multigraphs are used to transcribe TUNIZI char-acters that compensate the absence of equivalent Latin characters for exclusively Arabic Arabic sounds.

They are represented with their corresponding Arabic characters and Arabizi characters in other countries.  For instance, the number 5 is used to represent the character خ in the same way as the multigraph ”kh”.

For example, the word ”5dhit” is the representation of the word أخذت as shown in Table 1.  Numerals and multigraphs used to represent TUNIZI are different from those used to represent Arabizi.  As an example, the word غالية (expensive) written as ”ghalia” or ”8alia” in TUNIZI corresponds to ”4’alia” in Arabizi.

 

Arabic Arabizi TUNIZI
ح 7 7
خ 5 or 7’ 5 or kh
ذ d’ or dh dh
ش $ or sh ch
ث t’ or th or 4 th
غ 4’ gh or 8
ع 3 3
ق 8 9

Table 2: Special Tunizi characters and their corresponding Arabic and Arabizi characters

Tunizi Uses

TUNIZI dataset can be used for Sentiment Analysis projects dedicated for other underrepresented Maghrebian dialects, such as the Libyan, Moroccan or Algerian because of similarities of the dialects.  Also, this dataset can be used also for other NLP projects, such as chatbots.

Tunizi in the industry

TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect.  TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.

Researcher Profile: Chayma Fourati

Chayma Fourati is an AI R&D Engineer at iCompass. She is a graduate of Software Engineering (June 2020) from the Mediterranean Institute of Technology in Tunisia. She had her final year project at iCompass where she participated in most of the R&D projects. She was invited as a speaker at a webinar during the covid-19 crisis in March 2020 to talk about African IT solutions in fighting the Covid-19 through the latest AI Technologies.

During her last academic years, in both internships and university classes, she developed her skills in the AI field, and at iCompass, in the NLP field. During her final year internship at iCompass, she published a paper with two teammates at iCompass in the ICLR 2020 workshop. Her current research intersts include Natural Language Processing, Neural Networks and Deep Learning.

Researcher Profile: Hatem Haddad

Hatem Haddad is Co-Founder, CTO and RD director of iCompass. He received a doctorate in Computer Science (2002) from University Grenoble Alpes, France. He occupied assistant professor positions at Grenoble Alpes university (France), NTNU (Norway), at UAEU (EAU), at Sousse university (Tunisia), at Mevlana university (Turkey) and at ULB (Belgium). He worked for industrial corporations in R&D at VTT Technical Research Centre of Finland and Institute for Infocomm Research, Image Processing and Applications Lab of Singapore.

He was an invited researcher at Leibniz-Fachhochschule School of Business (Germany) and Polytechnic Institute of Coimbra (Portugal). His current research interests include Natural Languages Processing, Machine Learning and Deep Learning. He is author or co-author of more than 50+ papers published in peer-reviewed international Journals and Conferences and a frequent reviewer for international journals, conferences and R&D projects.

Researcher Profile: Malek Naski

Malek Naski is currently a summer intern at iCompass. She will graduate in June 2021 as a software engineer from the national school of engineering of Tunis (ENIT). Previously, she did her academic end-of-year project for the year 2019/2020 at iCompass, working on sentiment analysis and classification for the tunisian dialect using state-of-the-art NLP methods and technology. She is now focusing on natural language processing and natural language understanding and her current research interests include sentiment analysis and conversational agents.

Partners

Partners in Cracking the Language Barrier for a Multilingual Africa
Partners in Cracking the Language Barrier for a Multilingual Africa

Disclaimer

The designations employed and the presentation of material on these map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or any area or of its authorities, or concerning the delimitation of its frontiers or boundaries. Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. Final status of the Abyei area is not yet determined.