Swahili(also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. Given its presence within the continent and outside, learning Swahili is a popular choice for many language enthusiasts. In Tanzania, it is one of two national languages (the other is English).
News in Swahili is an important part of the media sphere in Tanzania. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many Africa countries. In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.
Objective
Swahili open-source text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.
The goal of this project is to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news at different categories such as Local, International, Business or Financial, health, sports, and entertainment. The dataset will be open-source, and NLP practitioners will be able to access the dataset and learn from it.
Implementation
I was able to implement the following phases of the project in order to achieve the objective of the project.
Collect website with Swahili news: The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites provide news in Swahili only and others in different languages including Swahili.
Understand policy and copyright: In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.AI4D helped me to understand this by providing a Data Protection Guidelines to consider for data collection and data mining.
Understand the structure of the news website: Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.
Data Collection: news articles were collected by using different tools and programming languages. These tools are as follow: Python programming language, Jupyter notebook, Python open-source packages (NumPy, pandas, and BeautifulSoup). The collected news articles were saved in a CSV file (contains the content and the category of particular news e.g sports)
Analyze and Cleanin: The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.
Results
At the end of this project, I was able to achieve the following milestones
Collecting and organizing a total of 40,331 news (with a total number of words = 12,488,239).
I have collected news from different six categories which are local,International,business,health,sports and entertainment
The main challenge is the imbalance of collected news from different categories. For example we have few news focus on international, business and health news.
I would like to extend my thanks to the AI4D(Artificial Intelligence for Development Africa) team and other partners in this AI4D-language dataset fellowship for their support and guidance throughout the project. Also, I have learned a lot from my fellow researchers across Africa who participated in this program to develop datasets in our Africa languages.
In this work, we propose to create a Wolof text to speech dataset. Text To Speech(TTS) dataset is composed of pairs of text and audio, where it’s text is the transcription of the associated audio. But before we deep dive in the process of collecting the dataset, let’s take a look at some interesting facts about Wolof language and why it is important to build such dataset.
Wolof /ˈwoʊlɒf/[4] is a language of Senegal, the Gambia and Mauritania, and the native language of the Wolof people. Like the neighbouring languages Serer and Fula, it belongs to the Senegambian branch of the Niger–Congo language family. Unlike most other languages of the Niger-Congo family, Wolof is not a tonal language.[1]
Wolof is spoken by more than 10 million people and about 40 percent (approximately 5 million people) of Senegal’s population speak Wolof as their native language. Increased mobility, and especially the growth of the capital Dakar, created the need for a common language.
Today, an additional 40 percent of the population speak Wolof as a second or acquired language. In the whole region from Dakar to Saint-Louis, and also west and southwest of Kaolack, Wolof is spoken by the vast majority of the people.
Typically when various ethnic groups in Senegal come together in cities and towns, they speak Wolof. It is therefore spoken in almost every regional and departmental capital in Senegal.[1]
Goal and benefits
Our goal here is to help researchers and companies to have a dataset that they can use to experiment and build automatic systems that can convert text to audio. This type of system can help people with reading problems (e.g blind or illiterate people) to get information and interact with other people or even new technologies(e.g web and mobile applications). There is also the fact that Wolof is not written correctly by most Wolof natifs and non natif, so it becomes difficult for them to read Wolof text, but with a TTS system they could easily be able to understand Wolof text and also learn how Wolof should be written in the process.
Text data collection and preparation
The text collection is the phase of creating clean and representative text that can be used to do the recordings.
Sources: Unlike popular languages, such as english or french, Wolof texts are very scarce on the Internet and in general in digital form, so we had to make more effort just to get the raw text data.
The text used to build this dataset is collected from different sources, such as Wolof website news(sabaal and dekufi), wikipedia and many many texts provided by the Wolof expert in our team.
Cleaning: The cleaning of the text was the most challenging and time consuming task of this work. We had to remove some non Wolof sentences, non used symbols or words, long sentences or paragraphs. There was also manual cleaning of the data.
We also developed with the help of our Wolof expert an algorithm to convert Wolof text to Wolof phonemes. This part is crucial here, because we needed to verify if the text corpus covers correctly the Wolof phonemes, because if phonemes are not covered correctly, the resulting TTS system will have difficulties converting some phonemes to their corresponding sound. After some iteration, we were able to choose sentences that cover all Wolof phonemes with a good distribution with respect to phonemes frequencies .
Where are we: the collection of the text is complete with more than 30 000 sentences cleaned and ready for recording.
Audio data collection
The audio collection is the phase of creating recordings that correspond to the already cleaned text.
The human part: The audio collection is done by two actors, a male and a female voice. Each one needs to record at least 20 000 out of the 30 000 cleaned sentences. We have had some issues on starting the recording due to the time of text cleaning but, we also had some delay with respect to having the microphone and the material resources needed to do the recording. The other problem we had here was building the Web platform that actors use to do the recording, which we will talk about in the next section.
The platform for recording: We forked and modified open source project Common Voice[2], which is a project from Mozilla to help our actors to easily due the recording using just their internet browser. Data is collected and automatically sent to an S3 bucket after each 5 recordings.
Where are we: We had a big delay on the collection of the recordings (the recording started just three weeks ago), as writing this article(2020/10/20), we have collected 4000 out of the 40 000 recordings. But we hope that the recording rate will be higher after the actors are more used to the process. We are expecting to collect at least 1400 recordings per week from the two actors.
Also after the recording is done, we will verify and clean the audio data set, for example, trim silences, duration check, and so on. We will also build à baseline model with this dataset and make it available to the community.
Conclusion
We are really grateful to AI4D for giving us the possibility and the means to build and collect a Wolof TTS dataset and we hope that this kind of initiative will be more frequent to help create more and more dataset for our local languages so that new system and models can be built with them and so increase accessibility to new technologies but also help more people to have access to information in their own language.
The aim of our project is to investigate the technological feasibility of deploying Unmanned Ground Vehicles for automated wildlife patrol, as well as performing a preliminary analysis of other metadata collected from officials at a national park in Kenya. To this end, we seek to collect and publish a dataset of driving data across national park trails in Kenya, the first of its kind, and use deep learning to predict steering wheel angle when driving on these trails.
Khushal Brahmbhatt, Deep Learning and Computer Vision researcher, Autonomous Vehicles
Ronald Ojino Co-researcher (Autonomous Driving Research)/ PhD student (University of Dar es Salaam)/Lecturer – Cooperative University of Kenya
Setting up the data acquisition system
The data collection required a vehicle mounted with a camera to be driven across national park trails while recording the trail video as well as key driving signals such as steering wheel angle, speed and brake and accelerator pedal positions. We began design, installation and configuration of the data collection system in November and December 2019.
The first idea was to procure and attach sensors to the vehicle to obtain these driving signals. But upon further research, it was discovered that most of these driving signals can be read from the CAN bus which is exposed on the OBD-II (On-Board Diagnostics) port on most vehicles manufactured after 2008.
This information however is grouped and encoded within different parameter ids, and it requires reverse engineering to identify each of these driving parameters which is significantly time consuming, an activity that would take months by itself.
Furthermore, not all of the driving signals would be exposed on the CAN bus. The parameters exposed on the bus vary between vehicle manufacturers and models, and so does the encoding. After failing to understand the data read from the CAN bus of our personal vehicles, we decided to find a vehicle model which had already been reverse-engineered.
We were able to identify [1] and procure a Toyota Prius 2012 for the data collection, from which we could read the steering wheel angle, steering wheel torque, vehicle speed, individual wheel speeds and brake and accelerator pedal positions. We used a Raspberry Pi 3 microcomputer with the PiCan hat to read and log the driving signals.
In order to create the dataset for training and testing the learning algorithm, each data sample would have to contain a video frame matched to the corresponding driving signals at that instance. That means all the video frames, as well as the driving signals, have to be timestamped.
The driving signals are automatically timestamped during logging on the Raspberry Pi, but most cameras don’t timestamp the individual frames. Further, the internal clock of the camera would not be in sync with that of the RPi’s, and would cause the video frames and driving signals to also be out of sync when creating the data samples.
That means a camera that could interface to the computer as a webcam would be needed, so each frame can be read and timestamped before being written to the video file. Driving on rough national park trails would also induce a lot of vibrations and require a camera with good stabilization. These were some of the challenges in selecting a camera for recording the driving video.
We settled on the Apeman A80 action camera which has gyro stabilization, HD video recording and can also function as a webcam. OpenCV was used to read and record timestamped video to the computer.
Initially, we tried to connect the camera to the Raspberry Pi itself. But the RPi is a low-powered microcomputer. There was significant lag in recording and could not write the video higher than a frame rate of 8fps. We therefore decided to use a laptop which could comfortably record HD
video at 30fps to connect to the camera, and the RPi for only logging the driving signals from the vehicle’s CAN bus.
This however presented a different challenge of being limited by the laptop battery. While the RPi can be charged using a portable power bank or directly from the car’s charging port, the laptop cannot. That meant significantly shorter data collection runs. We could only drive around continuously for 2 hours before we had to return to charge the laptop which took another 2 hours.
This forced revising down our overall data collection projections from 50 hours to 20 hours, of which 25 hours which was to be on the national park trails was revised down to 10 hours, and the other 10 hours on a mixture of tarmac roads and other rural dirt roads.
There was also extensive testing of different video encoding methods to determine the best filesize versus quality tradeoff, as well as data collection code optimization to ensure minimum lag during the data logging.
Data collection
We began the data collection in January 2020 on tarmac and rural dirt roads. The idea behind this was to train the algorithm on a simpler dataset and then use transfer learning for better faster results on the national park trails. The data was collected at various times of the day: early in the morning, noon and late in the evening in order to get a varied dataset in different lighting conditions.
While we were able to smoothly collect the data on tarmac roads, driving over the rural dirt roads proved impossible as they were marked with potholes. Not only was it challenging to drive a low-body vehicle over the rough terrain, but the constant maneuvers made to go around the potholes meant that most of that data would be unusable as it would present a different challenge altogether in training.
The challenge of driving a low-body vehicle on dirt roads also limited our choices of national parks, as we had to carefully select ones with smooth driving trails. Our plan to collect data from the Maasai Mara National Reserve had to be abandoned due to the bad road conditions there, and we opted to collect data from Nairobi National Park (8 hrs) and Ruma National Park (2.5 hrs) instead. Even these however were not without their setbacks involving a flat tire and bumper damage.
Sample video frames from Nairobi National Park (left, center) and Ruma National Park (right)
Sample video frames from Nairobi National Park (left, center) and Ruma National Park (right)
Sample video frames from Nairobi National Park (left, center) and Ruma National Park (right)
Another challenge faced in the parks was internet connectivity. While a stable internet connection was not needed for the data collection which was done offline, a connection to the internet was needed when starting up the Raspberry Pi to allow it to initialize the correct datetime value.
This is because the RPi microcomputer does not have an internal clock. That means unless it has a connection to the internet, it will resume the clock from the last saved time before it was shut down, hence ending up showing the wrong time. That resulted in incorrect timestamps on the logged driving data that could not be matched to the video timestamps.
This was observed while analyzing the driving data logs from one of the runs at Ruma National Park. Luckily, internet connectivity was regained towards the end of the run and the rest of the timestamps could be calculated correctly using the message baud rates.
Other minor issues faced in obtaining good quality data involved keeping the windshield clean while driving on dusty park trails where one is not allowed to alight from the vehicle, and securely mounting the camera inside the vehicle while driving over rough terrain.
Dataset preparation and Training
A significant portion of the data collected included driving around potholes, overtaking, stopping, U-turns etc. which would not be useful for predicting the steering wheel angle within the scope of this study. All these segments had to be visually identified and removed before
preparing the dataset.
Initially, we proposed to use a simple Convolutional Neural Network (CNN) model for training as in [2], where the steering wheel angle is predicted independently on each video frame as the input. However, the steering angle is also largely dependent on the speed of the vehicle. Driving
is also a stateful process, where the current steering wheel angle is also dependent on the previous wheel position.
We therefore investigated the use of a more sophisticated temporal CNN model as in [3] using recurrent units such as LSTM and Conv-LSTM that could give more promising results. The above model however is very computationally expensive and would require a cluster of very expensive GPUs and still take days to train.
Using this model proved impossible to achieve within the given timeline and budget. We therefore decided to continue with our initial proposal using a static CNN model [2].
Currently we are in the process of building the dataset and learning model for the project. We are also working on preparing a preliminary analysis on the feasibility of automated wildlife patrol [4] based on other metadata collected from park officials.
We are grateful for the immense support that we always get from our mentor Billy Okal who in spite of his busy schedule, gets the time to set up calls whenever we need to consult and always comes up with great ideas that address most of our concerns.
References
[1] C. Miller and C. Valasek, Adventures in Automotive Networks and Control Units, IOActive
Inc., 2014, pp. 92-97.
[2] M. Bojarski et al., End to end learning for self-driving cars, 2016, arXiv:1604.07316.
[3] L. Chi and Y. Mu, Deep steering: Learning end-to-end driving model from spatial and
temporal visual cues, 2017, arXiv:1708.03798.
[4] L. Aksoy et al., Operational Feasibility Study of Autonomous Vehicles, Turkey International
Logistics and Supply Chain Congress, 2016.
Here is an example of a case for which a PDF is available on MalawiLii. Here is an example of a case for which only a scanned image of a pdf is available. We used OCR for more than 90% of data to extract the text for our corpus (see below a description of our corpus).
Please open these files to familiarise yourself with the content of a court criminal judgment. What kind of information we want to extract? For each case we wanted:
Name of the Case
Number of the Case
Year in which the case was filled
Year in which the judgment was given, Court which issued the judgment
Names of Judges
Names of parties involved (appellants and respondents, but you can take this further and extract names of principal witnesses, and names of victims)
References to other Cases
Referencesto Laws/Statues and Codes, and,
Legal keywords which can help us classify the cases according to the ICCS classification.
This project has taught us so much about working with text, preparing data for a corpus, exchange formats for the corpus data, analysing the corpus using lexical tools, and machine learning algorithms for annotating and extracting information from legal text.
Along the way we experimented also with batch OCR processing and different annotation formats such as IOB tagging[1], and the XML TEI[2] standard for sharing and storing the corpus data, but also with the view of using these annotations in sequence-labelling algorithms.
Each has advantages and disadvantages, the IOB tagging does not allow nesting (or multiple labelling for the same element), while an XML notation would allow this but it is more challenging to use in algorithms. We also learned how to build a corpus, and experimented with existing lexical tools for analysing this corpus and comparing it to other legal corpora.
We learned how to use POS annotations and contextual regular expressions to extract some of our annotations for laws and case citations and we generated more than 3000 different annotations. Another interesting thing we learned is that preparing annotated training data is not easy, for example, most algorithms require training examples to be of the same size and the training set needs to be a good representation of the data.
We also experimented with the classification algorithms and topics detection using skitlearn, spacy, weka and mathlab. The hardest task was to prepare the data in the right format and to anticipate how this data will lead to the outputs we saw. We felt that time spent in organising and annotating well is not lost but will result in gains in the second stage of the project when we focus on algorithms.
Most algorithms split the text into tokens, and for us, multi-word tokens (or sequences) are those we want to find and annotate. This means a focus on sequence-labelling algorithms. The added complications which are peculiar to legal text is that most of our key terms belong logically to more than one label, and the context of a term can span multiple chunks (e.g., sentences).
When using LDA (Latent Dirichlet Association) to detect topics in our judgments, it became clear to us that one needs to use a somehow ‘sumarised’ version in which we collapse sequences of words into their annotations (this is because LDA uses term frequency-based measure of keyword relevance, whereas in our text the most relevant words may appear much less frequently than others).
Our work has highlighted to us the benefits and importance of multi-disciplinary cooperation. Legal text has its peculiarities and complexities so having an expert lawyer in the team really helped!
Finding references to laws and cases is made slightly more complicated because of the variety in which these references may appear or because of the use of “hereinafter”. Legal text makes use of “hereinafter”[3], e.g., Mwase Banda (“hereinafter” referred to as the deceased). But this can also happen for references to laws or cases as the following example shows:
Section 346 (3) of the Criminal Procedure and Evidence Code Cap 8:01 (hereinafter called “the Code”) which Wesbon J was faced with in the case of DPP V Shire Trading CO. Ltd (supra) is different from the wording of Section 346 (3) of the Code as it stands now.
Compare extracting the reference to law from “Section 151(1) of the Criminal Procedure and Evidence Code” to extracting from “Our own Criminal Procedure and Evidence Code lends support to this practice in Sections 128(d) and (f)”. We have identified a reasonably large number of different references to laws and cases used in our text! The situation is very similar for case citations. Consider the following variants:
Republic v Shautti , Confirmation case No. 175 of 1975 (unreported)
Republic v Phiri [ 1997] 2 MLR 68
Republic v Francis Kotamu , High Court PR Confirmation case no. 180 of 2012 ( unreported )
Woolmington v DPP [1935] A.C. 462
Chiwaya v Republic 4 ALR Mal. 64
Republic v Hara 16 (2) MLR 725
Republic v Bitoni Allan and Latifi Faiti
Something for you to Do Practically! To play with some annotations and appreciate the diversity in formats, and at the same time the huge savings that a semi-automatic annotation can bring, we have set up a doccano platform for you: you log in here using the user guest and password Gu3st#20.
Annotating with keywords for the purposes of the ICCS classification proved to be even harder. The International Classification of Crime for Statistical Purposes (ICCS)[4] and it is a classification of crimes as defined in the national legislations and comes on several levels each with varying degrees of the specification. We considered mainly the Level 1 and we wanted to classify our judgments according to the 11 types in Level 1 as shown in the Table.
We discovered that this task of classification according to Level 1 requires a lot of work and it is of a significant complexity (and the complexities only grow if we would consider the sublevels of the ICCS). First, the legal expert of our team manually classified all criminal cases of 2019 according to Level 1 ICCS and worked on a correspondence between the Penal Code and the ICCS classification. This is excellent.
We are in the process of extending this to mapping other Malawi laws, codes and statutes that are relevant to criminal cases into the ICCS. This in itself is a whole project on its own for the legal profession and requires processing a lot of text and making ‘parallel correspondences’! Such national correspondence tables are still work in progress in most countries and to our knowledge, our work is the first of such work for Malawi.
Looking at Level 1 of the ICCS meant we were kept very busy. Our research centred on hard and important questions. How to represent our text so that it can be processed efficiently? What kind of data labels are most useful for the ICCS classification? What type of annotations to use (IOB or an xml-based)? What algorithms to employ (Hidden Markov Models or Recurrent Neural Networks or Long Short Term Memory)? But most importantly, we focussed on how to prepare our annotated data to be used with these algorithms?
We need to be mindful that this is a fine classification because we have to distinguish between texts that are quite similar. For example, if we wanted to classify whether a judgment by the type of law it falls under, say whether it is either civil or criminal case, this would have been slightly easier because the keywords/vocabulary used in civil cases would be quite different than that used in criminal cases.
We want to distinguish between types of crimes, and the language used in our judgments is very similar. Within our data set there is the level of difficulty, e.g., theft and murder cases may be easier to differentiate, that is Type 1 and 7 from the table above, than, say, to differentiate between types 1 and 2.
We have the added complication that most text representation models which define the relevance of a keyword as given by its frequency (whether that is TF or TF-IDF) but in our text, a word may appear only once and still be the most significant word for the purpose of our classification. For example, a keyword that distinguishes between type 1 and type 2 murders is “malice aforethought” and this may only occur once in the text of the judgment.
To help with this situation, one can extract first the structure of the judgment and focus only on the part that deals with the sentence of the judge. Indeed, there is research that focuses only on extracting various segments of a judgment.
This may work in many cases because usually the sentence is summarised in one paragraph. But it does not work for all cases. This is so especially when the case history is long, the crime committed has several facets, or the case has several counts, e.g., the murder victim is an albino or a disabled person.
In such situations one needs a combined strategy which uses: (1) An good set of annotated text with meta-data described above; (2) the mapping of the Penal Code/ Laws/Statues relevant to the ICCS; (3) collocations of words/ or a thesaurus and (4) concordances to help us detect clusters and extract relevant portions of the judgments; (5) employing sequence modelling algorithms, e.g., HMM, recurrent neural networks, for annotation and classification.
In the first part of the project, we focussed on the tasks (1) – (4) and experimented to some extent with (5). What we wanted is to find a representation of our text based on all the information at (1) – (4) and attempt to use that in the algorithms we employ.
We have created a training set of over 2500 annotations for references to sections of the law and over 1000 annotations for references to other cases. We are still preparing these so that they are representative of the corpus and are good examples.
And finally but most importantly, while working on this AI4D project, it has brought me in contact with very clever people, whom I would have not otherwise met. We appreciate the support and guidance of the AI4D team!
[3] Hereinafter is a term that is used to refer to the subject already mentioned in the remaining part of a legal document. Hereinafter can also mean from this point on in the document.
[4] United Nations Economic Commission for Europe. Conference of European Statisticians. Report of the UNODC/UNECE Task Force on Crime Classification to the Conference of European Statisticians. 2011. Available: www.unodc.org/documents/data-andanalysis/statistics/crime/Report_crime_classification_2012.pdf>
This research focuses on enhancing Pharmacovigilance Systems using Natural Language Processing on Electronic Medical Records (EMR). Our major task was to develop an NLP model for extracting Adverse Drug Reaction(ADR) cases from EMR. The team was required to collect data from two hospitals, which are using EMR systems (i.e. University of Dodoma (UDOM) Hospital and Benjamin Mkapa (BM) Hospital). During data collection and analysis, we worked with health professionals from the two mentioned hospitals in Dodoma. We also used the public dataset from the MIMIC-III database. These datasets were presented in different formats, CSV for UDOM hospital and MIMIC III and PDF for BM hospital as shown on the attached file.
In most cases, pharmacovigilance practices depend on analyzing clinical trials, biomedical writing, observational examinations, Electronic Health Records (EHRs), Spontaneous Reporting (SR) and social media (Harpaz et al., 2014). As to our context, we considered EMR to be more informative compared to other practices, as suggested by (Luo et al., 2017). We studied schemas of EMRs from the two hospitals. We collected inpatients’ data since outpatients’ would have given the incomplete patient history. Also, our health information systems are not integrated, which makes it difficult to track patients’ full history unless patients were admitted to a particular hospital for a while. From all the data sources used there was a pattern of information that we were looking for, and this included clinical history, prior patient history, symptoms developed, allergies/ ADRs discovered during medication and patient’s discharge summary.
Gloriana Monko
Steven Edward
Zephania Reuben
Waziri Shebogholo
Ibrahimu Mtandu
Much as we worked on UDOM and BM hospitals’ data, we encountered several challenges that made the team focus on MIMIC-III dataset while searching for an alternative way to our local data. Here were the challenges noted:
The reports had no clear identification of ADR cases.
In most cases, the doctor did not mention the reasons for changing a medicine on a particular patient which made it hard to understand whether the medication didn’t work well for a specific patient or any other reasons like adverse reaction.
The justification for ADR cases was vague
Mismatch of information between patients and doctors
The patients talk in a way that doctor can’t understand
There is a considerable gap between the health workers and regulatory authorities (They don’t know if they have to report for ADR cases)
The issue of ADR is so complex since there is a lot to take into account like Drug to Drug, Drug to food and Drug to herbal interactions.
There was no common/consistent reporting style among doctors
The language used to report is hard for a non-specialist to understand.
Some fields were left empty with no single information which led to incomplete medical history
The annotation process prolonged since we had one pharmacologist for the work.
After noticing all these challenges, the team carefully studied the MIMIC-III database to assess the availability of the data with ADR cases which would help to come up with the baseline model to the problem. We discovered that the NoteEvent table has enough information about the patient history with all clear indications of ADR cases and with no ADR see the text.
To start with, we were able to query 100,000 records from the database with many attributes, but we used a text column found in the NoteEvent table with the entire patient’s history including (patient’s prior history, medication, dosage, examination, changes noted during medications, symptoms etc.). We started the annotation of the first group by filtering the records to remain with the rows of interest. We used the following keywords in the search; adverse, reaction, adverse events, adverse reaction and reactions. We discovered that only 3446 rows contain words that guided the team in the labelling process. The records were then annotated with the labels 1 and 0 for ADR and non-ADR cases respectively, as indicated in the filtration notebook.
In analysing the data, we found that there were more non-ADR cases than ADR cases, in which non-ADR cases were 3199 and 228 ADR cases and 19 data rows not annotated. Due to high data imbalance, we reduced Non-ADR cases to 1000, and we applied sampling techniques (i.e upsampling ADR cases to 800) to at least balance the classes to minimize bias during modelling.
After annotation and simple analysis we used NLTK to apply the basic preprocessing techniques for text corpus as follows:-
Converting the corpus-raw sentences to lower cases which helps in other processing techniques like parsing.
Sentence tokenization, due to the text being in paragraphs, we applied sentence boundary detection to segment text to sentence level by identifying sentence starting point and endpoint.
Then we worked with regular contextual expressions to extract information of interest from the documents by removing some of the unnecessary characters and replacing some with easily understandable statements or characters as for professional guidelines.
We removed affixes in tokens which put words/tokens into their root form. Also, we removed common words(stopwords) and applied lemmatization to identify the correct part of speech(s) in the raw text. After data preprocessing, we used Term Frequency Inverse Document Frequency (TF-IDF) from scikit-learn to vectorize the corpus, which also gives the best exact keywords in the corpus.
In modelling to create a baseline model, we worked with classification algorithms using scikit-learn. We trained six different models which are Support Vector Machines, eXtreme Gradient Boosting, Adaptive Gradient Boosting , Decision Trees, Multilayer Perceptron and Random Forest and then we selected three (Support Vector Machine, Multilayer Perceptron and Random Forest )models which performed better on validation compared to other models for further model evaluation. We’ll also use the deep learning approach in the next phase of the project to produce more promising results for the model to be deployed and kept in practice. Here is the link to colab for data pre-processing and modelling.
From the UDOM database, we collected a total of 41,847 patient records in chunks of 16185, 18400, and 7262 from 2017 to 2019 respectively. The dataset has following attributes (Date, Admission number, Patient Age, Sex, Height(Kg), Allergy status, Examination, Registration ID, Patient History, Diagnosis, and Medication ), we downsized it to 12,708 records by removing missing columns and uninformative rows. We used regular contextual expressions to extract information of interest from the documents as for professional guidelines. The data cleaned and exchanged data formatting, analyzing and preparing data for machine learning was elaborated in this Colab link.
On the BM hospital, the PDF files extracted from EMS have patient records with the following information.
Discharge reports
Medical notes
Patients history
Lab notes
Health professionals on the respective hospitals manually annotated the labels for each document, and this task took most of our time in this phase of the project. We’re still collecting and interpreting more data from these hospitals.
The team organizes and extracts information from BM hospital PDF files by exchanging data formatting, analyzing and preparing data for machine learning. We experimented with OCR processing for PDF files to extract data, but we didn’t generate promising results as more information appeared to be missing. We therefore hard to programmatically remove content from individual files and align them to the corresponding professional provided labels.
The big lesson that we have learned up to now is that most of the data stored in our local systems are not informative. Policymakers must set standards to guide system developers during development and health practitioners when using the system.
Lastly but not least, we want to thank our stakeholders, mentors and funders for your involvement in our research activities. It is because of such a partnership we can be able to achieve our main goal.
So why have we decided to collect malaria datasets to assist in developing a solution in its diagnosis? First, Malaria remains one of the significant threats to public health and economic development in Africa. Globally, it is estimated that 216 million cases of malaria occurred in 2017, with Africa bearing the brunt of this burden [5*]. In Tanzania, malaria is the leading cause of morbidity and mortality, especially in children under 5 years and pregnant women. Malaria kills one child every 30 seconds, about 3000 children every day [4*]. Malaria is also the leading cause of outpatients, inpatients, and admissions of children less than five years of age at health facilities [5*].
Martha Shaka
Frederick Apina
Said H Said
Halidi Maneno
Nyamos Waigama
Imani Sulutya
Said Mmaka
Emilian Ngatunga
Simon Chaula
Second, the most common methods to test for malaria are microscopy and Rapid Diagnostic Tests (RDT) [1, 2]. RDTs are widely used, but their chief drawback is that they cannot count the number of parasites. The gold standard for the diagnosis of malaria is, therefore, microscopy. Evaluation of Giemsa-stained thick blood smears, when performed by expert microscopists, provides an accurate diagnosis of malaria [3].
Nonetheless, there are challenges to this method, it consumes a lot of time to perform one diagnosis, requires experienced technologists who are very few in developing countries, and manually looking at the sample via a microscope is a tedious and eye-straining process. We learned that although a microscopic diagnostic is a golden standard for malaria diagnosis, it is still not used in most of the private and public health centers. We realized that some of the lab technologists in health care are not competent in preparing staining reagents used in the diagnosis process. We had to create our own reagents and supply to them for the purpose of this research.
Artificial intelligence is transforming how health care is delivered across the world. This has been evident in pathology detection, surgery assistance and early detection of diseases such as breast cancer. However, these technologies often require significant amounts of quality data and in many developing countries, there is a shortage of this.
Taking images of stained blood smear
A sample of the annotated image( green box represent a plasmodium)
To address this deficiency, my team, composed of 6 computer scientists and 3 lab technologists, collected and annotated 10,000 images of a stained blood smear and developed an open-source annotation tool for the creation of a malaria dataset. We strongly believe the availability of more datasets and the annotation tool (for automating the labeling of the parasites in an image of stained blood smear) will improve the existing algorithms in malaria diagnosis and create a new benchmark.
In the collection of this dataset, we first sought and were granted ethical clearance from the University of Dodoma and Benjamin Mkapa Hospital’s research center. We have collected 50 blood smear samples for patients confirmed with malaria and 50 samples for negative confirmed cases. Each sample was stained by the lab technologist and 100 images were taken using iPhone 6S attached to a microscope. This led to having a total of 5000 images for the positive confirmed patients and 5000 imaged for the negative confirmed patient.
Through this work, we have had several opportunities including attending academic conferences and forming connections with other researchers such as Dr. Tom Neumark, a postdoctoral social anthropologist at the University of Oslo. Through our work, we also met Prof Delmiro Fernandes-Reyes, a professor of biomedical engineering. In a joint venture with Prof Delmiro Fernandes-Reyes, we submitted a proposal for the DIDA Stage 1 African Digital Pathology Artificial Intelligence Innovation Network (AfroDiPAI) at the end of November 2019.
We are also disseminating the results of our research. We have submitted an abstract (on the ongoing project) to two workshops (Practical Machine Learning in Developing Countries and Artificial Intelligence for Affordable Health) for the 2020 ICLR conference in Ethiopia, and it has been accepted to be presented as a poster. We were also delighted to get very constructive feedback from reviewers of the conference and look forward to incorporating them as we continue with the projects and final publication.
The next stage will be to start using our data and train deep learning models in the development of the open-source annotation tool. At the same time, together with the AI4D team, we are looking for the best approach to follow when releasing our open-source dataset in the medical field.
But our overall aim is to develop a final product of our mobile application that will assist lab technologist in Tanzania and beyond in the onerous work of diagnosis malaria. We have already met many of these technologists who are not only excited and eagerly awaiting this tool, but generously helped us as we have gone about developing it.
Links
[1] B.B. Andrade, A. Reis-Filho, A.M. Barros, S.M. Souza-Neto, L.L. Nogueira, K.F. Fukutani, E.P. Camargo, L.M.A. Camargo, A. Barral, A. Duarte, and M. Barral-Netto. Towards a precise test for malaria diagnosis in the Brazilian Amazon: comparison among field microscopy, a rapid diagnostic test, nested PCR, and a computational expert system based on artificial neural networks. Malaria Journal, 9:117, 2010.
[2]Maysa Mohamed Kamel, Samar Sayed Attia, Gomaa Desoky Emam, and Naglaa Abd El Khalek Al Sherbiny, “The Validity of Rapid Malaria Test and Microscopy in Detecting Malaria in a Preelimination Region of Egypt,” Scientifica, vol. 2016, Article ID 4048032, 5 pages, 2016. https://doi.org/10.1155/2016/4048032.
[3]Philip J. Rosenthal*, “How Do We Best Diagnose Malaria in Africa?”: https://doi.org/10.4269/ajtmh.2012.11-0619
A high yielding crop such as tomato with high economic returns can greatly increase smallholder farmers income when well managed. however, it is apparently constrained by the recent invasion of tomato pest Tuta absoluta that is devastating tomato yield. Look at tomato field situation in highly affected areas of arush [Arusha- mp4 video] and Morogoro regions.
To tackle this challenge, our work focus on an early detection and control measure initiatives in order to strengthen phytosanitary capacity and systems to help solve Tuta absoluta devastation using computer vision technique. It should be noted that Tuta absoluta control still rely on low-speed inefficient manual identification and a few on the support of limited number of agriculture extension officers.
Our initial works involved field work and in-house experiments to collect data in areas that are mostly affected by Tuta absoluta. We collected image data in Arusha and Morogoro regions of Tanzania.
As for any computer vision task, getting the right images for the task at hand is sometimes challenging. Regarding our use case, we had to generate our own image data. To accumulate enough data for model training, we have been collecting data since June 2018 and have had four (4) in-house experiments in the target areas. The whole data collection process is shown in this link.
The data collection process involved taking images of tomato inoculated with Tuta absoluta larvae for the first two (2) weeks of tomato growth since transplanting date. Images were taken for each plant on a daily basis. These images are RGB (Red, Green, Blue) photos of high and low resolutions. In order to acquire high resolution images, we used Canon EOS KISS X7 camera with a resolution of 5184 x 3456 pixels and we used mobile phone camera (set to low resolution).
For our previous first in-house experiment, we had encountered some challenge with the data collection process. The inoculated tomatoes were tagged with a red ribbon. Tagging species or target organisms is a common practice in fields such as entomology. We came to realize, that these tagged images couldn’t be included in the dataset for training our models and therefore had to exclude them from our model.
To meet our objectives, we worked on Convolution Neural Network (CNN) based model for a binary classification that could be able to identify tomatoes affected and not affected by Tuta Absoluta using the state-of-art of CNN architectures (VGG16, VGG19, ResNet50, InceptionV3). The results of this task were promising. Primary preprocessing tasks were limited to selecting the suitable images for training CNN model.
We are certain that the images we collected represented real images of small scale farmers’ fields. The images collected had more images with healthy tomato leaves than those inoculated with Tutaabsoluta which implies data imbalance. To reduce the bias our CNN model may encounter towards images with no Tuta absoluta samples, the number of samples per class were selected to create balanced classes during model training.
The main aim of the image data collection process was expected to cover the main tomato growing regions in Tanzania affected mostly by Tuta absoluta, though we ended up obtaining data from only two main areas. Our team is certain that the collected data can be a representative case covering Tanzania situation. Also we had to adopt to local agronomic practices of the two areas.
For instance, we collected data of the commonly grown tomato varieties. The in-house experiment was also carried out following the cropping calendar of the respected two regions. To cover the main two growing season in Arusha, we had to carry out three experiments and one experiment in Morogoro.
During CNN model training, following a typical early detection of pest or disease model approach, we managed to focus on identification of affected and none affected plants. We have successfully been able to develop this type of binary classification model to identify tomato affected by tuta and not affected by tuta.
We further, developed another multiclass classification, that would be used to classify tomato affected at mainly three levels of damage i.e. low, high and no damage. This approach gave us a much better sense of the original idea we had. The model results showed us that to meet an early detection system in determining damage at early stage, a typical quantification based model is much better than a binary classification model.
For instance, results of the multiclass model showed us that tomatoes that are highly damaged are easily identified compared to lowly damage tomato. In such case, it would be best to identify tomato damage at early stage i.e at low damage level in order to enhance early control measures for Tuta absoluta.
And this point, we are to redefine the model classification approach. Since the objective is early identification and if a simple classification model cannot perform such a task, this puts us at risk. With that in mind, we are further working on models that can identify Tuta absoluta mine density, a quantification method based on instance segmentation.
The projectproposes expanding a framework on categorizing parliamentary bills in Nigeria using Optical Character Recognition (OCR), document embedding and recurrent neural networks to three other countries in Africa: Kenya, Ghana, and South Africa.
His work was accepted at 4th Widening NLP Workshop, Annual Meeting of the Association for Computational Linguistics, ACL 2020.
In most African countries, perhaps more so in Africa than elsewhere, the majority of the populations do not speak the official languages; instead, they speak traditional languages. In some countries, this proportion is as high as 80%. Because of this language barrier, this large part of the population is practically excluded from the march of society: they have no access to information or education and cannot really participate in the debates on the socio-economic development of their country.
From another point of view, our values, cultures, knowledge of all kinds and history are conveyed orally in these languages and thus remain inaccessible to the rest of the world.
Dr Aminata SABANE, Université Joseph Ki-Zerbo
Teg-wende Idriss Tinto
Objectives
The main objective of the Preservation of Indigenous Languages project is to contribute to the preservation of local languages and the enhancement of local language content through (1) archiving, (2) promotion and (3) popularization of local language content. Archiving will make it possible to preserve content and knowledge in local languages. We will collect and use existing data in local languages for this purpose. The promotion will be done by exploiting the richness of this local language content. And popularisation will be made possible by making this content accessible in the official languages. In order to achieve these objectives, our project is divided into three parts, all of which have an important upstream data collection and pre-processing stage:
Transcription from local languages to text in local languages
Translation from local languages to official languages (French) and vice versa
Voice synthesis of texts in local languages into audios in local languages.
Team
To successfully carry out the project, we have set up a dedicated team of 10 people:
A research mentor with a background in AI,
Two practice mentors with a background in local languages. The first one is a specialist of education in local languages and the second one is with various works in translation from French to Moore, the main local language in Burkina Faso.
A research assistant with a background in linguistic. In this case, the assistant was a student whose responsibility was to help on the collection of content in languages, pre-treatement of data,
Three computer programmers. In this case, the programmers were computer science students (master and PhD students). Each of them has in charge one of the three parts of the project plus some pretreatment tasks.
Implementation
For this project, we limited ourselves to one local language, Mooré. This language is the main language of Burkina Faso and is spoken by more than half of the population. There are also many sources of data in this language and important work has already been done on translations from French into this language, especially in the educational and religious fields.
(0)Data Collection: As announced, data collection is an important and necessary step for the different parts of the project. It is also one of the most difficult steps. The opening of data is not yet compulsory in our countries.
With the invaluable help of practice mentors, meetings were organised with the main institutions, both public and private, to explore existing data and the extent to which these data could be exploited.
Among the institutions that were contacted, the main ones are the following:
Fondation pour le Développement Communautaire/ Burkina Faso(FDC-BF);
the biblical alliance of Burkina Faso;
Fonds pour l’alphabétisation et l’éducation non formelle (FONAENF);
The Directorate of Research in Non-Formal Education (DRENF);
The DPDMT;
Ecole et langue nationale en Afrique (ELAN);
Savane Media.
We were thus able to access a certain amount of data but not always in digital format or not always complete. This required an enormous amount of pre-processing work either to put the data in digital format or to complete it either with translations or transcriptions.
One of the first sources of data we had access to was the Moore Bible in text and audio. It is this source that was also used after pre-processing (audio cutting sentence by sentence or verse by verse, alignment of Moore and French texts) for the first tests for the different parts of the project.
The collection and pre-processing work is still in progress to enrich our data sources and improve our models.
(1)Transcription: Since writing is not yet very popular in our local languages, we have a large amount of data in local languages in audio format. In addition, people who cannot write will always use oral communication to express themselves. The step of transcribing the audio content into local languages is an essential step to not only collect existing information but also to gather what people have to say.
After a state of the art and testing of existing transcription tools, the student in charge of this part implemented his transcription model based on the DeepSeepch tool. He uses data from the bible for these tests. In addition to the workload for pre-processing and the working conditions made a bit difficult because of the Covid19 pandemic, we unfortunately had problems with computing capacity and are working with one of the partners to increase the capacities of the leased Virtual Machines.
(2)Translation: Translation is at the heart of this project. It aims to make official language information accessible to people in rural areas but also to provide access to the wealth of local language content.
The student in charge of this component has, after a state of the art of existing translation approaches, applied classical neural machine translation techniques on bible data using OpenMT. But the results were not very good as one could expect given the lack of training data. So he is now implementing meta-learning using the Meta-NMT tool. Meta-learning has been described in the literature as performing better than the classical approach when there is little data.
Here, too, in addition to the need for more data, we face a need for computing capacity that should also be resolved with the provision of VMs.
(3) Voice synthesis: Voice synthesis will make it possible, after translation from the official languages into local languages, to make the content available to populations who cannot read but who will be able to have it in audio format. The student in charge of this part also carried out a state of the art of existing tools in this field. He is currently testing different tools and studying different models. He, unfortunately, started with a little delay but will continue his work in order to be able to adapt a model and to make tests with the collected data in order to be able to carry out the vocal synthesis of the text in mooré audio.
Results
At this stage, while we just crossed the mid-term of the project execution, we can report that a number of milestones have been achieved:
Data collection has been done and is still ongoing.
Pre-processing of audio and text content as well as audio and text mapping in Mooré and alignment of text in Mooré et al correspondence in French have been performed.
A transcription model for Mooré to French based on deepSpeech has been implemented.
The classical translation has been implemented and tested on the Bible dataset
Main challenges
Access to Data
After going through about ten structures, we were confronted with the availability of resources. Indeed, apart from the Bible, some training materials and official documents translated, there were very few documents available in Moore and French.
The structures that produce Moore content, most often do so for training or awareness-raising for the illiterate population. As a result, they do not produce the same content in French. As for radio and television channels, they have interventions directly in Moore, without written notes, even for the presentation of the television news.
However, we found a lot of printed material, without digital versions and only in Moore. For this phase of the project, we collected and carried out the alignment for the already existing data in both languages in digital format. This allowed us to test the model, and although it did not lead to conclusive results, we did identify the problem of data availability. For further work, we plan to translate the existing documents into Moore so that we have both versions to continue the work. We are aware that this is a long term work, but it is the indispensable condition to have enough data to make the results of the algorithms interesting.
Copyright
A second problem we encountered was copyright. Indeed, we do not always have direct access to the authors, and the holders of the documents are reserved to share them without their agreement. In other cases, the documents had been commissioned by international organizations. It was therefore necessary for our interlocutors here to have the agreement of these institutions before giving us access to the data. This takes time and has delayed access to the working data.
In the long term, we plan to bring together a group of authors to raise their awareness of the project so that they can facilitate advocacy for the project.
Computing capacity
We unfortunately do not have a laboratory equipped with servers powerful enough to run our models. Our partnership with Anptic was supposed to allow us to use VMs with greater capacity to go faster in testing, but the administrative burden also delayed the availability of VMs.
Knowledge 4 All Foundation partnered with the Deep Learning Indaba to fund research projects across Africa that are collaborative at heart and have a strong development focus.
This Call for Proposals invites individuals, grassroots organizations, initiatives, academic, and civil society institutions to apply for funding for mini-projects.
A mini-project could also be early-stage research around our Grand Challenge of curing leishmaniasis.