Knowledge 4 All Foundation is pleased to announce the successful completion of its participation in the European Learning and Intelligent Systems Excellence (ELISE) project, a prominent European Network of Artificial Intelligence Excellence Centres. ELISE, part of the EU Horizon 2020 ICT-48 portfolio, originated from the European Laboratory for Learning and Intelligent Systems (ELLIS) and concluded in August 2024.
Throughout the project, Knowledge 4 All Foundation collaborated with leading AI research hubs and associated fellows to advance high-level research and disseminate knowledge across academia, industry, and society. The Foundation contributed to various initiatives, including mobility programs, research workshops, and policy development, aligning with ELISE’s mission to promote explainable and trustworthy AI outcomes.
The Foundation’s involvement in ELISE has reinforced its commitment to fostering innovation and excellence in artificial intelligence research. By engaging in this collaborative network, Knowledge 4 All Foundation has played a role in positioning Europe at the forefront of AI advancements, ensuring that AI research continues to thrive within open societies
Knowledge 4 All Foundation (K4A) is pleased to announce the successful completion of its engagements in two prominent European Networks of Artificial Intelligence (AI) Excellence Centres: the HumanE AI Network. These initiatives have been instrumental in advancing human-centric AI research and fostering collaboration across Europe.
The HumanE AI Network, comprising leading European research centres, universities, and industrial enterprises, has focused on developing AI technologies that align with European ethical values and societal norms. K4A’s participation in this network has contributed to shaping AI research directions, methods, and results, ensuring that AI advancements are beneficial to individuals and society as a whole.
K4A remains committed to advancing AI research and development, building upon the foundations established through these collaborations. The foundation looks forward to future opportunities to contribute to the global AI community and to promote the responsible and ethical development of AI technologies.
The Knowledge 4 All Foundation is pleased to announce the successful completion of its Natural Language Processing (NLP) projects under the Lacuna Fund initiative. These projects aimed to develop open and accessible datasets for machine learning applications, focusing on low-resource languages and cultures in Africa and Latin America.
The portfolio includes impactful initiatives such as NaijaVoice, which focuses on creating datasets for Nigerian languages, Masakhane Natural Language Understanding, which advances NLU capabilities for African languages, and Masakhane Domain Adaptation in Machine Translation, targeting improved domain-specific machine translation systems. The Foundation’s efforts have significantly contributed to assisting African researchers and research institutions in creating inclusive datasets that address critical needs in these regions.
As part of a strategic transition, the Foundation has entrusted the continuation and expansion of these initiatives to the Deep Learning Indaba charity. The Deep Learning Indaba, dedicated to strengthening machine learning and artificial intelligence across Africa, is well-positioned to build upon the groundwork laid by Knowledge 4 All. The Foundation extends its gratitude to the Deep Learning Indaba charity for taking over these projects and is confident that their expertise will further the mission of fostering inclusive and representative AI development in the future.
On social media, Arabic speakers tend to express themselves in their own local dialect. To do so, Tunisians use “Tunisian Arabizi”, which consists in supplementing numerals to the Latin script rather than the Arabic alphabet.
In the African continent, analytical studies based on Deep Learning are data hungry. To the best of our knowledge, no annotated Tunisian Arabizi dataset exists.
Twitter, Facebook and other micro-blogging systems are becoming a rich source of feedback information in several vital sectors, such as politics, economics, sports and other matters of general interest. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Tunisian Arabizi.
TUNIZI is composed of one instance presented as text comments collected from Social Media, annotated as positive, negative or neutral. This data does not include any confidential information. However, negative comments may include offensive or insulting content.
TUNIZI dataset is used in all iCompass products that are using the Tunisian Dialect. TUNIZI is used in a Sentiment Analysis project dedicated for the e-reputation and also for all Tunisian chatbots that are able to understand the Tunisian Arabizi and reply using it.
Team
TUNIZI Dataset is collected, preprocessed and annotated by iCompass team, the Tunisian Startup speciallized in NLP/NLU. The team composed of academics and engineers specialized in Information technology, mathematics and linguistics were all dedicated to ensure the success of the project. iCompass can be contacted through emails or through the website: www.icompass.tn
Implementation
Data Collection: TUNIZI is collected from comments on Social Media platforms. All data was directly observable and did not require other data to be inferred from. Our dataset is taken from people expressing themselves in their own Tunisian Dialect using Arabizi. This dataset relates directly to Tunisians from different regions, different ages and different genders. Our dataset is collected anonymously and contains no information about users identity.
Data Preprocessing & Annotation: TUNIZI was preprocessed by removing links, emoji symbols and punctuation. Annotation was then performed by five Tunisian native speakers, three males and two females at a higher education level (Master/PhD).
Distribution and Maintenance: TUNIZI dataset is made public for all upcoming research and development activitieson Github. TUNIZI is maintained by iCompass team that can be contacted through emails or through the Github repository. Updates will be available on the same Github link.
Conclusion: As the interest in Natural Language Processing, particularly for African languages is growing, a natural future step would involve building Arabizi datasets for other underrepresented north African dialects such as Algerian and Moroccan.
We set out with a novel idea; to develop an application that would (i) collect an individual’s Blood Pressure (BP) and activity data, and (ii) make future BP predictions for the individual with this data.
Key requirements for this study therefore were;
The ability to get the BP data from an individual.
The ability to get a corresponding record of their activities for the BP readings.
The identification of a suitable Machine Learning (ML) Algorithm for predicting future BP.
Dr. Moses Thiga, Kabarak University, School of Science, Engineering and Technology
Ms. Daisy Kiptoo, Kabarak University, School of Science, Engineering and Technology
Dr. Pamela Kimeto, Kabarak University, School of Medicine and Health Sciences
Pre-test the idea – Pre testing the idea was a critical first step in our process before we could proceed to collect the actual data. The data collection process would require the procurement of suitable smart watches and the development of a mobile application, both of which are time consuming and costly activities. At this point we learnt our first lessons; (i) there was no precedence to what we were attempting and subsequently (ii) there were no publicly available BP data sets available for use in pre-testing our ideas.
Simulate the test data – The implication therefore was that we had to simulate data based on the variables identified for our study. The variables utilized were the Systolic and Diastolic BP Reading, Activity and a timestamp. This was done using a spreadsheet and the data saved as a comma separate values (csv) file. The csv is a common file format for storing data in ML.
Identify a suitable ML model – The data simulated and that in the final study was going to be time series data. The need to predict both the Systolic and Diastolic BP using previous readings, activity and timestamps meant that we were was handling a multivariate time series data. We therefore tested and settled on an LSTM model for multivariate time series forecasting based on a guide by Dr Jason Browniee (https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)
Develop the data collection infrastructure – There being no pre-existing data for the development implied that we had to collect our data. The unique nature of our study, collecting BP and activity data from individuals called for an innovative approach to the process.
BP data collection – for this aspect of the study we established that the best way to achieve this would be the use of smart watches with BP data collection and transmission capabilities. In addition to the BP data collection, another key consideration for the device selection was affordability. This was occasioned both by the circumstances of the study, limited resources available and more importantly, the context of use of a probable final solution; the watch would have to be affordable to allow for wide adoption of the solution.
The watch identified was the F1 Wristband Heart and Heart Rate Monitor.
Activity data collection – for this aspect of the study a mobile application was identified as the method of choice. The application was developed to be able to receive BP readings from the smart watch and to also collect activity data from the user.
Test the data collection – The smart watch – mobile app data collection was tested and a number of key observations were made.
Smart watch challenges – In as much as the watch identified is affordable it does not work well for dark skinned persons. This is a major challenge given the fact that a majority of people in Kenya, the location of the study and eventual system use, are dark skinned. As a result we are examining other options that may work in a universal sense.
Mobile app connectivity challenges – The app initially would not connect to the smart watch but this was resolved and the data collection is now possible.
Next Steps
Pilot the data collection – We are now working on piloting the solution with at least 10 people over a period of 2 – 3 weeks. This will give us an idea on how the final study will be carried out with respect to:
How the respondents use the solution,
The kind of data we will be able to actually get from the respondents
The suitability of the data for the machine learning exercise.
Develop and Deploy the LSTM Model – We shall then develop the LSTM model and deploy it on the mobile device to examine the practicality of our proposed approach to BP prediction.
This research focuses on enhancing Pharmacovigilance Systems using Natural Language Processing on Electronic Medical Records (EMR). Our major task was to develop an NLP model for extracting Adverse Drug Reaction(ADR) cases from EMR. The team was required to collect data from two hospitals, which are using EMR systems (i.e. University of Dodoma (UDOM) Hospital and Benjamin Mkapa (BM) Hospital). During data collection and analysis, we worked with health professionals from the two mentioned hospitals in Dodoma. We also used the public dataset from the MIMIC-III database. These datasets were presented in different formats, CSV for UDOM hospital and MIMIC III and PDF for BM hospital as shown on the attached file.
In most cases, pharmacovigilance practices depend on analyzing clinical trials, biomedical writing, observational examinations, Electronic Health Records (EHRs), Spontaneous Reporting (SR) and social media (Harpaz et al., 2014). As to our context, we considered EMR to be more informative compared to other practices, as suggested by (Luo et al., 2017). We studied schemas of EMRs from the two hospitals. We collected inpatients’ data since outpatients’ would have given the incomplete patient history. Also, our health information systems are not integrated, which makes it difficult to track patients’ full history unless patients were admitted to a particular hospital for a while. From all the data sources used there was a pattern of information that we were looking for, and this included clinical history, prior patient history, symptoms developed, allergies/ ADRs discovered during medication and patient’s discharge summary.
Gloriana Monko
Steven Edward
Zephania Reuben
Waziri Shebogholo
Ibrahimu Mtandu
Much as we worked on UDOM and BM hospitals’ data, we encountered several challenges that made the team focus on MIMIC-III dataset while searching for an alternative way to our local data. Here were the challenges noted:
The reports had no clear identification of ADR cases.
In most cases, the doctor did not mention the reasons for changing a medicine on a particular patient which made it hard to understand whether the medication didn’t work well for a specific patient or any other reasons like adverse reaction.
The justification for ADR cases was vague
Mismatch of information between patients and doctors
The patients talk in a way that doctor can’t understand
There is a considerable gap between the health workers and regulatory authorities (They don’t know if they have to report for ADR cases)
The issue of ADR is so complex since there is a lot to take into account like Drug to Drug, Drug to food and Drug to herbal interactions.
There was no common/consistent reporting style among doctors
The language used to report is hard for a non-specialist to understand.
Some fields were left empty with no single information which led to incomplete medical history
The annotation process prolonged since we had one pharmacologist for the work.
After noticing all these challenges, the team carefully studied the MIMIC-III database to assess the availability of the data with ADR cases which would help to come up with the baseline model to the problem. We discovered that the NoteEvent table has enough information about the patient history with all clear indications of ADR cases and with no ADR see the text.
To start with, we were able to query 100,000 records from the database with many attributes, but we used a text column found in the NoteEvent table with the entire patient’s history including (patient’s prior history, medication, dosage, examination, changes noted during medications, symptoms etc.). We started the annotation of the first group by filtering the records to remain with the rows of interest. We used the following keywords in the search; adverse, reaction, adverse events, adverse reaction and reactions. We discovered that only 3446 rows contain words that guided the team in the labelling process. The records were then annotated with the labels 1 and 0 for ADR and non-ADR cases respectively, as indicated in the filtration notebook.
In analysing the data, we found that there were more non-ADR cases than ADR cases, in which non-ADR cases were 3199 and 228 ADR cases and 19 data rows not annotated. Due to high data imbalance, we reduced Non-ADR cases to 1000, and we applied sampling techniques (i.e upsampling ADR cases to 800) to at least balance the classes to minimize bias during modelling.
After annotation and simple analysis we used NLTK to apply the basic preprocessing techniques for text corpus as follows:-
Converting the corpus-raw sentences to lower cases which helps in other processing techniques like parsing.
Sentence tokenization, due to the text being in paragraphs, we applied sentence boundary detection to segment text to sentence level by identifying sentence starting point and endpoint.
Then we worked with regular contextual expressions to extract information of interest from the documents by removing some of the unnecessary characters and replacing some with easily understandable statements or characters as for professional guidelines.
We removed affixes in tokens which put words/tokens into their root form. Also, we removed common words(stopwords) and applied lemmatization to identify the correct part of speech(s) in the raw text. After data preprocessing, we used Term Frequency Inverse Document Frequency (TF-IDF) from scikit-learn to vectorize the corpus, which also gives the best exact keywords in the corpus.
In modelling to create a baseline model, we worked with classification algorithms using scikit-learn. We trained six different models which are Support Vector Machines, eXtreme Gradient Boosting, Adaptive Gradient Boosting , Decision Trees, Multilayer Perceptron and Random Forest and then we selected three (Support Vector Machine, Multilayer Perceptron and Random Forest )models which performed better on validation compared to other models for further model evaluation. We’ll also use the deep learning approach in the next phase of the project to produce more promising results for the model to be deployed and kept in practice. Here is the link to colab for data pre-processing and modelling.
From the UDOM database, we collected a total of 41,847 patient records in chunks of 16185, 18400, and 7262 from 2017 to 2019 respectively. The dataset has following attributes (Date, Admission number, Patient Age, Sex, Height(Kg), Allergy status, Examination, Registration ID, Patient History, Diagnosis, and Medication ), we downsized it to 12,708 records by removing missing columns and uninformative rows. We used regular contextual expressions to extract information of interest from the documents as for professional guidelines. The data cleaned and exchanged data formatting, analyzing and preparing data for machine learning was elaborated in this Colab link.
On the BM hospital, the PDF files extracted from EMS have patient records with the following information.
Discharge reports
Medical notes
Patients history
Lab notes
Health professionals on the respective hospitals manually annotated the labels for each document, and this task took most of our time in this phase of the project. We’re still collecting and interpreting more data from these hospitals.
The team organizes and extracts information from BM hospital PDF files by exchanging data formatting, analyzing and preparing data for machine learning. We experimented with OCR processing for PDF files to extract data, but we didn’t generate promising results as more information appeared to be missing. We therefore hard to programmatically remove content from individual files and align them to the corresponding professional provided labels.
The big lesson that we have learned up to now is that most of the data stored in our local systems are not informative. Policymakers must set standards to guide system developers during development and health practitioners when using the system.
Lastly but not least, we want to thank our stakeholders, mentors and funders for your involvement in our research activities. It is because of such a partnership we can be able to achieve our main goal.
So why have we decided to collect malaria datasets to assist in developing a solution in its diagnosis? First, Malaria remains one of the significant threats to public health and economic development in Africa. Globally, it is estimated that 216 million cases of malaria occurred in 2017, with Africa bearing the brunt of this burden [5*]. In Tanzania, malaria is the leading cause of morbidity and mortality, especially in children under 5 years and pregnant women. Malaria kills one child every 30 seconds, about 3000 children every day [4*]. Malaria is also the leading cause of outpatients, inpatients, and admissions of children less than five years of age at health facilities [5*].
Martha Shaka
Frederick Apina
Said H Said
Halidi Maneno
Nyamos Waigama
Imani Sulutya
Said Mmaka
Emilian Ngatunga
Simon Chaula
Second, the most common methods to test for malaria are microscopy and Rapid Diagnostic Tests (RDT) [1, 2]. RDTs are widely used, but their chief drawback is that they cannot count the number of parasites. The gold standard for the diagnosis of malaria is, therefore, microscopy. Evaluation of Giemsa-stained thick blood smears, when performed by expert microscopists, provides an accurate diagnosis of malaria [3].
Nonetheless, there are challenges to this method, it consumes a lot of time to perform one diagnosis, requires experienced technologists who are very few in developing countries, and manually looking at the sample via a microscope is a tedious and eye-straining process. We learned that although a microscopic diagnostic is a golden standard for malaria diagnosis, it is still not used in most of the private and public health centers. We realized that some of the lab technologists in health care are not competent in preparing staining reagents used in the diagnosis process. We had to create our own reagents and supply to them for the purpose of this research.
Artificial intelligence is transforming how health care is delivered across the world. This has been evident in pathology detection, surgery assistance and early detection of diseases such as breast cancer. However, these technologies often require significant amounts of quality data and in many developing countries, there is a shortage of this.
Taking images of stained blood smear
A sample of the annotated image( green box represent a plasmodium)
To address this deficiency, my team, composed of 6 computer scientists and 3 lab technologists, collected and annotated 10,000 images of a stained blood smear and developed an open-source annotation tool for the creation of a malaria dataset. We strongly believe the availability of more datasets and the annotation tool (for automating the labeling of the parasites in an image of stained blood smear) will improve the existing algorithms in malaria diagnosis and create a new benchmark.
In the collection of this dataset, we first sought and were granted ethical clearance from the University of Dodoma and Benjamin Mkapa Hospital’s research center. We have collected 50 blood smear samples for patients confirmed with malaria and 50 samples for negative confirmed cases. Each sample was stained by the lab technologist and 100 images were taken using iPhone 6S attached to a microscope. This led to having a total of 5000 images for the positive confirmed patients and 5000 imaged for the negative confirmed patient.
Through this work, we have had several opportunities including attending academic conferences and forming connections with other researchers such as Dr. Tom Neumark, a postdoctoral social anthropologist at the University of Oslo. Through our work, we also met Prof Delmiro Fernandes-Reyes, a professor of biomedical engineering. In a joint venture with Prof Delmiro Fernandes-Reyes, we submitted a proposal for the DIDA Stage 1 African Digital Pathology Artificial Intelligence Innovation Network (AfroDiPAI) at the end of November 2019.
We are also disseminating the results of our research. We have submitted an abstract (on the ongoing project) to two workshops (Practical Machine Learning in Developing Countries and Artificial Intelligence for Affordable Health) for the 2020 ICLR conference in Ethiopia, and it has been accepted to be presented as a poster. We were also delighted to get very constructive feedback from reviewers of the conference and look forward to incorporating them as we continue with the projects and final publication.
The next stage will be to start using our data and train deep learning models in the development of the open-source annotation tool. At the same time, together with the AI4D team, we are looking for the best approach to follow when releasing our open-source dataset in the medical field.
But our overall aim is to develop a final product of our mobile application that will assist lab technologist in Tanzania and beyond in the onerous work of diagnosis malaria. We have already met many of these technologists who are not only excited and eagerly awaiting this tool, but generously helped us as we have gone about developing it.
Links
[1] B.B. Andrade, A. Reis-Filho, A.M. Barros, S.M. Souza-Neto, L.L. Nogueira, K.F. Fukutani, E.P. Camargo, L.M.A. Camargo, A. Barral, A. Duarte, and M. Barral-Netto. Towards a precise test for malaria diagnosis in the Brazilian Amazon: comparison among field microscopy, a rapid diagnostic test, nested PCR, and a computational expert system based on artificial neural networks. Malaria Journal, 9:117, 2010.
[2]Maysa Mohamed Kamel, Samar Sayed Attia, Gomaa Desoky Emam, and Naglaa Abd El Khalek Al Sherbiny, “The Validity of Rapid Malaria Test and Microscopy in Detecting Malaria in a Preelimination Region of Egypt,” Scientifica, vol. 2016, Article ID 4048032, 5 pages, 2016. https://doi.org/10.1155/2016/4048032.
[3]Philip J. Rosenthal*, “How Do We Best Diagnose Malaria in Africa?”: https://doi.org/10.4269/ajtmh.2012.11-0619
We are proud of our final hackathon at the British Embassy in Paris using the X5GON platform to make OER based education more accessible everywhere. We would like to thank our great colleagues without whom this hackathon couldn’t have happened Sahan Bulathwela, John Shawe-Taylor, Colin de la Higuera, Davor Orlic, Kristijan Perčič and colleagues from UCLIC Sheena Visram and UCL Computer Science Dr Dean Mohamedally and of course all the Osnabruck, Nantes, UCL and JSI students competing.