Conference of the UK UNITWIN/UNESCO Chairs Programme

On 30-31 May 2023, UK National Commission for UNESCO hosted the Conference UK UNESCO Chairs Conference to mark the yearly anniversary of the UNITWIN/UNESCO Chairs Programme. This event, supported by the National Commission for UNESCO, brought together over 20 participants representing some 22 UNESCO Chairs and UNITWIN in the UK. This global network encourages inter-university cooperation, collaboration and information sharing. Today, the Programme involves over 700 institutions in 126 countries.

The two days of knowledge sharing inspired new ideas, partnerships, and opportunities that highlighted the value of intellectual collaboration across the network and beyond. The value of transdisciplinarity, future-oriented approaches and the need for increased North-South-South and South-South cooperation were emphasized throughout the event.

Presenting the new science of Artificial Intelligence that can put Europe on the world stage in the European Parliament

K4A is very happy to have helped co-organize an awesome half-day event at the European Parliament, titled “Beyond ChatGPT: How can Europe get in front of the pack on Generative AI Models?“, with Humane AI Net, IRCAI – International Research Center on Artificial Intelligence under the auspices of UNESCO, CLAIRE – Confederation of Laboratories for Artificial Intelligence Research in Europe, TAILOR, AI4Media, and VISION.
A big thank you to Paul Lukowicz, Cees Snoek, Fredrik Heintz, Ioannis Kompatsiaris, Virginia Dignum, Ieva Martinkenaite, Francesca Rossi, Holger Hoos, Marko Grobelnik, Catelijne Muller, Clara Neppel, Dino Pedreschi, and Cécile Huet.

Contributing to Lillehammer’s (Norway) Lifelong Learning ICDE Conference with the workshop: “Your place in the Open Ecosystem”

KA, as a partner institution of the ENCORE + Network, approached how Open Technologies can support initiatives, projects, and business’ uptake of Open Educational Resources (OERs) in the Lifelong Learning Conference 2023 (15th-17th February) in Lillehammer (Norway), which gathered 350 participants from 32 countries.

The workshop was intended to provide participants an opportunity to imagine and recreate their work and business as Open, reflecting on OERs applicability and benefit to business, innovation, and technology in lifelong learning. There was exchange, debate, and genuine interest on the possibilities openness offer to different stakeholders.

Some of the ongoing K4A’s research for the ENCORE+ Network was also presented as relevant background for engaging participants with the activities proposed. An overview of how businesses envisioned the use and potential strategies by the use of approach was provided, in terms of services provided to learners, and technologies supporting these processes. Additionally, some of the latest AI-based solutions for OER repositories were showcased as efficient tools catering for lifelong learners’ needs.

K4A workshop at the third International Lillehammer Lifelong Learning ICDE Conference 15-17 February 2023

Funding available for human-centered AI projects

K4A is a partner in the HumanE AI network of excellence which has been running a program of micro-projects and there is a potential to link this with the Network for AI and Knowledge for Sustainable Development (NAiXUS) established jointly by the International Research Centre on AI under the Auspices of UNESCO, the DataPop Alliance, Knowledge 4 All Foundation, ELLIS Alicante UNIT and Regional Center for Studies on the Development of the Information Society (Cetic.br). The Humane AI has funds reserved to finance the involvement of external partners and this call is concerned with micro-projects that would like to leverage these funds to include NAiXUS partners. Check here for the opportunity.

Human AI net Micro-Projects Collaboration Network

The periodic technical report for the HumaneAI Network successfully submitted to EU reviewers

After the HumaneAI project setup phase, initiating the internal and external collaboration mechanisms the first 18 months were focused on engaging with the research questions posed in the proposal within WPs 1-5 and conducting a series of concrete high-impact activities to connect to the community. Nearly 70 micro projects spanning the large majority of the project partners have been initiated resulting in 82 project publications, incl. Nature, PNAS, Phys.Rev, Artificial Intelligence etc papers.

A major result of this work has been the updated research agenda which includes a novel conceptual framework for human-AI collaboration, a notion of shared representations centered around of narratives and the expansion of the definition of AI trustworthiness and explainability in terms of human-computer interaction (systems that humans (both individually and as a society) feel they understand and are comfortable trusting rather than systems that “only” fulfill certain hard technical specification).

Lacuna Fund 2022 grantees convening in Tunis

As of August 2021, Lacuna Fund has selected 29 projects for funding in the Agriculture, Natural Language Processing, and Equity & Health domains. Project teams from the first rounds of funding in Agriculture and Natural Language Processing have either completed or are nearly finished with their datasets. Those project teams were invited to attend the 2022 Lacuna Fund Grantee Convening.

K4A was granted two projects funded by the first rounds of Lacuna funding, Jade Abbott leading the “Masakhane MT: Decolonizing Scientific Writing for Africa” and Peter Nabende’s project “Named Entity Recognition and Parts of Speech datasets for African languages“. Both project results are available now, with links to the datasets listed on the Lacuna Fund website.

We were extremely happy to meet representatives from many of the project teams in person!

African languages are richer for 20 more language datasets

We are very happy to announce that one of our Lacuna funded projects titled Named Entity Recognition and parts of Speech Datasets for African Languages has been successfully finished. At the start of our work, none of the languages associated with this project had a manually prepared NER dataset. Also, only a very small subset of languages in South Africa, and Yoruba, Naija, Wolof, and Bambara had Part-of-speech (POS) datasets. This project has therefore provided the first carefully prepared NER and POS datasets for 20 African languages. The project initially achieved new parallel texts (up to 8000 parallel sentences) for at least 8 low-resourced languages. The parallel texts are a very valuable resource for bilingual NLP applications. The results will be uploaded to the Masakhane Github repository.

David Adelani presenting the project results

Making artificial intelligence human-centric at the first post-pandemic HumaneAI-Net consortium meeting in person

This was the EU-funded HumanE-AI-Net project meeting which brought together leading European research centres, universities and industrial enterprises into a network of centres of excellence. Leading global artificial intelligence (AI) laboratories collaborate with key players in areas, such as human-computer interaction, cognitive, social and complexity sciences. The project is looking forward to drive researchers out of their narrowly focused field and connect them with people exploring AI on a much wider scale. The challenge is to develop robust, trustworthy AI systems that can ‘understand’ humans, adapt to complex real-world environments and interact appropriately in complex social settings. HumanE-AI-Net will lay the foundations for designing the principles for a new science that will make AI based on European values and closer to Europeans.

AI4D blog series: A Study Towards Automated Wildlife Patrol

The aim of our project is to investigate the technological feasibility of deploying Unmanned Ground Vehicles for automated wildlife patrol, as well as performing a preliminary analysis of other metadata collected from officials at a national park in Kenya. To this end, we seek to collect and publish a dataset of driving data across national park trails in Kenya, the first of its kind, and use deep learning to predict steering wheel angle when driving on these trails.

Setting up the data acquisition system

The data collection required a vehicle mounted with a camera to be driven across national park trails while recording the trail video as well as key driving signals such as steering wheel angle, speed and brake and accelerator pedal positions. We began design, installation and configuration of the data collection system in November and December 2019.

The first idea was to procure and attach sensors to the vehicle to obtain these driving signals. But upon further research, it was discovered that most of these driving signals can be read from the CAN bus which is exposed on the OBD-II (On-Board Diagnostics) port on most vehicles manufactured after 2008.

This information however is grouped and encoded within different parameter ids, and it requires reverse engineering to identify each of these driving parameters which is significantly time consuming, an activity that would take months by itself.

Furthermore, not all of the driving signals would be exposed on the CAN bus. The parameters exposed on the bus vary between vehicle manufacturers and models, and so does the encoding. After failing to understand the data read from the CAN bus of our personal vehicles, we decided to find a vehicle model which had already been reverse-engineered.

We were able to identify [1] and procure a Toyota Prius 2012 for the data collection, from which we could read the steering wheel angle, steering wheel torque, vehicle speed, individual wheel speeds and brake and accelerator pedal positions. We used a Raspberry Pi 3 microcomputer with the PiCan hat to read and log the driving signals.

Encoded driving data seen on the vehicle’s CAN bus
Encoded driving data seen on the vehicle’s CAN bus

In order to create the dataset for training and testing the learning algorithm, each data sample would have to contain a video frame matched to the corresponding driving signals at that instance. That means all the video frames, as well as the driving signals, have to be timestamped.

The driving signals are automatically timestamped during logging on the Raspberry Pi, but most cameras don’t timestamp the individual frames. Further, the internal clock of the camera would not be in sync with that of the RPi’s, and would cause the video frames and driving signals to also be out of sync when creating the data samples.

That means a camera that could interface to the computer as a webcam would be needed, so each frame can be read and timestamped before being written to the video file. Driving on rough national park trails would also induce a lot of vibrations and require a camera with good stabilization. These were some of the challenges in selecting a camera for recording the driving video.

Check the project documentation on Github

We settled on the Apeman A80 action camera which has gyro stabilization, HD video recording and can also function as a webcam. OpenCV was used to read and record timestamped video to the computer.

Initially, we tried to connect the camera to the Raspberry Pi itself. But the RPi is a low-powered microcomputer. There was significant lag in recording and could not write the video higher than a frame rate of 8fps. We therefore decided to use a laptop which could comfortably record HD
video at 30fps to connect to the camera, and the RPi for only logging the driving signals from the vehicle’s CAN bus.

This however presented a different challenge of being limited by the laptop battery. While the RPi can be charged using a portable power bank or directly from the car’s charging port, the laptop cannot. That meant significantly shorter data collection runs. We could only drive around continuously for 2 hours before we had to return to charge the laptop which took another 2 hours.

This forced revising down our overall data collection projections from 50 hours to 20 hours, of which 25 hours which was to be on the national park trails was revised down to 10 hours, and the other 10 hours on a mixture of tarmac roads and other rural dirt roads.

There was also extensive testing of different video encoding methods to determine the best filesize versus quality tradeoff, as well as data collection code optimization to ensure minimum lag during the data logging.

Data collection

We began the data collection in January 2020 on tarmac and rural dirt roads. The idea behind this was to train the algorithm on a simpler dataset and then use transfer learning for better faster results on the national park trails. The data was collected at various times of the day: early in the morning, noon and late in the evening in order to get a varied dataset in different lighting conditions.

While we were able to smoothly collect the data on tarmac roads, driving over the rural dirt roads proved impossible as they were marked with potholes. Not only was it challenging to drive a low-body vehicle over the rough terrain, but the constant maneuvers made to go around the potholes meant that most of that data would be unusable as it would present a different challenge altogether in training.

The challenge of driving a low-body vehicle on dirt roads also limited our choices of national parks, as we had to carefully select ones with smooth driving trails. Our plan to collect data from the Maasai Mara National Reserve had to be abandoned due to the bad road conditions there, and we opted to collect data from Nairobi National Park (8 hrs) and Ruma National Park (2.5 hrs) instead. Even these however were not without their setbacks involving a flat tire and bumper damage.

Another challenge faced in the parks was internet connectivity. While a stable internet connection was not needed for the data collection which was done offline, a connection to the internet was needed when starting up the Raspberry Pi to allow it to initialize the correct datetime value.

This is because the RPi microcomputer does not have an internal clock. That means unless it has a connection to the internet, it will resume the clock from the last saved time before it was shut down, hence ending up showing the wrong time. That resulted in incorrect timestamps on the logged driving data that could not be matched to the video timestamps.

This was observed while analyzing the driving data logs from one of the runs at Ruma National Park. Luckily, internet connectivity was regained towards the end of the run and the rest of the timestamps could be calculated correctly using the message baud rates.

Other minor issues faced in obtaining good quality data involved keeping the windshield clean while driving on dusty park trails where one is not allowed to alight from the vehicle, and securely mounting the camera inside the vehicle while driving over rough terrain.

Dataset preparation and Training

A significant portion of the data collected included driving around potholes, overtaking, stopping, U-turns etc. which would not be useful for predicting the steering wheel angle within the scope of this study. All these segments had to be visually identified and removed before
preparing the dataset.

Initially, we proposed to use a simple Convolutional Neural Network (CNN) model for training as in [2], where the steering wheel angle is predicted independently on each video frame as the input. However, the steering angle is also largely dependent on the speed of the vehicle. Driving
is also a stateful process, where the current steering wheel angle is also dependent on the previous wheel position.

We therefore investigated the use of a more sophisticated temporal CNN model as in [3] using recurrent units such as LSTM and Conv-LSTM that could give more promising results. The above model however is very computationally expensive and would require a cluster of very expensive GPUs and still take days to train.

Using this model proved impossible to achieve within the given timeline and budget. We therefore decided to continue with our initial proposal using a static CNN model [2].

Currently we are in the process of building the dataset and learning model for the project. We are also working on preparing a preliminary analysis on the feasibility of automated wildlife patrol [4] based on other metadata collected from park officials.

We are grateful for the immense support that we always get from our mentor Billy Okal who in spite of his busy  schedule, gets the time to set up calls whenever we need to consult and always comes up with great ideas that address most of our concerns.

References

[1] C. Miller and C. Valasek, Adventures in Automotive Networks and Control Units, IOActive
Inc., 2014, pp. 92-97.
[2] M. Bojarski et al., End to end learning for self-driving cars, 2016, arXiv:1604.07316.
[3] L. Chi and Y. Mu, Deep steering: Learning end-to-end driving model from spatial and
temporal visual cues, 2017, arXiv:1708.03798.
[4] L. Aksoy et al., Operational Feasibility Study of Autonomous Vehicles, Turkey International
Logistics and Supply Chain Congress, 2016.

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D

AI4D blog series: Extracting meta-data from Malawi Court Judgments

We have set the task to develop semi-automatic methods for extracting key information from criminal cases issued by courts in Malawi. Our body of court judgments came partly from the MalawiLii platform and partly from the High Court Library in Blantyre, Malawi. We focussed our first analysis on cases between 2010 – 2019.

Amelia Taylor, University of Malawi | UNIMA · Information Technology and Computing
Amelia Taylor, University of Malawi | UNIMA · Information Technology and Computing

Here is an example of a case for which a PDF is available on MalawiLii. Here is an example of a case for which only a scanned image of a pdf is available. We used OCR for more than 90% of data to extract the text for our corpus (see below a description of our corpus).

Please open these files to familiarise yourself with the content of a court criminal judgment. What kind of information we want to extract?  For each case we wanted:

  1. Name of the Case
  2. Number of the Case
  3. Year in which the case was filled
  4. Year in which the judgment was given, Court which issued the judgment
  5. Names of Judges
  6. Names of parties involved (appellants and respondents, but you can take this further and extract names of principal witnesses, and names of victims)
  7. References to other Cases
  8. References to Laws/Statues and Codes, and,
  9. Legal keywords which can help us classify the cases according to the ICCS classification.

This project has taught us so much about working with text, preparing data for a corpus, exchange formats for the corpus data, analysing the corpus using lexical tools, and machine learning algorithms for annotating and extracting information from legal text.

Along the way we experimented also with batch OCR processing and different annotation formats such as IOB tagging[1], and the XML TEI[2] standard for sharing and storing the corpus data, but also with the view of using these annotations in sequence-labelling algorithms.

Each has advantages and disadvantages, the IOB tagging does not allow nesting (or multiple labelling for the same element), while an XML notation would allow this but it is more challenging to use in algorithms. We also learned how to build a corpus, and experimented with existing lexical tools for analysing this corpus and comparing it to other legal corpora.

We learned how to use POS annotations and contextual regular expressions to extract some of our annotations for laws and case citations and we generated more than 3000 different annotations. Another interesting thing we learned is that preparing annotated training data is not easy, for example, most algorithms require training examples to be of the same size and the training set needs to be a good representation of the data.

We also experimented with the classification algorithms and topics detection using skitlearn, spacy, weka and mathlab. The hardest task was to prepare the data in the right format and to anticipate how this data will lead to the outputs we saw. We felt that time spent in organising and annotating well is not lost but will result in gains in the second stage of the project when we focus on algorithms.

Most algorithms split the text into tokens, and for us, multi-word tokens (or sequences) are those we want to find and annotate. This means a focus on sequence-labelling algorithms. The added complications which are peculiar to legal text is that most of our key terms belong logically to more than one label, and the context of a term can span multiple chunks (e.g., sentences).

When using LDA (Latent Dirichlet Association) to detect topics in our judgments, it became clear to us that one needs to use a somehow ‘sumarised’ version in which we collapse sequences of words into their annotations  (this is because LDA uses term frequency-based measure of keyword relevance, whereas in our text the most relevant words may appear much less frequently than others).

Our work has highlighted to us the benefits and importance of multi-disciplinary cooperation. Legal text has its peculiarities and complexities so having an expert lawyer in the team really helped!

Finding references to laws and cases is made slightly more complicated because of the variety in which these references may appear or because of the use of “hereinafter”. Legal text makes use of “hereinafter”[3], e.g., Mwase Banda (“hereinafter” referred to as the deceased). But this can also happen for references to laws or cases as the following example shows:

Section 346 (3) of the Criminal Procedure and Evidence Code Cap 8:01 (hereinafter called “the Code”) which Wesbon J  was faced with in the case of  DPP V Shire Trading CO. Ltd (supra) is different from the wording of Section 346 (3) of the Code  as it stands now.

Compare extracting the reference to law from “Section 151(1) of the Criminal Procedure and Evidence Code” to extracting from “Our own Criminal Procedure and Evidence Code lends support to this practice in Sections 128(d) and (f)”. We have identified a reasonably large number of different references to laws and cases used in our text!  The situation is very similar for case citations. Consider the following variants:

  • Republic v Shautti , Confirmation case No. 175 of 1975 (unreported)
  • Republic v Phiri [ 1997] 2 MLR 68
  • Republic v Francis Kotamu , High Court PR Confirmation case no. 180 of 2012 ( unreported )
  • Woolmington v DPP [1935] A.C. 462
  • Chiwaya v Republic  4 ALR Mal. 64
  • Republic v Hara 16 (2) MLR 725
  • Republic v Bitoni Allan and Latifi Faiti

Something for you to Do Practically! To play with some annotations and appreciate the diversity in formats, and at the same time the huge savings that a semi-automatic annotation can bring, we have set up a doccano platform for you: you log in here using the user guest and password Gu3st#20.

Annotating with keywords for the purposes of the ICCS classification proved to be even harder. The International Classification of Crime for Statistical Purposes (ICCS)[4] and it is a classification of crimes as defined in the national legislations and comes on several levels each with varying degrees of the specification. We considered mainly the Level 1 and we wanted to classify our judgments according to the 11 types in Level 1 as shown in the Table.

Table 1: Level 1 sections of the ICCS
Table 1: Level 1 sections of the ICCS

We discovered that this task of classification according to Level 1 requires a lot of work and it is of a significant complexity (and the complexities only grow if we would consider the sublevels of the ICCS).  First, the legal expert of our team manually classified all criminal cases of 2019 according to Level 1 ICCS and worked on a correspondence between the Penal Code and the ICCS classification.  This is excellent.

We are in the process of extending this to mapping other Malawi laws, codes and statutes that are relevant to criminal cases into the ICCS. This in itself is a whole project on its own for the legal profession and requires processing a lot of text and making ‘parallel correspondences’! Such national correspondence tables are still work in progress in most countries and to our knowledge, our work is the first of such work for Malawi.

Looking at Level 1 of the ICCS meant we were kept very busy. Our research centred on hard and important questions.  How to represent our text so that it can be processed efficiently? What kind of data labels are most useful for the ICCS classification? What type of annotations to use (IOB or an xml-based)? What algorithms to employ (Hidden Markov Models or Recurrent Neural Networks or Long Short Term Memory)? But most importantly, we focussed on how to prepare our annotated data to be used with these algorithms?

We need to be mindful that this is a fine classification because we have to distinguish between texts that are quite similar. For example, if we wanted to classify whether a judgment by the type of law it falls under, say whether it is either civil or criminal case, this would have been slightly easier because the keywords/vocabulary used in civil cases would be quite different than that used in criminal cases.

We want to distinguish between types of crimes, and the language used in our judgments is very similar. Within our data set there is the level of difficulty, e.g.,  theft and murder cases may be easier to differentiate, that is Type 1 and 7 from the table above, than, say, to differentiate between types 1 and 2.

We have the added complication that most text representation models which define the relevance of a keyword as given by its frequency (whether that is TF or TF-IDF) but in our text, a word may appear only once and still be the most significant word for the purpose of our classification. For example, a keyword that distinguishes between type 1 and type 2 murders is “malice aforethought” and this may only occur once in the text of the judgment.

To help with this situation, one can extract first the structure of the judgment and focus only on the part that deals with the sentence of the judge. Indeed, there is research that focuses only on extracting various segments of a judgment.

This may work in many cases because usually the sentence is summarised in one paragraph. But it does not work for all cases. This is so especially when the case history is long, the crime committed has several facets, or the case has several counts, e.g., the murder victim is an albino or a disabled person.

In such situations one needs a combined strategy which uses: (1) An good set of annotated text with meta-data described above; (2) the mapping of the Penal Code/ Laws/Statues relevant to the ICCS; (3) collocations of words/ or a thesaurus and (4) concordances to help us detect clusters and extract relevant portions of the judgments; (5) employing sequence modelling algorithms, e.g., HMM, recurrent neural networks, for annotation and classification.

In the first part of the project, we focussed on the tasks (1) – (4) and experimented to some extent with (5).  What we wanted is to find a representation of our text based on all the information at (1) – (4) and attempt to use that in the algorithms we employ.

We have created a training set of over 2500 annotations for references to sections of the law and over 1000 annotations for references to other cases. We are still preparing these so that they are representative of the corpus and are good examples.

And finally but most importantly, while working on this AI4D project, it has brought me in contact with very clever people, whom I would have not otherwise met. We appreciate the support and guidance of the AI4D team!

[1] https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

[2] http://fedora.clarin-d.uni-saarland.de/teaching/Corpus_Linguistics/Tutorial_XML.html

[3] Hereinafter is a term that is used to refer to the subject already mentioned in the remaining part of a legal document. Hereinafter can also mean from this point on in the document.

[4] United Nations Economic Commission for Europe. Conference of European Statisticians. Report of the UNODC/UNECE Task Force on Crime Classification to the Conference of European Statisticians. 2011. Available: www.unodc.org/documents/data-andanalysis/statistics/crime/Report_crime_classification_2012.pdf>

Reposted within the project “Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa” #UnitedNations #artificialintelligence #SDG #UNESCO #videolectures #AI4DNetwork #AI4Dev #AI4D