A key ambition of AI  is to render computers able to evolve in and interact with the real world. This can be made possible only if the machine is able to produce a correct interpretation of its available modalities (image, audio, text, …), upon which it would then build a reasoning to take appropriate actions. Computational linguists use the term “semantics” to refer to the possible interpretations (concepts) of natural language expressions, and showed some interest in “learning semantics”, that is finding (in an automated way) these interpretations. However, “semantics” are not restricted to natural language modality, and are also pertinent for speech or vision modalities. Hence, knowing visual concepts and common relationships between them would certainly bring a leap forward in scene analysis and in image parsing akin to the improvement that language phrase interpretations would bring to data mining, information extraction or automatic translation, to name a few.

Progress in learning semantics has been slow mainly because this involves sophisticated models which are hard to train, especially since they seem to require large quantities of precisely annotated training data. However, recent advances in learning with weak and limited supervision lead to the emergence of a new body of research in semantics based on multi-task/transfer learning, on learning with semi/ambiguous supervision or even with no supervision at all. The goal of this workshop is to explore these new directions and, in particular, to investigate the following questions:

  • How should meaning representations be structured to be easily interpretable by a computer and still express rich and complex knowledge?
  • What is a realistic supervision setting for learning semantics? How can we learn sophisticated representations with limited supervision?
  • How can we jointly infer semantics from several modalities?


Program Committee