Master/PhD position at INRIA Grenoble / ETH Zurich

The LEAR research group at INRIA Grenoble and the CALVIN research group at the Computer Vision Laboratory of ETH Zurich are looking for a Master and/or PhD student. The candidate will be jointly supervised and will spend time in both institutions ( and

Topic: Exploiting associations between text and images will become more and more important over the next few years to reduce the amount of manual annotation necessary to learn visual concepts. Existing work has mainly focused on associating either nouns to image regions [1], or names to faces [2,3]. While techniques for associating nouns to regions require annotated image-nouns pairs, works on names and faces use uncontrolled News captions collected from the internet. However, their success depends heavily on the availability of a pre-trained face detector. In the case of general object classes, such detectors are a central component of what the system should learn automatically. The main goal of this project is to generalize existing approaches so that generic object classes can be learned from image-caption pairs mined from the internet. A possible research avenue is to devise techniques for bootstrapping background knowledge from supervised data, and then automatically move up to less and less supervision. Another important direction is to go beyond individual nouns and explore relations between multiple words, especially words of different types, such as nouns-adjectives and names-verbs. The visual counterparts of adjectives and verbs are attributes [5,6,7] and poses/actions [8,9] respectively. Relational words such as prepositions and comparators [4] could also be incorporated, as well as larger structures composed of more than two words. The multi-entity nature of the project also opens the door to the exciting possibility of automatic learning context models. The project is part of a larger research endeavor to model the parallel between the structure of visual scenes and the structure of natural sentences.

Your profile:
* Bachelor/Masters degree (preferably in Computer Science or Applied Mathematics; Electrical Engineering will also be considered)
* Solid programming skills; the project involves programming in Matlab and C++
* Solid mathematics knowledge (especially linear algebra and statistics)
* Creative and highly motivated
* Fluent in English, both written and spoken
* Prior knowledge in the areas of computer vision, machine learning or data mining is a plus (ideally a Bachelor/master thesis in a related field)

Duration: 6 to 9 month (Masters) or 3 years (PhD)

Start date: As soon as possible

Location: This is a joint project between INRIA Grenoble and ETH Zurich. The candidate will be required to spend time in both institutions.

Res. Dir. Cordelia Schmid, schmid (at)
Prof. Vittorio Ferrari, ferrari (at)

Please send applications via email, including:
* a complete CV
* graduation marks
* topic of your Bachelor/master thesis
* the name and email address of two references (including your BS/master thesis supervisor)
* if you already have research experience, please include a publication list and references

[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan, Matching Words and Pictures, JMLR 2003
[2] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, D. Forsyth, Names and Faces in the News, CVPR 2004
[3] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, Automatic Face Naming with Caption-based Supervision, CVPR 2008
[4] A. Gupta and L. Davis, Beyond Nouns: Exploiting Prepositions and Comparators for Learning Visual Classifiers, ECCV 2008
[5] V. Ferrari and A. Zisserman, Learning Visual Attributes, NIPS 2007
[6] K. Yanai and K. Barnard, Image Region Entropy: A Measure of “Visualness” of Web Images Associated with One Concept, ACM Multimedia 2005
[7] J. Van de Weijer, C. Schmid, and J. Verbeek, Learning Color Names from Real-World Images, CVPR 2007
[8] V. Ferrari, M. Marin-Jiminez, and A. Zisserman, Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008
[9] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning Realistic Human Actions from Movies, CVPR 2008.