The Visual Object Classes Challenge was organised for eight years in a row, with increasing success. For example, the VOC 2011 workshop took place at ICCV and there were approximately 200 attendees. There were very healthy numbers of entries: 19 entries for the classification task, 13 for detection, and 6 for segmentation. There was also a successful collaboration with ImageNet (www.image-net.org) organized by a team in the US. They held a second competition on their dataset with 1000 categories (but with only one labelled object per image). It is safe to say that the PASCAL VOC challenges have become a major point of reference for the computer vision community when it comes to object category detection and segmentation. There are over 800 publications (using Google Scholar) which refer to the data sets and the corresponding challenges. The best student paper winner at CVPR 2011 made use of the VOC 2008 detection data; the prize winning paper at ECCV 2010 made use of the VOC 2009 segmentation data; and prize winning papers at ICCV 2009, CVPR 2008 and ECCV 2008 were based on the VOC detection challenge (using our performance measure in the loss functions).
The basic challenge is the recognition of objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). There can be multiple objects in each image. There are typically three main competitions with 20 object classes and around 10,000 images on: classification (is an object of class X present?), detection (where is it and what is its size?), segmentation (pixel-wise labelling). There were also “taster” competitions on subjects such as: layout (predicting the bounding box and label of each part of a person) and human action recognition (e.g. riding a bike, taking a photo, reading) The goal of the layout and action recognition tasters was to provide a richer description of people in images than just bounding box/segmentation information. Our experiences in 2010 and 2011 with Mechanical Turk annotation for the classification and detection challenges were that it was hard to achieve the level of quality we require from this pipeline. The focus for VOC 2012 annotation was thus to increase the labelled data for the segmentation and action recognition challenges. The segmentation data is of a very high quality not available elsewhere, and it is very valuable to provide more data of this nature. The legacy of the VOC challenges is the freely-available VOC 2007 data which was ported to mldata.org. We also extended our evaluation server to display the top k submissions for each of the challenges (a leaderboard feature), so that the likely performance increases after 2012 can be viewed on the web (similar to that available for the Middlebury evaluations, see http://vision.middlebury.edu/stereo/eval/).