Listeners outperform automatic speech recognition systems at every level, including the very basic level of consonant identification. What is not clear is where the human advantage originates. Does the fault lie in the acoustic representations of speech or in the recognizer architecture, or in a lack of compatibility between the two? Many insights can be gained by carrying out a detailed human-machine comparison. The purpose of the Interspeech 2008 Consonant Challenge is to promote focused comparisons on a task involving intervocalic consonant identification in noise, with all participants using the same training and test data. This paper describes the Challenge, listener results and baseline ASR performance. Index Terms: consonant perception, VCV, humanmachine performance comparisons.
In most comparisons of human and machine performance on speech tasks, listeners win [1][2][3] (but see [4]; for an overview, see0[5]). While some of the benefit comes from the use of high-level linguistic information and world knowledge, listeners are also capable of better performance on low-level tasks such as consonant identification which do not benefit from lexical, syntactic, semantic and pragmatic knowledge. This is especially the case when noise is present. For this reason, understanding consonant perception in quiet and noisy conditions is an important scientific goal with immediate applications in speech perception (e.g. for the design of hearing prostheses) and spoken language processing [6]. A detailed examination of confusion patterns made by humans and computers can point towards potential problems at the level of speech signal representations or recognition architectures. For example, one compelling finding from a number of recent studies has been that much of the benefit enjoyed by listeners comes from better perception of voicing distinctions [7][8][9]. A number of corpora suitable for speech perception testing exist [7][10], although few contain sufficient data to allow training of automatic speech recognizers. However, the main motivation for the Interspeech 2008 Consonant Challenge was not solely to make available a corpus large enough for human-machine comparisons, but also to define a number of varied and challenging test conditions designed to exercise listeners and algorithmic approaches. In addition, by providing software for perceptual testing and scoring, the aim was to support a wide range of comparisons, for both native and non-native listeners. This paper describes the design, collection and postprocessing of the Consonant Challenge corpus and specifies the test conditions as well as the training and development material. It provides results for native listeners and for two baseline automatic speech recognition systems.