The Challenge

Listeners outperform automatic speech recognition systems at every level of speech recognition, including the very basic level of consonant recognition. What is not clear is where the human advantage originates. Does the fault lie in the acoustic representations of speech or in the recogniser architecture, or in a lack of compatibility between the two? There have been relatively few studies comparing human and automatic speech recognition on the same task, and, of these, overall identification performance is the dominant metric. However, there are many insights which might be gained by carrying out a far more detailed comparison.

The purpose of this Special Session is to promote focused human-computer comparisons on a task involving consonant identification in noise, with all participants using the same training and test data. Training and test data and native listener and baseline recogniser results will be provided by the organisers, but participants are encouraged to also contribute listener responses.

Contributions are sought in (but not limited to) the following areas:

  • Psychological models of human consonant recognition
  • Comparisons of front-end ASR representations
  • Comparisons of back-end recognisers
  • Exemplar vs statistical recognition strategies
  • Native/Non-native listener/model comparisons

The results of the Challenge will be presented at a Special Session of Interspeech’08 in Brisbane, Australia.

Speakers and VCV material

Twelve female and 12 male native English talkers aged between 18-49 contributed to the corpus. Speakers produced each of the 24 consonants (/b/, /d/, /g/, /p/, /t/, /k/, /s/, //, /f/, /v/, //, //, /t/, /z/, //, /h/, /d/, /m/, /n/, //, /w/, //, /y/, /l/) in nine vowel contexts consisting of all possible combinations of the three vowels /i:/ (as in “beat”), /u:/ (as in “boot”), and /ae/ (as in “bat”). Each VCV was produced using both front and back stress (e.g. ‘aba vs ab’a) giving a total of 24 (speakers) * 24 (consonants) * 2 (stress types) * 9 (vowel contexts) = 10368 tokens. Pilot listening tests were carried out to identify unusable tokens due to poor pronunciations or other problems.

See the Technical details page for further details of the collection and postprocessing procedure.

Training, development and test sets

Training material comes from 8 male and 8 female speakers while tokens from the remaining 8 speakers are used in the independent test set. A development set will be released shortly. After removing unusable tokens identified during post-processing, the training set consists of 6664 clean tokens.

Seven tests sets, corresponding to a quiet and 6 noise conditions, are available. Each test set contains 16 instances of each of the 24 consonants, for a total of 384 tokens. Listeners will identify consonants in each of the test conditions. Minimally, each contribution to the special session should report results on some or all of the test sets. Scoring software will be released in February 2008.

Noise

The table shows the 7 conditions:

 

test set noise type SNR (dB)
1 clean
2 competing talker -6
3 8-talker babble -2
4 speech-shaped noise -6
5 factory noise 0
6 modulated speech-shaped noise -6
7 3-speaker babble -3

These noise types provide a challenging and varied range of conditions. Signal-to-noise ratios were determined using pilot tests with listeners with the goal of producing similar identification scores (~ 65-70%) in each noise condition.

VCV tokens are additively embedded in noise samples of duration 1.2s. The SNR is computed token-wise and refers to the SNR in the section where the speech and noise overlap. The time of onset of each VCV takes on one of 8 values ranging from 0 to 400 ms.

In addition, test materials are also available as “stereo” sound files which are identical to the test sets except that the the noise and VCV tokens are in separate channels. We have made the test material available in this form to support computational models of human consonant perception which may wish to make some assumptions about e.g. ideal noise processing, and also to allow for the computation of idealised engineering systems (e.g. to determine performance ceilings). Of course, contributors should clearly distinguish which of their results are based on the single-channel and dual-channel noise sets.

Download training, test, and development material.

Experimental Set-up

Twenty seven native English listeners aged between 18 and 48 who reported no hearing problems identified the 384 VCVs of the test set. Listeners were drawn from the staff and students at the University of Sheffield and were paid for their participation. Perception tests ran under computer control in the IAC booth. Listeners were presented with a screen layout as shown in Figure 1 on which the 24 consonants were represented using both ASCII symbols and with an example word containing the sound. Listeners were phonetically-naive and were given instructions as to the meaning of each symbol. They underwent a short practice session prior to the main test. Two listeners failed to reach a criterion level of 85% in a practice session using clean tokens. Another failed to complete all conditions, while a fourth was an outlier on most of the test conditions. Results are reported for the remaining 23 listeners. For the main test, listeners started with the clean condition. The order of the noisy conditions was randomised.

We welcome further contributions of listener results from native British English, other native English and non-native populations. We can make available MATLAB software for running listening tests if needed. Note that we anticipate that listening to the full range of tests will take 90-120 minutes in addition to the time taken for hearing tests and practice. Please contact the organisers for further information and to ensure that potential contributions are as useful as possible.

Results

The native listener results, averaged over all consonants and all listeners, are shown in the table below for each of the test conditions separately:

Test set 1 2 3 4 5 6 7
Rec. rate 93.8 79.5 76.5 72.2 66.7 79.2 71.4
Std. err. 0.57 0.78 0.79 0.75 0.77 0.61 0.74

Confusions matrices and Transmitted Information has been calculated for each of the test conditions separately:

The diagonal of the confusion matrices shows the percentage correct responses. Vertically: the phoneme that was produced; horizontally: the phoneme that was recognised.

The table used to calculate the Transmitted Information for manner, place, and voicing can be downloaded here.

The baseline recognition system

The performance of various acoustic features (MFCC, FBANK, MELSPEC, Ratemaps) and recogniser architectures (monophone, triphone, gender-dependent/independent) was investigated. Two representative combinations were chosen as baselines for the Consonant Challenge: one system based on MFCCs, the other on Ratemaps.

For both systems, 30 models were trained, 24 consonants and two models for each of the three vowels – one to model the initial vowel context and one to model the final vowel context of the VCV. The maximum number of training items per consonant is 9 (vowel contexts) * 2 (stress conditions) * 16 (speakers) = 288 items.

MFCC-based recognition system

The speech is parameterised with 12 MFCC coefficients and log energy and augmented with first and second temporal derivatives resulting in a 39-dimensional feature vector. Each of the monophones consists of 3 emitting states with a 24-Gaussian mixture output distribution. No silence model and short pause model are employed in this distribution as features are end-pointed. The HMMs were trained from a flat start using HTK.

Ratemap-based recognition system

Ratemaps are a filterbank-based representation based on auditory excitation patterns. 64-dimensional feature vectors and the same model architecture as for the MFCC-based system were used.

A .zip file containing the MFCC-based baseline model, training and testing scripts, and the evaluation script can be downloaded here. The ratemap generation scripts are available upon request. A short explanation of the scripts can be downloaded here. For any remaining questions please contact Ning Ma (University of Sheffield, UK).

Results

The overall consonant recognition accuracy for the MFCC-based recognition system is 88.5% on Test set 1 (clean). The overall consonant recognition accuracy for the Ratemap-based system on Test set 1 is 84.4%.

The confusion matrices of the two baseline systems can be found here. The diagonal of the confusion matrices shows the number of correct responses (16 is the maximum). Vertically: the phoneme that was produced; horizontally: the phoneme that was recognised.

The speech material

A description of the material can be found on the Description of the materials page.

The .zip files of both sets of test material and the development material contain seven directories, named testsetX and devsetX where X is the number referring to the number associated with the type of noise in the table on the Description of the materials page. In addition to this, test.zip also contains seven directories named testsetXp. These contain a small number of practice stimuli, which can be used in the perceptual experiments to give the listeners a small practice session. If you are not running perceptual experiments, you can ignore these directories. Finally, the test.zip and dev.zip also contain six files named testsetX_offsets.dat and devsetX_offsets.dat. These are (MATLAB) files showing the offsets of each of the VCVs into the noise. People building ASR systems are not allowed to use these data.

Phoneme segmentation data

The models of the MFCC-based baseline recognition system (click here for a description of the model set) were used to create phoneme segmentations of the clean test data using forced alignment.

Warning: Everyone is invited to use these segmentations but should be warned that they are not ‘perfect’ segmentations. If you obtain better segmentations and if you would like to share these please send them to the organisers via e-mail and we will post them on this website.

A number of people are helping us with the realisation of this Challenge. Many thanks go out to them!

  • The PASCAL Network: for financial support.
  • Maria Luisa Garcia Lecumberri (University of the Basque Country, Spain): for help with the design of the speech material, analysis of the production material, and the design of listening experiments.
  • Ning Ma (University of Sheffield, UK): for on-going work on the baseline recogniser.
  • Youyi Lu (University of Sheffield, UK): for helping run the listening experiments.
  • Matt Gibson (University of Sheffield, UK): for discussions on the baseline speech recogniser.