Autoregressive models for text independent speaker identification in noisy environments
MetadataShow full item record
The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high classification accuracies (in the range 95% - 100%) when both the training and testing utterances are recorded in sound proof studio, i.e., there is neither additive noise nor spectral distortion to the speech signals. However, in real life applications, speech is usually corrupted by noise and band-limitation. Moreover, there is a mismatch between the recording conditions of the training and testing environments. As a result, the classification accuracy of GMM-based systems deteriorates significantly. In this thesis, we propose a two-step procedure for improving the speaker identification performance under noisy environment. In the first step, we introduce a new classifier: vector autoregressive Gaussian mixture (VARGM) model. Unlike the GMM, the new classifier models correlations between successive feature vectors. We also integrate the proposed method into the framework of the universal background model (UBM). In addition, we develop the learning procedure according to the maximum likelihood (ML) criterion. Based on a thorough experimental evaluation, the proposed method achieves an improvement of 3 to 5% in the identification accuracy. In the second step, we propose a new compensation technique based on the generalized maximum likelihood (GML) decision rule. In particular, we assume a general form for the distribution of the noise-corrupted utterances, which contains two types of parameters: clean speech-related parameters and noise-related parameters. While the clean speech related parameters are estimated during the training phase, the noise related parameters are estimated from the corrupted speech in the testing phase. We applied the proposed method to utterances of 50 speakers selected from the TIMIT database, artificially corrupted by convolutive and additive noise. The signal to noise ratio (SNR) varies from 0 to 20 dB. Simulation results reveal that the proposed method achieves good robustness against variation in the SNR. For utterances corrupted by covolutive noise, the improvement in the classification accuracy ranges from 70% for SNR = 0 dB to around 4% for SNR = 10dB, compared to the standard ML decision rule. For utterances corrupted by additive noise, the improvement in the classification accuracy ranges from 1% to 10% for SNRs ranging from 0 to 20 dB. The proposed VARGM classifier is also applied to the speech emotion classification problem. In particular, we use the Berlin emotional speech database to validate the classification performance of the proposed VARGM classifier. The proposed technique provides a classification accuracy of 76% versus 71% for the hidden Markov model, 67% for the k-nearest neighbors, 55% for feed-forward neural networks. The model gives also better discrimination between high-arousal emotions (joy, anger, fear), low arousal emotions (sadness, boredom), and neutral emotions than the HMM. Another interesting application of the VARGM model is the blind equalization of multi input multiple output (MIMO) communication channels. Based on VARGM modeling of MIMO channels, we propose a four-step equalization procedure. First, the received data vectors are fitted into a VARGM model using the expectation maximization (EM) algorithm. The constructed VARGM model is then used to filter the received data. A Baysian decision rule is then applied to identify the transmitted symbols up to a permutation and phase ambiguities, which are finally resolved using a small training sequence. Moreover, we propose a fast and easily implementable model order selection technique. The new equalization algorithm is compared to the whitening method and found to provide less symbol error probability. The proposed technique is also applied to frequency-flat slow fading channels and found to provide a more accurate estimate of the channel response than that provided by the blind de-convolution exploiting channel encoding (BDCC) method and at a higher information rate.
Cite this version of the work
Moataz El Ayadi (2008). Autoregressive models for text independent speaker identification in noisy environments. UWSpace. http://hdl.handle.net/10012/4036