University of Waterloo >
Electronic Theses and Dissertations (UW) >
Please use this identifier to cite or link to this item:
|Title: ||Autoregressive models for text independent speaker identification in noisy environments|
|Authors: ||El Ayadi, Moataz|
|Approved Date: ||25-Sep-2008 |
|Date Submitted: ||2008 |
|Abstract: ||The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain
utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high classification accuracies (in the
range 95% - 100%) when both the training and testing utterances are recorded in sound proof studio, i.e., there is neither additive
noise nor spectral distortion to the speech signals.
However, in real life applications, speech is usually corrupted by noise and band-limitation. Moreover, there is a mismatch between the recording conditions of the training and testing environments. As a
result, the classification accuracy of GMM-based systems deteriorates significantly. In this thesis, we propose a two-step procedure for improving the speaker identification performance under noisy environment. In the first step, we introduce a new classifier: vector autoregressive Gaussian mixture (VARGM) model. Unlike the
GMM, the new classifier models correlations between successive feature vectors. We also integrate the proposed method into the
framework of the universal background model (UBM). In addition, we develop the learning procedure according to the maximum likelihood
(ML) criterion. Based on a thorough experimental evaluation, the proposed method achieves an improvement of 3 to 5% in the
In the second step, we propose a new compensation technique based on the generalized maximum likelihood (GML) decision rule. In particular, we assume a general form for the distribution of the
noise-corrupted utterances, which contains two types of parameters: clean speech-related parameters and noise-related parameters. While the clean speech related parameters are estimated during the
training phase, the noise related parameters are estimated from the corrupted speech in the testing phase. We applied the proposed
method to utterances of 50 speakers selected from the TIMIT database, artificially corrupted by convolutive and additive noise.
The signal to noise ratio (SNR) varies from 0 to 20 dB. Simulation results reveal that the proposed method achieves good robustness
against variation in the SNR. For utterances corrupted by covolutive noise, the improvement in the classification accuracy ranges from 70% for SNR = 0 dB to around 4% for SNR = 10dB, compared to the standard ML decision rule. For utterances corrupted by additive noise, the improvement in the classification accuracy ranges from 1% to 10% for SNRs ranging from 0 to 20 dB.
The proposed VARGM classifier is also applied to the speech emotion classification problem. In particular, we use the Berlin emotional speech database to validate the classification performance of the proposed VARGM classifier. The proposed technique provides a classification accuracy of 76% versus 71% for the hidden Markov model, 67% for the k-nearest neighbors, 55% for feed-forward neural networks. The model gives also better discrimination between
high-arousal emotions (joy, anger, fear), low arousal emotions (sadness, boredom), and neutral emotions than the HMM.
Another interesting application of the VARGM model is the blind equalization of multi input multiple output (MIMO) communication
channels. Based on VARGM modeling of MIMO channels, we propose a four-step equalization procedure. First, the received data vectors are fitted into a VARGM model using the expectation maximization (EM) algorithm. The constructed VARGM model is then used to filter the received data. A Baysian decision rule is then applied to
identify the transmitted symbols up to a permutation and phase ambiguities, which are finally resolved using a small training
sequence. Moreover, we propose a fast and easily implementable model order selection technique. The new equalization algorithm is
compared to the whitening method and found to provide less symbol error probability. The proposed technique is also applied to
frequency-flat slow fading channels and found to provide a more accurate estimate of the channel response than that provided by the blind de-convolution exploiting channel encoding (BDCC) method and at a higher information rate.|
|Program: ||Electrical and Computer Engineering|
|Department: ||Electrical and Computer Engineering|
|Degree: ||Doctor of Philosophy|
|Appears in Collections:||Faculty of Engineering Theses and Dissertations |
Electronic Theses and Dissertations (UW)
All items in UWSpace are protected by copyright, with all rights reserved.