Speaker Diarization Using Improved SincNet Models to Extract Speaker Embeddings
MetadataShow full item record
Speaker diarization is the process of identifying who spoke when in an audio stream, and it is applied in many fields, such as information retrieval and psychotherapy. Speaker embeddings extraction is a crucial step in any diarization system, where the goal is to extract highly discriminative speaker embeddings (d-vectors). Most of the existing methods are based on deep neural networks (DNNs) and they rely on engineered features, which may not guarantee optimal performance for all cases. This led to the development of the SincNet model, which can effectively and efficiently process raw input audio signals. The SincNet model was successfully used to perform embeddings extraction in a speaker diarization system, where it resulted in a high diarization performance. Its successor, the AM-SincNet model, which combines SincNet with an improved loss function, outperformed the standard SincNet on the speaker diarization task. This shows the importance of enhancing the loss function of SincNet to achieve better diarization performance. Thus, the goal of this thesis is to improve the ability of the SincNet model to extract discriminative embeddings such that it results in a better diarization performance by experimenting with different architectures and state-of-the-art loss functions. In this thesis, 16 different SincNet based models were proposed as follows: four models that combine the SincNet architecture with four different loss functions, six models that combine the Res-SincNet architecture (a recently proposed architecture) with six different loss functions, and six models that combine the Res-SincNet-FC architecture (proposed in this thesis) with six different loss functions. The results show that the proposed MV-AM-SincNet model gives the best diarization performance compared to all the models discussed in this thesis. This shows the high capability of the MV-Softmax loss at extracting highly discriminative embeddings compared to the other losses. Additionally, the speaker recognition performance was reported, since all the models were trained for speaker recognition before being applied in speaker diarization. It was found that the proposed Res-SincNet-FC architecture resulted in the lowest frame error rate (FER) when combined with the different loss functions, where the D-Res-SincNet-FC and Arc-Res-SincNet-FC achieved the lowest FER. The Visualization of the extracted embeddings and the diarization output of the MV-AM-SincNet model showed its ability to extract highly discriminative embeddings. However, the visualization showed that having a large number of overlapping segments and/or small speaker segments impacts the diarization performance negatively. In this thesis, significant improvements on the SincNet model were made, which assists in achieving higher speaker recognition and diarization performance, where the raw audio signals were processed efficiently and effectively, without the need for feature engineering.
Cite this version of the work
Mohammad Dib (2021). Speaker Diarization Using Improved SincNet Models to Extract Speaker Embeddings. UWSpace. http://hdl.handle.net/10012/17666
Showing items related by title, author, creator and subject.
Cooper, Stephanie (University of Waterloo, 2014-08-22)This thesis seeks to investigate what it means to be a German speaker, and how this identification can emerge, and change, as a person is describing their language use throughout different contexts of their lives. Using ...
El Ayadi, Moataz (University of Waterloo, 2008-09-25)The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high ...
Stokes-Rees, Ian James (University of Waterloo, 2000)This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and ...