Speaker Diarization Using Improved SincNet Models to Extract Speaker Embeddings

Dib, Mohammad

Speaker Diarization Using Improved SincNet Models to Extract Speaker Embeddings

Files

Dib_Mohammad.pdf (653.41 KB)

Date

2021-10-25

Authors

Dib, Mohammad

Advisor

Basir, Otman

Publisher

University of Waterloo

Abstract

Speaker diarization is the process of identifying who spoke when in an audio stream, and it is applied in many fields, such as information retrieval and psychotherapy. Speaker embeddings extraction is a crucial step in any diarization system, where the goal is to extract highly discriminative speaker embeddings (d-vectors). Most of the existing methods are based on deep neural networks (DNNs) and they rely on engineered features, which may not guarantee optimal performance for all cases. This led to the development of the SincNet model, which can effectively and efficiently process raw input audio signals. The SincNet model was successfully used to perform embeddings extraction in a speaker diarization system, where it resulted in a high diarization performance. Its successor, the AM-SincNet model, which combines SincNet with an improved loss function, outperformed the standard SincNet on the speaker diarization task. This shows the importance of enhancing the loss function of SincNet to achieve better diarization performance. Thus, the goal of this thesis is to improve the ability of the SincNet model to extract discriminative embeddings such that it results in a better diarization performance by experimenting with different architectures and state-of-the-art loss functions. In this thesis, 16 different SincNet based models were proposed as follows: four models that combine the SincNet architecture with four different loss functions, six models that combine the Res-SincNet architecture (a recently proposed architecture) with six different loss functions, and six models that combine the Res-SincNet-FC architecture (proposed in this thesis) with six different loss functions. The results show that the proposed MV-AM-SincNet model gives the best diarization performance compared to all the models discussed in this thesis. This shows the high capability of the MV-Softmax loss at extracting highly discriminative embeddings compared to the other losses. Additionally, the speaker recognition performance was reported, since all the models were trained for speaker recognition before being applied in speaker diarization. It was found that the proposed Res-SincNet-FC architecture resulted in the lowest frame error rate (FER) when combined with the different loss functions, where the D-Res-SincNet-FC and Arc-Res-SincNet-FC achieved the lowest FER. The Visualization of the extracted embeddings and the diarization output of the MV-AM-SincNet model showed its ability to extract highly discriminative embeddings. However, the visualization showed that having a large number of overlapping segments and/or small speaker segments impacts the diarization performance negatively. In this thesis, significant improvements on the SincNet model were made, which assists in achieving higher speaker recognition and diarization performance, where the raw audio signals were processed efficiently and effectively, without the need for feature engineering.