Audio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content

Haque, Md Rezwanul

Audio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content

dc.contributor.author	Haque, Md Rezwanul
dc.date.accessioned	2026-04-28T12:46:55Z
dc.date.available	2026-04-28T12:46:55Z
dc.date.issued	2026-04-28
dc.date.submitted	2026-04-27
dc.description.abstract	Depression has become a critical public health concern, with the World Health Organization reporting that over 280 million people worldwide are affected by it. The rapid growth of social media, particularly video blogs, has drawn research attention toward analyzing user-generated audiovisual content for signs of depression. These videos capture natural facial expressions, voice characteristics, and speech patterns that may reveal more about a person's emotional state than verbal self-reports alone. However, extracting useful features from such noisy, unstructured data and combining audio and visual information in a way that preserves their complementary nature remain open problems in this domain. The thesis is organized into two main contributions. In the first part, we propose MDD-Net, a multimodal depression detection network that uses a mutual transformer to fuse acoustic and visual features. The acoustic branch employs a global self-attention network to process 25 low-level descriptors including loudness, Mel-Frequency Cepstral Coefficients, and spectral flux, capturing both content-based and positional relationships. The visual branch applies hierarchical multi-head self-attention on 68 facial landmarks extracted from each video frame. The mutual transformer then operates bidirectionally: audio queries attend to visual keys and values, and visual queries attend to audio keys and values. We also design a composite loss function that combines binary cross-entropy, focal loss, and L2 regularization to handle the noisy labels and class imbalance that are common in social media datasets. In the second part, we introduce MMFformer, a multimodal fusion transformer network that takes a different approach to the same problem. For video, a pre-trained vision transformer augmented with residual connections extracts high-level spatial patterns from facial data. For audio, a transformer encoder built on the audio spectrogram transformer paradigm models temporal dynamics in speech signals through patch and positional embeddings. On the fusion side, we propose and compare three distinct strategies: late transformer fusion, intermediate transformer fusion, and intermediate attention fusion, each operating at a different level of the processing pipeline. We evaluate both architectures on the D-Vlog dataset, a publicly available collection of 961 YouTube vlogs from 816 individuals annotated for depression. MMFformer is additionally tested on the LMVD dataset, a larger corpus of 1,823 vlogs collected from four different social media platforms. MDD-Net reaches an F1-Score of 77.07% on D-Vlog, which is an improvement ranging from 1.82% to 17.37% over previously reported methods. MMFformer achieves 90.92% on D-Vlog and 90.48% on LMVD, surpassing the best existing results by 13.92% and 7.74% respectively. Cross-corpus validation between D-Vlog and LMVD further confirms that the developed architectures generalize across different platforms and populations.
dc.identifier.uri	https://hdl.handle.net/10012/23069
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	Audio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Electrical and Computer Engineering
uws-etd.degree.discipline	Electrical and Computer Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Karray, Fakhri
uws.contributor.advisor	Ho, Pin-Han
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Haque_Md Rezwanul.pdf
Size:: 1.53 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Electrical and Computer Engineering