Audio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content

dc.contributor.authorHaque, Md Rezwanul
dc.date.accessioned2026-04-28T12:46:55Z
dc.date.available2026-04-28T12:46:55Z
dc.date.issued2026-04-28
dc.date.submitted2026-04-27
dc.description.abstractDepression has become a critical public health concern, with the World Health Organization reporting that over 280 million people worldwide are affected by it. The rapid growth of social media, particularly video blogs, has drawn research attention toward analyzing user-generated audiovisual content for signs of depression. These videos capture natural facial expressions, voice characteristics, and speech patterns that may reveal more about a person's emotional state than verbal self-reports alone. However, extracting useful features from such noisy, unstructured data and combining audio and visual information in a way that preserves their complementary nature remain open problems in this domain. The thesis is organized into two main contributions. In the first part, we propose MDD-Net, a multimodal depression detection network that uses a mutual transformer to fuse acoustic and visual features. The acoustic branch employs a global self-attention network to process 25 low-level descriptors including loudness, Mel-Frequency Cepstral Coefficients, and spectral flux, capturing both content-based and positional relationships. The visual branch applies hierarchical multi-head self-attention on 68 facial landmarks extracted from each video frame. The mutual transformer then operates bidirectionally: audio queries attend to visual keys and values, and visual queries attend to audio keys and values. We also design a composite loss function that combines binary cross-entropy, focal loss, and L2 regularization to handle the noisy labels and class imbalance that are common in social media datasets. In the second part, we introduce MMFformer, a multimodal fusion transformer network that takes a different approach to the same problem. For video, a pre-trained vision transformer augmented with residual connections extracts high-level spatial patterns from facial data. For audio, a transformer encoder built on the audio spectrogram transformer paradigm models temporal dynamics in speech signals through patch and positional embeddings. On the fusion side, we propose and compare three distinct strategies: late transformer fusion, intermediate transformer fusion, and intermediate attention fusion, each operating at a different level of the processing pipeline. We evaluate both architectures on the D-Vlog dataset, a publicly available collection of 961 YouTube vlogs from 816 individuals annotated for depression. MMFformer is additionally tested on the LMVD dataset, a larger corpus of 1,823 vlogs collected from four different social media platforms. MDD-Net reaches an F1-Score of 77.07% on D-Vlog, which is an improvement ranging from 1.82% to 17.37% over previously reported methods. MMFformer achieves 90.92% on D-Vlog and 90.48% on LMVD, surpassing the best existing results by 13.92% and 7.74% respectively. Cross-corpus validation between D-Vlog and LMVD further confirms that the developed architectures generalize across different platforms and populations.
dc.identifier.urihttps://hdl.handle.net/10012/23069
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleAudio-Visual Feature Fusion through Transformers for Automated Depression Screening in Social Media Content
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorKarray, Fakhri
uws.contributor.advisorHo, Pin-Han
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Haque_Md Rezwanul.pdf
Size:
1.53 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections