Addressing Domain Shifts for Computer Vision Applications via Language
No Thumbnail Available
Date
2025-05-23
Authors
Advisor
Rambhatla, Sirisha
Wong, Alexander
Wong, Alexander
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Semantic segmentation is used in safety-critical applications such as autonomous driving and cancer diagnosis, where accurately identifying small and rare objects is essential. However, pixel-level annotations are expensive and time-consuming, and distribution shifts (e.g. daytime to snowy weather in self-driving, color variations between tumor scans across hospitals) between datasets further degrade model generalization capabilities. Unsupervised domain adaptation for semantic segmentation (DASS) addresses this challenge by training models on labeled source distributions and adapting them to unlabeled target domains.
Existing DASS methods rely on either vision-only approaches or language-based techniques. Vision-only frameworks, such as masking and utilizing multi-resolution crops, implicitly learn spatial relationships between different image patches but often suffer from noisy pseudo-labels biased toward the source domain. To mitigate noisy predictions, language-based DASS methods leverage generalized representations from large-scale language pre-training. However, those approaches use generic class-level prompts (e.g., "a photo of a \{class\}") and fail to capture complex spatial relationships between objects, which are key for dense prediction tasks like semantic segmentation.
To address these limitations, we propose LangDA, a language-guided DASS framework that enhances spatial context-awareness by leveraging vision-language models (VLMs). LangDA generates scene-level descriptions (e.g., "a pedestrian is on the sidewalk, and the street is lined with buildings") to encode object relationships. At an image-level, LangDA aligns an image's feature representation with the corresponding scene-level text embedding, improving the model’s ability to generalize across domains. LangDA eliminates the need for cumbersome manual prompt tuning and expensive human feedback, ensuring consistency and reproducibility.
LangDA achieves state-of-the-art performance on three self-driving DASS benchmarks: Synthia to Cityscapes, Cityscapes to ACDC, and Cityscapes to DarkZurich, surpassing existing methods by 2.6\%, 1.4\%, and 3.9\%, respectively. Ablation studies confirm the effectiveness of context-aware image-level alignment over pixel-level alignment. These results demonstrate LangDA’s capability to leverage spatial relationships encoded in language to accurately segment objects under domain shift.
Description
Keywords
Semantic Segmentation, Machine Learning, Unsupervised Domain Adaptation, Deep Learning, Image Segmentation, Self-driving, Domain Shift, Distribution Shift, Computer Vision, Vision Language Models, Large Language Models, Cross-model, Multi-modal, Artificial Intelligence