Addressing Domain Shifts for Computer Vision Applications via Language

Liu, Chang

Addressing Domain Shifts for Computer Vision Applications via Language

Files

Liu_Chang.pdf (99.67 MB)

Date

2025-05-23

Authors

Liu, Chang

Advisor

Rambhatla, Sirisha
Wong, Alexander

Publisher

University of Waterloo

Abstract

Semantic segmentation is used in safety-critical applications such as autonomous driving and cancer diagnosis, where accurately identifying small and rare objects is essential. However, pixel-level annotations are expensive and time-consuming, and distribution shifts (e.g. daytime to snowy weather in self-driving, color variations between tumor scans across hospitals) between datasets further degrade model generalization capabilities. Unsupervised domain adaptation for semantic segmentation (DASS) addresses this challenge by training models on labeled source distributions and adapting them to unlabeled target domains. Existing DASS methods rely on either vision-only approaches or language-based techniques. Vision-only frameworks, such as masking and utilizing multi-resolution crops, implicitly learn spatial relationships between different image patches but often suffer from noisy pseudo-labels biased toward the source domain. To mitigate noisy predictions, language-based DASS methods leverage generalized representations from large-scale language pre-training. However, those approaches use generic class-level prompts (e.g., "a photo of a \{class\}") and fail to capture complex spatial relationships between objects, which are key for dense prediction tasks like semantic segmentation. To address these limitations, we propose LangDA, a language-guided DASS framework that enhances spatial context-awareness by leveraging vision-language models (VLMs). LangDA generates scene-level descriptions (e.g., "a pedestrian is on the sidewalk, and the street is lined with buildings") to encode object relationships. At an image-level, LangDA aligns an image's feature representation with the corresponding scene-level text embedding, improving the model’s ability to generalize across domains. LangDA eliminates the need for cumbersome manual prompt tuning and expensive human feedback, ensuring consistency and reproducibility. LangDA achieves state-of-the-art performance on three self-driving DASS benchmarks: Synthia to Cityscapes, Cityscapes to ACDC, and Cityscapes to DarkZurich, surpassing existing methods by 2.6\%, 1.4\%, and 3.9\%, respectively. Ablation studies confirm the effectiveness of context-aware image-level alignment over pixel-level alignment. These results demonstrate LangDA’s capability to leverage spatial relationships encoded in language to accurately segment objects under domain shift.