Park, Jinman2025-08-282025-08-282025-08-282025-07-23https://hdl.handle.net/10012/22314Salient object detection (SOD) is a core problem in computer vision that involves identifying and segmenting the most visually prominent regions in an image. Its relevance spans a wide range of applications, including image understanding, object recognition, scene parsing, and human-computer interaction, as well as safety-critical domains such as autonomous driving, robotics, and medical imaging. Despite substantial progress, modern SOD models often rely on dense, pixel-level computation that imposes high computational and memory costs, limiting their deployment in resource-constrained environments. This thesis investigates an alternative paradigm for salient object detection based on superpixel representations—compact, perceptually homogeneous regions that reduce spatial redundancy while preserving boundary structure. Although superpixels offer significant efficiency advantages, their irregular and heterogeneous nature presents unique challenges for integration into modern deep learning frameworks. Furthermore, traditional augmentation strategies and transfer learning pipelines are not readily compatible with superpixel-based models, complicating training and generalization. To address these challenges, we propose SuperFormer, a lightweight vision transformer architecture tailored for superpixel-based saliency detection. Our contributions include: (1) formulating SOD as a superpixel-to-superpixel learning task to reduce computational overhead, (2) introducing a heterogeneity-aware feature representation that fuses color, texture, and shape cues, (3) adapting vision transformer architectures to operate on irregular superpixel inputs via novel positional encodings and mix-attention decoding, (4) designing superpixel-specific augmentation strategies and demonstrating the effectiveness of ImageNet pre-training in this context, and (5) conducting extensive evaluations on seven benchmark datasets, where our models (SF-S, SF-XS, and SF-XXS) achieve state-of-the- art results among lightweight SOD methods in terms of accuracy, FLOPs, and parameter efficiency. Overall, this thesis demonstrates that with the proper architectural adaptations and training strategies, superpixel representations can enable efficient, interpretable, and high-performing salient object detection, laying the groundwork for broader adoption of structured visual abstractions in deep learning.ensuperpixelsalient object detectionvision transformersheterogeneityefficiencyfeature augmentationSuperpixel Salient Object DetectionDoctoral Thesis