SMaT-HSI: Structure-aware Mamba-Transformer Hybrid Model for Hyperspectral Image Classification

Liu, Yaxuan

SMaT-HSI: Structure-aware Mamba-Transformer Hybrid Model for Hyperspectral Image Classification

Files

Liu_Yaxuan.pdf (13.12 MB)

Date

2025-09-11

Authors

Liu, Yaxuan

Advisor

Clausi, David
Xu, Linlin

Publisher

University of Waterloo

Abstract

Hyperspectral image (HSI) classification is a crucial task in remote sensing, playing a fundamental role in environmental monitoring, precision agriculture, urban planning, and mineral exploration. By leveraging the rich spectral information across hundreds of contiguous bands, HSI classification enables precise identification of materials and land cover types, facilitating accurate mapping of vegetation, soil, water bodies, and built environments. Traditional convolutional neural network (CNN)-based methods effectively extract local spatial features, while transformer-based models excel in capturing global contextual dependencies. However, both approaches face challenges in fully leveraging the spectral and spatial dependencies inherent in hyperspectral data. Recently, Mamba, a state-space model (SSM)-based architecture, has shown promise in sequence modeling by efficiently capturing long-range dependencies with linear computational complexity. A comprehensive comparison of CNN-based, transformer-based, and Mamba-based models for HSI classification reveals that Mamba-based models achieve performance comparable to transformer-based models, highlighting their potential in this domain. Current Mamba-based methods often convert images into one-dimensional sequences and use scanning strategies to capture local spatial and spectral dependencies. However, these approaches struggle to fully represent the intricate spectral-spatial structures in HSIs and introduce computational redundancy. To address this, a structure-aware state fusion mechanism is proposed to explicitly model the spatial and spectral relationships of neighboring features in the latent state space, enabling more efficient and accurate representation learning. To further improve the capture of global context and long-range spatial dependencies, a hybrid Mamba-transformer architecture is explored. Different integration strategies are investigated, including inserting transformer blocks in the earlier, middle, and final layers, as well as at regular intervals. Analysis indicates that incorporating a self-attention block in the final layer achieves the highest average overall accuracy of 97.58% across the five datasets. The proposed approach is evaluated on five publicly available benchmark datasets—IndianPines, Pavia University, Houston 2013, WHU-Hi-HanChuan, and WHU-Hi-HongHu, demonstrating an average overall accuracy improvement of 0.87% compared to the baseline model and competitive results with existing transformer-based and Mamba-based models. These findings underscore the potential of combining Mamba and transformer architectures for efficient and accurate hyperspectral image classification, offering new insights into advanced sequence modeling for remote sensing applications.