Multi-level Temporal Understanding in Video Analysis: From Action Recognition to Quality Assessment

Li, Yaoxin2025-09-172025-09-172025-09-172025-09-17https://hdl.handle.net/10012/22466Artificial intelligence algorithms have permeated virtually every facet of contemporary life, from personalized shopping recommendations and targeted advertising to search engine optimization and multimedia content delivery. Among these applications, video content—as a predominant carrier of information in multimedia—occupies a position of exceptional significance. However, the application of AI algorithms to video analysis remains insufficiently mature, with comprehensive video content interpretation persisting as a critical challenge. This dissertation presents an integrated framework for multi-level temporal understanding in video analysis, advancing a coherent progression from fundamental action recognition to temporal localization and ultimately to qualitative assessment. The foundation of our integrated approach begins with addressing fundamental limitations in current action recognition paradigms. We identify a critical gap in existing methodologies where trimmed approaches restrict model training to curated action segments, while untrimmed methods rely solely on video-level supervision. Both approaches fail to exploit the inherent complementarity between action and non-action segments within complete temporal sequences. To address this limitation, we introduce a novel multi-stage contrastive learning architecture that hierarchically extracts motion-critical features through coarse-to-fine-to-finer temporal contrasting. This approach establishes a self-correcting learning regime where action discriminability emerges from explicit comparisons against non-action references, effectively suppressing static bias amplification while enhancing temporal sensitivity. Building directly upon this temporal understanding foundation, our second research direction extends these temporal insights to the challenge of precise action localization within continuous video streams. This integrated framework incorporates a multi-modal fusion classifier with adaptive modality weighting, a Class-Semantic Attention mechanism for precise proposal generation, and cross-domain prototype alignment enabling knowledge transfer between trimmed and untrimmed paradigms. This advancement represents a natural progression from recognizing what actions occur to precisely determining when they occur within the temporal dimension. This system was validated through participation in the CVPR ActivityNet Temporal Action Localization challenge, achieving second-place rankings in both 2021 and 2022 editions, with the action classification component attaining first place on the validation set—demonstrating the practical efficacy of our theoretical contributions. The third component completes our multi-level temporal understanding framework by addressing how well actions are performed—moving beyond recognition and localization to qualitative assessment. This progression reflects the natural evolution of video understanding from basic detection to increasingly nuanced temporal analysis. We develop a novel Mamba-based framework that leverages selective state space models to capture long-range dependencies in sequential data. This approach enables fine-grained analysis of temporal patterns and subtle motion nuances that determine action quality across various domains including sports performance, physical rehabilitation, and skill assessment. Experimental results demonstrate superior performance compared to transformer and CNN-based methods in both accuracy and computational efficiency. The collective contributions of this dissertation establish a comprehensive framework for multi-level temporal understanding in videos. Rather than representing isolated research directions, these three components form a coherent progression from fundamental action recognition (what), to precise temporal localization (when), and finally to qualitative assessment (how well). This integrated approach advances both the theoretical understanding and practical capabilities in AI-driven video analysis, offering a unified perspective that bridges the gap between academic research and real-world applications across the spectrum of video understanding tasks. By addressing these complementary facets within a unified framework, our work establishes new foundations for temporal video understanding that can be applied across diverse application domains.enAction Quality AssessmentMulti-level Temporal UnderstandingMulti-modal FusionComputer VisionVideo AnalysisAction RecognitionTemporal Action LocalizationMulti-level Temporal Understanding in Video Analysis: From Action Recognition to Quality AssessmentDoctoral Thesis