From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances
dc.contributor.author | Buzko, Kseniia | |
dc.date.accessioned | 2025-09-22T13:07:10Z | |
dc.date.available | 2025-09-22T13:07:10Z | |
dc.date.issued | 2025-09-22 | |
dc.date.submitted | 2025-09-18 | |
dc.description.abstract | Human action recognition (HAR) refers to the task of identifying and classifying human actions within videos or sequences of images. This field has gained significant importance due to its diverse applicability across domains such as sports analytics, human-computer interaction, surveillance, and interpersonal communication. Accurate action recognition becomes especially difficult when the camera distance changes, because the cues that matter shift with scale. For instance, a close-up hinges on facial emotion (such as smiles and eye gaze), whereas a medium shot relies on hand gestures or objects being manipulated. In the context of HAR, we distinguish two primary scenarios that illustrate this challenge. The first is the far-field setting, characterized by subjects positioned at a distance and often exhibiting rapid movement, which leads to frequent occlusions. This scenario is commonly observed in sports broadcasts, where capturing the game’s dynamics is essential. In contrast, the near-field setting involves subjects that are nearby and tend to remain relatively static. This setting enables the capture of subtle yet informative gestures, similar to those observed in presenter-focused videos. Although most studies treat these regimes separately, modern media (films, replays, vlogs) cut or zoom fluidly between them. An effective recognizer must therefore decide dynamically which cues to prioritize: facial emotion in tight close-ups, hand or torso motion in medium shots, and full-body dynamics in wide views. Despite substantial progress, current HAR pipelines rarely adapt across that zoom continuum. This thesis therefore asks: What scale-specific hurdles confront human action recognition in far-field, near-field, and zoom-mixed scenarios, and how can insights from separate case studies keep recognition robust when the camera sweeps from full-body scenes to tight close-ups and back again? To answer, we contribute three scale-aware systems: 1. Hockey Action Identification and Keypose Understanding (HAIKYU) (far-field). For hockey broadcasts, we introduce temporal bounding-box normalization, which removes camera-induced scale jitter, and a 15-keypoint skeleton that adds stick endpoints. Combined with normalization, this improves Top-1 accuracy from 31% to 64%, showing that stick cues are indispensable for ice-hockey actions. 2. Confidence Fostering Identity-preserving Dynamic Transformer (CONFIDANT) (near-field). We curate a 38-class micro-gestures dataset and train an upper-body action recognizer that flags unconfident cues, such as folding arms, crossing fingers, and clasping hands. A diffusion-based video editor then rewrites these segments into confident counterparts, serving as a downstream demonstration of fine-grained recognition. 3. Scale-aware routing framework for mixed-zoom action recognition (Zoom-Gate) (zoom-mixed). A lightweight zoom score derived from the bounding-box area and the density of detected keypoints routes each tracklet to the specialist model best suited to that scale. Experiments confirm that this scale-aware routing, combined with context-specific skeletons, delivers robust performance across mixed-zoom datasets. Collectively, these contributions demonstrate that coupling scale-aware preprocessing with context-specific skeletons can maintain pose-centric HAR reliability across the zoom spectrum. The resulting frameworks open avenues for real-time segmentation, multi-view fusion, and ultimately a unified, scale-invariant action understanding pipeline. | |
dc.identifier.uri | https://hdl.handle.net/10012/22496 | |
dc.language.iso | en | |
dc.pending | false | |
dc.publisher | University of Waterloo | en |
dc.subject | human action recognition | |
dc.subject | action classification | |
dc.subject | video understanding | |
dc.subject | skeleton-based recognition | |
dc.title | From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances | |
dc.type | Master Thesis | |
uws-etd.degree | Master of Applied Science | |
uws-etd.degree.department | Systems Design Engineering | |
uws-etd.degree.discipline | System Design Engineering | |
uws-etd.degree.grantor | University of Waterloo | en |
uws-etd.embargo.terms | 0 | |
uws.contributor.advisor | David, Clausi | |
uws.contributor.advisor | Chen, Yuhao | |
uws.contributor.affiliation1 | Faculty of Engineering | |
uws.peerReviewStatus | Unreviewed | en |
uws.published.city | Waterloo | en |
uws.published.country | Canada | en |
uws.published.province | Ontario | en |
uws.scholarLevel | Graduate | en |
uws.typeOfResource | Text | en |