From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances

Buzko, Kseniia

From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances

dc.contributor.author	Buzko, Kseniia
dc.date.accessioned	2025-09-22T13:07:10Z
dc.date.available	2025-09-22T13:07:10Z
dc.date.issued	2025-09-22
dc.date.submitted	2025-09-18
dc.description.abstract	Human action recognition (HAR) refers to the task of identifying and classifying human actions within videos or sequences of images. This field has gained significant importance due to its diverse applicability across domains such as sports analytics, human-computer interaction, surveillance, and interpersonal communication. Accurate action recognition becomes especially difficult when the camera distance changes, because the cues that matter shift with scale. For instance, a close-up hinges on facial emotion (such as smiles and eye gaze), whereas a medium shot relies on hand gestures or objects being manipulated. In the context of HAR, we distinguish two primary scenarios that illustrate this challenge. The first is the far-field setting, characterized by subjects positioned at a distance and often exhibiting rapid movement, which leads to frequent occlusions. This scenario is commonly observed in sports broadcasts, where capturing the game’s dynamics is essential. In contrast, the near-field setting involves subjects that are nearby and tend to remain relatively static. This setting enables the capture of subtle yet informative gestures, similar to those observed in presenter-focused videos. Although most studies treat these regimes separately, modern media (films, replays, vlogs) cut or zoom fluidly between them. An effective recognizer must therefore decide dynamically which cues to prioritize: facial emotion in tight close-ups, hand or torso motion in medium shots, and full-body dynamics in wide views. Despite substantial progress, current HAR pipelines rarely adapt across that zoom continuum. This thesis therefore asks: What scale-specific hurdles confront human action recognition in far-field, near-field, and zoom-mixed scenarios, and how can insights from separate case studies keep recognition robust when the camera sweeps from full-body scenes to tight close-ups and back again? To answer, we contribute three scale-aware systems: 1. Hockey Action Identification and Keypose Understanding (HAIKYU) (far-field). For hockey broadcasts, we introduce temporal bounding-box normalization, which removes camera-induced scale jitter, and a 15-keypoint skeleton that adds stick endpoints. Combined with normalization, this improves Top-1 accuracy from 31% to 64%, showing that stick cues are indispensable for ice-hockey actions. 2. Confidence Fostering Identity-preserving Dynamic Transformer (CONFIDANT) (near-field). We curate a 38-class micro-gestures dataset and train an upper-body action recognizer that flags unconfident cues, such as folding arms, crossing fingers, and clasping hands. A diffusion-based video editor then rewrites these segments into confident counterparts, serving as a downstream demonstration of fine-grained recognition. 3. Scale-aware routing framework for mixed-zoom action recognition (Zoom-Gate) (zoom-mixed). A lightweight zoom score derived from the bounding-box area and the density of detected keypoints routes each tracklet to the specialist model best suited to that scale. Experiments confirm that this scale-aware routing, combined with context-specific skeletons, delivers robust performance across mixed-zoom datasets. Collectively, these contributions demonstrate that coupling scale-aware preprocessing with context-specific skeletons can maintain pose-centric HAR reliability across the zoom spectrum. The resulting frameworks open avenues for real-time segmentation, multi-view fusion, and ultimately a unified, scale-invariant action understanding pipeline.
dc.identifier.uri	https://hdl.handle.net/10012/22496
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	human action recognition
dc.subject	action classification
dc.subject	video understanding
dc.subject	skeleton-based recognition
dc.title	From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Systems Design Engineering
uws-etd.degree.discipline	System Design Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	David, Clausi
uws.contributor.advisor	Chen, Yuhao
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Buzko_Kseniia.pdf
Size:: 18.84 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Systems Design Engineering