Language Guided Out-of-Bounding Box Pose Estimation for Robust Ice Hockey Analysis

Loading...
Thumbnail Image

Date

2024-08-27

Advisor

Clausi, David
Rambhatla, Sirisha

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Accurate estimation of human pose and the pose of interacting objects, such as hockey sticks, is fundamental in vision-driven hockey analytics and crucial for tasks like action recognition and player assessment. Estimating 2D keypoints from monocular video is challenging, particularly in fast-paced sports such as ice hockey, where motion blur, occlusions, bulky equipment, color similarities, and constant camera panning complicate accurate pose prediction. This thesis addresses these challenges with contributions on three fronts. First, recognizing the lack of an existing benchmark, we present a comparative study of four state-of-the-art human pose estimation approaches using a real-world ice hockey dataset. This analysis aims to understand the impact of each model on ice hockey pose estimation and investigate their respective advantages and disadvantages. Building on insights from this comparative study, we develop an ensemble model for jointly predicting player and stick poses. The ensemble comprises two networks: one trained from scratch to predict all keypoints, and another utilizing a unique transfer learning paradigm to incorporate knowledge from large-scale human pose datasets. Despite achieving promising results, we observe that these top-down approaches yield suboptimal outcomes due to constraints such as requiring all keypoints to be within a bounding box and accommodating only one player per bounding box. To overcome these issues, we introduce an image and text based multi-modal solution called TokenCLIPose, which predicts stick keypoints without encapsulating them within a bounding box. By focusing on capturing only the player in a bounding box and treating their stick as missing, our model predicts out-of-bounding box keypoints. To incorporate the context of the missing keypoints, we use keypoint-specific text prompts to leverage the rich semantic representations provided by language. This dissertation’s findings advance the state-of-the-art in 2D pose estimation for ice hockey, outperforming existing methods by 2.6% on our dataset, and provide a robust foundation for further developments in vision-driven sports analytics.

Description

Keywords

LC Keywords

Citation