Novel Motion-Aware Strategies for Efficient and Accurate Video Analytics

Gebotys, Brennan

Novel Motion-Aware Strategies for Efficient and Accurate Video Analytics

Files

final_thesis.pdf (5.15 MB)

Date

2022-05-05

Authors

Gebotys, Brennan

Advisor

Wong, Alexander
Clausi, David

Publisher

University of Waterloo

Abstract

Recent advances in machine learning strategies have led to improved results across a variety of fields. A field that would benefit greatly from improved machine learning strategies is video analytics: the analysis of video data. Two applications of importance include pose estimation, which aims to identify the pose of a person in a video and action recognition, which aims to identify the action that is performed in a video. However, key problems such as how to train a pose estimation model with a small number of annotations and how to design an action recognition model to achieve the highest possible accuracy still remain. This thesis explores how effectively leveraging motion information can enable strategies that can solve both of these problems. The first problem is that for pose estimation models to achieve a high accuracy, they require a large number of pose annotations, which can be expensive to collect. While a naive approach is to annotate a single frame at a time, researchers have investigated how modifying the model training and generating more annotations can reduce the number of annotations required. However, all these approaches either still include requirements that make annotation collection difficult. This thesis introduces a motion-aware pose annotation strategy called POse annotation using Optical Flow (POOF), which explores how motion information can reduce the number of annotations required without any additional constraints. We show that with only a small number of annotations, utilizing POOF's annotations can achieve a +52% improvement in accuracy compared to training on the small number of annotations. By reducing the number of annotations required, POOF should enable pose estimation models to be more easily applied to many more real-world problems. The second problem is that because there is such a large number of possible design choices, it is difficult to design an action recognition model's architecture to achieve the highest possible accuracy. While state-of-the-art attention mechanisms are a popular choice and have achieved accurate results, a key shortcoming is that they do not leverage any motion information. Motivated by this, this thesis explores how motion can be leveraged with these attention-based mechanisms by introducing a Motion-Aware Attention mechanism called M2A which explicitly leverages both attention and motion information. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead from a +15% to a +26% improvement in top-1 accuracy across different backbone architectures, with only a small increase in computational complexity. By better understanding how motion mechanisms can be both accurate and efficient, M2A should enable action recognition solutions to be applied to real-world problems sooner.