Efficient Deep Learning-Driven Systems for Real-Time Video Expression Recognition
MetadataShow full item record
The ability to detect, recognize, and interpret facial expressions is an important skill for humans to have due to the abundance of social interactions one faces on a daily basis, but it is also something that most take for granted. Being the social animals that we are, expression understanding not only enables us to gauge current emotional states, but also allows for the recognition of conversational cues such as level of interest, speaking turns, and level of information understanding. For individuals with autism spectrum disorder, a core challenge that they face is an impaired ability to infer other people's emotions based on their facial expressions, which can cause problems when creating and sustaining meaningful, positive relationships, leading to troubles integrating into society and a higher prevalence of depression and loneliness. However, with significant recent advances in machine learning, one potential solution is to leverage assistive technology to aid these individuals to better recognize facial expressions. Such a technology requires reasonable accuracy in order to provide users with correct information, but also must follow a real-time constraint to be relevant and seamless in a social setting. Due to the dynamic and transient nature of human facial expressions, a challenge during classification is the usage of temporal information to provide additional context to a scene. Many applications require the real-time aspect to be preserved, and thus temporal information must be leveraged in an efficient manner. Consequently, we explore the dynamic and transient nature of facial expressions through a novel deep time windowed convolutional neural network design called TimeConvNets, that is capable of encoding spatiotemporal information in an efficient manner. We compare against other methods capable of leveraging temporal information, and show that TimeConvNets can provide a real-time solution that is both accurate as well as architecturally and computationally less complex. Even with the strong performances that the TimeConvNet architecture offers, additional architecture modifications tailored specifically for human facial expression classification can likely result in increased performance gains. Thus, we explore a human-machine collaborative design strategy for the purpose of further reducing and optimizing these facial expression classifiers. EmotionNet Nano was created and tailored specifically for the task of expression classification on edge devices, by leveraging human experience combined with the meticulousness and speed of machines. Experimental results on the CK+ facial expression benchmark dataset demonstrate that the proposed EmotionNet Nano networks achieved accuracy comparable to state-of-the-art, while requiring significantly fewer parameters, and are also capable of performing inference in real-time, making them suitable for deployment on a variety of platforms including mobile phones. To train these models, a high quality expression dataset is required, specifically one that retains temporal information between consecutive image frames. We introduce FaceParty as a solution, which is a more difficult dataset created by the modified aggregation of six public video facial expression datasets, and provide details for replication. We hope that models trained using FaceParty can achieve increased generalization ability for faces in the wild due to the nature of the dataset.
Cite this version of the work
James Lee (2021). Efficient Deep Learning-Driven Systems for Real-Time Video Expression Recognition. UWSpace. http://hdl.handle.net/10012/16686