Towards Human-Centered AI-Powered Assistants for the Visually Impaired

Wang, Linda

Towards Human-Centered AI-Powered Assistants for the Visually Impaired

Files

Wang_Linda.pdf (5.34 MB)

Date

2020-09-02

Authors

Wang, Linda

Advisor

Wong, Alexander

Publisher

University of Waterloo

Abstract

Artificial intelligence has become ubiquitous in today's society, aiding us in many everyday tasks. Given particular prowess of today's AI technologies in visual perception and speech recognition, an area where AI can have tremendous societal impact is in assistive technologies for the visually impaired. Although assisting the visually impaired for tasks such as environment navigation and item localization improves independence and autonomy, concerns over privacy arise. Taking privacy of personal data into consideration, we present the design of a human-centered AI-powered assistant for object localization for impaired vision (OLIV). OLIV integrates multi-modal perception (custom-designed visual scene understanding and speech recognition and synthesis) for the purpose of assisting the visually impaired in locating misplaced items in indoor environments. OLIV is comprised of three main components: speech recognition, custom-designed visual scene understanding, and synthesis. Speech recognition allows these individuals to independently query and interact with the system, increasing their level of independence. Visual scene understanding performs on-device object detection and depth estimation to build up a representation of the surrounding 3D scene. Synthesis then combines the detected objects along with their locations and depths with the user’s intent to construct a verbal semantic description that is verbally conveyed via speech synthesis. An important component of OLIV is scene understanding. Current state-of-the-art deep neural networks for the two tasks have been shown to achieve superior performance, but requires high computation and memory, making them cost prohibitive for on-device operation. On-device operation is necessary to address privacy concerns related to misuse of personal data. By performing on-device scene understanding, data captured by the camera will remain on the device. To address the challenge of high computation and memory requirements, two different architecture design exploration approaches, micro-architecture exploration and human-machine collaborative design strategy, are taken to design efficient neural networks with an optimal trade-off between accuracy, speed and size. Micro-architecture exploration approach resulted in a highly compact single shot network architecture for object detection. Human-machine collaborative design strategy resulted in a highly compact densely-connected encoder-decoder network architecture for monocular depth estimation. Through experiments on two indoor datasets to simulate environments OLIV operates in, the object detection network and depth estimation network were able to achieve CPU speeds of 17 FPS and 9.35 FPS, sizes of 6.99 and 3.46 million parameters, respectively, while maintaining comparable accuracy performance. Size and speed are important for on-device scene understanding on OLIV to provide a more private assistance for the visually impaired.