Fusion of Estimated Depth and RGB Features for Improved Grasp-Type Selection of Novel Objects

Asghar, Ali, 1994-

Fusion of Estimated Depth and RGB Features for Improved Grasp-Type Selection of Novel Objects

Files

Asghar_Ali.pdf (3.75 MB)

Date

2022-09-01

Authors

Asghar, Ali, 1994-

Advisor

Kofman, Jonathan

Publisher

University of Waterloo

Abstract

Prostheses can alleviate some of the challenges faced by upper limb amputees in performing activities of daily living. However, electric-powered prosthetic hands have not seen much improvement over the past decade. Unintuitive interfaces for selecting grasp types have resulted in low user satisfaction and high abandonment rates. Recently, efforts have been made to automate the grasp type selection process by collecting visual data, such as Red, Green Blue (RGB) images or depth data of the object to be grasped and classifying the object into the desired grasp type. This effort has been greatly aided by the advent of Deep Convolutional Neural Networks (DCNNs), which have been trained on examples of objects and their desired grasp types. However, the biggest challenge is to improve the generalization capabilities of DCNN models, so that they can efficiently classify novel objects, i.e., objects that the model was not trained on. Combining RGB and depth data has been shown to improve model generalization; however, common methods of acquiring depth data require bulky hardware, that cannot be installed on a prosthetic hand for practical applications. Therefore, this research focused on estimating depth maps through pre-trained models developed using RGB input images acquired from a single compact RGB camera, instead of depth maps acquired from bulky dedicated hardware such as an RGB-D camera or similar hardware. An object detector based DCNN architecture was used to detect grasp types of objects along with their bounding boxes in cluttered scenes. To combine the RGB and estimated depth data, this research used a novel method to fuse RGB feature maps and estimated depth feature maps. In order to train the DCNN, a dataset was created with images of objects in a cluttered scene from the viewpoint of a camera mounted on a prosthesis. Every graspable object in each image was annotated with a bounding box and assigned one of two grasp types: neutral wrist palmar or pronated wrist precision. Different methods of encoding single-channel depth maps as three-channel depth data were evaluated, including duplication, surface normal encoding, jet colormap encoding, and HHA encoding. Moreover, different strategies to fuse RGB feature maps with estimated depth feature maps were also evaluated. iv Experiments determined that the developed model was capable of operating in real-world scenarios, such as in cluttered scenes with multiple graspable objects. Compared to training the model on only RGB data, there was an increase of up to 12.7% in common metrics used to evaluate the model’s generalization capabilities when the RGB data was fused with estimated depth data. This is the first work that demonstrated improvement in DCNN model performance for the task of detecting grasp types of novel objects by using fusion of RGB and estimated depth-map features. The improvement in performance of the model that was trained using estimated depth data exceeds that of methods that require dedicated hardware to acquire depth data by up to 2.8%. The proposed model can be incorporated in the control schemes of upper limb prostheses, without the need for dedicated hardware.