Development of a Semantic Model and Synthetic Dataset for Multi-Grasp Affordance Detection for Application to Vision-Based Upper-Limb Prosthetic Grasping

Ng, Nathan

Development of a Semantic Model and Synthetic Dataset for Multi-Grasp Affordance Detection for Application to Vision-Based Upper-Limb Prosthetic Grasping

Files

Ng_Nathan.pdf (8.96 MB)

Date

2024-05-27

Authors

Ng, Nathan

Advisor

Kofman, Jonathan
Jeon, Soo

Publisher

University of Waterloo

Abstract

Current upper-limb prosthetic grasping methods are predominately myoelectric, where surface electromyogram (sEMG) pattern recognition is used to predict a grasp type for a prosthetic hand to grasp objects. The sEMG patterns also simultaneously detect the action intent of a grasping action and overall movements of the prosthetic arm. Since the overall control strategy of a myoelectric prosthesis is coupled, the prediction of grasp types can be inaccurate, especially if the grasp type has a similar sEMG pattern for manipulating the prosthetic arm or selecting other grasp types. Recent vision-based prosthetic grasping methods solve the coupled control strategy of myoelectric prostheses, by implementing a camera system to capture an RGB image of an object and a convolutional neural network (CNN) to predict a grasp type. The action intent to move the prosthetic arm and perform the grasping action is independently determined through sEMG pattern recognition. Unlike myoelectric prostheses, vision-based prostheses can predict a suitable grasp type based on the features of an object (e.g. object’s shape). However, current vision-based grasping methods are limited because each object can only be grasped with a single grasp type, despite the object’s shape, environmental context, and the available tasks. Recent robotic grasping applications implement grasp affordance detection to identify the regions on an object that can be grasped for a task. By adapting the detection of grasp affordances into a vision-based prosthetic device, multiple task-oriented grasp-type predictions are possible for each object. Therefore, to improve the vision system in vision-based prostheses, grasp affordance detection methods from robotic grasping applications are adapted in this thesis research. Grasp affordances, as grasp-type and task regions, are predicted by implementing instance segmentation models. Instance segmentation models utilize RGB images to localize objects and their grasp affordances with bounding box locations and image mask segmentation. Since there is no instance segmentation model and dataset that can allow the simultaneous detection of objects and their grasp affordances, the Multi-Affordance Detection Network (MAD-Net) model and Multi-Object Multi-Grasp-Affordance (MOMA) synthetic dataset were developed as part of this thesis research. Unlike the current vision-based prosthetic grasping methods, MAD-Net can detect objects and their grasp affordances in multi-object RGB scenes. The MAD-Net model was derived from the Mask R-CNN model, a common baseline model for instance segmentation. Most instance segmentation models were derived from Mask R-CNN, since the additional mask prediction head in Mask R-CNN can convert all object detection models into instance segmentation models. The MOMA synthetic dataset is a collection of 20K RGB images that is generated from placing random images of objects on random background images. Each image generated was automatically annotated with the instances of objects and their grasp affordances (grasp-type and task regions). The single-object RGB images used for synthetic dataset generation were manually captured with a camera and then manually annotated. The mean average precision (mAP) metric is used to evaluate the performance of MAD-Net and other instance segmentation models on the MOMA dataset. The mAP metric is a good indicator of model performance, since it determines how accurate the predicted bounding box and image mask locations are w.r.t. the ground truth annotations. MAD-Net has outperformed all the other instance segmentation models across all detection categories (objects, grasp types, tasks) on the validation datasets. On the test datasets, MAD-Net has maintained a similar mAP score as the other instance segmentation models. In all cases, MAD-Net has outperformed Mask R-CNN, especially in the grasp type detection category, where MAD-Net has a 10% increase in the mAP score compared to Mask R-CNN. When the objects and their grasp affordances are jointly trained on the MOMA dataset, the total training time decreased by 50%. Since MAD-Net has outperformed Mask R-CNN, the joint detection of objects and their grasp affordances is a feasible solution to implement in the vision system for vision-based prostheses. Although the proposed vision system produces multiple task-oriented grasp types on a single object, modern myoelectric prostheses can select a grasp type from a small selection of pre-programmed grasp types. A grasp database can also be implemented alongside the proposed vision system. Prosthetic users can continuously update the database for new unseen objects and their corresponding task-oriented grasp types.

Keywords

computer vision, prosthetic grasping, convolutional neural network, grasp affordance detection, synthetic dataset, image processing, grasp-type detection, instance segmentation

URI

http://hdl.handle.net/10012/20619

Collections

Theses
Mechanical and Mechatronics Engineering

Full item page

Development of a Semantic Model and Synthetic Dataset for Multi-Grasp Affordance Detection for Application to Vision-Based Upper-Limb Prosthetic Grasping

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections