3D Ground Truth Generation Using Pre-Trained Deep Neural Networks
Loading...
Date
2019-05-24
Authors
Lee, Jungwook
Advisor
Waslander, Steven
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Training 3D object detectors on publicly available data has been limited to small datasets
due to the large amount of effort required to generate annotations. The difficulty of labeling
in 3D using 2.5D sensors, such as LIDAR, is attributed to the high spatial reasoning skills
required to deal with occlusion and partial viewpoints. Additionally, the current methods
to label 3D objects are cognitively demanding due to frequent task switching. Reducing
both task complexity and the amount of task switching done by annotators is key to
reducing the effort and time required to generate 3D bounding box annotations. We
therefore seek to reduce the burden on the annotators by leveraging existing 3D object
detectors using deep neural networks.
This work introduces a novel ground truth generation method that combines human
supervision with pre-trained neural networks to generate per-instance 3D point cloud seg-
mentation, 3D bounding boxes, and class annotations. The annotators provide object
anchor clicks which behave as a seed to generate instance segmentation results in 3D. The
points belonging to each instance are then used to regress object centroids, bounding box
dimensions, and object orientation. The deep neural network model used to generate the
segmentation masks and bounding box parameters is based on the PointNet architecture.
We develop our approach with reliance on the KITTI dataset to analyze the quality
of the generated ground truth. The neural network model is trained on KITTI training
split and the 3D bounding box outputs are generated using annotation clicks collected
from the validation split. The validation split of KITTI detection dataset contains 3712
frames of pointcloud and image scenes and it took 16.35 hours to label with the following
method. Based on these results, our approach is 19 times faster than the latest published
3D object annotation scheme. Additionally, it is found that the annotators spent less
time per object as the number of objects in the scenes increase, making it a very efficient
for multi-object labeling. Furthermore, the quality of the generated 3D bounding boxes,
using the labeling method, is compared against the KITTI ground truth. It is shown that
the model performs on par with the current state-of-the-art 3D detectors and the labeling
procedure does not negatively impact the output quality of the bounding boxes. Lastly, the
proposed scheme is applied to previously unseen data from the Autonomoose self-driving
vehicle to demonstrate generalization capabilities of the network.
Description
Keywords
Machine Learning, Computer Vision, Autonomous Driving, Deep Learning, Object Detection, Data Mining