Design of Binocular Stereo Vision System Via CNN-based Stereo Matching Algorithm

Jiao, Yan

Design of Binocular Stereo Vision System Via CNN-based Stereo Matching Algorithm

Files

Jiao_Yan.pdf (15 MB)

Date

2020-08-06

Authors

Jiao, Yan

Advisor

Ho, Pin-Han
Tang, Quentin

Publisher

University of Waterloo

Abstract

Stereo vision is one of the representative technologies in the 3D camera, using multiple cameras to perceive the depth information in the three-dimensional space. The binocular one has become the most widely applied method in stereo vision. So in our thesis, we design a binocular stereo vision system based on an adjustable narrow-baseline stereo camera, which can simultaneously capture the left and right images belonging to a stereo image pair. The camera calibration and rectification techniques are firstly performed to get rectified stereo pairs, serving as the input to the subsequent step, that is, searching the corresponding points between the left and right images. The stereo matching algorithm resolves the correspondence problem and plays a crucial part in our system, which produces disparity maps targeted at predicting the depths with the help of the triangulation principle. We focus on the first stage of this algorithm, proposing a CNN-based approach to calculating the matching cost by measuring the similarity level between two image patches. Two kinds of network architectures are presented and both of them are based on the siamese network. The fast network employs the cosine metric to compute the similarity level at a satisfactory accuracy and processing speed. While the slow network is aimed at learning a new metric, making the disparity prediction slightly more precise but at the cost of spending way more image handling time and counting on more parameters. The output of either network is regarded as the initial matching cost, followed by a series of post-processing methods, including cross-based cost aggregation as well as semi-global cost aggregation. With the trick of Winner-Take-All (WTA), the raw disparity map is attained and it will undergo further refinement procedures containing interpolation and image filtering. The above networks are trained and validated on three standard stereo datasets: Middlebury, KITTI 2012, and KITTI 2015. The contrast tests of CNN-based methods and census transformation have demonstrated that the former approach outperforms the later one on the mentioned datasets. The algorithm based on the fast network is adopted in our devised system. To evaluate the performance of a binocular stereo vision system, two types of error criteria are come up with, acquiring the proper range of working distance under diverse baseline lengths.