|dc.description.abstract||Autonomously operating vehicles are being developed to take over human supervision in applications such as search and rescue, surveillance, exploration and scientific data collection. For a vehicle to operate autonomously, it is important for it to predict its location with respect to its surrounding in order to make decisions about its next movement. Simultaneous Localization and Mapping (SLAM) is a technique that utilizes information from multiple sensors to not only estimate the vehicle's location but also simultaneously build a map of the environment. Substantial research efforts are being devoted to make pose predictions using fewer sensors. Currently, laser scanners, which are expensive, have been used as a primary sensor for environment perception as they measure obstacle distance with good accuracy and generate a point-cloud map of the surrounding. Recently, researchers have used the method of triangulation to generate similar point-cloud maps using only cameras, which are relatively inexpensive. However, point-clouds generated from cameras have an unobservable scale factor. To get an estimate of scale, measurements from an additional sensor such as another camera (stereo configuration), laser scanners, wheel encoders, GPS or IMU, can be used. Wheel encoders are known to suffer from inaccuracies and drifts, using laser scanners is not cost effective, and GPS measurements come with high uncertainty. Therefore, stereo-camera and camera-IMU methods have been topics of constant development for the last decade.
A stereo-camera pair is typically used with a graphics processing unit (GPU) to generate a dense environment reconstruction. The scale is estimated from the pre-calculated base-line (distance between camera centers) measurement. However, when the environment features are far away, the base-line becomes negligible to be effectively used for triangulation and the stereo-configuration reduces to monocular. Moreover, when the environment is texture-less, information from visual measurements only cannot be used. An IMU provides metric measurements but suffers from significant drifts. Hence, in a camera-IMU configuration, an IMU typically is used only for short-durations, i.e. in-between two camera frames. This is desirable as it not only helps to estimate the global scale, but also to give a pose estimate during temporary camera failure. Due to these reasons, a camera-IMU configuration is being increasingly used in applications such as in Unmanned Aerial Vehicles (UAVs) and Augmented/ Virtual Reality (AR/VR).
This thesis presents a novel method for visual-inertial odometry for land vehicles which is robust to unintended, but unavoidable bumps, encountered when an off-road land vehicle traverses over potholes, speed-bumps or general change in terrain. In contrast to tightly-coupled methods for visual-inertial odometry, the joint visual and inertial residuals is split into two separate steps and the inertial optimization is performed after the direct-visual alignment step. All visual and geometric information encoded in a key-frame are utilized by including the inverse-depth variances in the optimization objective, making this method a direct approach. The primary contribution of this work is the use of epipolar constraints, computed from a direct-image alignment, to correct pose prediction obtained by integrating IMU measurements, while simultaneously building a semi-dense map of the environment in real-time. Through experiments, both indoor and outdoor, it is shown that the proposed method is robust to sudden spikes in inertial measurements while achieving better accuracy than the state-of-the art direct, tightly-coupled visual-inertial fusion method. In the future, the proposed method can be augmented with loop-closure and re-localization to enhance the pose prediction accuracy. Further, semantic segmentation of point-clouds can be useful for applications such as object labeling and generating obstacle-free path.||en