|For the last decade, deep learning (DL) has emerged as a new effective machine learning approach that is capable of solving difficult challenges. Due to their increasing effectiveness, DL approaches have been applied widely in commercial products such as social media platforms and self-driving cars. Such widespread application in critical areas means that mistakes caused by bugs in such DL systems would lead to serious consequences. Our research focuses on improving the reliability of such DL systems.
At a high level, the DL systems development process starts with labeled data. This data is then used to train the DL model with some training methods. Once the model is trained, it can be used to create predictions for some unlabeled data in the inference stage. In this thesis, we present testing and analysis techniques that help improve the DL system reliability for all stages.
In the first work, CRADLE, we improve the reliability of the DL system inference by applying differential testing to find bugs in DL libraries. One key challenge of testing DL libraries is the difficulty of knowing the expected output of DL libraries given an input instance. We leverage equivalent DL libraries to overcome this challenge. CRADLE focuses on finding and localizing bugs in DL software libraries by performing cross-implementation inconsistency checking to detect bugs, and leveraging anomaly propagation tracking and analysis to localize faulty functions that cause the bugs. CRADLE detects 12 bugs in three libraries (TensorFlow, CNTK, and Theano), and highlights functions relevant to the causes of inconsistencies for all 104 unique inconsistencies.
Our second work is the first to study the variance of DL systems training and the awareness of this variance among researchers and practitioners. Our experiments show large overall accuracy differences among identical training runs. Even after excluding weak models, the accuracy difference is 10.8%. In addition, implementation-level factors alone cause the accuracy difference across identical training runs to be up to 2.9%. Our researcher and practitioner survey shows that 83.8% of the 901 participants are unaware of or unsure about any implementation-level variance. This work raises awareness of DL training variance and directs SE researchers to challenging tasks such as creating deterministic DL implementations to facilitate debugging and improving the reproducibility of DL software and results.
DL systems perform well on static test sets coming from the same distribution as training sets but may not be robust in real-world deployments because of the fundamental assumption that the training data represents the real world data well. In cases where the training data misses samples from the real-world distribution, it is said to contain blindspots. In practice, it is more likely a training dataset contains weakspots (i.e., a weaker form of blindspots, where the training data contains some samples that represent the real world but it does not contain enough). In the third work, we propose a new procedure to detect weakspots in training data and to improve the DL system with minimum labeling effort. This procedure leverages the variance of the DL training process to detect highly varying data samples that could indicate the weakspots. Metrics that measure such variance can also be used to rank new samples to prioritize the labeling of additional training data that can improve the DL system accuracy when applied to the real world. Our evaluation shows that, in scenarios where the weakspots are severe, our procedure improves the model accuracy on weakspot samples by 25.2% requiring 2% of additional training data. This is an improvement of 4.5 percentage points compared to the traditional single model metric with the same amount of additional training data.