Leveraging Machine Learning to Improve Software Reliability
Finding software faults is a critical task during the lifecycle of a software system. While traditional software quality control practices such as statistical defect prediction, static bug detection, regression test, and code review are often inefficient and time-consuming, which cannot keep up with the increasing complexity of modern software systems. We argue that machine learning with its capability in knowledge representation, learning, natural language processing, classification, etc., can be used to extract invaluable information from software artifacts that may be difficult to obtain with other research methodologies to improve existing software reliability practices such as statistical defect prediction, static bug detection, regression test, and code review. This thesis presents a suite of machine learning based novel techniques to improve existing software reliability practices for helping developers find software bugs more effective and efficient. First, it introduces a deep learning based defect prediction technique to improve existing statistical defect prediction models. To build accurate prediction models, previous studies focused on manually designing features that encode the statistical characteristics of programs. However, these features often fail to capture the semantic difference of programs, and such a capability is needed for building accurate prediction models. To bridge the gap between programs' semantics and defect prediction features, this thesis leverages deep learning techniques to learn a semantic representation of programs automatically from source code and further build and train defect prediction models by using these semantic features. We examine the effectiveness of the deep learning based prediction models on both the open-source and commercial projects. Results show that the learned semantic features can significantly outperform existing defect prediction models. Second, it introduces an n-gram language based static bug detection technique, i.e., Bugram, to detect new types of bugs with less false positives. Most of existing static bug detection techniques are based on programming rules inferred from source code. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs. To solve this issue, this thesis proposes Bugram, which leverages n-gram language models instead of rules to detect bugs. Specifically, Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware. We examine the effectiveness of our approach on the latest versions of 16 open-source projects. Results show that Bugram detected 25 new bugs, 23 of which cannot be detected by existing rule-based bug detection approaches, which suggests that Bugram is complementary to existing bug detection approaches to detect more bugs and generates less false positives. Third, it introduces a machine learning based regression test prioritization technique, i.e., QTEP, to find and run test cases that could reveal bugs earlier. Existing test case prioritization techniques mainly focus on maximizing coverage information between source code and test cases to schedule test cases for finding bugs earlier. While they often do not consider the likely distribution of faults in the source code. However, software faults are not often equally distributed in source code, e.g., around 80\% faults are located in about 20\% source code. Intuitively, test cases that cover the faulty source code should have higher priorities, since they are more likely to find faults. To solve this issue, this thesis proposes QTEP, which leverages machine learning models to evaluate source code quality and then adapt existing test case prioritization algorithms by considering the weighted source code quality. Evaluation on seven open-source projects shows that QTEP can significantly outperform existing test case prioritization techniques to find failed test cases early. Finally, it introduces a machine learning based approach to identifying risky code review requests. Code review has been widely adopted in the development process of both the proprietary and open-source software, which helps improve the maintenance and quality of software before the code changes being merged into the source code repository. Our observation on code review requests from four large-scale projects reveals that around 20\% changes cannot pass the first round code review and require non-trivial revision effort (i.e., risky changes). In addition, resolving these risky changes requires 3X more time and 1.6X more reviewers than the regular changes (i.e., changes pass the first code review) on average. This thesis presents the first study to characterize these risky changes and automatically identify these risky changes with machine learning classifiers. Evaluation on one proprietary project and three large-scale open-source projects (i.e., Qt, Android, and OpenStack) shows that our approach is effective in identifying risky code review requests. Taken together, the results of the four studies provide evidence that machine learning can help improve traditional software reliability such as statistical defect prediction, static bug detection, regression test, and code review.
Cite this version of the work
Song Wang (2019). Leveraging Machine Learning to Improve Software Reliability. UWSpace. http://hdl.handle.net/10012/14334