Software Bug Detection Using the N-gram Language Model
MetadataShow full item record
Over the years many techniques have been proposed to infer programming rules in order to improve software reliability. The techniques use violations of these programming rules to detect software defects. This thesis introduces an approach, NGDetection, which models a software’s source code using the n-gram language model in order to find bugs and refactoring opportunities in a number of open source Java projects. The use of the n-gram model to infer programming rules for software defect detection is a new domain for the application of the n-gram model. In addition to the n-gram model, NGDetection leverages two additional techniques to address limitations of existing defect detection techniques. First, the approach infers combined programming rules, which are a combination of infrequent programming rules with their related programming rules, to detect defects in a way other approaches cannot. Second, the approach integrates control flow into the n-gram model which increases the accuracy of defect detection. The approach is evaluated on 14 open source Java projects which range from 36 thousand lines of code (KLOC) to 1 million lines of code (MLOC). The approach detected 310 violations in the latest version of the projects, 108 of which are useful violations, i.e., 43 bugs and 65 refactoring opportunities. Of the 43 bugs, 32 were reported to the developers and the remaining are in the process of being reported. Among the reported bugs, 2 have been confirmed by the developers, while the rest await confirmation. For the 108 usefulviolations, at least 26 cannot be detected by existing techniques.