Abstraction Mechanism on Neural Machine Translation Models for Automated Program Repair

Wei, Moshi

Abstraction Mechanism on Neural Machine Translation Models for Automated Program Repair

Files

Wei_Moshi.pdf (452.52 KB)

Date

2019-09-23

Authors

Wei, Moshi

Advisor

Tan, Lin

Publisher

University of Waterloo

Abstract

Bug fixing is a time-consuming task in software development. Automated bug repair tools are created to fix programs with little or no human effort. There are many existing tools based on the generate-and-validate (G&V) approach, which is an automated program repair technique that generates a list of repair candidates then selects the correct candidates as output. Another approach is learning the repair process with machine learning models and then generating the candidates. One machine learning-based approach is the end-to-end approach. This approach passes the input source code directly to the machine learning model and generates the repair candidates in source code as output. There are several challenges in this approach such as the large size of vocabulary, high rate of out-of-vocabulary(OOV) tokens and difficulties on embedding learning. We propose an abstraction-and-reconstruction technique on top of end-to-end approaches that convert the training source code to templates to alleviate the problems in the traditional end-to-end approach. We train the machine learning model with abstracted bug-fix pairs from open source projects. The abstraction process converts the source code to templates before passing it to the model. After the training is complete, we use the trained model to predict the fix templates of new bugs. The output of the model is passed to the reconstruction layer to get the source code patch candidates. We train the machine learning model with 470,085 bug-fix pairs collected from 1000 top python projects from Github. We use the QuixBugs dataset as the test set to evaluate the result. The fix of the bug in the QuixBugs is verified by the test cases provided by the QuixBugs dataset. We choose the traditional end-to-end approach as the baseline and comparing it with the abstraction model. The accuracy of generating correct bug fixes increase from 25% to 57.5% while the training time reduces from 5.7 hours to 1.63 hours. The overhead introduced by the reconstruction model is 218 milliseconds on average or 23.32%, which is negligible comparing to the time saved in the training, which is 4.07 hours or 71.4%. We performed a deep analysis of the result and identified three reasons that may explain why the abstraction model outperforms the baseline. Compared to existing works, our approach has the complete reconstruction process which converts templates to the source code. It shows that adding a layer of abstractions increase the accuracy and reduces the training time of machine-learning-based automated bug repair tool.