Filtering overfitted automatically-generated patches by using automated test generation
MetadataShow full item record
"Generate-and-Validate'' (G&V) approaches to automatic program repair first generate candidate patches and then validate the patches against a test suite. Current G&V tools accept the first patch that passes all the test cases and are at risk of making a mistake: they can choose an ultimately buggy patch while leaving a correct patch unfound. These mistakes are often due to developer test suites being insufficient to correctly validate a patch. In our approach, we aim to improve existing test suites with automatic test generation. To circumvent the oracle problem, we compare the behavior of the buggy program with the behavior of the newly patched program, and if the patched program fails more, then the patch is considered to be overfitted and it is filtered. We evaluate our approach on 441 patches (both overfitted and correct) from three automatic repair systems and show that 67% (279/417) of overfitted patches are filtered. In addition, by further exploring the search space of patches for one of the defects, our approach filters overfitted patches until the correct patch is found and accepted. We also conduct a post-mortem relevance analysis of automatically-generated test cases to evaluate (1) how many of the test cases should aid the developer in manual debugging, (2) how many of the test cases should filter out more overfitted patches if better oracles are used, and (3) how many of the test cases are relevant to filtering overfitted patches (i.e., how many of them execute lines of code patched by SPR---one of the automatic repair systems in our study). The analysis shows that up to 40% of the automatically-generated test cases per bug can help developers conduct manual debugging and can filter out more overfitted patches if automatic repair systems are empowered with better oracles.
Cite this work
Alexey Zhikhartsev (2018). Filtering overfitted automatically-generated patches by using automated test generation. UWSpace. http://hdl.handle.net/10012/12874