Evaluating Deep Learning-based Vulnerability Detection Models on Realistic Datasets

Arumugam, Krishna Kanth

Evaluating Deep Learning-based Vulnerability Detection Models on Realistic Datasets

Files

Arumugam_Krishna-Kanth.pdf (534.62 KB)

Date

2023-05-23

Authors

Arumugam, Krishna Kanth

Advisor

Nagappan, Meiyappan

Publisher

University of Waterloo

Abstract

The impact of software vulnerabilities on daily-used software systems is alarming. Despite numerous proposed deep learning-based models to automate vulnerability detection, the detection of software vulnerabilities remains a significant issue. While some techniques report high precision/recall scores of up to 99%, our experience leads us to believe that these models may underperform in realistic settings, specifically when evaluating vulnerability detection models on the entire source code repository of a project. Therefore, in this thesis, we create a more comprehensive vulnerability detection dataset (i.e., Comp-Vul), which aims to accurately represent the realistic settings where vulnerability detection models are deployed. Then, we evaluate the performance of two state-of-the-art deep learning-based models, LineVul and DeepWukong, on the Comp-Vul dataset. Our results show that the performance of both models drops drastically, with precision dropping by 86% - 95% and F1 score dropping by 88% - 91%. Our further investigation shows that the ratio of vulnerable to non-vulnerable samples in the evaluation dataset significantly impacts the performance metrics of these models. When we visualize the embeddings produced by the models, we find that there is a substantial overlap between vulnerable and non-vulnerable samples. This shows that these models have difficulty distinguishing between vulnerable and non-vulnerable samples in the Comp-Vul dataset, resulting in a high number of false positives. We introduce a new program slice-level vulnerability detection technique named SliceVul, which leverages the powerful capabilities of Transformers and incorporates the semantic properties of source code programs such as data and control flow information. Our approach outperforms the existing state-of-the-art program slice-level vulnerability detection model, DeepWukong when evaluated on the Comp-Vul dataset. Our study argues that accurately identifying vulnerabilities using deep learning remains a challenging task that requires improved approaches to model evaluation and design. Further research and development, complemented by realistic evaluation datasets, is required to enhance the performance of these methods.