UWSpace has migrated to a new version of its software. The UWSpace team invites all UWaterloo community members to review the newly created help documentation available on the UWSpace homepage.
 

Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits

dc.contributor.authorKeshav Ram, Achyudh Ram
dc.date.accessioned2020-07-15T20:27:54Z
dc.date.available2020-07-15T20:27:54Z
dc.date.issued2020-07-15
dc.date.submitted2020-07-06
dc.description.abstractPublic vulnerability databases such as CVE and NVD account for only 60% of security vulnerabilities present in open-source projects and are known to suffer from inconsistent quality. Over the last two years, there has been considerable growth in the number of known vulnerabilities across projects available in various repositories such as NPM and Maven Central. However, public vulnerability management databases such as NVD suffer from poor coverage and are too slow to add new vulnerabilities. Such an increasing risk calls for a mechanism to promptly infer the presence of security threats in open-source projects. In this thesis, we seek to address this problem by treating the identification of security-relevant commits as a classification task. Since existing literature on neural networks for commit classification is sparse, we first turn to document classification for inspiration. Extensive research in this domain, on the other hand, has resulted in increasingly complex neural models, with a number of researchers questioning the necessity of such architectures. We conduct a large-scale reproducibility study of several recent neural network models, and show that well-executed, simpler models are quite effective for document classification. We find that a simple bi-directional LSTM with regularization yields competitive accuracy and F1 on four benchmark document classification datasets. Based on trends in document classification and the domain-specific peculiarities of commit classification, we build a family of hierarchical neural network models for the identification of security-relevant commits. We evaluate five different input representations and show that models that learn on tokens extracted from the commit diff are simpler and more effective than models that learn from path-contexts extracted from the AST. We also show that providing the models with contextual information through features extracted from the source code improves accuracy and F1 further, and discuss why path-based models might not capture any additional information compared to token-based models for this task. Finally, we make a case for reporting standard deviation of test scores across multiple runs in order to avoid erroneous conclusions and establish robust baselines.en
dc.identifier.urihttp://hdl.handle.net/10012/16061
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectsecurity vulnerabilitiesen
dc.subjectsecurity-relevant commitsen
dc.subjectneural networksen
dc.subjectregularizationen
dc.subjectpath-based representationsen
dc.subjectopen source softwareen
dc.titleExploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commitsen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Mathematicsen
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws.contributor.advisorNagappan, Meiyappan
uws.contributor.advisorLin, Jimmy
uws.contributor.affiliation1Faculty of Mathematicsen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KeshavRam_AchyudhRam.pdf
Size:
1.73 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: