Show simple item record

dc.contributor.authorWatson, Daniel 20:18:16 (GMT) 20:18:16 (GMT)
dc.description.abstractPublic software repositories such as GitHub make transparent the development history of an open source software system. Source code commits, discussions about new features and bugs, and code reviews are stored and carefully attributed to the appropriate developers. However, sometimes governments may seek to analyze these repositories, to identify citizens who contribute to projects they disapprove of, such as those involving cryptography or social media. While developers who seek anonymity may contribute under assumed identities, their body of public work may be characteristic enough to betray who they really are. The ability to contribute anonymously to public bodies of knowledge is extremely important to the future of technological and intellectual freedoms. Just as in security hacking, the only way to protect vulnerable individuals is by demonstrating the means and strength of available attacks so that those concerned may know of the need and develop the means to protect themselves. In this work, we present a method to de-anonymize source code contributors based on the authors' intrinsic programming style. First, we present a partial replication study wherein we attempt to de-anonymize a large number of entries into the Google Code Jam competition. We base our approach on Caliskan-Islam et al. 2015, but with modifications to the feature set and modelling strategy for scalability and feature-selection robustness. We did not achieve 0.98 F1 achieved in this prior work, but managed a still reasonable 0.71 F1 under identical experimental conditions, and a 0.88 F1 given more data from the same set. Second, we present an exploratory study focused on de-anonymizing programmers who have contributed to a repository, using other commits from the same repository as training data. We train random-forest classifiers using programmer data collected from 37 medium to large open-source repositories. Given a choice between active developers in a project, we were able to correctly determine authorship of a given function about 75% of the time, without the use of identifying meta-data or comments. We were also able to correctly validate a contributor as the author of a questioned function with 80\% recall and 65\% precision. This exploratory study provides empirical support for our approach. Finally, we present the results of a similar, but more difficult study wherein we attempt de-anonymize a repository in the same manner, but without using the target repository as training data. To do this, we gather as much training data as possible from the repository's contributors through the Github API. We evaluate our technique over 3 repositories: Bitcoin, Ethereum (crypto-currencies) and TrinityCore (a game engine). Our results in this experiment starkly contrast our results in the intra-repository study showing accuracies of 35% for Bitcoin, 22% for Ethereum, and 21% for TrinityCore which had candidate set sizes of 6, 5, and 7 respectively. Our results indicate that we can do somewhat better than random guessing, even under difficult experimental conditions, but they also indicate some fundamental issues with the state of the art of Code Stylometry. In this work we present our methodology, results, and some comments on past empirical studies, the difficulties we faced, and likely hurdles for future work in the area.en
dc.publisherUniversity of Waterlooen
dc.subjectcode stylometryen
dc.subjectauthorship attributionen
dc.subjectmining software repositoriesen
dc.subjectsoftware engineeringen
dc.subjectmachine learningen
dc.titleSource Code Stylometry and Authorship Attribution for Open Sourceen
dc.typeMaster Thesisen
dc.pendingfalse R. Cheriton School of Computer Scienceen Scienceen of Waterlooen
uws-etd.degreeMaster of Scienceen
uws.contributor.advisorGodfrey, Michael
uws.contributor.affiliation1Faculty of Mathematicsen

Files in this item


This item appears in the following Collection(s)

Show simple item record


University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages