Attaching Social Interactions Surrounding Software Changes to the Release History of an Evolving Software System
Open source software is designed, developed and maintained by means of electronic media. These media include discussions on a variety of issues reflecting the evolution of a software system, such as reports on bugs and their fixes, new feature requests, design change, refactoring tasks, test plans, etc. Often this valuable information is simply buried as plain text in the mailing archives. We believe that email interactions collected prior to a product release are related to its source code modifications, or if they do not immediately correlate to change events of the current release, they might affect changes happening in future revisions. In this work, we propose a method to reason about the nature of software changes by mining and correlating electronic mailing list archives. Our approach is based on the assumption that developers use meaningful names and their domain knowledge in defining source code identifiers, such as classes and methods. We employ natural language processing techniques to find similarity between source code change history and history of public interactions surrounding these changes. Exact string matching is applied to find a set of common concepts between discussion vocabulary and changed code vocabulary. We apply our correlation method on two software systems, LSEdit and Apache Ant. The results of these exploratory case studies demonstrate the evidence of similarity between the content of free-form text emails among developers and the actual modifications in the code. We identify a set of correlation patterns between discussion and changed code vocabularies and discover that some releases referred to as minor should instead fall under the major category. These patterns can be used to give estimations about the type of a change and time needed to implement it.