Information Theoretic Evaluation of Change Prediction Models for Large-Scale Software
During software development and maintenance, as a software system evolves, changes are made and bugs are fixed in various files. In large-scale systems, file histories are stored in software repositories, such as CVS, which record modifications. By studying software repositories, we can learn about open source software development rocesses. Knowing where these changes will happen in advance, gives power to managers and developers to concentrate on those files. Due to the unpredictability in software development process, proposing an accurate change prediction model is hard. It is even harder to compare different models with the actual model of changes that is not available. <br /><br /> In this thesis, we first analyze the information generated during the development process, which can be obtained through mining the software repositories. We observe that the change data follows a Zipf distribution and exhibits self-similarity. Based on the extracted data, we then develop three probabilistic models to predict which files will have changes or bugs. One purpose of creating these models is to rank the files of the software that are most susceptible to having faults. <br /><br /> The first model is Maximum Likelihood Estimation (MLE), which simply counts the number of events i. e. , changes or bugs that occur in to each file, and normalizes the counts to compute a probability distribution. The second model is Reflexive Exponential Decay (RED), in which we postulate that the predictive rate of modification in a file is incremented by any modification to that file and decays exponentially. The result of a new bug occurring to that file is a new exponential effect added to the first one. The third model is called RED Co-Changes (REDCC). With each modification to a given file, the REDCC model not only increments its predictive rate, but also increments the rate for other files that are related to the given file through previous co-changes. <br /><br /> We then present an information-theoretic approach to evaluate the performance of different prediction models. In this approach, the closeness of model distribution to the actual unknown probability distribution of the system is measured using cross entropy. We evaluate our prediction models empirically using the proposed information-theoretic approach for six large open source systems. Based on this evaluation, we observe that of our three prediction models, the REDCC model predicts the distribution that is closest to the actual distribution for all the studied systems.