Performance of IR Models on Duplicate Bug Report Detection: A Comparative Study

dc.contributor.authorKaushik, Nilam
dc.date.accessioned2012-01-06T21:17:42Z
dc.date.available2012-01-06T21:17:42Z
dc.date.issued2012-01-06T21:17:42Z
dc.date.submitted2011-12-23
dc.description.abstractOpen source projects incorporate bug triagers to help with the task of bug report assignment to developers. One of the tasks of a triager is to identify whether an incoming bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports, a triager either relies on his memory and experience or on the search capabilties of the bug repository. Both these approaches can be time consuming for the triager and may also lead to the misidentication of duplicates. It has also been suggested that duplicate bug reports are not necessarily harmful, instead they can complement each other to provide additional information for developers to investigate the defect at hand. This motivates the need for automated or semi-automated techniques for duplicate bug detection. In the literature, two main approaches have been proposed to solve this problem. The first approach is to prevent duplicate reports from reaching developers by automatically filtering them while the second approach deals with providing the triager a list of top-N similar bug reports, allowing the triager to compare the incoming bug report with the ones provided in the list. Previous works have tried to enhance the quality of the suggested lists, but the approaches either suffered a poor Recall Rate or they incurred additional runtime overhead, making the deployment of a retrieval system impractical. To the extent of our knowledge, there has been little work done to do an exhaustive comparison of the performance of different Information Retrieval Models (especially using more recent techniques such as topic modeling) on this problem and understanding the effectiveness of different heuristics across various application domains. In this thesis, we compare the performance of word based models (derivatives of the Vector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such as Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing (RI). We leverage heuristics that incorporate exception stack frames, surface features, summary and long description from the free-form text in the bug report. We perform experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of 60% and 58% respectively. We find that word based models, in particular a Log-Entropy based weighting scheme, outperform topic based ones such as LSI and LDA. Using historical bug data from Eclipse and NetBeans, we determine the optimal time frame for a desired level of duplicate bug report coverage. We realize an Online Duplicate Detection Framework that uses a sliding window of a constant time frame as a first step towards simulating incoming bug reports and recommending duplicates to the end user.en
dc.identifier.urihttp://hdl.handle.net/10012/6439
dc.language.isoenen
dc.pendingfalseen
dc.publisherUniversity of Waterlooen
dc.subjectduplicateen
dc.subjectbugen
dc.subject.programElectrical and Computer Engineeringen
dc.titlePerformance of IR Models on Duplicate Bug Report Detection: A Comparative Studyen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Applied Scienceen
uws-etd.degree.departmentElectrical and Computer Engineeringen
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kaushik_Nilam.pdf
Size:
5.25 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
250 B
Format:
Item-specific license agreed upon to submission
Description: