SWE-bench-secret: Automating AI Agent Evaluation for Software Engineering Tasks

Kio, Godsfavour

SWE-bench-secret: Automating AI Agent Evaluation for Software Engineering Tasks

dc.contributor.advisor	Nagappan, Meiyappan
dc.contributor.author	Kio, Godsfavour
dc.date.accessioned	2025-01-21T18:59:25Z
dc.date.available	2025-01-21T18:59:25Z
dc.date.issued	2025-01-21
dc.date.submitted	2025-01-20
dc.description.abstract	The rise of large language models (LLMs) has sparked significant interest in their application to software engineering tasks. However, as new and more capable LLMs emerge, existing evaluation benchmarks (such as HumanEval and MBPP) are no longer sufficient for gauging their potential. While benchmarks like SWE-bench and SWE-bench-java provide a foundation for evaluating these models on real-world challenges, publicly available datasets face potential contamination risks, compromising their reliability for assessing generalization. To address these limitations, we introduce SWE-bench-secret, a private dataset carefully selected to evaluate AI agents on software engineering tasks spanning multiple years, including some originating after the models’ training data cutoff. Derived from three popular GitHub repositories, it comprises 457 task instances designed to mirror SWE-bench’s structure while maintaining strict data secrecy. Evaluations on a lightweight subset, called SWE-Secret-Lite, reveal significant performance gaps between public and private datasets, highlighting the increased difficulty models face when dealing with tasks that extend beyond familiar patterns found in publicly available data. Additionally, we provide a secure mechanism that allows researchers to submit their agents for evaluation without exposing the dataset. Our findings emphasize the need for improved logical reasoning and adaptability in AI agents, particularly when confronted with tasks that lie outside well-known public training data distributions. By introducing a contamination-free evaluation framework and a novel secret benchmark, this work strengthens the foundation for advancing benchmarking methodologies and promoting the development of more versatile, context-aware AI agents.
dc.identifier.uri	https://hdl.handle.net/10012/21398
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	benchmarking
dc.subject	large language models (LLMs)
dc.subject	AI agents
dc.title	SWE-bench-secret: Automating AI Agent Evaluation for Software Engineering Tasks
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	David R. Cheriton School of Computer Science
uws-etd.degree.discipline	Computer Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Nagappan, Meiyappan
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kio_Godsfavour.pdf
Size:: 652.59 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science