Code Generation and Testing in the Era of AI-Native Software Engineering

Nagappan, MeiyappanMathews, Noble Saji2025-01-172025-01-172025-01-172025-01-13https://hdl.handle.net/10012/21379Large Language Models (LLMs) like GPT-4 and Llama 3 are transforming software development by automating code generation and test case creation. This thesis investigates two pivotal aspects of LLM-assisted development: the integration of Test-Driven Development (TDD) principles into code generation workflows and the limitations of LLM-based test-generation tools in detecting bugs. LLMs have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code is often written in response to a requirement. Historically, Test-Driven Development (TDD) has proven its merit, requiring developers to write tests before the functional code, ensuring alignment with the initial problem statements. Applying TDD principles to LLM-based code generation offers one distinct benefit: it enables developers to verify the correctness of generated code against predefined tests. This paper investigates if and how TDD can be incorporated into AI-assisted code-generation processes. We experimentally evaluate our hypothesis that providing LLMs like GPT-4 and Llama 3 with tests in addition to the problem statements enhances code generation outcomes. We experimented with established function-level code generation benchmarks such as MBPP and HumanEval. Our results consistently demonstrate that including test cases leads to higher success in solving programming challenges. We assert that TDD is a promising paradigm for helping ensure that the code generated by LLMs effectively captures the requirements. As we progress toward AI-native software engineering, a logical follow-up question arises: Why not allow LLMs to generate these tests as well? An increasing amount of research and commercial tools now focus on automated test case generation using LLMs. However, a concerning trend is that these tools often generate tests by inferring requirements from code, which is counterintuitive to the principles of TDD and raises questions about their behaviour when the flawed assumption of the code under test being correct is violated. Thus we set out to critically examine whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code. Considering bugs are only exposed by failing test cases, we explore the question: can these tools truly achieve the intended objectives of software testing when their test oracles are designed to pass? Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests. These findings raise important questions about the validity of the design behind LLM-based test generation tools and their impact on software quality and test suite reliability. Together, these studies provide critical insights into the promise and pitfalls of integrating LLMs into software development processes, offering guidelines for improving their reliability and impact on software quality.enllmcode generationtestingtddsoftware engineeringCode Generation and Testing in the Era of AI-Native Software EngineeringMaster Thesis