Mitigating Risks to Dependability from Vibe-Coding C for Embedded Systems

Dunne, Murray

Mitigating Risks to Dependability from Vibe-Coding C for Embedded Systems

Files

Dunne_Murray.pdf (1.85 MB)

Date

2026-05-13

Authors

Dunne, Murray

Advisor

Fischmeister, Sebastian

Publisher

University of Waterloo

Abstract

Vibe coding is the process of using a Large Language Model (LLM) to iteratively generate software code. It is popular, with 36% of workers at technology companies reporting adoption of generative artificial intelligence for software engineering in 2024 [1]. At this rate of use, LLM-generated code is quickly becoming part of the embedded-systems that comprise our everyday cyber-physical infrastructure. Most of this infrastructure is built on C language code [2]. LLM-generated C code poses threats to dependability, exhibiting faults such as buffer overflows, out-of-bounds writes, integer overflows, and more. In this work, we contribute methods for improving the dependability of these systems in three key parts: providing a real-world benchmark dataset for evaluating LLM-generated C code, protecting LLM code generation from poisoning attacks, and detecting changes in production embedded systems through power side-channel analysis. This work begins with an examination and categorization of weaknesses in LLMgenerated C code for embedded systems networking. Our findings suggest that LLMs perform poorly at programming tasks involving direct interactions with memory. Scores on existing LLM-generated C benchmarks do not adequately express this difficulty, as these benchmarks do not include sufficiently real-world C programming challenges. To support future testing of LLMs, we introduce EmbedEvalC, a dataset of C coding challenges to provide a benchmark against which LLMs can be evaluated on real-world tasks. Retrieval Augmented Code Generation (RACG) is an essential tool for vibe coding, but presents new threats to dependability from poisoning attacks. If an attacker can cause a RACG system to retrieve their crafted documents, they can induce the LLM to generate code with weaknesses. To detect this attack, we introduce canary functions, a process by which specific functions in the codebase are regenerated and re-tested to determine whether the addition of new documents induces new weaknesses. Finally, we consider the black-box setting where a systems integrator seeks to detect unexpected changes in embedded firmware. Such changes will only become more common with the proliferation of vibe coding. We suggest using power side-channel analysis to provide a feedback mechanism to a fuzzer in order to determine if a fuzzing input has caused a new response from the system. We show that responses involving five or more memory-interacting instructions are consistently detectable. In this work, we suggest a collection of techniques to mitigate risks to the dependability of embedded systems posed by LLM-generated C code. Abstract Citations: [1] Alex Singla, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. "The state of AI: How organizations are rewiring to capture value", McKinsey & Company, March 2025. [2] P. Soulier, D. Li, and J. R. Williams, “A Survey of Language-Based Approaches to Cyber-Physical and Embedded System Development,” Tsinghua Science and Technology, vol. 20, no. 2, pp. 130–141, 2015.