Mitigating Risks to Dependability from Vibe-Coding C for Embedded Systems
Loading...
Date
Authors
Advisor
Fischmeister, Sebastian
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Vibe coding is the process of using a Large Language Model (LLM) to iteratively generate
software code. It is popular, with 36% of workers at technology companies reporting
adoption of generative artificial intelligence for software engineering in 2024 [1]. At this
rate of use, LLM-generated code is quickly becoming part of the embedded-systems that
comprise our everyday cyber-physical infrastructure. Most of this infrastructure is built
on C language code [2]. LLM-generated C code poses threats to dependability, exhibiting
faults such as buffer overflows, out-of-bounds writes, integer overflows, and more. In this
work, we contribute methods for improving the dependability of these systems in three key
parts: providing a real-world benchmark dataset for evaluating LLM-generated C code,
protecting LLM code generation from poisoning attacks, and detecting changes in production
embedded systems through power side-channel analysis.
This work begins with an examination and categorization of weaknesses in LLMgenerated
C code for embedded systems networking. Our findings suggest that LLMs
perform poorly at programming tasks involving direct interactions with memory. Scores
on existing LLM-generated C benchmarks do not adequately express this difficulty, as these
benchmarks do not include sufficiently real-world C programming challenges. To support
future testing of LLMs, we introduce EmbedEvalC, a dataset of C coding challenges to
provide a benchmark against which LLMs can be evaluated on real-world tasks.
Retrieval Augmented Code Generation (RACG) is an essential tool for vibe coding, but
presents new threats to dependability from poisoning attacks. If an attacker can cause a
RACG system to retrieve their crafted documents, they can induce the LLM to generate
code with weaknesses. To detect this attack, we introduce canary functions, a process by
which specific functions in the codebase are regenerated and re-tested to determine whether
the addition of new documents induces new weaknesses.
Finally, we consider the black-box setting where a systems integrator seeks to detect
unexpected changes in embedded firmware. Such changes will only become more common
with the proliferation of vibe coding. We suggest using power side-channel analysis to
provide a feedback mechanism to a fuzzer in order to determine if a fuzzing input has
caused a new response from the system. We show that responses involving five or more
memory-interacting instructions are consistently detectable.
In this work, we suggest a collection of techniques to mitigate risks to the dependability
of embedded systems posed by LLM-generated C code.
Abstract Citations:
[1] Alex Singla, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. "The state of AI: How organizations are rewiring to capture value", McKinsey & Company, March 2025.
[2] P. Soulier, D. Li, and J. R. Williams, “A Survey of Language-Based Approaches to Cyber-Physical and Embedded System Development,” Tsinghua Science and Technology, vol. 20, no. 2, pp. 130–141, 2015.