Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs

Elayat, Omar2025-08-192025-08-192025-08-192025-08-14https://hdl.handle.net/10012/22203The widespread adoption of Large Language Models (LLMs) in various applications has pushed the demand for efficient hardware acceleration beyond the capabilities of traditional platforms. Due to their highly parallel architecture and ease of deployment, Field Programmable Gate Arrays (FPGAs) are widely used to accelerate LLMs. However, the FPGAs’ limited on-chip memory resources are still too limited to accommodate the trained models. While existing FPGA-based solutions have demonstrated promising throughput and energy efficiency, they often rely on abundant fabric resources, assume high-bandwidth devices that are not suitable for deployment at the edge, or employ highly customized acceleration architectures that are not scalable with the advancements of the LLMs architectures. This thesis addresses these challenges by proposing a novel on-chip resources manager architecture for integer encoder-based transformer inference, with a focus on Bidirectional Encoder Representations from Transformers (BERT) models. We target resource-constrained FPGAs with limited memory bandwidth. We show that, through structured operation scheduling and resource-sharing, significant performance improvements can be achieved. The proposed resource-shared infrastructure is also designed to be modular, allowing newly introduced computation blocks to be easily integrated into the accelerator without requiring major modifications or incurring additional off-chip data movement. Demonstrated on a fully quantized integer-only variant of the BERT model as a representative workload, the proposed system achieves 2.32x latency improvement over the baseline custom accelerator, 1.17x over Jetson Orin Nano GPU, and at least 23.63x over CPU. The design is validated on two FPGAs: the PYNQ-Z1 as a low-end proof-of-concept and the KV260 as a mid-range deployment target.enFPGALLMsAcceleratorAcceleration of Integer Transformer Models Via Structured Resource Management Using FPGAsMaster Thesis