Language Model Inference on FPGA with Integer-only Operations
Loading...
Date
2025-04-17
Authors
Advisor
Kapre, Nachiket
Patel, Hiren
Patel, Hiren
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Large Language Models (LLMs) are currently dominating the field of Artificial Intelligence (AI) applications, but their integration for edge computing purposes is rather limited due to computational complexity and power consumption. This thesis addresses this challenge by investigating the integer-only acceleration of transformer models on FPGAs, focusing on the BERT architecture. We demonstrate that by removing the floating-point operations from the inference pipeline, especially from non-linear functions like GELU, Softmax, and Layer Normalization, we can improve the performance without sacrificing accuracy.
Our pipelined, batched architecture processes multiple sequences in parallel and optimizes the FPGA resources. We achieve a 2.6x throughput improvement over a single-sequence inference and at least 10x speedup over the offloading to CPU approach. The results of the experiments show that our implementation has comparable accuracy to the floating-point models for the GLUE benchmark tasks with INT8 quantization. These findings reveal that integer-only transformer inference on FPGAs is a feasible way of implementing complex language models on resource-limited edge devices, with the potential for new privacy-conscious, low-latency AI applications.
Description
Keywords
Large Language Models, Artificial Intelligence, FPGA, Inference Acceleration