Language Model Inference on FPGA with Integer-only Operations

Loading...
Thumbnail Image

Date

2025-04-17

Advisor

Kapre, Nachiket
Patel, Hiren

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Large Language Models (LLMs) are currently dominating the field of Artificial Intelligence (AI) applications, but their integration for edge computing purposes is rather limited due to computational complexity and power consumption. This thesis addresses this challenge by investigating the integer-only acceleration of transformer models on FPGAs, focusing on the BERT architecture. We demonstrate that by removing the floating-point operations from the inference pipeline, especially from non-linear functions like GELU, Softmax, and Layer Normalization, we can improve the performance without sacrificing accuracy. Our pipelined, batched architecture processes multiple sequences in parallel and optimizes the FPGA resources. We achieve a 2.6x throughput improvement over a single-sequence inference and at least 10x speedup over the offloading to CPU approach. The results of the experiments show that our implementation has comparable accuracy to the floating-point models for the GLUE benchmark tasks with INT8 quantization. These findings reveal that integer-only transformer inference on FPGAs is a feasible way of implementing complex language models on resource-limited edge devices, with the potential for new privacy-conscious, low-latency AI applications.

Description

Keywords

Large Language Models, Artificial Intelligence, FPGA, Inference Acceleration

LC Subject Headings

Citation