Language Model Inference on FPGA with Integer-only Operations

Bekmyrza, Marat

Language Model Inference on FPGA with Integer-only Operations

Files

Bekmyrza_Marat.pdf (629.92 KB)

Date

2025-04-17

Authors

Bekmyrza, Marat

Advisor

Kapre, Nachiket
Patel, Hiren

Publisher

University of Waterloo

Abstract

Large Language Models (LLMs) are currently dominating the field of Artificial Intelligence (AI) applications, but their integration for edge computing purposes is rather limited due to computational complexity and power consumption. This thesis addresses this challenge by investigating the integer-only acceleration of transformer models on FPGAs, focusing on the BERT architecture. We demonstrate that by removing the floating-point operations from the inference pipeline, especially from non-linear functions like GELU, Softmax, and Layer Normalization, we can improve the performance without sacrificing accuracy. Our pipelined, batched architecture processes multiple sequences in parallel and optimizes the FPGA resources. We achieve a 2.6x throughput improvement over a single-sequence inference and at least 10x speedup over the offloading to CPU approach. The results of the experiments show that our implementation has comparable accuracy to the floating-point models for the GLUE benchmark tasks with INT8 quantization. These findings reveal that integer-only transformer inference on FPGAs is a feasible way of implementing complex language models on resource-limited edge devices, with the potential for new privacy-conscious, low-latency AI applications.

Keywords

Large Language Models, Artificial Intelligence, FPGA, Inference Acceleration

URI

https://hdl.handle.net/10012/21600

Collections

Theses
Electrical and Computer Engineering

Full item page

Language Model Inference on FPGA with Integer-only Operations

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections