Learning Design Parameters to Build Application Customizable Network-on-Chips for FPGAs

No Thumbnail Available

Date

2025-01-20

Advisor

Kapre, Nachiket

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

We can exploit configurability of Field Programmable Gate Arrays (FPGA) and maximize the performance of communication-intensive FPGA applications by designing specifically customized Network-on-Chips (NoCs) using Machine Learning (ML). As transistor density growth stalls, NoCs play an increasingly critical role in deployment of FPGA applications for modern-day use cases. Unlike Application-Specific Integrated Circuits (ASICs), FPGA configurability allows the design of application-aware NoCs that can outperform statically configured NoCs in terms of both performance and efficiency. Conventional NoC design process is typically centered around universally-sound one-size-fits-all NoC design decisions and does not take the underlying application into account. In contrast, we present application aware designs that learn their NoC parameters by casting the NoC design space as a function of application performance using ML algorithms. Complex and non-obvious relationships between the large search space of NoC parameters and performance of the underlying FPGA application necessitates a more efficient approach than manual hand-tuning or brute force. Modern ML algorithms have demonstrated a remarkable ability to generalize to complex representations of the world by extracting high-order, non-linear features from complex inputs. In this thesis, we identify 1) NoC topology, 2) Flow control and 3) Regulation rate, as the key NoC design variables that have the strongest influence on application performance and leverage two primary ML methodologies in this thesis: 1) Stochastic Gradient Free Evolutionary Learning and 2) Gradient based Supervised Learning. First, we present NoC designs based on Butterfly Fat Tree (BFT) topology and light weight flow control. These BFT-based NoCs can customize their bisection bandwidth to match the application being routed while providing features such as in-order delivery and bounded packet delivery times. We present the design of routers with 1) latency-insensitive interfaces, coupled with 2) deterministic routing policy, and 3) round-robin scheduling at NoC ports. We evaluate our NoC designs under various conditions to deliver up to 3x lower latency and 6x higher throughput. We also learn the routing policy on a per-switch basis in an application-aware manner using Maximum Likelihood Estimation, decreasing latencies by a further ~1.1--1.7x over the static policy. Second, we overcome the pessimism in routing analysis of timing-predictable NoCs through the use of a "hybrid" application-customized NoCs. HopliteBuf NoCs leverage stall-free FIFOs as a measure of flow control under token bucket regularization. The static analysis, in the worst-case, can deliver very large FIFO size and latency bounds. Alternatively, HopliteBP uses light-weight backpressure as flow control under similar injection regulation. But, it suffers from severely pessimistic static analysis due to propagation of backpressure to other switches. We show that a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis, delivers the best of both worlds. We learn, specifically for the application being routed, the switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). We demonstrate ~1--6.8x lower routing latencies and ~2--3x improvements in feasibility, while only consuming ~1--1.5x more FPGA resources. Third, we further improve routability of a workload on the hybrid Buf-BP Hoplite NoC by learning to tune regulation rates for each traffic trace. We model the regulation space as a multivariate gaussian distribution. We capture critical dependency between parameters of the multivariate distribution using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning, that learns switch configurations and regulation rates in-tandem, and further lower cost-constrained latency by ~1.5x and accelerate rates by ~3.1x. Finally, we propose a Graph Neural Network (GNN) based framework to accurately predict NoC performance in sub-second latencies for a variety of FPGA NoC designs and applications. Application-aware NoC design can include thousands of incremental updates to the NoC design space, with each step requiring performance evaluation of NoC configuration using slow and expensive conventional tooling. This presents a bottleneck in the adoption of application-aware FPGA NoC design. Instead of spending up to tens of wall clock minutes simulating the NoC design for each step, we present a GNN based framework to encode any FPGA NoC and any FPGA application into graphs. We create a dataset, consisting of over 1.5 million samples, to train GNNs to predict NoC routing latencies. GNNs accelerate benchmarking run-times by up to ~148x (~506x on GPU) while preserving accuracies as high as 97.2%. Through this work, we observe that application-aware NoCs designed using ML algorithms such as MLE and CMA-ES can decrease routing latency by ~2.5--10.2x, increase workload feasibility by ~2--3x, increase injection rates by up to ~3.1x. By leveraging GNNs trained using supervised learning, we can accelerate design time of such NoCs by up to ~4.3x.

Description

Keywords

machine learning, artificial intelligence, FPGA, network-on-chips

LC Subject Headings

Citation