NengoFPGA: an FPGA Backend for the Nengo Neural Simulator
Loading...
Date
2019-08-22
Authors
Morcos, Benjamin
Advisor
Kapre, Nachiket
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Low-power, high-speed neural networks are critical for providing deployable embedded AI
applications at the edge. We describe a Xilinx FPGA implementation of Neural Engineering
Framework (NEF) networks with online learning that outperforms mobile Nvidia GPU
implementations by an order of magnitude or more. Specifically, we provide an embedded
Python-capable PYNQ FPGA implementation supported with a Xilinx Vivado High-Level
Synthesis (HLS) workflow that allows sub-millisecond implementation of adaptive neural
networks with low-latency, direct I/O access to the physical world. The outcome of this
work is NengoFPGA, a seamless and user-friendly extension to the neural compiler Python
package Nengo. To reduce memory requirements and improve performance we tune the
precision of the different intermediate variables in the code to achieve competitive absolute
accuracy against slower and larger floating-point reference designs. The online learning
component of the neural network exploits immediate feedback to adjust the network weights
to best support a given arithmetic precision. As the space of possible design configurations
of such quantized networks is vast and is subject to a target accuracy constraint, we use
the Hyperopt hyper-parameter tuning tool instead of manual search to find Pareto optimal
designs. Specifically, we are able to generate the optimized designs in under 500 short
iterations of Vivado HLS C synthesis before running the complete Vivado place-and-route
phase on that subset, a much longer process not conducive to rapid exploration. For neural
network populations of 64–4096 neurons and 1–8 representational dimensions our optimized
FPGA implementation generated by Hyperopt has a speedup of 10–484× over a competing
cuBLAS implementation on the Jetson TX1 GPU while using 2.4–9.5× less power. Our
speedups are a result of HLS-specific reformulation (15× improvement), precision adaptation
(3× improvement), and low-latency direct I/O access (1000× improvement).
Description
Keywords
neural networks, FPGA, nengo, high-level synthesis