Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs

Mellat, Alireza

dc.contributor.author	Mellat, Alireza
dc.date.accessioned	2022-01-27 18:13:07 (GMT)
dc.date.available	2022-01-27 18:13:07 (GMT)
dc.date.issued	2022-01-27
dc.date.submitted	2022-01-24
dc.identifier.uri	http://hdl.handle.net/10012/17993
dc.description.abstract	Coarse-grained reconfigurable array (CGRA) overlays can improve dataflow kernel throughput by an order of magnitude over Vivado HLS on Xilinx Alveo U280. This is possible with a combination of carefully floorplanned high-frequency (645 - 768 MHz Torus, 788 - 856 MHz Mesh, 583 - 746 MHz BFT) design and a scalable, communication-aware compiler. Our CGRA architecture supports configurable Processing Element (PE) functionality supported by a configurable number of communication channels to match application demands. Compared to recent FPGA overlays like 4×4 ADRES and HyCUBE implementations in CGRA-ME, our design operates at a faster clock frequency by up to 3.4×, while scaling to an orders-of-magnitude larger array size of 19×69 on Xilinx Alveo U280. We propose a novel topology agnostic ILP placer that formulates the CGRA placement problem into an ILP problem. Our ILP placer optimizes placement regardless of topology and even for non-linear objective functions by using pre-computed placement costs as inputs to the ILP problem formulation. Using the ILP placer reduces placement quadratic wirelength up to 37% compared to the commonly used simulated annealing approach but increases runtime from less than a minute to hours. Our communication-aware compiler targets HLS objectives such as initiation interval (II) and minimizes communication cost using an integer linear programming (ILP) formulation. Unlike SDC schedulers in FPGA HLS tools, we treat data movement as a first-class citizen by encoding the space and time resources of the communication network in the ILP formulation. Given the same constraints on operational resources as Vivado HLS, we can retain our target II and achieve up to 9.2× higher frequency. We compare Torus and Mesh topologies, and show Mesh has less latency per area compared to Torus for the same benchmarks.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	CGRA	en
dc.subject	FPGA	en
dc.subject	Compiler	en
dc.subject	RTL	en
dc.subject	Overlay	en
dc.title	Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	Electrical and Computer Engineering	en
uws-etd.degree.discipline	Electrical and Computer Engineering	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Applied Science	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Kapre, Nachiket
uws.contributor.affiliation1	Faculty of Engineering	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Mellat_Alireza.pdf
Size:: 1.644Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record