Reconfigurable devices, including AIE CGRA, FPGA fabric, HBM stacks, System-wide NoC, and ARM processing sub-system, offer diverse design options due to their heterogeneous nature. Design tools such as Vitis use a push-button approach, where application RTL is generated via HLS and then undergoes synthesis, placement, and routing in Vivado. This method of-ten yields sub-optimal results (PPA) because high-level design semantics, such as processing element structure, composition, memory hierarchy, and interconnect, are lost during implementation. Therefore, the challenge lies in designing accelerators on FPGAs to fully use the FPGA resources, while still preserving designer productivity while still leveraging design and device characteristics.
This presentation will focus on a few projects in our team (AMD FPGA architecture group) that deal with problems in diverse domains by exploiting the semantics of the problem being mapped and the specifics of the architecture to which it is mapped to.
Existing SpMV accelerators on HBM-enabled FPGA platforms do not scale well, resulting in under-utilization of the HBM bandwidth. Poor scaling of existing accelerator designs prevents us from using the entire HBM bandwidth. Physically unaware system design prevents us from achieving high frequency of operation. To address these issues, we propose a modular and lean SpMV dataflow accelerator and then implement it on FPGA fabric using our “Atoms” method-ology. This is the first work that can use the entire bandwidth of commercial HBM-enabled FPGAs and surpass all reported frequencies while doing so. The “Atoms” methodology relies on exploiting design semantics to generate efficient floorplans for the design. We decompose the design into smaller building blocks communicating over latency-insensitive elastic channels. To navigate the heterogeneous canvas that modern FPGAs present, we add the required number of elastic buffers so that communication never becomes the frequency limiter. We expect this pattern to apply to a wide variety of other domains as well.
The second project focuses on streaming neural networks, specifically - the FINN framework. FINN takes a high-level description of a network and then generates RTL followed by FPGA implementation. Depending on the resource budget, FINN can generate bunch of designs with varying throughput and resource requirements. One of the key building blocks in FINN-generated network is the streaming Matrix Vector multiplication Unit (MVU). We propose o design MVU in a structured way so that we can extract maximum performance out of the device resources. DSP blocks can achieve close to 1 GHz on latest Versal FPGAs, and we are hoping to generate MVU units which also operate close to this limit. We plan to create an overlay MVU (with high fmax) which is instruction-programmable but does not exhibit overheads associated with usual overlays. All the blocks in our overlay MVU are supposed to be highly customized for FINN, specifically the DSP-based dot-engine ALU, register files for activation and weights, and instruction memory as well. Our approach relies on elastic communication between building blocks so that we can insert pipeline stages even after blocks are placed on the FPGA fabric.
The third project is about packet processing using FPGA networking overlay also referred to as Packet Processing Engine (PPE). PPE is instruction-programmable, and AMD’s compiler can compile networking workloads (expressed in eBPF) on PPE. We are currently exploring ways to customize the overlay once we have compiled a networking workload on top of it. Our hope is to generate workload-specific PPEs mapped on FPGA fabric so that we do not have to pay the “overlay tax”.
Finally, we present how some aspects of the problem faced by verification customers is one that is amenable to structured implementation. We demonstrate how similar ideas discussed so far help us significantly improve the performance and lower the resources used for those workloads.
Over the long term, we expect to develop a set of domain-specific optimized implementation flows that exploit a handful of basic concepts. Since the entire flow (including implementation) is aware of how the physical architecture of the FPGA interacts with the specific design needs, we expect the proposed flow to result in higher performance implementations compared to simply handing off a design at RTL after synthesis from some general-purpose HLS engine.
Dinesh Gaitonde got his Bachelor's and Master's from IIT Bombay and his PhD from CMU in Electrical Engineering. He is currently a Senior Fellow at AMD (Xilinx) focusing on FPGA architectures, applications and implementation algorithms. Previous to AMD he has worked at Motorola, Synopsys as an EDA researcher. His interests include FPGA & Other Reconfigurable Fabrics, High Performance Computing on Reconfigurable Fabrics and EDA for FPGAs.
Abhishek Jain received the PhD degree in computer engineering from Nanyang Technological University, Singapore, in 2016. After that, he was a postdoc at Lawrence Livermore National Laboratory. Since 2018, he has been an architect with Xilinx USA. His research interests include computer architecture, FPGAs, high-performance accelerators, and domain-specific FPGA overlays
FPGA Device and Floorplan-aware Accelerator Implementation via Domain-specific Tooling
Computer Architecture Seminar
-
Location: EER 3.646
Speaker:
Abhishek Kumar Jain & Dinesh Gaitonde
Seminar Series