Compiling Code Accelerators for FPGAs Walid A. Najjar University of California Riverside Computer Science & Engineering Riverside, California +1.951.827.4406 najjar@cs.ucr.edu ABSTRACT This tutorial addresses the challenges and opportunities presented by compiled FPGA-based code accelerators. In recent years we have witnessed a fast growth of both size and speed of FPGAs. These had been initially designed and marketed as convenient devices for "glue logic." Later, they became used as fast prototyping platforms. As their size and speed grew, they have been used for the short time to market they can afford. Lately, their size and speed have made them attractive as code accelerator. While the clock speed achievable on a typical FPGA design is about an order of magnitude lower than that on a typical CPU, their advantage comes from two sources: (1) Large degree of instruction and loop level parallelism. Parallel loops can typically be unrolled by factors ranging in the 100s. (2) Increased efficiency of hardware execution. The streaming of the data through a dedicated circuit eliminates a large number of support operations such as data fetch, address calculations, index management, loop control, etc. The combined higher efficiency and parallelism of hardware execution on FPGAs has been shown to result in speedups ranging from the 10s to the 1,000s over traditional processor on frequently executed code segments. However, the main obstacle to wider acceptance of this technology is programmability. FPGAs are typically programmed using Hardware Description Languages (HDLs), which poses two problems: Traditional application developers are typically not HDL designers, and HDLs are not well suited for algorithm implementation. Furthermore, the FPGA is an amorphous mass of logic on which the compiler must create a data-path and schedule the computation. Such a task requires the harnessing of technologies developed for parallelizing compilers as well as those developed for high-level synthesis. The main challenge that faces HLL to HDL translation is the paradigm shift from the stored program model to a value-based, data-driven execution - that is, from temporal to spatial execution. The task of an FPGA compiler is to generate both the data path and the sequence of operations (control flow) on that data path. The lack of architectural structure on the FPGA presents a number of opportunities for the compiler: (1) The available parallelism, instruction loop and thread, is very high and limited only by the size of the FPGA or the I/O bandwidth to the chip. (2) On-chip storage can be configured at will. (3) Circuit customization allows the compiler to reduce the circuit size as well as the clock duration. Optimizing compilers for traditional processors have benefited from several decades of extensive research that has led to extremely powerful tools. Similarly, electronic design automation (EDA) tools have also benefited from several decades of research and development, leading to powerful tools that can translate VHDL and Verilog code, and recently SystemC code, into efficient circuits. However, little work has been done to combine these two approaches. The Riverside Optimizing Compiler for Configurable Computing (ROCCC) is a C to VHDL compiler that targets the automatic generation of FPGA-based accelerators. ROCCC optimizes and parallelizes the most frequently executed loops for mapping as circuits on the FPGA. A host processor then manages the streaming of data through that circuit. The overall aim of ROCCC is to (1) bridge the performance gap between compiled and hand-written code and (2) apply extensive compile-time transformations on multi-dimensional arrays and non-trivial loop nests. Such transformations would be too complex for a human programmer to handle in a reasonable time. The objectives of the ROCCC optimizations are: (1) maximize the parallelism in the circuit as well as the clock rate at which it operates (2) minimize the number of off-chip memory accesses as well as the area of the circuit. This tutorial will address the issues of compiling a high-level language to generate FPGA-based code accelerators. It will take a look at the whole field with a special emphasis on the ROCCC compiler toolset. Tutorial outline: 1. FPGA code acceleration - An opportunity 2. Platform models - Why they matter 3. Compiling to FPGAs - The challenges 4. The ROCCC approach 5. The ROCCC toolset 6. Future outlook - Hardware, software and system support