# High-level programming of data-centric reconfigurable dataflow systems

Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft Honorary Visiting Professor at the Department of Computing, Imperial College London

#### MAXELER Technologies

MAXIMUM PERFORMANCE COMPUTING

*HLPP 2019*, Linköping University, Linköping, Sweden, 4 July 2019

## Think Ångströms not nanometers

We should steer the movements of almost each individual electron to solve our specific problem



## The Data Movement Challenge



| Processor Technology         | 40 nm  | 10nm   |  |
|------------------------------|--------|--------|--|
| Vdd (nominal)                | 0.9 V  | 0.7 V  |  |
| DFMA energy                  | 50 pJ  | 7.6 pJ |  |
| 64b 8 KB SRAM Rd             | 14 pJ  | 2.1 pJ |  |
| Wire energy (256 bits, 10mm) | 310 pJ | 174 pJ |  |



(Courtesy: NVIDIA and ITRS)

Moving data off-chip will cost ~200x more energy and is also much slower

Wires that carry the data (and instructions, if any) at all levels should be considered seriously





#### **Build Computers for your Problem and Data**



Computing in Time: Follow a recipe step by step one at the time Computing in Space: Build a "recipe specific" factory with multiple paths performed simultaneously

*Efficient, predictable, reliable "mass production" of huge data amounts* 

At each clock tick all data in processing move one stage ahead -> massive throughput





#### The Combined Control/DataFlow System



\* System 1 and System 2 are based on D Kahneman, "Thinking Fast Thinking Slow", Nobel Prize in Economics, 2002



Create customized mega accelerators with massive inherent throughputs



1. Describe Conjugate Gradient as dataflow graph

2. Compile dataflow structure and load to hardware



**Custom Accelerator** 



#### **From Equations to Dataflow Hardware**



Technologie AVIALIM PERFORMANCE COMPLITIN



## **Maxeler's DataFlow Engines (DFEs)**





## **Application Level Components**





## The Computational Model (DFE sub-system)

- Spatial arithmetic chip "hardware" substrate with
  - flexible arithmetic units and
  - programmable interconnect
  - (looks like FPGAs but is not limited to)
- Programmable Static Dataflow in 2D Hardware
- Systolic Execution at Kernel level
- Streaming Custom Computing at system level
- GALS\* kernel-to-kernel comm.

\* GALS – Globally Asynchronous Locally Synchronous





11

## The Computational Model (SW suite)

- Complete compilation toolchain (MaxCompiler)
- Dedicated design methodology
- Integrated simulation and debug environment
- Fully integrated in Linux (CentOS)
  - runtime system (MaxelerOS and SLiC)
  - low level software support

Help designers focus on the data/algorithm and the system architecture for rapid development

[1] Nils Voss, Tobias Becker, Oskar Mencer, Georgi Gaydadjiev: Rapid Development of Gzip with MaxJ. ARC 2017: 60-71



Make everything as simple as possible, but not simpler.

— Albert Einstein —

AZQUOTES



## The Computational Model (Memory types)

- Only three explicit basic memory types
  - Scalars (always exposed to the CPU)
  - Fast Memory (FMEM): small and fast (on-chip)
  - Large Memory (LMEM): large and slow (off-chip)
- There are, however, always exceptions, an example
  - VU9P on MAX5 has BRAM and URAM on chip
  - both constitute FMEM FPGA slices in the package
  - correct placement is a key
  - dedicated algorithms required [2]



Make everything as simple as possible, but not simpler.

— Albert Einstein —

AZQUOTES

[2] Nils Voss, Pablo Quintana, Oskar Mencer, Wayne Luk, Georgi Gaydadjiev: Memory Mapping for Multi-die FPGAs. FCCM 2019: pp78-86





### Multiple platforms, single DFE abstraction

## Application and MaxJ





#### **Optimizations at all levels**



and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)



- Example:  $x^2 + 30$
- where x is a long 1D vector







### MaxJ Basics: What is a DFEVar?

- Connection between operators in the dataflow graph
- An edge of the dataflow graph
- Stream of data elements of a certain type and size
  - (long vector)
- Physically it is a set of wires in the hardware
- It looks like a variable in MaxJ code
- BUT IS NOT A VARIABLE! (in traditional CS sense)



## MaxJ Basics: Java meta-programming

- You can use the full power of Java to write a program that generates the dataflow graph
- Java variables can be used as constants in hardware
  - int y; DFEVar x; x = x + y;
- Hardware variables can not be read in Java!
  - Cannot do: int y; DFEVar x; y = x;
- Java conditionals and loops choose how to generate hardware → not make run-time decisions
- Once you execute your Java program the generated graph is what exist in your application (not the Java)
- We do not execute Java on the DFE!



#### MaxJ: Moving Average of three numbers

#### Dataflow computing in hardware using a language you know





#### What about branches



## MaxJ Basics: Java meta-programming

- You describe a dataflow graph generation using Java syntax
- Objects in the MaxCompiler API are used to generate hardware or configure the hardware/the build process
- Java API is crafted to ease the generation of massive dataflow graphs
  - Object Orientation possible and encouraged (e.g., using KernelLibs)
  - You can write generic code which optimises itself on the fly
  - You can write optimisation libraries, e.g., MaxPower
  - Many normal Java libraries can be used, e.g., JUnit





### MaxJ Intro: Scheduling

- The dataflow graph in a kernel is statically scheduled and will be executed simultaneously in a parallel fashion
- Operations have inherent latencies
  - If different data paths meet, they need to be balanced and delays (FIFOs) are inserted
  - The scheduler tries to minimise the costs of implementing those delays in terms of FMEM
  - You can add manual scheduling constraints with stream.offset()





#### MaxJ Intro: Scheduling

DFEVar x = io.input("x", type);
DFEVar y;

y = (x + x) \* x;

io.output("y", y, type);







#### MaxJ Intro: Scheduling

DFEVar x = io.input("x", type);
DFEVar y;

y = (x + x) \* stream.offset(x, 1);

io.output("y", y, type);







## MaxJ Intro: Working with loop counters

• How can we implement this in MaxJ?

```
for (int i = 0; i < N; i++) {
    q[i] = p[i] + i;</pre>
```

• How about this?

```
DFEVar p = io.input("p", dfeInt(30));
DFEVar i = io.input("i", dfeInt(30));
DFEVar q = p + i;
io.output("q", q, dfeInt(31));
```



**Yes...** But, now we need to create an array *i* in software and stream it to the DFE as well

25



## MaxJ Intro: Working with loop counters



- Could produce it directly on the DFE itself
- Some HW resources will be used to do so

```
DFEVar p = io.input("p", dfeInt(31));
DFEVar i = control.count.simpleCounter(32, N);
DFEVar q = p + i;
io.output("q", q, dfeInt(32));
Half as many inputs
Less data transfer
```

- Counters can be used to generate sequences of numbers
- Complex counters can have strides, wrap points, triggers:
  - e.g., *if* (*y*==10) *y*=0; *else if* (*en*==1) *y*=*y*+2;





## MaxJ Intro: Programming Components

- MaxCompiler Java-driven dataflow compiler
- **SLiC Interface** CPU integration
- MaxelerOS optimized DFE <-> CPU link
- Seamless simulation environment

| MaxCompiler - MovingAverageWeighted_MovingAverageWeightedKernel_original.pxg - MaxIDE                                                                                                                                                                                                                     |                                                                                                                                                           |  |  |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| <u>F</u> ile <u>E</u> dit <u>N</u> avigate Se <u>a</u> rch <u>P</u> roject <u>R</u> un <u>W</u> indow <u>H</u> elp                                                                                                                                                                                        |                                                                                                                                                           |  |  |  |  |  |  |
| 📄 📬 🖷 🐘 📩 🚳 🚺 tutorial-chap03-example3 > Simulation                                                                                                                                                                                                                                                       | • *Þ ↔• ↔• 🖻 •                                                                                                                                            |  |  |  |  |  |  |
| 🕝 🕑 CpuStreamKernel.maxj 📝 MovingAverageWeightedKernel.maxj 🕄                                                                                                                                                                                                                                             | 7 MovingAverageWeighted_MovingAverageWeightedKernel_original.pxg 🛿 🧧 🗖                                                                                    |  |  |  |  |  |  |
| <pre>BE DFEVar prevWeighted = prev*weight0;<br/>DFEVar nextWeighted = next*weight2;<br/>DFEVar xWeighted = x*weight1;<br/>DFEVar divisor = withinBounds ? constant.v<br/>DFEVar sum = prevWeighted + xWeighted + ne<br/>DFEVar result = sum / divisor;<br/>io.output("y", result, dfeFloat(8, 24));</pre> | R R   Navigation   MovingAverageWeightedMan   MovingAverageWeightedKern   Selected Node Properties   NodeMul   24   a: hwFloat(8, 24)   b: hwFloat(8, 24) |  |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                                           | MAXELER 🖉 🖷 🖏                                                                                                                                             |  |  |  |  |  |  |





#### **DFE Resource Usage Reporting**

- Allows you to see what lines of code are
- using what resources and focus optimization
- Separate reports for each kernel and for the manager



| LUTS   | FFs    | BRAMs   | DSPs    | : | MyKernel.java                               |                       |
|--------|--------|---------|---------|---|---------------------------------------------|-----------------------|
| 727    | 871    | 1.0     | 2       | : | resources used by this file                 | Different operations  |
| 0.24%  | 0.15%  | 0.09%   | 0.10%   | : | % of available                              | use different         |
| 71.41% | 61.82% | 100.00% | 100.00% | : | % of total used                             |                       |
| 94.29% | 97.21% | 100.00% | 100.00% | : | % of user resources                         | resources             |
|        |        |         |         | : |                                             |                       |
|        |        |         |         | : | public class MyKernel extends Kernel        | {                     |
|        |        |         |         | : | public MyKernel (KernelParameters p         | arameters) {          |
|        |        |         |         | : | super(parameters);                          |                       |
| 1      | 31     | 0.0     | 0       | : | DFEVar p = io.input("p", dfeFloa            | t(8,24));             |
| 2      | 9      | 0.0     | 0       | : | DFEVar q = io.input("q", dfeUInt            | .(8));                |
|        |        |         |         | : | DFEVar offset = io.scalarInput("            | offset", dfeUInt(8)); |
| 8      | 8      | 0.0     | 0       | : | DFEVar addr = offset + q;                   |                       |
| 18     | 40     | 1.0     | 0       | : | DFEVar v = mem.romMapped("table", addr,     |                       |
|        |        |         |         | : | dfeFloat(8,24), 256);                       |                       |
| 139    | 145    | 0.0     | 2       | : | p = p * p;                                  |                       |
| 401    | 541    | 0.0     | 0       | : | p = p + v;                                  |                       |
|        |        |         |         | : | <pre>io.output("r", p, dfeFloat(8,24)</pre> | );                    |
|        |        |         |         | : | }                                           |                       |
|        |        |         |         | : | }                                           |                       |
|        |        |         |         |   |                                             |                       |

28

LUT/FFs



#### **Optimization Feedback**

 MaxCompiler gives detailed latency and area annotation back to the programmer



• Evaluate precise effect of code on latency and chip area



## **FPGA HW vs Dataflow System Design**

- DFEs currently use FPGAs
  - Maxeler MAX4C, MAX4N, MAX5C
  - Xilinx Alveo
  - Amazon EC2 F1 instance
  - (but any suitable coarser-grain technology will do)
- HPC FPGA development often focus on kernels
  - e.g., accelerate Matrix Multiply, FFT, Convolution
- I/O and memory bottlenecks are often ignored
- Dataflow looks at the complete application
  - Customise Dataflow
  - Reduce Bandwidth requirements
  - Balance the entire system
  - Maximize Throughput



## VHDL vs MaxJ example (HMS CERN)

```
class Jet extends KernelLib{
   //... constructors etc.
   public static Jet add(Jet lJet, Jet rJet) {
     DFEVar energy = lJet.energy() + rJet.energy();
     DFEVar ecal = lJet.ecal() + rJet.ecal();
     DFEVar valid = lJet.valid() & rJet.valid();
     return new Jet(lJet.kernel(), energy, ecal,
     valid);
  }
}
```

```
1
```

- MaxJ code is just Software
- No need to keep track on exact bit sizes, fixed point positions, etc
- Other kernels benefit even more
  - Bitonic Sort is 500 VHDL lines versus 130 in MaxJ

```
ARCHITECTURE behavioral OF JetSum IS
  SIGNAL jetTmp : tJet := cEmptyJet;
  SIGNAL EnergyTmp , EcalTmp :
STD LOGIC VECTOR ( 21 DOWNTO 0 );
BEGIN
  PROCESS( clk )
  BEGIN
    IF ( RISING EDGE ( clk ) ) THEN
      IF( NOT jetIn1.DataValid ) THEN
        jetOut <= cEmptyJet;</pre>
      ELSE
        jetOut.Energy( 15 DOWNTO 0 ) <=
            jetIn1.Energy + jetIn2.Energy;
        jetOut.Ecal( 15 DOWNTO 0 ) <=
             jetIn1.Ecal + jetIn2.Ecal;
        jetOut.DataValid <= TRUE;</pre>
      END IF;
    END IF;
  END PROCESS;
END ARCHITECTURE behavioral;
```



#### **System Example: Decelerate to Accelerate**



But what about the required effort?

## Easy it is not (and not really new)

Slotnick's law (of effort):

"The parallel approach to computing does require that some *original thinking* be done **about numerical analysis and data management** in order to secure efficient use.

In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage."

Daniel Slotnick (1931-1985) Chief Architect of Illiac IV





Turn brain power into performance





#### **Non Traditional Design Process**



#### **Global Weather Simulation with DFEs in China**

An order of magnitude improvement over the Linpack-driven supercomputer technology

- L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, published at FPL 2013
- Joint research with Imperial College and Tsinghua University
- Simulating the atmosphere using the shallow water equation

| Platform       | Speedup | Energy<br>Efficiency |
|----------------|---------|----------------------|
| 6 Core CPU     | 1x      | 1x                   |
| Tianhe-1A Node | 23x     | 15x                  |
| Maxeler MPC-X  | 330x    | 145x                 |



- (a) The cubed-sphere mesh
- (b) The computational domain
- Fig. 1. Mesh and computational domain.





## **Open Research Questions**

- How to compare against general purpose machines?
- How to validate results as "good enough"?
- How to forecast the required "Dan Slotnick's Effort"?
- How to significantly decrease P&R times?
- How to create a much better silicon substrate?
- How to educate the *think* → *model* → *program* mindset?
- Tools, tools, tools, ...

(we need the help of Parallel Programming experts, you)

All of the above while dealing with the growing impact of Quantum Mechanical Effects on running software







#### **Over 150 Maxeler University Program Members**



Technologies

## **Maxeler Applications Gallery**

Dataflow Apps and Analytics for Machine Learning

http://appgallery.maxeler.com/



#### Dataflow Engine (DFE) Ecosystem

- With over 150 universities in our university program, we decided to create an app gallery to enable the community to share applications, examples, demos, ...
- The App Gallery is complemented by a teaching program, with the first successful course taught at Imperial College in 2014. see

http://cc.doc.ic.ac.uk/openspl16

- Top 10 APPS:
  - Correlation: in real-time, pairwise, on 6,000 streams
  - > 100% Guaranteed Packet Capture
  - > Webserver, cache and load balancing
  - ➢ HESTON Option pricer
  - ➤ N-body simulation
  - ➤ Regex matching (e.g. for Security)
  - Brain network simulation
  - Quantum Chromo-Dynamics kernel
  - ➤ Seismic Imaging
  - Realtime Classification

## Some links with more information

Maxeler Multiscale Dataflow Computing: <u>https://www.maxeler.com/technology/dataflow-computing/</u>

Computing in Space explained by Mike Flynn: http://www.openspl.org/what-is-openspl/

Computing in Space Course at Imperial College: <u>http://cc.doc.ic.ac.uk/openspl16/</u>

Exciting Applications for DFEs (and JDFEs): <a href="http://appgallery.maxeler.com">http://appgallery.maxeler.com</a>

Maxeler DFEs on AWS EC2 F1: <u>https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-</u> c66ab2ba202a

Maxeler and Xilinx Alveo collaboration: https://www.xilinx.com/products/boards-and-kits/alveo.html



georgi@maxeler.com







#### Questions



