TDDD16 Compilers and Interpreters (opt.) TDDB44 Compiler Construction

cisc



# **Code Generation** for RISC and Instruction-Level Parallel **Processors**

**RISC/ILP Processor Architecture Issues** Instruction Scheduling **Register Allocation Phase Ordering Problems** Integrated Code Generation Peter Fritzson, Christoph Kessler IDA, Linköpings universitet, 2008.



**CISC vs. RISC** Instruction-Level Parallel (ILP) architectures Single-Issue: (can start at most one instruction per clock cycle) RISC Complex Instruction Set Computer Reduced Instruction Set Computer Simple, pipelined RISC processors Memory operands for arithmetic and Arithmetic/logical operations only on with one or multiple functional units logical operations possible registers • e.g. ARM9E, DLX add r1, r2, r1 load (r1), r4 load r3+disp, r5 mul r4, r5 ■ M(r1+r2) ← M(r1+r2) \* M(r3+disp) Multiple-Issue: (can start several instructions per clock cycle) store r5, (r1) Superscalar processors Many instructions Few, simple instructions e.g. Sun SPARC, MIPS R10K, Alpha 21264, IBM Power2, Pentium Complex instructions Many registers, all general-purpose typically 32 ... 256 registers Few registers, not symmetric VLIW processors (Very Long Instruction Word) Variable instruction size Fixed instruction size and format e.g. Multiflow Trace, Cydrome Cydra-5, Intel i860, Instruction decoding (often done in microcode) takes much silicon overhead Instruction decoding hardwired HP Lx. Transmeta Crusoe: most DSPs, e.g. Philips Trimedia TM32, TI 'C6x EPIC processors (Explicitly Parallel Instruction Computing) Example: 80x86, 680x0 Example: POWER, HP-PA RISC, MIPS, ARM, SPARC • e.g. Intel Itanium family (IA-64)





| Processor with Super-Pipelining                                                                                                                                           |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A new instruction can begin before the previous one is finished.                                                                                                          |
| Thus you manage on average 3 instr/cycle when the pipeline is full.                                                                                                       |
| Processor cycle no. 1 2 3 4 5 6 7 8 9 10 11<br>Instruction 1 starts $R_1 D_1 E_1 S_1 $ Instruction 1 ready<br>$R_2 D_2 E_2 S_2$<br>$R_3 D_3 E_3 S_3$<br>$R_4 D_4 E_4 S_4$ |
| R <sub>5</sub> D <sub>5</sub> E <sub>5</sub> S <sub>5</sub> R= Instr. retrieval         R <sub>6</sub> D <sub>6</sub> E <sub>6</sub> S <sub>6</sub>                       |
| D= Instr. decoding                                                                                                                                                        |
| E= Execution                                                                                                                                                              |
| S= Store result                                                                                                                                                           |
|                                                                                                                                                                           |



























































## **Global Register Allocation**

- Register Allocation: Determines values (variables, temporaries, constants) to be kept when in registers
- Register Assignment: Determine in which physical register such a value should reside.
- Essential for Load-Store Architectures
- Reduce memory traffic (→ memory / cache latency, energy)
- Limited resource
- Values that are alive simultaneously cannot be kept in the same register
- Strong interdependence with instruction scheduling
  - scheduling determines live ranges
  - spill code needs to be scheduled
- Local register allocation (for a single basic block) can be done in linear time (see previous lecture)
- Global register allocation on whole procedure body (with minimal spill code) is NP-complete. Can be modeled as a graph coloring problem [Ershov'62] [Cocke'71].
  - C. Kessler IDA, Linköpings universitet. 37 TDDB44: Code Generation for RISC and ILP Proce

## When do Register Allocation

- Register allocation is normally performed at the end of global optimization, when the final structure of the code and all potential use of registers is known.
- It is performed on abstract machine code where you have access to an unlimited number of registers or some other intermediary form of program.
- The code is divided into sequential blocks (basic blocks) with accompanying control flow graph.

## **Live Range**

1

X

- (Here, variable = program variable or temporary)
- A variable is being defined at a program point if it is written (given a value) there.
- A variable is **used** at a program point if it is read (referenced in an expression) there.
- A variable is **live** at a point if it is referenced there or at some following point that has not (may not have) been preceded by any definition.
- A variable is reaching a point if an (arbitrary) definition of it, or usage (because a variable can be used before it is defined) reaches the point.
- A variable's **live range** is the area of code (set of instructions) where the variable is both alive and reaching.
  - does not need to be consecutive in program text.



# Register Allocation vs Graph Coloring Register allocation can be compared with the classic coloring problem. That is, to find a way of coloring - with a maximum of *k* colors - the interference graph which does not assign the same color to two adjacent nodes. *k* = the number of registers. On a RISC-machine there are, for example, 16 or 32 general registers. Certain methods use some registers for other tasks. e.g., for spill code. Determining whether a graph is colorable using *k* colors is NP-complete for k>3 In other words, it is unmanageable always to find an optimal solution.









# **Register Allocation for Loops (1)**



### Interference graphs have some weaknesses:

- Imprecise information on how and when live ranges interfere.
- No special consideration is taken of loop variables' live ranges (except when calculating priority).
- Instead, in a cyclic interval graph:
  - The time relationships between the live ranges are explicit.
  - Live ranges are represented for a variable whose live range crosses iteration limits by cyclic intervals.
- Notation for cyclic live intervals for loops:
  - Intervals for loop variables which do not cross the iteration limit are included precisely once.
  - Intervals which cross the iteration limit are represented as an interval pair, cyclic interval: ([0, t), [t, t<sub>end</sub>])







## X Live Range Coalescing/Combining (Reduces Register Needs) For a copy instruction $s_i \leftarrow s_i$ • where si and sj do not interfere • and si and sj are not rewritten after the copy operation Merge si and sj: • patch (rename) all occurrences of si to sj • update the register interference graph and remove the copy operation. s2 ← ... s3 ← ... s3 ← s2 \$3 ( \$3 ... s3 ... ... s3 ...









