DF00100 Advanced Compiler Construction TDDC86 Compiler Optimizations and Code Generation



## Instruction-Level Parallel Processor Architectures

## **Instruction Scheduling**

Local and Global Scheduling

Christoph Kessler, IDA, Linköpings universitet, 2014. DF00100 Advanced Compiler Construction TDDC86 Compiler Optimizations and Code Generation

## **RISC and Instruction-Level** Parallel Target Architectures

Christoph Kessler, IDA, Linköpings universitet, 2014



X

## **Pipelined RISC Architectures**

- A single instruction is issued per clock cycle
- Possibly several parallel functional units / resources
- Execution of different phases of subsequent instructions overlaps in time. This makes them prone to:
  - data hazards (may have to delay op until operands ready),
  - control hazards (may need to flush pipeline after wrongly predicted branch),
  - structural hazards (required resource(s) must not be occupied)
- Static scheduling (insert NOPs to avoid hazards)
- vs. Run-time treatment by automatic hazard detection + pipeline stalling

| L                                        |   | IF      |      | $I_1$ | 1 | IF <sub>1</sub> |        |          |         |        |
|------------------------------------------|---|---------|------|-------|---|-----------------|--------|----------|---------|--------|
| L                                        |   | ID      |      | $I_2$ | 2 | IF <sub>2</sub> | $ID_1$ |          |         |        |
| L                                        |   | EX      |      | $I_3$ | 3 | IF <sub>3</sub> | $ID_2$ | $EX_1$   |         |        |
| L                                        |   |         |      | $I_4$ | 4 | IF <sub>4</sub> | $ID_3$ | $EX_2$   | $MEM_1$ |        |
| L                                        |   | MEM/EX2 |      | $I_5$ | 5 | IF <sub>5</sub> | $ID_4$ | $EX_3$   | $MEM_2$ | $WB_1$ |
| L                                        | Ļ | WB      | time | $I_6$ | 6 | IF <sub>6</sub> | $ID_5$ | $EX_{4}$ | $MEM_3$ | $WB_2$ |
| C. Kessler, IDA, Linköpings universitet. |   |         |      |       |   |                 |        |          |         |        |



## Instruction Scheduling (1)



 Map instructions to time slots on issue units (and resources), such that no hazards occur
 → Global reservation table, resource usage map





## Superscalar processor

- Run-time scheduling by instruction dispatcher
  - convenient (sequential instruction stream as usual)
  - limited look-ahead buffer to analyze dependences, reorder instr.
  - high silicon overhead, high energy consumption



# Dual-Issue (w=2) • Example (1): Linear code "mul R1,...; add ...,R2" expands to mul R1,... add ...,R2 ... No data dependence • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ... • ...





## 2



## **Clustered VLIW processor**

- E.g., TI C62x, C64x DSP processors
- Register classes
- Parallel execution constrained by operand residence



## **EPIC architectures**

- Based on VLIW
- Compiler groups instructions to LIW's (bundles, fetch units)
- Compiler takes care of resource and latency constraints
- Compiler marks sequences of independent instructions as instruction groups by inserting delimiters (stop bits)
- Dynamic scheduler assigns resources and reloads new bundles as required



A 1 B 1 C 0 D 1 E 1 F 0 G 1 H 0











| MRIS: Space-optimal scheduling                                                                                                                                                 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (a) based on postorder traversal of the DAG                                                                                                                                    |
| Special case: tree: space-opt. schedule in linear time       [Sethi, Ullman '70]         Special case: vector tree (node size attribute): space-opt. $O(n \log n)$ [Rauber'90] |
| Special case: series-par. DAG: space-opt. schedule in pol. time [ $\underset{W}{G}$ üttler '81]                                                                                |
| General DAG, contiguous schedules ( $\leq 2^n$ )• Random dfs [K., Paul, Rauber '91]• Enumeration with DC strategy [K., Rauber '93/'95]                                         |
|                                                                                                                                                                                |
| <ul><li>(c) based on finding instruction lineages in the DAG</li><li>heuristic method by [Govindarajan et al. '00]</li></ul>                                                   |
| → see separate lecture on MRIS<br>C.Kessler, IDA, Linköpings universitet. 23                                                                                                   |















## Greedy List Scheduling for VLIW (1)

A greedy heuristic for list scheduling fills in one step as many slots in a VLIW word as possible with ready instructions of the zeroindegree set.





## **Local Scheduling Heuristics**

#### List Scheduling Heuristics

- Deepest Level First (a.k.a. highest level first etc.)
  - Select, among ready instructions, one with longest accumulated latency on a path towards any dependence sink (root node)
  - Forward vs Backward scheduling

#### Critical Path Scheduling

- Detect a critical path (longest accumulated latency) in the DAG, schedule its nodes → partial schedule, and remove them from the DAG.
- Repeat until DAG is empty, splicing in new nodes between scheduled ones as appropriate, or inserting fresh time slots where needed



## **Scheduling Branch Instructions**

#### Delayed Branch

- Effect of conditional branch on program counter is delayed
- 1 or more instructions after a branch instruction are always executed, independent of the condition outcome
  - ► SPARC, HP PA-RISC: 1 delay slot
  - > TI 'C62x: 5 delay slots
- Scheduling: Fill delay slots with useful instruction if there is one, otherwise with NOP
- Heuristic for finding candidate instructions:
  - Instructions from same basic block that are not control dependent on the branch and that the condition is not data dependent of
  - 2. Instructions from most likely branch target basic block for speculative execution
  - See e.g. [Muchnick Ch. 17.1.1] for further details

# Remark: Hardware Support for Branch Delay Slot Filling

- Nullification feature (SPARC) for conditional branch delay slots in one branch direction
- activated by setting a bit in the instruction opcode
- delay slot instruction executed only if a conditional branch is taken treated as NOP if not taken.
- not applicable to "branch always": fixed delay slot





developed for VLIW architectures [Fisher'81] [Ellis'85]

idea: enlarge the scope of local scheduling to traces

trace = acyclic path of basic blocks in the CFG

track execution frequencies for BB's/traces (e.g., profiling)

- idea: make the most frequent trace fast:
  - + virtually merge BB's in the most frequent trace schedule trace as one BB, e.g. by greedy VLIW list scheduling
  - + insert compensation code in less frequent side traces for correctness
    - $\rightarrow$  accept slowdown for side traces
    - $\rightarrow$  program lenght may grow (worst case: exponentially)
  - + continue same procedure with next frequent trace





## Trace Scheduling (4)

- Insertion of compensation code
  - **Case:** When moving an instruction *i1* to a successor block of *B* in the trace *T*



## **Trace Scheduling (5)**

#### Summary of cases:

Code reordering with insertion of compensation code Hoisting an assignment Interchange assignment and label Moving assignments across conditional branches

- Moving a branch
- interchange branches



## **Region Scheduling**

#### [Gupta/Soffa'90]

Idea: avoid idle cycles caused by regions with insufficient parallelism

Program region = one or several BB's that require the same control condition

Repeatedly apply a set of local code transformations:

- loop unrolling
- moving instructions from BB's with excessive parallelism into BB's with insufficient parallelism
- merging of regions
- to balance the degree of parallelism

Heuristic measure for average degree of parallelism in a region: # instructions(region) / length of critical path(region)

## Program regions for global scheduling

- Trace (see above)
  - A path of basic blocks
- Superblock
  - A trace with the restriction that there may be no branches into any of its basic blocks, except the first one
- Treegions = Extended Basic Blocks
  - An out-tree of basic blocks no branch into any of its basic blocks, except the first one
- Hyperblock
  - A single-entry, multiple exit region with internal control flow.
     As superblocks, but allow hammocks resolved by predication.
- All these regions are acyclic (but may be part of a cycle around)
- Traces and superblocks are "linear regions", while treegions and hyperblocks are "nonlinear regions"

Summary + Outlook: Instruction Scheduling

• usually, optimize for time (other important metrics: space, energy)  $\rightarrow$  see also lecture on energy-aware code generation

#### local methods

postorder traversals, forward/backward list scheduling, optimal methods  $\rightarrow$  see also lecture on space-optimal scheduling (MRIS)

#### global methods

- trace scheduling, percolation scheduling, region scheduling  $\rightarrow$  see also lecture on software pipelining
- interferences with instruction selection, register allocation, → phase-ordering problems
  - $\rightarrow$  see also lecture on integrated code generation
- interferences with data layout, exploit advanced addressing units, ...  $\rightarrow$  see also lecture on code generation for DSPs

## Further scheduling issues, not covered

- · creating and scheduling predicated code
- speculation (with and without hardware support) prefetching (load speculation), branch speculation, value speculation ...
- run-time scheduling, profile-driven scheduling
- automatic generation of instruction schedulers: finite state automata [Proebsting/Fraser: *Detecting Pipeline Hazards Quickly*, POPL'94], [Bala/Rubin MICRO-28, 1995]

e.g. the new GCC scheduler [Makarov, GCC Dev. Summit 2003]

## **Generation of Instruction Schedulers**

- Given: Instruction set with
  - reservation table for each instruction
- Set of resource-valid schedules = regular language over the alphabet of instructions
- Scheduling instr. A after B leads to a certain pipeline state (functional unit reservations and pending latencies of recently issued instructions)
- Scheduling A in pipeline state q leads to new pipeline state q'
- → Finite automaton ("Müller automaton") of all possible pipeline states and (appending) scheduling transitions
  - Or finite transducer → gives also the time offset for next instruction
- Precompute possible states + transitions → Scheduling much faster (table lookup instead of interpreting reservation table composition)
- Reversed automaton to allow insertions at any location
- Automata become huge! But can be optimized.

#### **Recommended Reading (global scheduling)**



- J. Fisher. Trace scheduling: A technique for global microcode compaction. *IEEE Trans. Computers*, 30(7):478–490, 1981.
- Paolo Faraboschi, Joseph A. Fisher, Cliff Young: Instruction Scheduling for Instruction Level Parallel Processors Proceedings of the IEEE, vol. 89 no. 11, Nov. 2001
- Daniel Kästner, Sebastian Winkel: ILP-based Instruction Scheduling for IA-64. Proc. ACM SIGPLAN LCTES-2001, June 2001
- Sebastian Winkel. Optimal Global Instruction Scheduling for the Itanium® Processor Architecture. Ph.D. thesis. Saarland University, Saarbrücken, Germany, 2004. ISBN 3-937436-01-6

# Recommended Reading (Generating Schedulers from Reservation Tables)

- T. Müller: Employing finite automata for resource scheduling. Proc. MICRO-26, 1993
- Proebsting, Fraser: Detecting pipeline structural hazards quickly. Proc. ACM POPL-1994
- Bala, Rubin: Efficient instruction scheduling using finite state automata. Proc. MICRO-28, 1995
- Eichenberger, Davidson: A reduced multi-pipeline machine description that preserves scheduling constraints. Proc. ACM PLDI-1996