DF00100 Advanced Compiler Construction TDDC86 Compiler optimizations and code generation

# Optimization and Parallelization of Sequential Programs

#### Lecture 7

#### Christoph Kessler IDA / PELAB Linköping University Sweden

Christoph Kessler, IDA, Linköpings universitet, 2014

#### Outline

Towards (semi-)automatic parallelization of sequential programs

- Data dependence analysis for loops
- Some loop transformations
  - Loop invariant code hoisting, loop unrolling, loop fusion, loop interchange, loop blocking and tiling
- Static loop parallelization
- Run-time loop parallelization
  - Doacross parallelization, Inspector-executor method
- Speculative parallelization (as time permits)
- Auto-tuning (later, if time)







### Loop Optimizations – General Issues

- Move loop invariant computations out of loops
- Modify the order of iterations or parts thereof

#### Goals:

- Improve data access locality
- Faster execution
- Reduce loop control overhead
- Enhance possibilities for loop parallelization or vectorization
- Only transformations that preserve the program semantics (its input/output behavior) are admissible
- Conservative (static) criterium: preserve data dependences
- Need data dependence analysis for loops

## Data Dependence Analysis for Loops

DF00100 Advanced Compiler Construction

TDDC86 Compiler optimizations and code generation

A more formal introduction

Christoph Kessler, IDA, Linköpings universitet, 2014

### Data Dependence Analysis – Overview

- Important for loop optimizations, vectorization and parallelization, instruction scheduling, data cache optimizations
- Conservative approximations to disjointness of pairs of memory accesses
   weaker than data-flow analysis
  - but generalizes nicely to the level of individual array element
- Loops, loop nests
  - Iteration space
  - Array subscripts in loops
  - Index space
- Dependence testing methods
- Data dependence graph
- Data + control dependence graph
  - Program dependence graph



## Data Dependence Graph



Data dependence graph for straight-line code ("basic block", no branching) is always *acyclic*, because relative execution order of statements is forward only.

Data dependence graph for a loop:

- Dependence edge S→T if a dependence may exist for some pair of instances (iterations) of S, T
- Cycles possible
- Loop-independent versus loop-carried dependences

| Example:                                                                            |                           |    | ×                |
|-------------------------------------------------------------------------------------|---------------------------|----|------------------|
| for (i=1; i <n; i++<="" th=""><th>) {</th><th>(s</th><th>(1) loop-carried</th></n;> | ) {                       | (s | (1) loop-carried |
| S1: a[i] = b[i] + a                                                                 | [i-1];                    | ٦  |                  |
| S2: b[i] = a[i];                                                                    |                           |    | loop-independent |
| } (assuming                                                                         | we know statically        | Ś  | $\sim$           |
| C. Kessler that arrays                                                              | a and b do not intersect) | C  | 2)               |













## For multidimensional arrays?



subscript-wise test vs. linearized indexing

for  $i \dots S_1 : \dots A[x[i], 2 * i] \dots S_2 : \dots A[y[i], 2 * i + 1] \dots$ 

 $S_1 : ... A[i,i]... S_2 : ... A[i,i+1]...$ 

 $A[i*(s_1+1)]$  $A[i*(s_1+1)+1]$ 

Moreover:

Hierarchical structuring of dependence tests [Burke/Cytron'86]

for *i* ...

## Survey of Dependence Tests gcd test separability test (gcd test for special case, exact) Banerjee-Wolfe test [Banerjee'88] rational solution in *ItS* Delta-test [Goff/Kennedy/Tseng'91] Power test [Wolfe/Tseng'91] Simple Loop Residue test [Maydan/Hennessy/Lam'91] Fourier-Motzkin Elimination [Maydan/Hennessy/Lam'91] Omega test [Pugh/Wonnacott'92]



## Some important loop transformations

- Loop normalization
- Loop parallelization
- Loop invariant code hoisting
- Loop interchange
- Loop fusion vs. Loop distribution / fission
- Strip-mining / loop tiling / blocking vs. Loop linearization
- Loop unrolling, unroll-and-jam
- Loop peeling
- Index set splitting, Loop unswitching
- Scalar replacement, Scalar expansion
- Later: Software pipelining
- More: Cycle shrinking, Loop skewing, ...

























## **Remark on Locality Transformations**



- An alternative can be to change the data layout rather than the control structure of the program
  - Example: Store matrix B in transposed form, or, if necessary, consider transposing it, which may pay off over several subsequent computations
    - Finding the best layout for all multidimensional arrays is a NP-complete optimization problem [Mace, 1988]
  - Example: Recursive array layouts that preserve locality
    - Morton-order layout
    - Hierarchically tiled arrays
- In the best case, can make computations cache-oblivious
- Performance largely independent of cache size





















#### Remark on static analyzability (1) Static dependence information is always a (safe) overapproximation of the real (run-time) dependences • Finding out the real ones exactly is statically undecidable! · If in doubt, a dependence must be assumed → may prevent some optimizations or parallelization One main reason for imprecision is aliasing, i.e. the program may have several ways to refer to the same memory location Example: Pointer aliasing void mergesort (int\* a, int n) How could a static analysis tool (e.g., compiler) know { ... that the two recursive mergesort (a, n/2); calls read and write mergesort (a + n/2, n-n/2); disjoint subarrays of a?







### Goal of run-time parallelization



- Typical target: irregular loops
  - **for** ( i=0; i<n; i++)
    - a[i] = f(a[g(i)], a[h(i)], ...);
  - Array index expressions g, h... depend on run-time data
  - Iterations cannot be statically proved independent (and not either dependent with distance +1)

#### Principle:

At runtime, inspect g, h ... to find out the real dependences and compute a schedule for partially parallel execution

• Can also be combined with speculative parallelization

#### **Overview**

- Run-time parallelization of irregular loops
  - DOACROSS parallelization
  - Inspector-Executor Technique (shared memory)
  - Inspector-Executor Technique (message passing) \*
  - Privatizing DOALL Test \*
- Speculative run-time parallelization of irregular loops \*
  - LRPD Test \*
- General Thread-Level Speculation
  - Hardware support \*

\* = not covered in this course. See the references







## Inspector-Executor Technique (4)

Problem: Inspector remains sequential - no speedup

#### Solution approaches:

- Re-use schedule over subsequent iterations of an outer loop if access pattern does not change
  - amortizes inspector overhead across repeated executions
- Parallelize the inspector using doacross parallelization [Saltz,Mirchandaney'91]
- Parallelize the inspector using sectioning [Leung/Zahorjan'91]
  - compute processor-local wavefronts in parallel, concatenate
  - trade-off schedule quality (depth) vs. inspector speed
  - Parallelize the inspector using bootstrapping [Leung/Z.'91]
  - Start with suboptimal schedule by sectioning, use this to execute the inspector → refined schedule

## Thread-Level Speculation

DF00100 Advanced Compiler Construction

TDDC86 Compiler optimizations and code generation

Christoph Kessler, IDA, Linköpings universitet, 2014

### Speculatively parallel execution

- For automatic parallelization of sequential code where dependences are hard to analyze statically
  - Works on a task graph
  - · constructed implicitly and dynamically
- Speculate on:
  - control flow, data independence, synchronization, values We focus on thread-level speculation (TLS) for CMP/MT processors. Speculative instruction-level parallelism is not considered here.
- Task:
  - statically: Connected, single-entry subgraph of the controlflow graph
    - Basic blocks, loop bodies, loops, or entire functions
  - dynamically: Contiguous fragment of dynamic instruction stream within static task region, entered at static task entry







## Selecting Tasks for Speculation



#### Small tasks:

- too much overhead (task startup, task retirement)
- low parallelism degree
- Large tasks:
  - higher misspeculation probability
  - higher rollback cost
  - many speculations ongoing in parallel may saturate the resources

#### Load balancing issues

- avoid large variation in task sizes
- Traversal of the program's control flow graph (CFG)
  - Heuristics for task size, control and data dep. speculation

#### **TLS Implementations**

#### Software-only speculation

- for loops [Rauchwerger, Padua '94, '95]
- ...

#### Hardware-based speculation

- Typically, integrated in cache coherence protocols
- Used with multithreaded processors / chip multiprocessors for automatic parallelization of sequential legacy code
- If source code available, compiler may help e.g. with identifying suitable threads



## Some references on Dependence Analysis Loop optimizations and Transformations H. Zima, B. Chapman: Supercompilers for Parallel and Vector Computers. Addison-Wesley / ACM press, 1990. M. Wolfe: High-Performance Compilers for Parallel Computing, Addison-Wesley, 1996. R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

Idiom recognition and algorithm replacement:

- C. Kessler: Pattern-driven automatic parallelization. Scientific Programming 5:251-274, 1996.
- A. Shafiee-Sarvestani, E. Hansson, C. Kessler: Extensible recognition of algorithmic patterns in DSP programs for automatic paral-lelization. *Int. J. on Parallel Programming*, 2012.

## Some references on run-time parallelization



- R. Cytron: Doacross: Beyond vectorization for multiprocessors. Proc. ICPP-1986
- D. Chen, J. Torrellas, P. Yew: An Efficient Algorithm for the Run-time Parallelization of DO-ACROSS Loops, Proc. IEEE Supercomputing Conf., Nov. 2004, IEEE CS Press, pp. 518-527
- R. Mirchandaney, J. Saltz, R. M. Smith, D. M. Nicol, K. Crowley: Principles of run-time suppor for parallel processors, Proc. ACM Int. Conf. on Supercomputing, July 1988, pp. 140-152.
- J. Saltz and K. Crowley and R. Mirchandaney and H. Berryman: Runtime Scheduling and Execution of Loops on Message Passing Machines, *Journal on Parallel and Distr. Computing* 8 (1990): 303-312.
- J. Saltz, R. Mirchandaney: The preprocessed doacross loop. Proc. ICPP-1991 Int. Conf. on Parallel Processing.
- S. Leung, J. Zahorjan: Improving the performance of run-time parallelization. Proc. ACM PPoPP-1993, pp. 83-91.
- Lawrence Rauchwerger, David Padua: The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization. Proc. ACM Int. Conf. on Supercomputing, July 1994, pp. 33-45.
- Lawrence Rauchwerger, David Padua: The LRPD Test: Speculative Run-Time Parallelizatic of Loops with Privatization and Reduction Parallelization. Proc. ACM SIGPLAN PLDI-95, 1995, pp. 218-232.

## Some references on speculative execution / parallelization

- T. Vijaykumar, G. Sohi: Task Selection for a Multiscalar Processor. Proc. MICRO-31, Dec. 1998.
- J. Martinez, J. Torrellas: Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors. Proc. WMPI at ISCA, 2001.
- F. Warg and P. Stenström: Limits on speculative module-level parallelism in imperative and object-oriented programs on CMP platforms. Pr. IEEE PACT 2001.
   P. Marcuello and A. Gonzalez: Thread-spawning schemes for speculative
- multithreading. Proc. HPCA-8, 2002.
- J. Steffan et al.: Improving value communication for thread-level speculation. HPCA-8, 2002.
- M. Cintra, J. Torrellas: Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors. HPCA-8, 2002.
   Eradik Warn and Par Stansteins Innovations especulation through learning through the standard standard
- Fredrik Warg and Per Stenström: Improving speculative thread-level parallelism through module run-length prediction. Proc. IPDPS 2003.
   F. Warg: Techniques for Reducing Thread-Level Speculation Overhead in Chin
- F. Warg: Techniques for Reducing Thread-Level Speculation Overhead in Chip Multiprocessors. PhD thesis, Chalmers TH, Gothenburg, June 2006.
   T. Ohsawa et al.: Pinot: Speculative multi-threading processor architecture
- T. Ohsawa et al.: Pinot: Speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities. Proc. MICRO-38, 2005.
   Kessler, IDA. Linköpings universitet.