







# Parallel Computer A parallel computer is a computer consisting of + two or more processors that can cooperate and communicate to solve a large problem faster, + one or more memory modules, + an interconnection network that connects processors with each other and/or with the memory modules. Multiprocessor: tightly connected processors, e.g. shared memory Multicomputer: more loosely connected, e.g. distributed memory C. Karder, IDA, Linkboldings universited.

Parallel Computer Architecture Concepts

Classification of parallel computer architectures:

by control structure

SISD, SIMD, MIMD

by memory organization

in particular, Distributed memory vs. Shared memory

by interconnection network topology













### More about Interconnection Networks ■ Hypercube, Crossbar, Butterfly, Hybrid networks... → TDDC78 ■ Switching and routing algorithms ■ Discussion of interconnection network properties • Cost (#switches, #lines) • Scalability (asymptotically, cost grows not much faster than #nodes) • Node degree • Longest path (→ latency) • Accumulated bandwidth • Fault tolerance (worst-case impact of node or switch failure)





(Fat-tree topology)

lı.u I.U LINK The Challenge ■ Today, basically all computers are parallel computers! · Single-thread performance stagnating · Dozens of cores and hundreds of HW threads available per server • May even be heterogeneous (core types, accelerators) Data locality matters • Large clusters for HPC and Data centers, require message passing Utilizing more than one CPU core requires thread-level parallelism One of the biggest software challenges: Exploiting parallelism Need LOTS of (mostly, independent) tasks to keep cores/HW threads busy and overlap waiting times (cache misses, I/O accesses) All application areas, not only traditional HPC General-purpose, data mining, graphics, games, embedded, DSP, ... Affects HW/SW system architecture, programming languages, algorithms, data structures  $\dots$ Parallel programming is more error-prone (deadlocks, races, further sources of inefficiencies) And thus more expensive and time-consuming

# Can't the compiler fix it for us? Automatic parallelization? at compile time: Requires static analysis – not effective for pointer-based languages inherently limited – missing runtime information needs programmer hints / rewriting ... ok only for few benign special cases: loop vectorization extraction of instruction-level parallelism at run time (e.g. speculative multithreading) High overheads, not scalable

| Insight                                                                                                                                                                | I.U UNIVERSITY |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| <ul> <li>Design of efficient / scalable parallel algorithms is, in general, a creative task that is not automatizable</li> <li>But some good regimes evict.</li> </ul> |                |
| <ul> <li>■ But some good recipes exist</li> <li>● Parallel algorithmic design patterns →</li> </ul>                                                                    |                |
|                                                                                                                                                                        |                |
|                                                                                                                                                                        |                |
|                                                                                                                                                                        |                |
| Kessler, IDA, Linköpings universitet.                                                                                                                                  | 18             |

























### Divide&Conquer Parallel Sum Algorithm in the PRAM / Circuit (DAG) cost model Given n numbers $x_0, x_1, ..., x_{n-1}$ stored in an array. The global sum $\sum_{i=0}^{n-1} x_i$ can be computed in $\lceil \log_2 n \rceil$ time steps on an EREW PRAM with n processors.







## For a fixed number of processors ...? Usually, p << n Requires scheduling the work to p processors (A) manually, at algorithm design time: Requires algorithm engineering E.g. stop the parallel divide-and-conquer e.g. at subproblem size n/p and switch to sequential divide-and-conquer (= task agglomeration) For parallel sum: Step 0. Partition the array of n elements in p slices of n/p elements each (= domain decomposition) Step 1. Each processor calculates a local sum for one slice, using the sequential sum algorithm, resulting in p partial sums (intermediate values) Step 2. The p processors run the parallel algorithm to sum up the intermediate values to the global sum.





### **Analysis of Parallel Algorithms**

Christoph Kessler, IDA, Linköpings universitet.

### **Analysis of Parallel Algorithms**

Performance metrics of parallel programs

- Parallel execution time
  - Counted from the start time of the earliest task to the finishing time of the latest task
- Work the total number of performed elementary operations
- Cost the product of parallel execution time and #processors
- Speed-up
  - the factor by how much faster we can solve a problem with p processors than with 1 processor, usually in range (0...p)
- Parallel efficiency = Speed-up / #processors, usually in (0...1)
- Throughput = #operations finished per second
- Scalability
  - does speedup keep growing well also when #processors grows large?

essler, IDA, Linköpings universitet.



**Analysis of Parallel Algorithms** 

### **Asymptotic Analysis**

- Estimation based on a cost model and algorithm idea (pseudocode operations)
- Discuss behavior for large problem sizes, large #processors

### **Empirical Analysis**

- Implement in a concrete parallel programming langauge
- Measure time on a concrete parallell computer
  - Vary number of processors used, as far as possible
- More precise
- More work, and fixing bad designs at this stage is expensive

essler, IDA, Linköpings universitet



I.U UNK

### Parallel work, time, cost

parallel work  $w_{\!\scriptscriptstyle A}(n)$  of algorithm  ${\cal A}$  on an input of size n

= max. number of instructions performed by all procs during execution of A, where in each (parallel) time step as many processors are available as needed to execute the step in constant time.

parallel time  $t_A(n)$  of algorithm A on input of size n

= max. number of parallel time steps required under the same circumstances

parallel cost  $c_A(n) = t_A(n) * p_A(n)$ 

 $\rightarrow c_A(n) \ge w_A(n)$ 

where  $p_A(n) = \max_i p_i(n) = \max$ . number of processors used in a step of A

Work, time, cost are thus worst-case measures.

 $\mathit{t}_{\mathit{A}}(n)$  is sometimes called the depth of  $\mathit{A}$ 

(cf. circuit model of (parallel) computation)

 $p_i(n)$  = number of processors needed in time step i,  $0 \le i < t_A(n)$ , to execute the step in constant time. Then,  $w_A(n) = \sum_i p_i$ 

### Speedup

Consider problem  $\mathcal{P}$ , parallel algorithm A for  $\mathcal{P}$ 

 $T_{s}$  = time to execute the best serial algorithm for  ${\cal P}$ 

on one processor of the parallel machine

 $T(1) = {\sf time\ to\ execute\ parallel\ algorithm\ } A \ {\sf on\ 1\ processor}$ 

T(p) = time to execute parallel algorithm A on p processors

Absolute speedup  $S_{abs} = \frac{T_s}{T(p)}$ 

Relative speedup  $S_{rel} = \frac{T(1)}{T(p)}$ 

 $S_{abs} \leq S_{rel}$ 

I.U UNKÖ

Speedup S(p) with p processors is usually in the range (0...p)

C. Kessler, IDA, Linköpings universite

### Amdahl's Law: Upper bound on Speedup Consider execution (trace) of parallel algorithm A: sequential part $A^s$ where only 1 processor is active parallel part $A^p$ that can be sped up perfectly by p processors $\rightarrow$ total work $w_A(n) = w_{A^s}(n) + w_{A^p}(n)$ , time $T = T_{A^s} + \frac{T_{A^p}}{p}$ , Amdahl's Law If the sequential part of A is a fixed fraction of the total work irrespective of the problem size n, that is, if there is a constant $\beta$ with $\beta = \frac{w_{A^s}(n)}{w_A(n)} \le 1$

the relative speedup of A with p processors is limited by

 $\frac{p}{\beta p + (1-\beta)} \; < \; 1/\beta$ 











### **Data Locality** Memory hierarchy rationale: Try to amortize the high access cost of lower levels (DRAM, disk, ...) by caching data in higher levels for faster subsequent accesses

- Cache miss stall the computation, fetch the block of data containing the accessed address from next lower level, then resume
- More reuse of cached data (cache hits) → better performance
- Working set = the set of memory addresses accessed together in a period of computation
- **Data locality** = property of a computation: keeping the working set small during a computation
  - Temporal locality re-access same data element multiple times within a short time interval
  - Spatial locality re-access neighbored memory addresses multiple times within a short time interval
- High latency favors larger transfer block sizes (cache lines, memory pages, file blocks, messages) for amortization over many subsequent accesses

### Memory-bound vs. CPU-bound computation

- Arithmetic intensity of a computation
  - = #arithmetic instructions (computational work) executed per accessed element of data in memory (after cache miss)
- A computation is CPU-bound if its arithmetic intensity is >> 1.
  - The performance bottleneck is the CPU's arithmetic throughput
- A computation is memory-access bound otherwise.
  - The performance bottleneck is memory accesses, CPU is not fully utilized
- - Matrix-matrix-multiply (if properly implemented) is CPU-bound.
  - Array global sum is memory-bound on most architectures.



L.U LINE Data Parallelism Given: • One (or several) data containers x, z, ... with n elements each, e.g. array(s)  $\mathbf{x} = (x_1, ... x_n), \mathbf{z} = (z_1, ..., z_n), ...$ An operation f on individual elements of x, z, ... (e.g. incr. sart. mult. ...) Compute:  $y = f(x) = (f(x_1), ..., f(x_n))$ Parallelizability: Each data element defines a task · Fine grained parallelism Input a a1 a2 ... an Easily partitioned into independent tasks, fits very well on all parallel architectures map(f,a,b): -Notation with higher-order function: f(a1,b1) f(a2,b2) • y = map(f, x)

### I.U LINKO **Data-parallel Reduction** A data container x with n elements. e.g. array $\mathbf{x} = (x_1, ... x_n)$ A $\underline{binary. associative}$ operation op on individual elements of x (e.g. add, max, bitwise-or, ...)Compute: $y = OP_{i=1...n} x = x_1 \text{ op } x_2 \text{ op ... op } x_n$ Parallelizability: Exploit associativity of op Notation with higher-order function: y = reduce ( op, x )























# Further Reading C. Kessler, J. Keller: Models for Parallel Computing: Review and Perspectives. PARS-Mitteilungen 24, Gesellschaft für Informatik, Dec. 2007, ISSN 0177-0454 On PRAM model and Design and Analysis of Parallel Algorithms J. Keller, C. Kessler, J. Träff: Practical PRAM Programming. Wiley Interscience, New York, 2001. J. JaJa: An introduction to parallel algorithms. Addison-Wesley, 1992. D. Cormen, C. Leiserson, R. Rivest: Introduction to Algorithms, Chapter 30. MIT press, 1989, or a later edition. H. Jordan, G. Alaghband: Fundamentals of Parallel Processing. Prentice Hall, 2003. A. Grama, G. Karypis, V. Kumar, A. Gupta: Introduction to Parallel Computing, 2nd Edition. Addison-Wesley, 2003. On skeleton programming, see e.g. our publications on SkePU: http://www.ida.liu.se/labs/pelab/skepu

