#### Seminar 2

- **1. Instruction Pipelining**
- 2. Superscalars
- **3. Parallel Architectures**

A nonpipelined processor has a clock rate of 2.5 GHz. An upgrade to the processor introduces a five-stage pipeline. However, due to internal pipeline delays, the clock rate of the new processor has to be reduced to 2 GHz. What is the speedup achieved for a sequence of 100 instructions?

A nonpipelined processor has a clock rate of 2.5 GHz. An upgrade to the processor introduces a five-stage pipeline. However, due to internal pipeline delays, the clock rate of the new processor has to be reduced to 2 GHz. What is the speedup achieved for a sequence of 100 instructions?

#### **Remember from Lecture 4-5**:

- $\Box$   $\tau$ : duration of one cycle
- □ *n*: number of instructions to execute
- □ *k*: number of pipeline stages
- $\Box$   $T_{k,n}$ : total time to execute *n* instructions on a pipeline with *k* stages
- □  $S_{k,n}$ : (theoretical) speedup produced by a pipeline with *k* stages when executing *n* instructions

 $T_{k,n} = [k + (n-1)] \times \tau$ 

On a non-pipelined processor each instruction takes  $k \times \tau$ , and *n* instructions take  $T_n = n \times k \times \tau$ 

A nonpipelined processor has a clock rate of 2.5 GHz. An upgrade to the processor introduces a five-stage pipeline. However, due to internal pipeline delays, the clock rate of the new processor has to be reduced to 2 GHz. What is the speedup achieved for a sequence of 100 instructions?

#### **Remember from Lecture 4-5**:

- $\Box$   $\tau$ : duration of one cycle
- □ *n*: number of instructions to execute
- □ *k*: number of pipeline stages
- $\Box$   $T_{k,n}$ : total time to execute *n* instructions on a pipeline with *k* stages
- □  $S_{k,n}$ : (theoretical) speedup produced by a pipeline with *k* stages when executing *n* instructions

$$T_{k,n} = [k + (n-1)] \times \tau$$

On a non-pipelined processor each instruction takes  $k \times \tau$ , and *n* instructions take  $T_n = n \times k \times \tau$ 

$$S_{k,n} = \frac{T_n}{T_{k,n}} = \frac{n \times k \times \tau}{[k + (n-1)] \times \tau} = \frac{n \times k}{k + (n-1)}$$

A nonpipelined processor has a clock rate of 2.5 GHz. An upgrade to the processor introduces a five-stage pipeline. However, due to internal pipeline delays, the clock rate of the new processor has to be reduced to 2 GHz. What is the speedup achieved for a sequence of 100 instructions?

#### **Solution**

$$S_{5,100} = \frac{100 \times 5}{5 + (100 - 1)} = 4.8$$

We would have a speedup of 4.8 if the pipelined processor would work at the same clock rate as the initial one!

But the clock rate of the pipelined processor is reduced by a factor of 2/2.5 = 0.8



**Consider the following assembly language program:** 

- 1: Move R3, R7  $R3 \leftarrow R7$
- 2: Load R8, (R3)  $R8 \leftarrow (R3)$
- 3: Add R3, R3, 4  $R3 \leftarrow R3 + 4$
- 4: Load R9, (R3) R9 ← (R3)
- 5: BLE R8, R9, L3 Branch if R9 > R8

**Consider the following assembly language program:** 

 $\begin{array}{cccc} 1: \mbox{ Move R3, R7} & \mbox{ R3} \leftarrow \mbox{ R7} \\ 2: \mbox{ Load R8, (R3)} & \mbox{ R8} \leftarrow (\mbox{ R3}) \\ 3: \mbox{ Add R3, R3, 4} & \mbox{ R3} \leftarrow \mbox{ R3} + 4 \\ 4: \mbox{ Load R9, (R3)} & \mbox{ R9} \leftarrow (\mbox{ R3}) \\ 5: \mbox{ BLE R8, R9, L3} & \mbox{ Branch if R9} > \mbox{ R8} \end{array}$ 

**Consider the following assembly language program:** 

**Consider the following assembly language program:** 



**Consider the following assembly language program:** 



**Consider the following assembly language program:** 



**Consider the following assembly language program:** 



Show the dependencies.

True data dependency (RAW): 1 - 2, 1 - 3, 2 - 5, 3 - 4, 4 - 5 Output dependency (WAW): 1 - 3 Antidependency (WAR): 2 - 3

Datorarkitektur Se 2

- a. Identify the dependencies in the following code:
  - 1:
      $R1 \leftarrow 100$  

     2:
      $R5 \leftarrow R1 + R2$  

     3:
      $R7 \leftarrow R5 + 1$  

     4:
      $R1 \leftarrow R2 + R4$  

     5:
      $R5 \leftarrow 0$  

     6:
      $R2 \leftarrow R4 25$  

     7:
      $R3 \leftarrow R7 2$  

     8:
      $R4 \leftarrow R1 + R3$  

     9:
      $R10 \leftarrow 0$  

     10:
      $R1 \leftarrow R1 + 30$
- b. Rename the registers in the above sequence to prevent, where possible, dependency problems.
- c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
  - c1) in order execution;
  - c2) out of order execution before renaming;
  - c3) out of order execution after renaming.

a. Identify the dependencies in the following code:

1:
 
$$R1 \leftarrow 100$$

 2:
  $R5 \leftarrow R1 + R2$ 

 3:
  $R7 \leftarrow R5 + 1$ 

 4:
  $R1 \leftarrow R2 + R4$ 

 5:
  $R5 \leftarrow 0$ 

 6:
  $R2 \leftarrow R4 - 25$ 

 7:
  $R3 \leftarrow R7 - 2$ 

 8:
  $R4 \leftarrow R1 + R3$ 

 9:
  $R10 \leftarrow 0$ 

 10:
  $R1 \leftarrow R1 + 30$ 

a. Identify the dependencies in the following code:

a. Identify the dependencies in the following code:

a. Identify the dependencies in the following code:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8.

a. Identify the dependencies in the following code:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8.

a. Identify the dependencies in the following code:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8.

a. Identify the dependencies in the following code:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R1 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R1 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R1 + 30$ 

a. Identify the dependencies in the following code:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R1 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R1 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R1 + 30$ 

a. Identify the dependencies in the following code:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R1 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R1 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R1 + 30$ 

a. Identify the dependencies in the following code:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R1 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R1 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R1 + 30$ 

a. Identify the dependencies in the following code:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R1 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R1 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R1 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 10, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10

b. Rename registers to prevent output - and antidependencies:



True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 10, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10

b. Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1 - 10, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10

b. Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R1 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1 - 10, 2 - 5, 4. Antidependency (WAR): 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8.

b. Rename registers to prevent output - and antidependencies:

**R1** ← 100 **∕R5 ← R1 + R2 R7** ← **R5** + 1 \_R11 ← R2 + R4 5: **^R5 ← 0**  $R2 \leftarrow R4 - 25$ **6**:  $R3 \leftarrow R7 - 2$  $R4 \leftarrow R11 + R3$ **R10** ← **0** 9: 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1 - 10, 2 - 5, 4 0. Antidependency (WAR): 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 0

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1, 0, 2 - 5, 4, 0. Antidependency (WAR): 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8, 0

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1, 0, 2 - 5, 4. Antidependency (WAR): 4, 2, 0, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8.

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R5 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 0. Antidependency (WAR): 2 - 4, 2 - 0, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 0

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R51 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1, 1, 0, 2, 5, 4, 10. Antidependency (WAR): 2, 4, 2, 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8, 10

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R51 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1, 1, 0, 2, 5, 4, 10. Antidependency (WAR): 2, 4, 2, 10, 2 - 6, 3, 5, 4 - 6, 4 - 8, 6 - 8, 8, 10

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R51 \leftarrow 0$   
6:  $R2 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 4, 1, 1, 0, 2, 5, 4, 10. Antidependency (WAR): 2, 4, 2, 10, 2 - 6, 3, 5, 4 - 6, 4 - 8, 6 - 8, 8, 10

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R51 \leftarrow 0$   
6:  $R21 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10

**b.** Rename registers to prevent output - and antidependencies:

1: 
$$R1 \leftarrow 100$$
  
2:  $R5 \leftarrow R1 + R2$   
3:  $R7 \leftarrow R5 + 1$   
4:  $R11 \leftarrow R2 + R4$   
5:  $R51 \leftarrow 0$   
6:  $R21 \leftarrow R4 - 25$   
7:  $R3 \leftarrow R7 - 2$   
8:  $R4 \leftarrow R11 + R3$   
9:  $R10 \leftarrow 0$   
10:  $R12 \leftarrow R11 + 30$ 

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10

**b.** Rename registers to prevent output - and antidependencies:

**b.** Rename registers to prevent output - and antidependencies:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 5, 4 - 10.

**b.** Rename registers to prevent output - and antidependencies:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 5, 4 - 10.

**b.** Rename registers to prevent output - and antidependencies:

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 0, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 5, 4 - 10.

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   |     |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   |     |
| Cycle 2 | 2   |     |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   |     |
| Cycle 2 | 2   |     |
| Cycle 3 | 3   | 4   |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;

| <b>1</b> :         | <b>R1</b> ←     | 100         | //        |                   |
|--------------------|-----------------|-------------|-----------|-------------------|
| 2:                 | <b>R5</b> ←     | R1 +        | <b>R2</b> | $\mathbf{i}$      |
| <b>3:</b>          | ( <b>/ R7</b> ← | R5 +        | 1)        | $\langle \rangle$ |
| 4:                 | <b>▶</b> R1 ←   | R2 +        | R4        |                   |
| 5:                 | <b>(R5</b> ←    | 0           | × `       | $\setminus$       |
| /  \ <b>6:</b>   ` | <b>₹R</b> 2 ←   | <b>R4</b> – | 25        |                   |
| 7:                 | <b>R3</b> ←     | <b>R7</b> – | 2         |                   |
| 8:                 | <b>R4</b> ←     | R1 +        | <b>R3</b> |                   |
| 9:                 | \( <b>R10</b> ← | - 0         |           |                   |
| 10:                | <b>₹R</b> 1 ←   | R1 +        | 30 🎽      |                   |

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   |     |
| Cycle 2 | 2   |     |
| Cycle 3 | 3   | 4   |
| Cycle 5 | 5   | 6   |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;

| $\land$ (1: R1 ← 100 $\searrow$                        |     |  |
|--------------------------------------------------------|-----|--|
| 2: $R5 \leftarrow R1 + R2$                             | ALU |  |
| $3: (R7 \leftarrow R5 + 1) $ Cycle 1                   | 1   |  |
| 4: $R1 \leftarrow R2 + R4$<br>Cycle 2                  | 2   |  |
| $ \begin{array}{c ccccccccccccccccccccccccccccccccccc$ | 3   |  |
| $7: (R3 \leftarrow R7 - 2) / Cycle 4$                  | 5   |  |
| $\mathbf{R} : \mathbf{R} + \mathbf{R} + \mathbf{R} $   | 7   |  |
| 9: $(R10 \leftarrow 0)$<br>10: $R1 \leftarrow R1 + 30$ |     |  |
|                                                        |     |  |

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8. Output dependency (WAW): 1 - 4, 1 - 10, 2 - 5, 4 - 10. Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10 Datorarkitektur Se 2

ALU

4

6

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;

| $\begin{array}{ccc} 1: & R1 \leftarrow 100 \\ 2: & R5 \leftarrow R1 + R2 \end{array}$                  |         | ALU | ALU |
|--------------------------------------------------------------------------------------------------------|---------|-----|-----|
| <b>3</b> : (( <b>R7</b> ← <b>R5</b> + 1 ))                                                             | Cycle 1 | 1   |     |
| 4: $R1 \leftarrow R2 + R4$<br>5: $R5 \leftarrow 0$<br>6: $R2 \leftarrow R4 - 25$                       |         | 2   |     |
|                                                                                                        |         | 3   | 4   |
| $7: \left( \begin{array}{c} R3 \leftarrow R7 - 2 \\ R4 \leftarrow R1 + R3 \end{array} \right) \right)$ | Cycle 4 | 5   | 6   |
|                                                                                                        | Cycle 5 | 7   |     |
| 9: $(R10 \leftarrow 0)$<br>10: $R1 \leftarrow R1 + 30$                                                 |         | 8   | 9   |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c1) in order execution;

| 1: $R1 \leftarrow 100$<br>2: $R5 \leftarrow R1 + R2$                                                                         |         | ALU | ALU |
|------------------------------------------------------------------------------------------------------------------------------|---------|-----|-----|
| 3: (( R7 ← R5 + 1 )                                                                                                          | Cycle 1 | 1   |     |
| $\begin{array}{c c} 4: & R1 \leftarrow R2 + R4 \\ 5: & R5 \leftarrow 0 \\ 6: & R2 \leftarrow R4 - 25 \end{array}$            |         | 2   |     |
|                                                                                                                              |         | 3   | 4   |
| $\textbf{7:} (\textbf{R3} \leftarrow \textbf{R7} - \textbf{2}) )$                                                            | Cycle 4 | 5   | 6   |
| $8: \mathbf{R}4 \leftarrow \mathbf{R}1 + \mathbf{R}3 $                                                                       | Cycle 5 | 7   |     |
| 9: $(R10 \leftarrow 0)$<br>10: $R1 \leftarrow R1 + 30$                                                                       |         | 8   | 9   |
| $\mathbf{V} \mathbf{U} \cdot \mathbf{X} \mathbf{V} \leftarrow \mathbf{X} \mathbf{U} + \mathbf{J} \mathbf{U}^{\prime \prime}$ | Cycle 7 | 10  |     |

True data dependency (RAW): 1 - 2, 2 - 3, 3 - 7, 4 - 8, 4 - 10, 7 - 8.

Output dependency (WAW): 1 - 4, 1 - 10, 2 - 5, 4 - 10.

Antidependency (WAR): 2 - 4, 2 - 10, 2 - 6, 3 - 5, 4 - 6, 4 - 8, 6 - 8, 8 - 10 Datorarkitektur. Se 2

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 9   |

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 9   |
| Cycle 2 | 2   |     |

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 9   |
| Cycle 2 | 2   |     |
| Cycle 3 | 3   | 4   |

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 9   |
| Cycle 2 | 2   |     |
| Cycle 3 | 3   | 4   |
| Cycle 4 | 5   | 6   |

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;

| <u>(</u> 1: | <b>R1</b> ← 100             | $\mathcal{N}$ |
|-------------|-----------------------------|---------------|
| 2:          | $R5 \leftarrow R1 + R$      | 2             |
| 3:          | (// <b>R7 ← R5 + 1</b>      |               |
| 4:          | <mark>}</mark> R1 ← R2 + R  | 4             |
| 5:          | ( <b>►R5</b> ← 0            |               |
| 6:          | <b>≩</b> R2 ← R4 – 2        | 5 \           |
| 7:          | ( <b>R3</b> ← <b>R7</b> − 2 |               |
| <b>8:</b>   | $R4 \leftarrow R1 + R$      |               |
| 9:          | \                           |               |
| 10:         | R1 ← R1 + 3                 | 0 🚧           |

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 9   |
| Cycle 2 | 2   |     |
| Cycle 3 | 3   | 4   |
| Cycle 4 | 5   | 6   |
| Cycle 5 | 7   |     |

c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c2) out-of-order order execution, without renaming;

| 1: $R1 \leftarrow 100$<br>2: $R5 \leftarrow R1 + R2$                         |         | ALU | ALU |
|------------------------------------------------------------------------------|---------|-----|-----|
| <b>3</b> : (( <b>R</b> 7 ← <b>R</b> 5 + 1 ) \                                | Cycle 1 | 1   | 9   |
| 4: $R1 \leftarrow R2 + R4$                                                   | Cycle 2 | 2   |     |
| $5: \mathbb{R}5 \leftarrow 0$ $6: \mathbb{R}2 \leftarrow \mathbb{R}4 - 25$   | Cycle 3 | 3   | 4   |
| $7: \left( R3 \leftarrow R7 - 2 \right) \right)$                             | Cycle 4 | 5   | 6   |
| $8: \mathbf{R} 4 \leftarrow \mathbf{R} 1 + \mathbf{R} 3$                     | Cycle 5 | 7   |     |
| 9: $R10 \leftarrow 0$<br>10: $R1 \leftarrow R1 + 30$                         | Cycle 6 | 8   | 10! |
| $10: \mathbf{K} \to \mathbf{K} \to \mathbf{J} \to \mathbf{J} \to \mathbf{J}$ |         |     | LJ  |

 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c3) out-of-order order execution, with renaming;

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 4   |



 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c3) out-of-order order execution, with renaming;

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 4   |
| Cycle 2 | 2   | 5   |



 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c3) out-of-order order execution, with renaming;

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 4   |
| Cycle 2 | 2   | 5   |
| Cycle 3 | 3   | 6   |



 c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c3) out-of-order order execution, with renaming;



|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 4   |
| Cycle 2 | 2   | 5   |
| Cycle 3 | 3   | 6   |
| Cycle 4 | 7   | 9   |



c. Consider a superscalar computer on which the execution of each instruction takes one cycle; the computer has two arithmetic units. Show how instructions are executed in consecutive cycles with:
 c3) out-of-order order execution, with renaming;

| <b>(1:</b>  | <b>R1</b> ← 100                    |
|-------------|------------------------------------|
| 2:          | <b>R5</b> ← <b>R1</b> + <b>R2</b>  |
| 3:          | <b>R7</b> ← <b>R5</b> + 1          |
| <b>4</b> :  | <b>R11</b> ← <b>R2</b> + <b>R4</b> |
| 5:          | <b>R51 ← 0</b>                     |
| 6:          | <b>R21</b> ← <b>R4</b> − <b>25</b> |
| 7:          | R3 ← R7 – 2                        |
| 8:          | R41 ← R11 + R3                     |
| 9:          | <b>R10</b> ← <b>0</b>              |
| <b>10</b> : | R12 ← R11 + 30                     |

|         | ALU | ALU |
|---------|-----|-----|
| Cycle 1 | 1   | 4   |
| Cycle 2 | 2   | 5   |
| Cycle 3 | 3   | 6   |
| Cycle 4 | 7   | 9   |
| Cycle 5 | 8   | 10  |



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.

- a. Calculate the speedup under the aformentioned conditions (relative to execution on a single processor).
- b. Suppose that we are able to effectively use 17 processors rather than 9 on the parallelized portion of the code. Calculate the speedup (relative to execution on a single processor) that is achieved.

An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.

| <br>                           |  |
|--------------------------------|--|
| <br>Contraction and the second |  |
| <br>                           |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |
|                                |  |

An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.

b. Suppose that we are able to effectively use 17 processors rather than 9 on the parallelized portion of the code. Calculate the speedup (relative to execution on a single processor) that is achieved.



An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.

b. Suppose that we are able to effectively use 17 processors rather than 9 on the parallelized portion of the code. Calculate the speedup (relative to execution on a single processor) that is achieved.



0.25\*T\*9

An application program is executed on a nine-processor cluster. The program took time *T* on this cluster. Further, it was found that 25% of T was time in which the application was running simultaneously on all nine processors. The remaining time, the application had to run on a single processor.

b. Suppose that we are able to effectively use 17 processors rather than 9 on the parallelized portion of the code. Calculate the speedup (relative to execution on a single processor) that is achieved.

