Worst-case Execution Time Analysis and Optimizing the Worst-case

Sudipta Chattopadhyay
(Guest lecture: Advanced Compiler Construction course)

Real-time Applications

- Real-time applications have temporal constraints
  - Hard real-time systems
  - Soft real-time systems
- System-level analysis for hard real-time systems
  - Application contains many tasks/programs
  - Check whether application meets deadline
- Task-level analysis
  - Worst-case execution time (WCET) of a task over all inputs
  - Analysis to obtain WCET

Obtaining WCET

- How do we calculate WCET of a program?
  - Exhaustively enumerating all inputs?
  - Consider an image processing application

Obtaining WCET

- How do we calculate WCET of a program?
  - Exhaustively enumerating all inputs?
    - Until the end of the world :-)
    - Consider an image processing application
  - Systematically generating inputs?
    - Can we give hard guarantees?

Obtaining WCET

- But wait a minute.....
  - But do we need all inputs?
  - Why don't we just test 2 paths and take the maximum of them?

Obtaining WCET

- But wait a minute.....
  - But do we need all inputs?
  - Why don't we just test 2 paths and take the maximum of them?
Obtaining WCET

• But wait a minute.....
  - But do we need all inputs?

![Flowchart]

- Why don’t we just test 2 paths and take the maximum of them?
- Different inputs may access different memory blocks, which might incur different latency to fetch from memory subsystem

Obtaining WCET

• Besides.....
  - What you see is not what you execute (WYSIW getattr. Balakrishnan et al. ACM TOPLAS 2010)

![Flowchart]

Source code
Many possible executions of the binary code

Obtaining WCET

• And.....
  - We still have exponential blow-up

![Flowchart]

Source code with exponential number of paths

Obtaining WCET

• Static analysis
  - Does not require inputs
  - Estimates an upper bound on execution time without executing a program.....
    - Sounds too good to be true?
    - How expensive?
    - How accurate?
    - What are the limitations?

Obtaining WCET

WCET

![Graph]

WCET = Worst-case Execution Time
BCET = Best-case Execution Time

WCET calculation

• Control flow graph (CFG) representation of a program
WCET calculation

- Control flow graph (CFG) representation of a program

Source code
```c
int a, b(10);
a = b[x] + u
```

Basic block
Loop

Is it appropriate?

WCET calculation

- Control flow graph (CFG) representation of a program

Source code
```c
int a, b(10);
a = b[x] + u
```

Basic block
Loop

Is it appropriate? NO

WCET calculation

- Control flow graph (CFG) representation of a program

Source code
```c
int a;
a = x + u
```

Intermediate code
```c
i = b + x
j = i + 1 + u
a = j
```

Binary code
`0001_1...

Compiler transformations / optimizations

Timing is target dependent. Any doubt?

WCET calculation

- Control flow graph (CFG) representation of a program

Source code
```c
int a, b(10);
a = b[x] + u
```

Basic block
Loop

Is it appropriate? YES, but....

Typical WCET Derivation

- Derive WCET of basic blocks (Micro architectural modeling)
- Use WCET of basic blocks to derive the longest path in the CFG (calculation)
WCET Calculation

Find the path that maximizes the sum of Ti
Hmmm, but isn’t this undecidable?

WCET Calculation – not so fast

Longest path with all loops having bounded length
How do we eliminate infeasible paths?

Loop bounds and infeasible paths

Finding loop bounds
- Static analysis
- Simulations (might be unsafe)
- Explicitly provided by programmer

Finding infeasible paths
- Model checking, symbolic execution
- Explicitly provided by programmer

Integrating loop bounds and infeasible path information into WCET calculation

Loop bounds and infeasible paths

Predicate: P1

assert(~P1 V P2)

Predicate: P2

assert(~P1 V ~P2)

Bound values on iter (value analysis)
Flow facts (Ermedahl et al., RTSS 2000)

- A generic way to specify program flow information
  - Loop bounds and infeasible paths
- Flow fact structure
- Derivation of flow facts
- Integration of flow facts into WCET calculation
Flow fact structure

IPET based WCET calculation

- Implicit path enumeration via integer linear programming (ILP)
- Current state-of-the-art in most WCET analyzers
- Still an NP-complete problem, however, efficient solvers exist for ILP, such as CPLEX

IPET based WCET calculation

- Modeling control flows
  - $c = 0$
  - $x \leq 2$
  - $ License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International Answer: License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International

- Modeling loop bounds
  - $c = 0$
  - $x \leq 2$
  - $ License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International Answer: License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International

- Modeling infeasible path
  - $c = 0$
  - $x \leq 2$
  - $ License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International Answer: License: CC BY-NC-SA 4.0 Attribution-NonCommercial-ShareAlike 4.0 International

- Overall modeling
  - Objective: Maximize $T(c=x)N(c=0) + T(x<2)N(x<2)$
  - Subject to: $N(Sx.m1) + N(Sx.m2) = 100$ (loop bound)
  - Similar control flow constraints...
**IPET based WCET calculation**

- Advantages and disadvantages
  - Accuracy
    - Good as long as the constraints are modeled appropriately
  - How difficult is it to integrate flow facts?
    - Most of the flow facts could be integrated as ILP constraints
    - However, approximated sometimes
  - Scalability
    - NP complete, however efficient solvers exist

**Why?**

- Why do not we annotate maximum execution time of each instruction

**Micro-architectural Modeling**

- Need timing models of underlying micro-architecture
  - Caches
  - Pipeline
  - Branch predictors
  - ?

**WCET Calculation**

- Find the path that maximizes the sum of $T_i$ (So far)
- Getting the value of $T_i$ (now)

**Why?**

- Because....

**Modeling caches**

- Why model caches?
  - Caches play a significant role in improving performance
  - Caches in modern processors (ARM, X86) are several magnitudes faster than main memory (DRAM)
  - Not modeling caches does not mean caches don't perform at runtime
    - Imprecision in the analysis
Modeling Instruction Caches

- Consider cache as a matrix

Instruction Cache Modeling

- Cache sets are modeled independently
- Depends on the replacement policy (we consider LRU for now)

Instruction Cache Modeling with AI

- Abstract interpretation (AI)
  - A static analysis framework (already in lectures)
  - Abstract and concrete states of programs
  - We shall talk about cache states within a program
- Predicts cache hits and cache misses
  - AH – always a cache hit (called *must analysis*)
  - AM – always a cache miss (called *may analysis*)
  - NC – unknown (remember it is a static analysis)

Instruction Cache Modeling with AI

- We need to formulate how the cache states are transformed by accessing each memory block (transfer operation in AI)
Instruction Cache Modeling with AI

LRU age

m4
m3
m2
m1

Cache state

We need to formulate how the cache states are transformed by accessing each memory block (transfer operation in AI)

LRU age

m2
m5
m4
m3
m1

Cache state

Memory blocks within a basic block

Instruction Cache Modeling with AI

• How to handle control flow merge points?

Depends on the type of analysis
- Must analysis (to predict definite cache hits)
- May analysis (to predict definite cache misses)

CFG

Must Analysis

• How to handle control flow merge points?

LRU age

m2
m4
m3
m1

Cache state

Set intersection

LRU age

m2
m5
m4
m3
m1

Cache state

LRU age

m5
m2
m4
m3
m1

Cache state

Set intersection

Must join

Must Analysis

• How to handle control flow merge points?

LRU age

m1
m4
m3
m5

Cache state

LRU age

m2
m4
m3
m5

Cache state

LRU age

m5
m2
m4
m3

Cache state

Set intersection + maximum age

Must join

LRU age

T
T
T
m3
(m1,m2)

LRU age

T
T
m3
(m1,m2)

LRU age

T
T
m3

Must join

LRU age

T
T
m3

Abstract state

What are the concrete states?
**Must Analysis**

- Abstract interpretation

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>T</td>
<td>T</td>
<td>T</td>
<td>m3</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>T</td>
<td>T</td>
<td>m5</td>
<td>m1</td>
</tr>
</tbody>
</table>

Abstract state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Abstract state

- In general, any concrete state with
  - Age of m3 ≤ 3
  - Age of m1 ≤ 4
  - Age of m2 ≤ 4

**May Analysis (Predicting cache misses)**

- How to handle control flow merge points?

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>T</td>
<td>T</td>
<td>m5</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

- Set Union
- May join

**Instruction Cache Modeling with AI**

- How to deal with loops?

- Loops are special type of merge points
- Solutions are repeated
- Fixed point is obtained on the cache states over all points
- Not loop unrolling

**Categorizing hits and misses**

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

LRU age

<table>
<thead>
<tr>
<th>m2</th>
<th>m4</th>
<th>m3</th>
<th>m1</th>
</tr>
</thead>
<tbody>
<tr>
<td>m2</td>
<td>m4</td>
<td>m3</td>
<td>m1</td>
</tr>
</tbody>
</table>

Cache state

- m3 is an "all-hit"
- m3 is an "all-miss"
Data Cache Modeling

- What is the main difference with instruction cache modeling?
  - Instruction address is program counter value
  - Data address need to be tracked

Data Cache Modeling

- Difficulties
  - Needs separate analysis to know the set of addresses tracked — value analysis
  - Value analysis at binary code is difficult
    - Absence of variable information
    - Register values can be faithfully tracked
    - But need a model of memory, to capture the value

Data cache modeling

- r1 ← r2+r3
- r5 ← Load (r1 + 100)
- r4 ← r5 + 6
- r7 ← Load (r4 + 1)
  //requires value at memory location r1+100

Data Cache Modeling

- Difficulties
  - Not all addresses are accessed in all iterations
  - Instruction addresses do not change based on loop iterations

Data Cache Modeling

- Difficulties
  - Modeling write policies
    - Write through
    - Write back

Data Cache Modeling

- Difficulties
  - Value analysis may compute an over-approximation on the set of addresses accessed by each load/store

Data Cache Modeling (Summary)

- Accurate data cache modeling is a very difficult problem
  - Most results so far are highly pessimistic
  - Difficulty due to address analysis at binary level
  - Difficulty in distinguishing memory blocks at different loop iterations
  - Difficulty in modeling write-back policies
  - Data caches with non-LRU replacement policy is almost unexplored
- A different solution is to avoid data cache and use software controlled scratchpad memory (next lecture)

Pipeline delay computation

- Instruction cache
  - Accessed at “IF stage” of the pipeline
- Instruction All-hit (AH)
  - $T(\text{IF}) = T(\text{IF}) + [1,1]$ 
- Instruction All-miss (AM)
  - $T(\text{IF}) = T(\text{IF}) + [100,100] /100$ cycle miss penalty
- Instruction unclassified (NC)
  - $T(\text{IF}) = T(\text{IF}) + [100,100]$? NO

Integration of Cache and Pipeline Model

- Instruction cache
  - Accessed at “IF stage” of the pipeline
- Instruction All-hit (AH)
  - $T(\text{IF}) = T(\text{IF}) + [1,1]$ 
- Instruction All-miss (AM)
  - $T(\text{IF}) = T(\text{IF}) + [100,100] /100$ cycle miss penalty
- Instruction unclassified (NC)
  - $T(\text{IF}) = T(\text{IF}) + [100,100]$? NO

Timing Anomaly

<table>
<thead>
<tr>
<th>Stage</th>
<th>ID</th>
<th>EX</th>
<th>WB</th>
<th>CM</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td></td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
</tbody>
</table>

Timing Anomaly

<table>
<thead>
<tr>
<th>Stage</th>
<th>ID</th>
<th>EX</th>
<th>WB</th>
<th>CM</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
</tbody>
</table>

Timing Anomaly

<table>
<thead>
<tr>
<th>Stage</th>
<th>ID</th>
<th>EX</th>
<th>WB</th>
<th>CM</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
</tbody>
</table>

Timing Anomaly

<table>
<thead>
<tr>
<th>Stage</th>
<th>ID</th>
<th>EX</th>
<th>WB</th>
<th>CM</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
</tbody>
</table>

Timing Anomaly

<table>
<thead>
<tr>
<th>Stage</th>
<th>ID</th>
<th>EX</th>
<th>WB</th>
<th>CM</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
<tr>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>WB</td>
<td>CM</td>
</tr>
</tbody>
</table>
1. **Dependency among analyses**

- Analyses of different micro-architectural entities are not independent.
- Compositional analyses
- At root of accuracy

2. **Current Trend of Research**

- A fairly active research area
  - WCET analysis in multi-core era
    - Resource sharing
  - WCET analysis for multi-threaded programs
    - Synchronizations
    - Coherence
  - Modeling timing delays for multi-tasking systems
    - Effect of preemption
    - Effect due to the presence of supervisory software (operating systems)
  - Systematic input generation to obtain near worst-case performance
    - Soft real-time applications
    - Eliminates the need to statically model hardware
  - ...

3. **Flow facts in Compilation**

- Flow facts are easier to derive on source/intermediate code
  - Variable information is present

- Programmers are more likely to specify constraints on source code
  - Annotating object code is cumbersome

- Compiler transformations may destroy flow facts....

4. **Recall flow facts**

- Flow fact derivation (derived primarily on same intermediate code)
- Target independent
- Compiler transformations / optimizations
- Timing information (micro-architectural modeling)

5. **Flow Facts**

- Scope 1 (source code)
- Loop bound: Scope 1: $8 \cdot XH <= 100$
  - (object code: loop is unrolled 3 times)
Flow Facts

- Scope 1:
  - $x < 10$
  - $y < 10$
  - $z < 10$
- Scope 1.1:
  - $x = y + z$
- Scope 1.2:
  - $x = y + z$
- Scope 1.3:
  - $x = y + z$

Loop bound: Scope 1: $x < 10$

$\leq 57$ for each iteration

Other Code Annotations

- Some code annotations could be provided to capture the range of values
  - e.g. `Variable "c" takes values between 1 and 10`
    ```
    func(c) {
        ... = a[c]; // access address a[1,10]
    }
    ```
  - Value annotations may greatly simplify data cache modeling!!
- However, error-prone and labor intensive to perform such annotations in object code

Code Annotations for WCET Analysis

- Flow facts
  - Loop bounds
  - Infeasible paths
- Range of values for variables
  - Helps in data cache modeling
  - Finding infeasible paths
- Addresses of indirect jumps
  - Difficult to obtain at binary level
  - Used for CFG reconstruction
  - Examples: function pointers

Compilation and Code Annotations

- Why a compiler can help?
  - Compiler knows different levels of code representations
    - Source code
    - Intermediate code
    - Object code
  - Compiler knows which transformations are applied
  - A compiler can translate annotations after each transformation
  - In practice, most code annotations can be translated by a compiler

Flow Fact Translation

Source code
```
int a;
a = x + u;
```

Intermediate code
```
t = b + x;
t_1 = t + u;
a_1 = t;
```

Binary code
```
0001_1...
```

Optimizing the worst-case

- WCET analysis can benefit from a compiler
  - Translating code annotations
- A compiler can optimize the worst-case performance of a program
  - Improves the WCET bound
  - Traditional compiler optimizations aim to optimize average-case performance
Worst-case Optimizations

- Fundamental differences with other optimizations

```
// Some C code
if (i < k) {
    // do some work
}
```

- Worst-case optimizations aim to improve WCET bound
  - Common optimizations
    - Scratchpad allocation
    - Cache locking
    - Procedure positioning
    - Procedure cloning

- Fundamental differences with other optimizations
  - Sometimes, does not really improve performance
  - However, makes the WCET analysis simpler and the analyzer more accurate

```
// More C code
```
Worst-case Optimizations

```c
f(A, B, C)
{
  A[i] = B[i] + C[i]
}
```

```
\text{Typically, from the address, it can be determined whether the memory block is fetched from scratchpad.}
```

Scratchpad Memory

- What is it?
  - A software controlled memory (typically, SRAM)
  - Visible to the application programmer
  - Mapped in the same virtual address space of the processor
- Many names
  - Local memory in Cell processor
  - Shared memory in GPGPU

Why Scratchpad?

- Software controlled
  - Every memory access is predictable
  - Improves accuracy of WCET analysis
- A promising approach to replace data cache in embedded systems
  - Practically eliminates the need of complex value/address analysis
- Energy efficient compared to caches
  - Does not require tag comparison logic. \textit{WHY?}

Why not scratchpad?

- Programmers may not use scratchpad efficiently
  - Puts burden on the programmer
  - Error-prone if used manually
- Requires compiler support
  - To allocate appropriate blocks in scratchpad
  - Average case performance optimization
  - Worst case performance optimization

Scratchpad Allocation

- For average case optimization
  - Typically done via profiling
  - From profiling
    - frequency of memory accesses
    - the expected gain to allocate each memory block in scratchpad
  - An optimization problem
    - Minimize execution time
    - With constraints on scratchpad size
**Scratchpad Allocation**

- **Worst-case optimization**
  - Complex, requires input from WCET analysis
  - The worst-case path can be obtained from a WCET analyzer
  - Not very useful to perform on a single worst-case path
    - Worst-case path switching

- **Greedy strategy**
  - Perform WCET analysis
  - Obtain worst-case path
  - Find the variable "v" facing maximum memory latency
  - Allocate "v" into scratchpad
  - Repeat WCET analysis and repeat the process
- **Advantages**
  - Simple and efficient implementation possible
  - Works good in practice

- **WCET calculation based on IPET**
  - State-of-the-art
  - An ILP-based optimization problem which can be integrated with IPET formulation

**Scratchpad Allocation**

- **Greedy strategy**
  - Not optimal, may not obtain the best WCET
  - Fits well with path-based WCET calculation
    - How about flow facts?
      - If flow facts are not considered, the optimization may concentrate on an infeasible path
        - Optimization effort is lost
Scratchpad Allocation

- Greedy allocation
  - Fast, but not optimal
  - Works well in practice
  - Relatively difficult to integrate flow facts
- ILP based model
  - Optimal
  - Integrates well with IPET based WCET calculation
  - Easy to integrate flow facts
  - Relatively slow, compiler optimizations need to be fast

Scratchpad Allocation

The discussed allocation does not consider overlay

- H2 and H can share scratchpad space
  - Similar to register allocation in compiler
  - Variable live range can be used to improve static scratchpad allocation
  - H2 is not accessed

Scratchpad Allocation

W(H1) = worst case cast for the acyclic path starting at H1 and ending at G (after allocation)

- X(H1)=1, if H1 is allocated to scratchpad
- X(H1)=0, otherwise

ILP constraint:

\[ W(H1) = W(Scope1) + (T(H1) - X(H1) \cdot Gain(H1)) \]

\[ = W(H2) + (T(H1) - X(H1) \cdot Gain(H1)) \]

Final objective:

\[ \text{Minimize } W(H1) \cdot E(H1) \]

Scope 2: \( XH1 \leq N \)

Scratchpad Allocation

- We have discussed static allocations
  - Allocations decisions are fixed at compile time
  - Allocation decisions do not change at runtime
- Most programs have phases
  - Spatial and temporal locality
  - Automatically handled by caches
- Dynamic scratchpad allocation
  - Allocations may change at runtime

Scratchpad Allocation

- Scratchpads are usually small
- Assume that the scratchpad can hold variables of only one inner loop
- Static allocation will not be very useful as one of the loops will face larger memory latencies
- How about caches?
**Scratchpad Allocation**

- Program phases
  - S:
  - H1:
  - A:
  - S1:
  - H2:
  - S2:
  - E:

Change allocation decision
Dynamic allocation
- Allocate H1, A, B, S1
- Flush Scratchpad
- Allocate H2, D, E, S2
- Flush Scratchpad

Require more analysis to decide appropriate points where allocations will be switched.

**Register Allocation**

- One of the most important compiler optimizations
  - Extensively studied in the past
  - Important, potentially in the past execution platforms
  - Register accesses substantially reduce overall memory access penalty

- Limited registers
  - Not all variables can be allocated in registers
  - Which variables should be allocated registers?

**Register Allocation**

- Classic algorithms
  - Based on graph coloring
    - Chaitin et al.
    - Briggs et al.

(load b,f), s1
add s1, s4, s2
store s2, 4(f)
sub s1, s2, s3
store s3, 12(f)
mul s1, s2, s4
store s4, 8(f)

Interference Graph with virtual registers

**Register Allocation**

- Spill decision lacks appropriate timing model
  - Current register allocation strategies may substantially affect the worst-case performance

- WCET analysis is performed on executable
  - Code with virtual registers is not executable
  - And, WCET estimates decide the virtual to physical register mapping
  - Cyclic dependency

- Helke Folk, WCET-aware Register Allocation based on Graph Coloring, DAC 2009

**Register Allocation**

- Compiler
  - Register allocation
  - WCET estimates
  - Binary

Cyclic dependency
WCET analyzer

- Take a pessimistic approach
- Initially assume all virtual registers are kept in memory (traditional register allocators are usually optimistic)
- During register allocations, virtual registers are moved to physical registers
- Intermediate code is always executable, therefore WCET analysis can be performed
Register Allocation

- Find the worst-case execution path (WCEP)
- Find basic block "b" leading to the highest execution of spill code along the WCEP
- "b" is chosen to be allocated in registers
- Sort all virtual registers in "b" by their number of occurrences
- Potential spill is the register with least occurrences
- Run WCET analysis again with the modified code

WCC Compiler

- WCET-aware C Compiler
  - Developed and maintained at TU, Dortmund
  - Extends traditional compiler passes to integrate a WCET analyzer
  - Specification of memory hierarchies to optimize memory performance
    - Scratchpad allocation
    - Cache locking
  - WCET-aware code generation
    - Preserving flow facts
    - Back annotations (timing information obtained from binary/level mapped to intermediate level)
  - WCET-aware compiler optimizations
WCET Analyzers

- Commercial
  - aif
    - http://www.ai4int.com/aif (WCET analyzer based on static analysis)
  - RapTime
    - http://www.raptime.com/products/raptime (measurement-based WCET analysis tool)

- Academic research
  - SWET (Maastricht University)
  - Chronus (National University of Singapore)
  - Ottawa (RIT, University of Toulouse, France)

Fun Stuff

- Using the same abstract domain of LRU instruction cache analysis, can you design an abstract interpretation-based framework for FIFO caches? In particular, try to formulate the update (after visiting each memory block) and join operations (at control flow merge point) for must analysis and reason about their correctness.

Lecture Materials

- Jakob Engblom and Andreas Ermedahl, Modeling Complex Flows for Worst Case Execution Time Analysis. RTSS 2000 (Flow fact/language)
- Vy Shernho, Takika Mitra, Abhinoy Roychoudhury, Ting Chen: WCET Centric Data Allocation to Scratchpad Memory. RTSS 2005 (Scratchpad allocation for WCET optimizations)

Questions

Thank you