Assignment 4: VLIW

Objective
Time allocation
Theoretical background
Assignments
What to Hand-In
Resources

Purpose

The purpose of this assignment is to get partial insight on which are the typical problems when developing a VLIW compiler and which are the factors to consider when deciding on a VLIW architecture.

Time Allocation

4 hours (2 lab sessions) are allocated for this lab.

Theoretical Background

You should review the folowing resources before you start working on this lab:

Notes from the 6th and 7th course lectures

Assignments

A typical compiler takes a program written in some high-level language and translates it to a machine language. As the development of such a compiler is a task beyond the scope of this course, you will perform some tasks of a simpler compiler, that translates a program written in a machine language of a non-VLIW processor P to a program written in the machine language of a VLIW processor V.

The data path of V is characterised by alu_no arithmetic-logic units (ALUs), mul_no multipliers-dividers (MULs), fpu_no floating point units (FPUs), and bau_no Bus Access Units (BAUs).

Consequently, one very long instruction word will consist of alu_no arithmetic/logic instructions of P (the non-VLIW processor), mul_no multiply/divide instructions of P, fpu_no floating point instructions of P, and bau_no load/store instructions of P.

One of the tasks of our imagined VLIW compiler is to pack the instructions of a program written for P in VLIW instructions to be run on V. The packing has to be done considering the data dependencies between the instructions and the resource constrains (for example, we are not allowed to pack more than alu_no arithmetic/logic instructions into one VLIW instruction).

You will

Select one arbitrary benchmark program p from ˜TDTS55/spec95-big subdirectory.
Select one basic block bb from p such that bb contains at least 15 instructions.
Define the architecture of the VLIW processor V by choosing the number of ALUs, alu_no, the number of MULs, mul_no, the number of FPUs, fpu_no, and the number of BAUs, bau_no.
Pack the instructions of bb in VLIW instructions such that the data dependencies and resource constraints are satisfied.
Explore the trade-off between cost (the more units you add to the VLIW, the higher the cost) and performance (the more units you add, the higher the chance that the basic block will run faster).

You will be supported in your work by the ˜TDTS55/bin/vliwc program.

Details:

In order to select the basic block bb of p you run vliwc ˜TDTS55/spec95-big/chosen_benchmark. It will produce a listing, where the basic block index, its start address, its end address and its number of instructions is listed. You can sort the list, for example, in decreasing order of instructions, by running vliwc ˜TDTS55/spec95-big/chosen_benchmark | sort -nr +3 -4 | less. Then you choose a basic block. Write down its index, start address and end address.

For example, I have chosen ˜TDTS55/spec95-big/ijpeg.ss. I run vliwc ˜TDTS55/spec95-big/ijpeg.ss. Then I see that basic block 9512 contains 10 instructions and I choose it. I write down its start address 0x00460480 and its end address 0x004604d0.

In order to see the contents of the basic block, you will have to disassemble the program. You do this by running ssbig-na-sstrix-objdump -d ˜TDTS55/spec95-big/chosen_benchmark | less. In less, if you type / (slash) you will be prompted for a pattern to be searched in the listing of less. Introduce the start address of the chosen basic block. If you do not reach that address from the first match, you type n for the next match, until you reach the start address of the chosen basic block.

For my example, I run ssbig-na-sstrix-objdump -d ˜TDTS55/spec95-big/ijpeg.ss and part of what I get is pasted below:

00460680 <__udivmoddi4+570> mflo $a2[6]
00460688 <__udivmoddi4+578> mfhi $v1[3]
00460690 <__udivmoddi4+580> andi $a3[7],$t1[9],65535
00460698 <__udivmoddi4+588> mult $a2[6],$a3[7]
004606a0 <__udivmoddi4+590> sll $v0[2],$v1[3],0x10
004606a8 <__udivmoddi4+598> mflo $a1[5]
004606b0 <__udivmoddi4+5a0> srl $v1[3],$t3[11],0x10
004606b8 <__udivmoddi4+5a8> or $v1[3],$v0[2],$v1[3]
004606c0 <__udivmoddi4+5b0> sltu $v0[2],$v1[3],$a1[5]
004606c8 <__udivmoddi4+5b8> beq $v0[2],$zero[0],00460710 <__udivmoddi4+600>

In order to pack the instructions, you need to determine the data dependencies. We neglect artificial dependencies (write-after-read (WAR) and write-after-write (WAW)) and pretend that a preprocessing phase eliminated them. You should focus only on the read-after-write (RAW) dependencies.

In my example, instruction 00460698 <__udivmoddi4+588> mult $a2[6],$a3[7] depends on the instructions 00460680 <__udivmoddi4+570> mflo $a2[6] and 00460690 <__udivmoddi4+580> andi $a3[7],$t1[9],65535 because the mult instruction reads from $a2[6] and from $a3[7] while the mflo instruction writes to $a2[6] and the andi instruction writes to $a3[7].

The instructions and their semantics are listed in Appendix A of The SimpleScalar Tool Set, Version 2.0, D. Burger, T. M. Austin.

Note that some of those "instructions" listed in the appendix are just macros which expand to a couple of instructions. For example, div is such a macro. It expands to first checking whether we divide by 0 and only after that it performs the division. If you find an instruction in the output of ssbig-na-sstrix-objdump which is not listed in Appendix A, then please ignore that instruction. Consider it as if it did not depend on any other instruction.

It is relatively easy to detect RAW dependencies between instructions which do not access the memory. It is slightly trickier for load/store instructions because we do not know the target address at compile time as the target address very often is the value of some register. Therefore, in order to ensure the program correctness, a load instruction should be executed only after all the store instructions which precede it in program order have executed. Additionally, a store instruction has to be executed only after all the load and all the store instruction which precede it in program order have executed. Make sure that you capture these aspects in your dependency graph.

For my example, I obtain the following dependency graph:

You can correct your assignment yourself, by using the vliwc program. For that, you'll have to write two files: the graph_file and the vliw_file. The graph_file specifies the graph of data dependencies. The vliw_file specifies the way the instructions are packed in VLIW instructions.

Each line in the graph_file contains one or two instruction addresses separated by blanks (space or tab). If it contains two addresses, then it means that the instruction at the second address depends on the instruction at the first address. If an instruction does not depend on any other instruction, then its address appears alone on a line.

For the dependency graph I obtained and depicted in the figure above, I can write the following dependency graph:

0x00460690 0x00460698
0x00460680 0x00460698
0x00460698 0x004606a8
0x004606a8 0x004606c0
0x004606c0 0x004606c8
0x00460688 0x004606a0
0x004606a0 0x004606b8
0x004606b0 0x004606b8
0x004606b8 0x004606c0

In order to pack the instructions such that they fulfill the resource constraints, label each graph node with the resource it uses (ALU, MUL, FPU, BAU, or none). You can deduce the used resources from the same Appendix The SimpleScalar Tool Set, Version 2.0

For my example, I obtain the following labels:

mflo $a2[6]				none
mfhi $v1[3]				none
andi $a3[7],$t1[9],65535		ALU
mult $a2[6],$a3[7]			MUL
sll $v0[2],$v1[3],0x10			ALU
mflo $a1[5]				none
srl $v1[3],$t3[11],0x10			ALU
or $v1[3],$v0[2],$v1[3]			ALU
sltu $v0[2],$v1[3],$a1[5]		ALU
beq $v0[2],$zero[0],00460710		ALU

You specify your VLIW architecture and instruction packing by means of the vliw_file. The first line in the file consists of 4 integer numbers separated by blanks (space or tab). The four numbers represent alu_no, mul_no, fpu_no, and bau_no in this order. Each of the subsequent lines consists of alu_no + mul_no + fpu_no + bau_no atoms, where an atom can be an instruction address or nop. The first alu_no atoms can be occupied only by instructions labelled with ALU or none. The next mul_no atoms can be occupied only by instructions labelled with MUL or none and so on.

In my example, I chose 2 ALUs, 1 MUL, 1 FPU, 1 and BAU. My vliw_file looks like below:

2                     1          1          1
0x00460690 0x004606b0 0x00460680 0x00460688 nop
0x004606a0 nop        0x00460698 nop        nop
0x004606b8 0x004606a8 nop        nop        nop
0x004606c0 nop        nop        nop        nop
0x004606c8 nop        nop        nop        nop

Once you created the graph_file and the vliw_file you can check your solution by running vliwc -b chosen_basic_block_number -d graph_file -v vliw_file ˜TDTS55/spec95-big/chosen_benchmark. The program will inform you if the dependency graph and the packing are correct, the cost of your solution and it will print out the dependency graph in the xvcg format. You can view your dependency graph by piping the output of vliwc to the input of xvcg.

For my example, I run vliwc -b 9512 -d sorma.graph -v sorma.vliw ˜TDTS55/spec95-big/ijpeg.ss | xvcg -

You will have to try different VLIW architectures (choose different values for alu_no, mul_no, fpu_no, and bau_no) in order to optimise (minimise) the cost-performance indicator.

What to Hand-In

Indicate which benchmark and which basic block you chose. The basic block has to contain at least 15 instructions.
The graph_file and the vliw_file. Please name them in the form login1_login2.graph and login1_login2.vliw. For example, if your group is formed by Erik Johansson the 123rd (erijo123@student.liu.se) and Johan Eriksson the 321st (joher321@student.liu.se) then your files are called erijo123_joher321.graph and erijo123_joher321.vliw.
Some comments on the chosen architecture: which other architectures you've tried and what cost-performance ratios did you obtain.
If vliwc says that everything is fine, then believe it. If it says that something is wrong, but you still think that your solution is correct, then please add an explanation of your solution.
Report any bugs in vliwc

Resources

vliwc, ssbig-na-sstrix-objdump, xvcg are in ˜TDTS55/bin. It's a good idea to add ˜TDTS55/bin to your PATH:
```
setenv PATH /home/TDTS55/bin:"$PATH"
rehash
```
The SimpleScalar Tool Set, Version 2.0, D. Burger, T. M. Austin.