Lab Assignment 2: VLIW Processors

Objective
Theoretical background
Assignments
Details
What to report
Resources

Objective

The purpose of this assignment is to get partial insight on which are the typical problems when developing a VLIW compiler and which are the factors to consider when deciding on a VLIW architecture.

Preparations

For this lab assignment you will need to use the SimpleScalar toolset. Instructions for setting up the toolset are given in the following link.

Assignments

A typical compiler takes a program written in some high-level language, and translates it to a machine language. As the development of such a compiler is a task beyond the scope of this course, you will perform some tasks of a simpler compiler, that translates a program written in a machine language of a non-VLIW processor, denoted with P, to a program written in the machine language of a VLIW processor, denoted with V.

The data path of V is characterised by the following units.

alu_no arithmetic-logic units (ALUs)
mul_no multipliers-dividers (MULs)
fpu_no floating point units (FPUs)
bau_no Bus Access Units (BAUs)

Consequently, one very long instruction word will consist of the following instructions of P (the non-VLIW processor).

alu_no arithmetic/logic instructions of P
mul_no multiply/divide instructions of P
fpu_no floating-point instructions of P
bau_no load/store instructions of P

One of the tasks of our imagined VLIW compiler is to pack the instructions of a program written for P into VLIW instructions to be run on V. The packing has to be done considering the data dependencies between the instructions and the resource constrains (e.g. we are not allowed to pack more than alu_no arithmetic/logic instructions into one VLIW instruction).

Here are the assignments you should work on:

Select 1 arbitrary benchmark program, denoted with p, from the ˜TDTS10/spec95-big subdirectory.
Select 1 basic block, denoted with bb, from p, such that bb contains at least 15 instructions.
A help on the delimitation of basic blocks can be found on this web page.
Define the architecture of the VLIW processor V by choosing the following numbers of units.
- the number of ALUs, denoted with alu_no
- the number of MULs, denoted with mul_no
- the number of FPUs, denoted with fpu_no
- the number of BAUs, denoted with bau_no
Pack the instructions of bb into VLIW instructions, such that the data dependencies and resource constraints are satisfied.
Explore the trade-off between cost (the more units you add to the VLIW, the higher the cost) and performance (the more units you add, the higher the chance that the basic block will run faster).

ATTENTION: You will be supported by using the ˜TDTS10/bin/vliwc program for the assignments.

Details

In order to select a basic block of a benchmark program, issue the following command line.

vliwc ˜TDTS10/spec95-big/<chosen_benchmark>

where <chosen_benchmark> is the name of the chosen benchmark program.

A list will be produced, It will produce a list of information of basic blocks, including

the index number of a basic block
the start address of the basic block
the end address of the basic block
the number of instructions contained in the basic block

You can sort the list, for example, in decreasing order of the instruction indices, using the following command line.

vliwc ˜TDTS10/spec95-big/<chosen_benchmark> | sort -nr +3 -4 | less

where <chosen_benchmark> is the name of the chosen benchmark program. Then you can choose an arbitrary basic block from the list. Remember to write down its index, start address and end address.

For example, you can choose the ˜TDTS10/spec95-big/ijpeg.ss benchmark program. Then you should issue the following command line to get the list of basic blocks of the benchmark program.

vliwc ˜TDTS10/spec95-big/ijpeg.ss

From the list, you can see that basic block 9512 contains 10 instructions, The basic block starts from address 0x00460480 and ends at address 0x004604d0.

In order to see the contents of the basic block, you will have to disassemble the benchmark program. This can be done by using the following command line.

ssbig-na-sstrix-objdump -d ˜TDTS10/spec95-big/<chosen_benchmark> | less

where <chosen_benchmark> is the name of the chosen benchmark program. In less, if you type / (slash), you will be prompted for a pattern to be searched in the listing of less. While being prompted, you should type in the start address of the chosen basic block. If you do not reach that address from the first match, you can type n for the next match, until you reach the start address of the chosen basic block.

For example, you can issue the following command line to obtain the basic block starting from address 0x00460480.

ssbig-na-sstrix-objdump -d ˜TDTS10/spec95-big/ijpeg.ss

Here is a part of the obtained information.

00460680 <__udivmoddi4+570> mflo $a2[6]
00460688 <__udivmoddi4+578> mfhi $v1[3]
00460690 <__udivmoddi4+580> andi $a3[7],$t1[9],65535
00460698 <__udivmoddi4+588> mult $a2[6],$a3[7]
004606a0 <__udivmoddi4+590> sll $v0[2],$v1[3],0x10
004606a8 <__udivmoddi4+598> mflo $a1[5]
004606b0 <__udivmoddi4+5a0> srl $v1[3],$t3[11],0x10
004606b8 <__udivmoddi4+5a8> or $v1[3],$v0[2],$v1[3]
004606c0 <__udivmoddi4+5b0> sltu $v0[2],$v1[3],$a1[5]
004606c8 <__udivmoddi4+5b8> beq $v0[2],$zero[0],00460710 <__udivmoddi4+600>

In order to pack the instructions, you need to determine the data dependencies. We neglect artificial dependencies, namely write-after-read (WAR) and write-after-write (WAW) dependencies, as we assume that a preprocessing can eliminate them. Therefore, you should focus ONLY on the read-after-write (RAW) dependencies. You can find descriptions on data dependencies from this web page.

In the presented example, instruction 00460698 <__udivmoddi4+588> mult $a2[6],$a3[7] depends on the instructions 00460680 <__udivmoddi4+570> mflo $a2[6] and 00460690 <__udivmoddi4+580> andi $a3[7],$t1[9],65535. This is because the mult instruction reads from $a2[6] and from $a3[7], while the mflo instruction writes to $a2[6] and the andi instruction writes to $a3[7].

You can find the instructions and their semantics in the following resources.

Appendix A of The SimpleScalar Tool Set.

Note that some of those "instructions" listed in the appendix are just macros which expand to a couple of instructions. For example, div is such a macro, which first checks if divided-by-zero occurs, and only if not, performs the division thereafter.
ATTENTION: If you find an instruction in the output of ssbig-na-sstrix-objdump but it is not listed in the appendix, please ignore that instruction and consider it as if it did not depend on any other instructions.

It is relatively easy to detect RAW dependencies between instructions which do NOT access the memory.
It is slightly trickier to detect RAW dependencies between load/store instructions which access the memory, because, at compile time, we do not know the target address, which is very often the value of a register. Therefore, in order to ensure the program correctness, you have to capture the following two types of RAW dependencies between load/store instructions.

A load instruction should be executed only after all the store instructions which precede it in the program order have been executed.
A store instruction has to be executed only after all the load and store instructions which precede it in the program order have been executed.

ATTENTION: Make sure that you capture all the RAW dependencies in the dependency graph.

For the presented example, you can obtain the following dependency graph.

You can correct your assignment yourself, by using the vliwc program. For that, you'll have to write two files, the <graph_file> and the <vliw_file>, described as follows.

The <graph_file> specifies the graph of data dependencies.
The <vliw_file> specifies the way the instructions are packed into VLIW instructions.

Each line in the <graph_file> contains one or two instruction addresses separated by blanks (space or tab). If a line contains two addresses, this means that the instruction at the second address depends on the instruction at the first address. If an instruction does NOT depend on any other instructions, then its address appears alone on a line.

The dependency graph depicted in the above figure can be described as lines of the <graph_file> as follows.

0x00460690 0x00460698
0x00460680 0x00460698
0x00460698 0x004606a8
0x004606a8 0x004606c0
0x004606c0 0x004606c8
0x00460688 0x004606a0
0x004606a0 0x004606b8
0x004606b0 0x004606b8
0x004606b8 0x004606c0

In order to pack the instructions such that they fulfill the resource constraints, you should label each graph node with the type of resources it uses, such as ALU, MUL, FPU, BAU, or none. You can deduce the used types of resources from the following resources.

Appendix A of The SimpleScalar Tool Set.

For the presented example, you can obtain the following labels:

mflo $a2[6]				none
mfhi $v1[3]				none
andi $a3[7],$t1[9],65535		ALU
mult $a2[6],$a3[7]			MUL
sll $v0[2],$v1[3],0x10			ALU
mflo $a1[5]				none
srl $v1[3],$t3[11],0x10			ALU
or $v1[3],$v0[2],$v1[3]			ALU
sltu $v0[2],$v1[3],$a1[5]		ALU
beq $v0[2],$zero[0],00460710		ALU

You can specify your VLIW architecture and instruction packing by means of the <vliw_file>. The first line in the <vliw_file> consists of 4 integer numbers separated by blanks (space or tab). The 4 numbers represent the alu_no, mul_no, fpu_no, and bau_no in this order. Each of the subsequent lines consists of alu_no + mul_no + fpu_no + bau_no atoms. Each atom can be the address of an instruction or a nop. The first alu_no atoms can be occupied only by the addresses of the instructions labelled with ALU or none. The next mul_no atoms can be occupied only by the addresses of the instructions labelled with MUL or none, and so on.

In the presented example, you may chose 2 ALUs, 1 MUL, 1 FPU, 1 BAU. The <vliw_file> looks as follows.

2                     1          1          1
0x00460690 0x004606b0 0x00460680 0x00460688 nop
0x004606a0 nop        0x00460698 nop        nop
0x004606b8 0x004606a8 nop        nop        nop
0x004606c0 nop        nop        nop        nop
0x004606c8 nop        nop        nop        nop

Once you have created the <graph_file> and the <vliw_file>, you can check your solution by running the vliwc program using the following command line.

vliwc -b <chosen_basic_block_number> 
	-d <graph_file> -v <vliw_file> 
	˜TDTS10/spec95-big/<chosen_benchmark>

where <chosen_basic_block_index> is the index number of the chosen basic block, <graph_file> is the name of the graph file, <vliw_file> is the name of the VLIW file, and <chosen_benchmark> is the name of the chosen benchmark program. The vliwc program will inform you whether the dependency graph as well as the packing of instructions are correct. It will also give the cost of your solution and print out the dependency graph in the xvcg format. You can view your dependency graph by piping the output of vliwc to the input of the xvcg command. For the presented example, you can issue the following command line.

vliwc -b 9512 -d example_ijpeg.graph -v example_ijpeg.vliw ˜TDTS10/spec95-big/ijpeg.ss | xvcg -

You need to explore different VLIW architectures (choose different values for alu_no, mul_no, fpu_no, and bau_no) in order to optimise (minimise) the cost-performance indicator.

What to report

Indicate the name of the benchmark program and the index number of the basic block you chose. The basic block has to contain AT LEAST 15 instructions.
The <graph_file> and the <vliw_file>.
Some comments on the chosen architecture: which other architectures you have tried and what cost-performance ratios you have obtained.
If vliwc says that everything is fine, then believe it. If it says that something is wrong, but you still think that your solution is correct, then please add an explanation of your solution.
Report any bugs in the vliwc program

Resources

vliwc, ssbig-na-sstrix-objdump, and xvcg located in the directory ˜TDTS10/bin.
D. Burger, and T. M. Austin, The SimpleScalar Tool Set, Version 2.0, 1997.
T. M. Austin, A User's and Hacker's Guide to SimpleScalar Architectural Tool Set, 1997.
William Stallings, Computer Organization and Architecture, 7th edition, Prentice Hall International, 2006.

Page responsible: Zebo Peng
Last updated: 2012-11-20

IDA - Department of Computer and Information Science