Andrzej Bednarski
PELAB
Linköping University
andbe@ida.liu.se
Digital Signal Processors (DSPs) became in the last decades the processor of choice for embedded systems in the public market. The high volume market imposes requirements for high performance of the system at low cost. In order to achieve high code quality, developers still write applications in assembly language. This is time consuming, and maintenance and updating are difficult. Traditional compiler optimizations for high-level languages still produce poor code for DSPs and thus do not meet the requirements [15]. On the one hand, the compiler researchers focused for a long time on general purpose processors. Thus, DSP irregularities, such as dual memory banks and non-homogeneous register sets, have been ignored in the traditional compiler optimization phases. On the other hand, code generation itself is a complex task.
The remainder of this paper is organized as follows. In Section 2, we introduce the basics of code generation and discuss general problems. Section 3 presents a fully integrated code generation method based on integer programming. Section 4 reviews related work, including code generation techniques based on heuristics and various strategies for optimal code generation. Section 5 concludes the paper.
First the program, written in a high level language, is parsed and analyzed by the front-end, which produces a certain Intermediate Representation (IR). Optimization and code generation are later phases which work on the IR.
In most commercial compilers, code generation is a phase based process of the back-end of the compiler. Each phase is performed independently. From the software engineering point of view, this approach is easier, even if it leads to sub-optimal code. In fact, there are strong interdependencies between different phases, known as phase coupling problem.
Code generation consists in mapping the IR of a program to an optimal or almost optimal target machine code, usually machine assembler language. This task is subdivided into three subtasks:
In the rest of the article, a phase means a subtask of code generation where different subtasks are performed consecutively in a special, predefined order.
In the following, we identify several general problems and difficulties in code generation and the weaknesses of phase-based code generation.
All major subtasks in the code generation process have been found to be NP-complete optimization problems. Optimal instruction scheduling for basic blocks 1 is a NP-hard problem, as well for optimizing execution time as for minimizing the number of registers used. Global register allocation is isomorphic to the problem of coloring a life range interference graph with a minimal number of colors, which is NP-complete. Code selection for basic blocks where the data dependences form a Directed Acyclic Graph (DAG), is known to be NP-complete.
The phase coupling problem results from the phase based approach of a compiler. Decisions made in one phase constraint the phases that follow. Thus, an early decision in a previous phase prevents the current phase from producing an optimal or nearly optimal code.
We give here an example (adapted from [14]) that illustrates the interdependence between instruction selection and instruction scheduling. Let us consider an architecture with three functional units: an adder, a multiplier, and a shifter. We assume that one instruction can be issued per clock cycle. For selecting an instruction for the following integer computation
three obvious possibilities are listed in Figure 1. Each of them uses a different functional unit.
![]() |
The code selection phase, if performed first, would choose the operation with
the least occupation time. In our example, the left-shift would be selected,
associating the shifter functional unit with this multiplication operation.
However, in the subsequent scheduling phase, it might be the case that this
left-shift instruction should be scheduled at, say, time
, but the
shift unit is already allocated for computing an other operation at that time.
The left part of Figure 2 shows the resulting unit occupation
schema if performing the multiplication with the shifter; the right part gives
the schedule if using the adder unit instead, which is shorter by one cycle.
First we informally introduce some terms. The formal definitions are given in [11].
We are focussing on scheduling basic blocks where the data dependencies among the instructions form a directed acyclic graph (DAG).
A list schedule, or simply schedule, of the basic block (DAG) is a permutation of the instructions, i.e. the DAG nodes, that is compliant with the partial order defined by the edges of the DAG. A register allocation for a given schedule is a mapping from the DAG nodes to physical registers such that the values represented by the nodes are not overwritten before their last use. For a particular register allocation, its register need is defined as the maximum number of registers that are in use at the same time. A register allocation is optimal for a given schedule if its register need is not higher than that of any other register allocation for that schedule. That register need is referred to as the register need for the given schedule. An optimal register allocation for a given schedule can be computed in linear time in a straightforward way [6].
A space-optimal schedule is a schedule that uses no more registers than any other possible schedule of the DAG.
The unit occupation time of a functional unit is the number of clock cycles that the functional unit is occupied with executing an instruction before a new instruction can be issued to that unit. The latency of a unit is the number of clock cycles taken before the result is available. The execution time of a schedule represents the number of clock cycles required by the execution of the instructions in the order determined by the schedule. Similarly to the definition of a space-optimal schedule, a time-optimal schedule is a schedule that takes not more time that any other schedule of the DAG.
A time-profile represents the occupation status of all units at a given time. It specifies for each functional unit the sequence of instructions that are currently under execution in that unit but not yet completed. In other words, the time-profile gives just that most recent part of a (partial) schedule that may still influence the time slot for the next instruction to be scheduled.
In a first step we focus on instruction scheduling within a basic block. Later on, this will be generalized to global scheduling.
We start with a naive approach that consists in the exhaustive enumeration of all possibilities for topological sorting of the DAG. Topological sorting maintains a set of nodes with indegree zero, which is initialized to the set of DAG leaves. The topological sorting algorithm repeatedly selects a DAG node from the zero-indegree set, appends it to the current schedule, removes it from the DAG and updates the zero-indegree set accordingly (see Figure 3). This process is continued until all DAG nodes have been scheduled. A recursive enumeration of all possibilities for selecting the next node from a zero-indegree list generates all possible schedules of the DAG. Unfortunately, in the worst case, the run time of the enumeration is exponential in the number of nodes of the DAG. Thus, this method cannot be applied for basic blocks larger than 10 to 15 instructions (the number of schedules depends on the structure of the DAG).
![]() |
However, the exhaustive enumeration of topological sorts implicitly builds a tree-like representation of all schedules of the DAG, called the selection tree (see Figure 4). Each node of the selection tree corresponds to an instance of a zero-indegree set of DAG nodes that occurs during topological sorting. A node in the selection tree is connected by directed edges to the selection nodes corresponding to all zero-indegree set instances that are produced by performing one step in the topological sort algorithm.
By looking closer to the selection tree, we can see that several
instances of the same zero-indegree set may occur.
Therefore, we aim at transforming the
selection tree into a selection DAG by merging all multiple instances of the
same
zero-indegree set into a single selection node, which results in a selection
DAG (see Figure 5).
For this purpose, we try to group schedules that are
comparable in a single selection node and keep only one of these for further
enumeration.
If optimizing for minimum register space,
two selection nodes are comparable if
the same set of nodes of the DAG has been scheduled, as
the same set of scheduled DAG nodes resides in registers.
Additionally,
if optimizing the execution time, two schedules must have the same
time-profile
in order to be comparable.
Keßler [10,11] proved that for determining
a space-optimal schedule, it is sufficient to keep just one locally
space-optimal schedule per selection node.
For determining a time-optimal schedule, we must consider the
combination of zero-indegree sets and time-profiles, as it is not sufficient
here to judge the time-optimality of a partial schedule only by its execution
time if it is to be used as a prefix schedule in subsequent steps of the
enumeration process.
Therefore, we extend the definition of a selection node to an extended
selection node that consists of a zero-indegree set, a time profile and the
reference time of the time profile.
Thus, it is sufficient if we
keep, in an extended selection node among all schedules with equal
time-profiles, one schedule with the shortest execution
time. The Figure 6 illustrates the resulting extended selection
DAG for the example of Figure 4, applied to a single-issue target
processor with two functional units, one with latency of 1, and the second with
latency of 2. Hence, the time profiles have a single entry. Node
and
are to be exectued on unit with latency 2, the other on the unit with latency
1.
An important optimization of this algorithm consists in the structuring of the space of all (extended) selection nodes as a grid, with one dimension for the number of instructions scheduled, one for the register space consumed, and one for the elapsed execution time of the schedule stored in that (extended) selection node. We construct the space of selection nodes in an order that is (1) compliant with the precedence relations among the selection nodes, and (2) most favorable with respect to the desired optimization goal, so that the most promising selection nodes are considered first.
Figure 7 represents our current framework. The front-end delivers DAGs of basic blocks that are the input for the optimizer.
Keßler's [11] current implementation of simultaneous optimization of register space and execution time does not yet use the time-profile extension in the selection nodes. It simply keeps all possibly relevant schedules, and thus the prototype runs fast out of space for DAGs with more than 20 instructions.
We are currently implementing a new prototype that works with time profiles, and expect a significant improvement in the size of the DAGs that can be handled optimally. We use the LEDA library [12] for the implementation of the most important data structures, such as DAGs, selection node lists, and zero-indegree sets. The new prototype is intended to become the technical basis for a future system for integrated code generation.
The current implementation integrates instruction selection and instruction scheduling.
Thus, the space of extended selection nodes is structured as a two dimensional grid of
exteneded selection nodes list
, with one dimension for the number
of IR
operations scheduled (i.e., the length of the schedule), and one dimension for
the execution time
.
contains only those extended selection nodes in level
whose associated partial schedules have execution time
.
Let
be the maximum latency. Appending a node
to an existing schedule
will increase the execution time of any target schedule
derived from
by at
most
cycles. Hence, if we perform a selection starting from an extended selection
node in
, the resulting extended selection node will be stored in exactly one
of
.
For each extended selection node we store the schedule
as an attribute of
the node.
The pseudo code of the currently implemented algorithm is given in the Figure 8.
![]() |
First, we plan to implement the time optimization algorithm using the time-profile extension for the selection nodes as discussed above. Incrementally, we plan to add register allocation and instruction selection. Currently we still target architectures with a homogeneous register set but aim at generalizing this in the near future. For that purpose, the time-profiles will be extended to time-space-profiles [11], and we will finally get a complete integrated system.
Further, to be able to retarget our system to different architectures, we will develop or adopt a hardware description language. Finally, the optimizations should be extended from the basic block scheduling scope to global scheduling. The time-space-profiles appear to be a good description for the interconnection of subsequent basic blocks in a control flow graph.
Table 1 classifies heuristic methods and optimal methods into isolated and integrated approaches. This classification is the basis for the following review of related work on code generation.
In the literature, various heuristic methods for code generation are proposed.
For instruction selection, methods are generally based on tree pattern matching and dynamic programming. The tree pattern matching is performed on IR expression trees, where the nodes are operators and the leaves are operands. For short, a tree pattern is an expression that describes a part of an IR-tree. The tree patterns correspond to an equivalent sequence of one or more target instructions. Also, each tree pattern has an associated cost. The IR-tree is said to be covered if all nodes in the tree are part of exactly one matching tree pattern. Then, a coverage with least cost is chosen, which selects the set of target instructions. This technique [7] is still employed in modern compiler frameworks.
Graph coloring methods are still the state of art for global register
allocation. In order to
allocate registers using the graph coloring method, the register allocation phase
is usually performed after instruction scheduling, because the life range of
the program values have to
be known in order to be able to build the life range interference graph.
In the interference graph, nodes correspond to life ranges.
Two nodes have an edge between them if they overlap in time and thus
cannot be put into the
same register. Thus, register allocation consists in coloring the graph with
colors, where
is the number of registers. A graph coloring is a decoration of
the nodes with colors such that connected nodes do not have the same color.
Numerous coloring
heuristics for register allocation have been mentioned in the literature.
The critical path list scheduling algorithm is the most popular one for local instruction scheduling [13]. List scheduling determines a schedule by topological sorting, as described above. The edges of the DAG are annotated by weights which correspond to latencies. In the critical path list scheduling, the priority for selecting a node from the zero-indegree set is based on the maximum-length path from that node to any leaf node. The node with the highest priority is chosen first.
However, in case of heuristics, even if generally their results are quite good, they may produce a sub-optimal result in some cases. On the DAG in Figure 9, the critical path algorithm can return arbitrarily any permutation of the leaf nodes A, B, C as order of these nodes. (In the figure the nodes are annotated with their respective priorities.) But, for the orders A, C, B, and C, A, B, the schedule is sub-optimal, since the next cycle will be a NOP due to the latency of B. A NOP (No Operation) instruction that does not perform any computation. It is inserted in order to respect the data dependency (all children have to be computed before their parent can begin).
![]() |
There exist extensions to list scheduling which take more parameters into account, data locality for example.
Global instruction scheduling methods, such as trace scheduling [5] and region scheduling [8] can move instructions across basic blocks to produce more efficient code.
Mutation scheduling [14] is an integrated, heuristic based method, which integrates code selection, register allocation and instruction scheduling into a unified framework. This method improves significantly the results of code generation.
In general, heuristic methods can produce effective results within time and space limitation. But, they do not guarantee optimality.
Integer Linear Programming (ILP) based methods are widely used today in code generation. The difficult part is to specify the linear program which is solved usually by third party linear solvers. Once the problem is specified as an ILP instance, the connection to the DAG is lost -- the solver does not access extra information about the original program to solve the problem.
Wilken et al. [16] present a set of extensive transformations on the DAG, that help to derive an advanced integer linear programming formulation for the instruction scheduling phase. First, they show that the basic formulation of the ILP of a given problem leads to unacceptable compile time. Then, providing transformations on the DAG, they decrease the complexity of the ILP, such that it can cope with basic blocks up to 1000 instructions in acceptable time. The resulting schedule is optimal. However, the register allocation problem is ignored during the optimization.
For instance, Wilken et al. show that an hourglass-shaped DAG, represented on Figure 10 can be scheduled using a divide and conquer method: An optimal schedule for such hourglass DAGs can be determined as a concatenation of the solutions for the two subDAGs below and above the articulation node.
Note that explicit application of this simplification is not necessary in our approach, as such structural properties are automatically exploited by our algorithm. This saves us from looking manually directly on DAG for such a structure. As a consequence, our algorithm creates for an hourglass-shaped DAG also an hourglass-shaped selection DAG, that is, the optimization is decomposed into two separate parts. By proceeding independently on the two parts of the selection DAG a lot of space is saved.
Further research on coupling instruction scheduling and register allocation is needed. Bradlee et al. [2] show that separating register allocation and instruction scheduling produces inefficient code.
An interesting approach, from a technical and economical point of view, consists in performing post-pass optimization. In the PROPAN framework [9], Kästner implemented a phase coupled optimizer generator. The generator reads in a processor specification described in a Target Description Language (TDL) and generates a phase coupled optimizer. The phase coupled generated optimizer is specified as an integer linear program which takes into account restrictions and features of the target processor. An exact and optimal solution is produced, or a heuristic based, if the time limit is exeeded. In this framework, the full phase integration is not possible, as the time complexity is too high.
Aho and Johnson [1] show that, by dynamic programming, an optimal schedule can be produced for homogeneous register set architectures in linear time.
Space-optimal schedules for DAGs
are generated by the
algorithm [10].
The worst case complexity for this algorithm is
exponential. Despite this, based upon the experimental results, optimizing compilers can
use
for medium size basic blocks. Furthermore, possible parallelization of
the algorithm has been pointed out in the paper, which should improve
significantly the presented figures.
Yang et al. [17] present an optimal scheduling algorithm for a special architecture where all instructions take one time unit, and all functional units are identical. Generating an optimal schedule, even for such an architecture is still NP-complete.
Chou and Chung [3] enumerate all possible schedules to find an optimal one. They propose methods to prune the enumeration tree based on structural properties of the DAG. The algorithm is suitable for small and medium size basic blocks, up to 30 instructions per block. However, the size of basic blocks in common programs rarely exceeds 30 instructions.
Ertl and Krall [4] generate an optimal schedule by specifying the instruction scheduling problem as a constraint logic program. A time-optimal schedule is achieved for small and medium size basic blocks.
Several integrated methods based on heuristics, especially integrating register allocation and instruction scheduling, can be found in the literature. The heuristic methods may improve generated code, but do not indicate how far the produced code is from being optimal.
There are few approaches which give, in acceptable space and time, an integrated optimal
solution. Using trivial enumeration of complexity
excludes from processing
basic blocks whose size is larger than 20 instructions.
Also, numerous optimal methods are mentioned in the literature for computing an optimal solution to an isolated phase (either register allocation or instruction scheduling). The NP-complete problems are attacked by different ways: ILP, constraint logic programming, etc. However, dealing with only one phase at a time produces (in the general case) non-optimal code.
Therefore, in ongoing research, we attack the problem of fully integrated code generation. The register allocation phase should not be separated from instruction scheduling. A dynamic programming approach seems to be interesting from the optimality point of view. There is no other solution known for the moment to cope efficiently with the time and size problems in integrated code generation.