OPTIMIST logo

Code Generation: Isolated versus Integrated Approaches

Andrzej Bednarski
PELAB
Linköping University
andbe@ida.liu.se

Abstract
This paper examines different techniques currently used in code generation. It shows how an isolated method affects the quality of the result. We present an idea for a fully integrated code generation method based on dynamic programming. We compare our approach to related work by elaborating a classification into isolated and integrated methods.

1 Introduction

Digital Signal Processors (DSPs) became in the last decades the processor of choice for embedded systems in the public market. The high volume market imposes requirements for high performance of the system at low cost. In order to achieve high code quality, developers still write applications in assembly language. This is time consuming, and maintenance and updating are difficult. Traditional compiler optimizations for high-level languages still produce poor code for DSPs and thus do not meet the requirements [15]. On the one hand, the compiler researchers focused for a long time on general purpose processors. Thus, DSP irregularities, such as dual memory banks and non-homogeneous register sets, have been ignored in the traditional compiler optimization phases. On the other hand, code generation itself is a complex task.

The remainder of this paper is organized as follows. In Section 2, we introduce the basics of code generation and discuss general problems. Section 3 presents a fully integrated code generation method based on integer programming. Section 4 reviews related work, including code generation techniques based on heuristics and various strategies for optimal code generation. Section 5 concludes the paper.

2 Code generation

First the program, written in a high level language, is parsed and analyzed by the front-end, which produces a certain Intermediate Representation (IR). Optimization and code generation are later phases which work on the IR.

In most commercial compilers, code generation is a phase based process of the back-end of the compiler. Each phase is performed independently. From the software engineering point of view, this approach is easier, even if it leads to sub-optimal code. In fact, there are strong interdependencies between different phases, known as phase coupling problem.

Code generation consists in mapping the IR of a program to an optimal or almost optimal target machine code, usually machine assembler language. This task is subdivided into three subtasks:

In the rest of the article, a phase means a subtask of code generation where different subtasks are performed consecutively in a special, predefined order.

In the following, we identify several general problems and difficulties in code generation and the weaknesses of phase-based code generation.

2.1 NP-hard problems

All major subtasks in the code generation process have been found to be NP-complete optimization problems. Optimal instruction scheduling for basic blocks 1 is a NP-hard problem, as well for optimizing execution time as for minimizing the number of registers used. Global register allocation is isomorphic to the problem of coloring a life range interference graph with a minimal number of colors, which is NP-complete. Code selection for basic blocks where the data dependences form a Directed Acyclic Graph (DAG), is known to be NP-complete.

2.2 Phase coupling

The phase coupling problem results from the phase based approach of a compiler. Decisions made in one phase constraint the phases that follow. Thus, an early decision in a previous phase prevents the current phase from producing an optimal or nearly optimal code.

We give here an example (adapted from [14]) that illustrates the interdependence between instruction selection and instruction scheduling. Let us consider an architecture with three functional units: an adder, a multiplier, and a shifter. We assume that one instruction can be issued per clock cycle. For selecting an instruction for the following integer computation


\begin{displaymath}b \leftarrow a \times 2\end{displaymath}

three obvious possibilities are listed in Figure 1. Each of them uses a different functional unit.

Figure 1: Possible operations to perform the integer computation $b \leftarrow a \times 2$, with their occupation time in clock cycles.
\begin{figure}\begin{center}
\begin{tabular}{\vert l\vert l\vert c\vert} \hline...
...(b, a, 1) & Shifter & 3 cycles \\ \hline
\end{tabular} \end{center}\end{figure}

The code selection phase, if performed first, would choose the operation with the least occupation time. In our example, the left-shift would be selected, associating the shifter functional unit with this multiplication operation. However, in the subsequent scheduling phase, it might be the case that this left-shift instruction should be scheduled at, say, time $t_{i+1}$, but the shift unit is already allocated for computing an other operation at that time. The left part of Figure 2 shows the resulting unit occupation schema if performing the multiplication with the shifter; the right part gives the schedule if using the adder unit instead, which is shorter by one cycle.

Figure 2: Using an adder results in a shorter execution time than using a shifter.
\begin{figure}\begin{center}
\begin{tabular}{llp{0.5cm}ll}
$t_i:$\ & lshift(y,...
...+4}:$ & NOP \\
$t_{i+5}:$ & NOP & & &
\end{tabular} \end{center} \end{figure}

3 An integrated approach

3.1 Terminology

First we informally introduce some terms. The formal definitions are given in [11].

We are focussing on scheduling basic blocks where the data dependencies among the instructions form a directed acyclic graph (DAG).

A list schedule, or simply schedule, of the basic block (DAG) is a permutation of the instructions, i.e. the DAG nodes, that is compliant with the partial order defined by the edges of the DAG. A register allocation for a given schedule is a mapping from the DAG nodes to physical registers such that the values represented by the nodes are not overwritten before their last use. For a particular register allocation, its register need is defined as the maximum number of registers that are in use at the same time. A register allocation is optimal for a given schedule if its register need is not higher than that of any other register allocation for that schedule. That register need is referred to as the register need for the given schedule. An optimal register allocation for a given schedule can be computed in linear time in a straightforward way [6].

A space-optimal schedule is a schedule that uses no more registers than any other possible schedule of the DAG.

The unit occupation time of a functional unit is the number of clock cycles that the functional unit is occupied with executing an instruction before a new instruction can be issued to that unit. The latency of a unit is the number of clock cycles taken before the result is available. The execution time of a schedule represents the number of clock cycles required by the execution of the instructions in the order determined by the schedule. Similarly to the definition of a space-optimal schedule, a time-optimal schedule is a schedule that takes not more time that any other schedule of the DAG.

A time-profile represents the occupation status of all units at a given time. It specifies for each functional unit the sequence of instructions that are currently under execution in that unit but not yet completed. In other words, the time-profile gives just that most recent part of a (partial) schedule that may still influence the time slot for the next instruction to be scheduled.

3.2 Our dynamic programming approach

In a first step we focus on instruction scheduling within a basic block. Later on, this will be generalized to global scheduling.

Figure 3: A snapshot of topological sorting for a DAG.
\begin{figure}\begin{center}
\epsfxsize =0.4\textwidth
\strut\epsfbox{topsort.eps}\end{center}\end{figure}

We start with a naive approach that consists in the exhaustive enumeration of all possibilities for topological sorting of the DAG. Topological sorting maintains a set of nodes with indegree zero, which is initialized to the set of DAG leaves. The topological sorting algorithm repeatedly selects a DAG node from the zero-indegree set, appends it to the current schedule, removes it from the DAG and updates the zero-indegree set accordingly (see Figure 3). This process is continued until all DAG nodes have been scheduled. A recursive enumeration of all possibilities for selecting the next node from a zero-indegree list generates all possible schedules of the DAG. Unfortunately, in the worst case, the run time of the enumeration is exponential in the number of nodes of the DAG. Thus, this method cannot be applied for basic blocks larger than 10 to 15 instructions (the number of schedules depends on the structure of the DAG).

Figure 4: An example DAG (left hand side) and the resulting selection tree (first three levels, right hand side).
\begin{figure*}\begin{center}
\epsfxsize =0.9\textwidth
\strut\epsfbox{nccex.eps}\end{center}\end{figure*}

However, the exhaustive enumeration of topological sorts implicitly builds a tree-like representation of all schedules of the DAG, called the selection tree (see Figure 4). Each node of the selection tree corresponds to an instance of a zero-indegree set of DAG nodes that occurs during topological sorting. A node in the selection tree is connected by directed edges to the selection nodes corresponding to all zero-indegree set instances that are produced by performing one step in the topological sort algorithm.

Figure 5: The selection DAG of the example DAG of Figure 4
\begin{figure}\begin{center}
\epsfxsize =0.4\textwidth
\strut\epsfbox{nceex.eps}\end{center}\end{figure}

By looking closer to the selection tree, we can see that several instances of the same zero-indegree set may occur. Therefore, we aim at transforming the selection tree into a selection DAG by merging all multiple instances of the same zero-indegree set into a single selection node, which results in a selection DAG (see Figure 5). For this purpose, we try to group schedules that are comparable in a single selection node and keep only one of these for further enumeration. If optimizing for minimum register space, two selection nodes are comparable if the same set of nodes of the DAG has been scheduled, as the same set of scheduled DAG nodes resides in registers. Additionally, if optimizing the execution time, two schedules must have the same time-profile in order to be comparable. Keßler [10,11] proved that for determining a space-optimal schedule, it is sufficient to keep just one locally space-optimal schedule per selection node. For determining a time-optimal schedule, we must consider the combination of zero-indegree sets and time-profiles, as it is not sufficient here to judge the time-optimality of a partial schedule only by its execution time if it is to be used as a prefix schedule in subsequent steps of the enumeration process. Therefore, we extend the definition of a selection node to an extended selection node that consists of a zero-indegree set, a time profile and the reference time of the time profile. Thus, it is sufficient if we keep, in an extended selection node among all schedules with equal time-profiles, one schedule with the shortest execution time. The Figure 6 illustrates the resulting extended selection DAG for the example of Figure 4, applied to a single-issue target processor with two functional units, one with latency of 1, and the second with latency of 2. Hence, the time profiles have a single entry. Node $b$ and $e$ are to be exectued on unit with latency 2, the other on the unit with latency 1.

Figure 6: Extended selection DAG
\begin{figure*}\begin{center}
\epsfxsize =0.7\textwidth
\strut\epsfbox{tprofex.eps} \end{center}\end{figure*}

An important optimization of this algorithm consists in the structuring of the space of all (extended) selection nodes as a grid, with one dimension for the number of instructions scheduled, one for the register space consumed, and one for the elapsed execution time of the schedule stored in that (extended) selection node. We construct the space of selection nodes in an order that is (1) compliant with the precedence relations among the selection nodes, and (2) most favorable with respect to the desired optimization goal, so that the most promising selection nodes are considered first.

3.3 Prototype implementation

Figure 7: Framework
\includegraphics[]{framework.eps}

Figure 7 represents our current framework. The front-end delivers DAGs of basic blocks that are the input for the optimizer.

Keßler's [11] current implementation of simultaneous optimization of register space and execution time does not yet use the time-profile extension in the selection nodes. It simply keeps all possibly relevant schedules, and thus the prototype runs fast out of space for DAGs with more than 20 instructions.

We are currently implementing a new prototype that works with time profiles, and expect a significant improvement in the size of the DAGs that can be handled optimally. We use the LEDA library [12] for the implementation of the most important data structures, such as DAGs, selection node lists, and zero-indegree sets. The new prototype is intended to become the technical basis for a future system for integrated code generation.

The current implementation integrates instruction selection and instruction scheduling. Thus, the space of extended selection nodes is structured as a two dimensional grid of exteneded selection nodes list $L_l^k$, with one dimension for the number $l$ of IR operations scheduled (i.e., the length of the schedule), and one dimension for the execution time $k$. $L_l^k$ contains only those extended selection nodes in level $l$ whose associated partial schedules have execution time $k$.

Let $D$ be the maximum latency. Appending a node $v$ to an existing schedule $S$ will increase the execution time of any target schedule $s$ derived from $S$ by at most $D$ cycles. Hence, if we perform a selection starting from an extended selection node in $L_i^k$, the resulting extended selection node will be stored in exactly one of $L_{l+1}^k,\dots,L_{l+1}^{k+D}$.

For each extended selection node we store the schedule $s$ as an attribute of the node.

The pseudo code of the currently implemented algorithm is given in the Figure 8.

Figure: The algorithm for determining a time-optimal schedule, taking instruction selection into account.
\begin{figure}\begin{boxedminipage}{0.47\textwidth}
\begin{small}
\begin{flushle...
... end} {\it timeopt}\\
\end{flushleft}\end{small}\end{boxedminipage}\end{figure}

3.4 Future extensions

First, we plan to implement the time optimization algorithm using the time-profile extension for the selection nodes as discussed above. Incrementally, we plan to add register allocation and instruction selection. Currently we still target architectures with a homogeneous register set but aim at generalizing this in the near future. For that purpose, the time-profiles will be extended to time-space-profiles [11], and we will finally get a complete integrated system.

Further, to be able to retarget our system to different architectures, we will develop or adopt a hardware description language. Finally, the optimizations should be extended from the basic block scheduling scope to global scheduling. The time-space-profiles appear to be a good description for the interconnection of subsequent basic blocks in a control flow graph.

4 Related work

Table 1 classifies heuristic methods and optimal methods into isolated and integrated approaches. This classification is the basis for the following review of related work on code generation.


Table 1: Code Generation: Classification of related work.
    Optimal
  Heuristics ILP BB DP
Isolated List Sched. [16] [17] [7]
Integrated [14],[9] [9]   [10


4.1 Heuristics

In the literature, various heuristic methods for code generation are proposed.

For instruction selection, methods are generally based on tree pattern matching and dynamic programming. The tree pattern matching is performed on IR expression trees, where the nodes are operators and the leaves are operands. For short, a tree pattern is an expression that describes a part of an IR-tree. The tree patterns correspond to an equivalent sequence of one or more target instructions. Also, each tree pattern has an associated cost. The IR-tree is said to be covered if all nodes in the tree are part of exactly one matching tree pattern. Then, a coverage with least cost is chosen, which selects the set of target instructions. This technique [7] is still employed in modern compiler frameworks.

Graph coloring methods are still the state of art for global register allocation. In order to allocate registers using the graph coloring method, the register allocation phase is usually performed after instruction scheduling, because the life range of the program values have to be known in order to be able to build the life range interference graph. In the interference graph, nodes correspond to life ranges. Two nodes have an edge between them if they overlap in time and thus cannot be put into the same register. Thus, register allocation consists in coloring the graph with $N$ colors, where $N$ is the number of registers. A graph coloring is a decoration of the nodes with colors such that connected nodes do not have the same color. Numerous coloring heuristics for register allocation have been mentioned in the literature.

The critical path list scheduling algorithm is the most popular one for local instruction scheduling [13]. List scheduling determines a schedule by topological sorting, as described above. The edges of the DAG are annotated by weights which correspond to latencies. In the critical path list scheduling, the priority for selecting a node from the zero-indegree set is based on the maximum-length path from that node to any leaf node. The node with the highest priority is chosen first.

However, in case of heuristics, even if generally their results are quite good, they may produce a sub-optimal result in some cases. On the DAG in Figure 9, the critical path algorithm can return arbitrarily any permutation of the leaf nodes A, B, C as order of these nodes. (In the figure the nodes are annotated with their respective priorities.) But, for the orders A, C, B, and C, A, B, the schedule is sub-optimal, since the next cycle will be a NOP due to the latency of B. A NOP (No Operation) instruction that does not perform any computation. It is inserted in order to respect the data dependency (all children have to be computed before their parent can begin).

Figure 9: Example where critical-path list scheduling can produce a sub-optimal schedule. The weights on the edges represent the latencies. The numbers that annotate the nodes are the priorities for the algorithm.
\begin{figure}\begin{center}
\epsfxsize =0.4\textwidth
\strut\epsfbox{critic.eps}\end{center}\end{figure}

There exist extensions to list scheduling which take more parameters into account, data locality for example.

Global instruction scheduling methods, such as trace scheduling [5] and region scheduling [8] can move instructions across basic blocks to produce more efficient code.

Mutation scheduling [14] is an integrated, heuristic based method, which integrates code selection, register allocation and instruction scheduling into a unified framework. This method improves significantly the results of code generation.

In general, heuristic methods can produce effective results within time and space limitation. But, they do not guarantee optimality.

4.2 Optimal approaches

4.2.1 Integer linear programming

Integer Linear Programming (ILP) based methods are widely used today in code generation. The difficult part is to specify the linear program which is solved usually by third party linear solvers. Once the problem is specified as an ILP instance, the connection to the DAG is lost -- the solver does not access extra information about the original program to solve the problem.

Wilken et al. [16] present a set of extensive transformations on the DAG, that help to derive an advanced integer linear programming formulation for the instruction scheduling phase. First, they show that the basic formulation of the ILP of a given problem leads to unacceptable compile time. Then, providing transformations on the DAG, they decrease the complexity of the ILP, such that it can cope with basic blocks up to 1000 instructions in acceptable time. The resulting schedule is optimal. However, the register allocation problem is ignored during the optimization.

For instance, Wilken et al. show that an hourglass-shaped DAG, represented on Figure 10 can be scheduled using a divide and conquer method: An optimal schedule for such hourglass DAGs can be determined as a concatenation of the solutions for the two subDAGs below and above the articulation node.

Note that explicit application of this simplification is not necessary in our approach, as such structural properties are automatically exploited by our algorithm. This saves us from looking manually directly on DAG for such a structure. As a consequence, our algorithm creates for an hourglass-shaped DAG also an hourglass-shaped selection DAG, that is, the optimization is decomposed into two separate parts. By proceeding independently on the two parts of the selection DAG a lot of space is saved.

Figure 10: An hourglass-shaped DAG.
\includegraphics[]{hourglass_dag.eps}

Further research on coupling instruction scheduling and register allocation is needed. Bradlee et al. [2] show that separating register allocation and instruction scheduling produces inefficient code.

An interesting approach, from a technical and economical point of view, consists in performing post-pass optimization. In the PROPAN framework [9], Kästner implemented a phase coupled optimizer generator. The generator reads in a processor specification described in a Target Description Language (TDL) and generates a phase coupled optimizer. The phase coupled generated optimizer is specified as an integer linear program which takes into account restrictions and features of the target processor. An exact and optimal solution is produced, or a heuristic based, if the time limit is exeeded. In this framework, the full phase integration is not possible, as the time complexity is too high.

4.2.2 Dynamic programming

Aho and Johnson [1] show that, by dynamic programming, an optimal schedule can be produced for homogeneous register set architectures in linear time.

Space-optimal schedules for DAGs are generated by the $ncv$ algorithm [10]. The worst case complexity for this algorithm is exponential. Despite this, based upon the experimental results, optimizing compilers can use $ncv$ for medium size basic blocks. Furthermore, possible parallelization of the algorithm has been pointed out in the paper, which should improve significantly the presented figures.

4.3.3 Branch-and-bound

Yang et al. [17] present an optimal scheduling algorithm for a special architecture where all instructions take one time unit, and all functional units are identical. Generating an optimal schedule, even for such an architecture is still NP-complete.

4.2.4 Enumeration

Chou and Chung [3] enumerate all possible schedules to find an optimal one. They propose methods to prune the enumeration tree based on structural properties of the DAG. The algorithm is suitable for small and medium size basic blocks, up to 30 instructions per block. However, the size of basic blocks in common programs rarely exceeds 30 instructions.

4.2.5 Constraint logic programming

Ertl and Krall [4] generate an optimal schedule by specifying the instruction scheduling problem as a constraint logic program. A time-optimal schedule is achieved for small and medium size basic blocks.

5 Summary and conclusions

Several integrated methods based on heuristics, especially integrating register allocation and instruction scheduling, can be found in the literature. The heuristic methods may improve generated code, but do not indicate how far the produced code is from being optimal.

There are few approaches which give, in acceptable space and time, an integrated optimal solution. Using trivial enumeration of complexity $O(n!)$ excludes from processing basic blocks whose size is larger than 20 instructions.

Also, numerous optimal methods are mentioned in the literature for computing an optimal solution to an isolated phase (either register allocation or instruction scheduling). The NP-complete problems are attacked by different ways: ILP, constraint logic programming, etc. However, dealing with only one phase at a time produces (in the general case) non-optimal code.

Therefore, in ongoing research, we attack the problem of fully integrated code generation. The register allocation phase should not be separated from instruction scheduling. A dynamic programming approach seems to be interesting from the optimality point of view. There is no other solution known for the moment to cope efficiently with the time and size problems in integrated code generation.

Bibliography

1
A. V. Aho and S. C. Johnson.
Optimal Code Generation for Expression Trees.
Journal of the ACM, 23(3):488-501, July 1976.

2
David G. Bradlee, Susan J. Eggers, and Robert R. Henry.
Integrating Register Allocation and Instruction Scheduling for RISCs.
ACM SIGPLAN Notices, 26(4):122-131, April 1991.

3
Hong-Chich Chou and Chung-Ping Chung.
An Optimal Instruction Scheduler for Superscalar Processors.
IEEE Transactions on Parallel and Distributed Systems, 6(3):303-313, 1995.

4
M. Anton Ertl and Andreas Krall.
Optimal Instruction Scheduling Using Constraint Logic Programming.
pages 75-86.

5
J. Fisher.
Trace Scheduling: A General Technique for Global Microcode Compaction.
IEEE Transactions on Computers, 30(7):478-490, July 1981.

6
R.A. Freiburghouse.
Register Allocation via Usage Counts.
Comm. ACM, 17(11), 1974.

7
R.S. Glanville and S.L. Graham.
A New Method for Compiler Code Generation.
In Proc. ACM SIGPLAN Symp. Principles of Programming Languages, pages 231-240, January 1978.

8
Rajiv Gupta and Mary Lou Soffa.
Region scheduling: An approach for detecting and redistributing parallelism.
16(4):421-431, April 1990.

9
Daniel Kästner.
PROPAN: A Retargetable System for Postpass Optimisations and Analyses.
In ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, June 2000.

10
Christoph W. Keßler.
Scheduling Expression DAGs for Minimal Register Need.
Computer Languages, 24(1):33-53, September 1998.

11
Christoph W. Keßler.
Parallelism and compilers.
Habilitation thesis, FB IV - Informatik, University of Trier, December 2000.
submitted.

12
Kurt Mehlhorn, Stefan Näher, and Christian Uhrig.
Library for Efficient Datastructures and Algorithms.
Max Planck Institute for Computer Science, Saarbrücken, 1997.
http://www.mpi-sb.mpg.de/LEDA/leda.html.

13
S. Muchnick.
Advanced Compiler Design and Implementation.
Morgan Kaufmann Publishers, 1997.

14
S. Novack and A. Nicolau.
Mutation Scheduling: A Unified Approach to Compiling for Fine-Grain Parallelism.
Lecture Notes in Computer Science, 892:16-30, 1995.

15
Mazen A. R. Saghir, Paul Chow, and Corinna G. Lee.
Exploiting Dual Data-Memory Banks in Digital Signal Processors.
ACM SIGPLAN Notices, 31(9):234-243, September 1996.
Co-published as SIGOPS Operating Systems Review 30(5), December 1996, and as SIGARCH Computer Architecture News, 24(special issue), October 1996.

16
Kent Wilken, Jack Liu, and Mark Heffernan.
Optimal Instruction Scheduling Using Integer Programming.
ACM SIGPLAN Notices, 35(5):121-133, May 2000.

17
Cheng-I. Yang, Jia-Shung Wang, and Richard C. T. Lee.
A Branch-and-Bound Algorithm to Solve the Equal-Execution Time Job Scheduling Problem with Precedence Constraints and Profile.
Computers Operations Research, 16(3):257-269, 1989.



Footnotes

... blocks 1
Maximal set of statments in which the flow of contol enters at the begining and leaves at the end. Statments are executed exactly once each time the block is entered.


Andrzej Bednarski (andbe@ida.liu.se)
Last update: Fri Jun 08 16:52:10 MET DST 2001