Christoph W. Keßler:

Optimal instruction scheduling and register allocation for basic blocks

The problem: We consider the NP-complete problem of scheduling the instructions in a basic block (where the precedence constraints among the instructions form a DAG) with the goal of optimizing either the execution time on a RISC, VLIW or superscalar processor, or minimizing the number of registers needed, or applying a combined optimization criterion for time and space consumption of the resulting schedule.

Main approach: Our main approach is based on enumeration strategies with pruning methods that take domain-specific constraints into account. The exponentially-sized solution space is traversed in a way such that the optimum is encountered as early as possible, in order to postpone the combinatorial explosion of space and time requirements.

Main results: For the most general problem formulation, we can compute a time-optimal schedule within reasonable time for basic blocks with up to 22 or 23 instructions. If only register space is to be minimized, our algorithm copes with basic blocks of up to 40 to 45 instructions. For even larger basic blocks (which are quite rare in real-world programs), we can compute a space-optimal contiguous schedule for basic blocks with up to 80 instructions. For all other cases, linear-time heuristics have been developed.

A first section of our work focuses on contiguous schedules which are generated by postorder traversal of the expression DAG. We give a linear-time heuristic and provide a simple exhaustive search algorithm and two faster algorithms that generate an optimal contiguous evaluation for a given DAG. The first is a modification of the complete search algorithm that omits the generation of redundant evaluations. The second algorithm generates only the most promising evaluations by splitting the DAG into trees with import and export nodes and evaluating the trees with a modified labeling scheme. Experiments with randomly generated DAGs and large DAGs from real application programs confirm that the new algorithms generate optimal contiguous evaluations quite fast. Optimal contiguous schedules are mostly good (but not necessarily globally optimal) schedules with regard to the register need, but they are less suitable with respect to the execution time of the schedule on a pipelined or superscalar target processor.

For this reason, later work relaxes the limitation to contiguous schedules and considers general schedules that are generated by topological sort traversals of the DAG. Also for this case, we give a clever exhaustive search algorithm based on dynamic programming that gives the optimal schedule for DAGs with up to 40 nodes in acceptable time. This method was originally developed for space-optimal scheduling but can easily be generalized to scheduling for target processors with delayed instructions and multiple functional units.

Peripheral work: Scheduling vector basic blocks. Using fewer registers is essential if the target machine has not enough registers to evaluate an expression without storing intermediate results in the main memory (spilling). This is especially important for vector processors that are often used as node processors in parallel computers. Many vector processors have a register file that can be partitioned into a number of vector registers of a certain length. A vector operation is evaluated by splitting it into stripes that have the length of the vector registers, and by computing the stripes one after another. If the register file is partitioned into a small number of vector registers, each of them can hold more elements and the vector operation to be evaluated is split into fewer stripes. We exploit the trade-off between spilling and strip-mining to minimize the overall execution time.

Publications:


Related projects and some links:
As far as I know, my approach to optimal instruction scheduling for basic blocks is the only one that is based on enumeration of schedules. There are, though, several approaches based on integer linear programming. As far as I can see from the results reported in their papers (if any), my method is competitive with theirs. Last modified: March 24, 2000

This page by Christoph W. Kessler (kessler \at ida.liu.se)