Index of /~chrke55/transferfusion
Optimizer for global operand transfer fusion in CUDA programs.
(c) 2018, 2019 by C. Kessler, Linkoping University.
christoph.kessler@liu.se
Warning: This is a research prototype implementation.
The code is ugly and may contain bugs.
No guarantees are given, use it at your own risk.
See LICENSE.txt for license issues.
I used CUDA 8 on a Nvidia Kepler GPU for the experiments,
it might not work with other CUDA versions without code modifications.
1. Main directory:
Files:
glob.h global declarations, included from trace.c
trace.c all C source code (no time to clean up and refactor, sorry).
I tried to comment at least the main data structures and
some of the functions in the code.
Please read the paper for more information.
Makefile for building the optimizer for all example DAGs.
Commands:
make builds the optimizer binaries for all example DAGs (called t....) in this directory.
t.... runs the optimizer for the example DAG ....,
creates 2 dot files (for DAG and affinity graph) called blubber.dot and aff.dot,
and creates corresponding source files in cuda/ subdirectory
draw generates png files of the DAG and affinity graph dot files,
and call eog for visualization.
Settings (#defines -D... to be set in Makefile):
N: number of kernels
INIT_K: number of vectors live on entry (internally numbered 0...INIT_K-1)
(live-on-exit vectors are marked up in the code with the LIVEONEXIT() macro).
The overall number MAX_K of vectors is INIT_K + Sum over all outdegrees of all kernels.
MIN_INDEG, MAX_INDEG, MIN_OUTDEG, MAX_OUTDEG: control randomly-generated kernel graph
topology properties, also used by the non-random examples e.g. to reduce memory allocation.
SPECIALDAG: select any predefined special DAG topology (numbered 1...12) in trace.c.
If not specified, the default is a randomly generated DAG, where
the naming scheme t.... of the generated binary contains the degrees, N and INIT_K.
TIGHT: use instead the tight variant of the optimizer (heuristically aiming for shorter
on-device vector live ranges by possibly reducing the realized fusion opportunities)
SAMECOLOR: this setting is not used in undefined form, it must always be switched on.
It means: use trivial identity coloring (naming) of vectors,
thus the optimizer works on the original SSA structure of the given DAG.
(an earlier plan to optionally run a register assignment algorithm
for space compaction before the optimization was never completed)
2. CUDA experiments: cd cuda
Files:
cuda/kernels.cu: Source code of the CPU and CUDA kernels used by and linked with the test programs.
cuda/t.....c: source code for the transfer-optimized code version of each test program .....
Memory allocation time is not included in the measurements.
cuda/nt.....c: source code for the baseline (non-transfer-optimized) code version of each test program.
Memory allocation time is not included in the measurements.
cuda/at.....c: source code for the transfer- and allocation-optimized code version of each test program .....
Memory allocation time is included in the measurements.
cuda/ant.....c: source code for the baseline (non-transfer-optimized, non-allocation-optimized)
code version of each test program.
Memory allocation time is included in the measurements.
In all these, the problem size used is appended as a suffix (default = 1M elements)
cuda/Makefile for compiling all example .cu source files
Commands:
make builds the example binaries t...., nt...., at.... and ant.... for each example DAG ....
t...., at...., nt...., ant.... run the generated CUDA code variant for a specific example DAG for different problem sizes
runall batch script for running and measuring all generated example programs
for 6 different problem sizes 4K, 16K, 64K, 256K, 1M, 4M.
(NB - this can take many hours with the standard settings for #runs per problem size to produce averaged times)
3. Microbenchmarking: cd test
test/cuda.cu: CUDA source code used for microbenchmarking the effect of transfer fusion and allocation fusion
for ternary ADD and for different vector sizes.
NB allocation fusion exhibits the anomaly described in the paper,
which mostly occurs for smaller operand sizes, sometimes for large ones too.
Reference:
Christoph Kessler:
Global optimization of operand transfer fusion in heterogeneous computing.
Proc. 22nd International Workshop on Software and Compilers for Embedded Systems
(SCOPES-2019), St. Goar, Germany, May 2019. ACM.
DOI: 10.1145/3323439.3323981