Index of /~chrke55/transferfusion

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[   ]C-sources-transferfusion-CKessler-Apr2019.zip2019-05-19 13:20 29K 
[TXT]LICENSE.txt2019-05-19 13:00 760  
[TXT]README.txt2019-05-19 13:08 4.9K 

Optimizer for global operand transfer fusion in CUDA programs.

(c) 2018, 2019 by C. Kessler, Linkoping University.

Warning: This is a research prototype implementation.
         The code is ugly and may contain bugs.
         No guarantees are given, use it at your own risk.
         See LICENSE.txt for license issues.
         I used CUDA 8 on a Nvidia Kepler GPU for the experiments,
         it might not work with other CUDA versions without code modifications.

1. Main directory:


 glob.h  global declarations, included from trace.c 

 trace.c all C source code (no time to clean up and refactor, sorry).
         I tried to comment at least the main data structures and
         some of the functions in the code.
         Please read the paper for more information.

 Makefile for building the optimizer for all example DAGs.


 make   builds the optimizer binaries for all example DAGs (called t....) in this directory.

 t....  runs the optimizer for the example DAG ....,
         creates 2 dot files (for DAG and affinity graph) called and,
         and creates corresponding source files in cuda/ subdirectory

 draw   generates png files of the DAG and affinity graph dot files,
         and call eog for visualization.

Settings (#defines -D... to be set in Makefile):

 N: number of kernels

 INIT_K: number of vectors live on entry (internally numbered 0...INIT_K-1)
         (live-on-exit vectors are marked up in the code with the LIVEONEXIT() macro).
         The overall number MAX_K of vectors is INIT_K + Sum over all outdegrees of all kernels.

 MIN_INDEG, MAX_INDEG, MIN_OUTDEG, MAX_OUTDEG: control randomly-generated kernel graph
        topology properties, also used by the non-random examples e.g. to reduce memory allocation.

 SPECIALDAG: select any predefined special DAG topology (numbered 1...12) in trace.c.
             If not specified, the default is a randomly generated DAG, where
             the naming scheme t.... of the generated binary contains the degrees, N and INIT_K.

 TIGHT: use instead the tight variant of the optimizer (heuristically aiming for shorter
         on-device vector live ranges by possibly reducing the realized fusion opportunities)

 SAMECOLOR: this setting is not used in undefined form, it must always be switched on.
            It means: use trivial identity coloring (naming) of vectors,
            thus the optimizer works on the original SSA structure of the given DAG.
            (an earlier plan to optionally run a register assignment algorithm
             for space compaction before the optimization was never completed)

2. CUDA experiments: cd cuda


 cuda/ Source code of the CPU and CUDA kernels used by and linked with the test programs.

 cuda/t.....c: source code for the transfer-optimized code version of each test program .....
                Memory allocation time is not included in the measurements.

 cuda/nt.....c: source code for the baseline (non-transfer-optimized) code version of each test program.
                Memory allocation time is not included in the measurements.

 cuda/at.....c: source code for the transfer- and allocation-optimized code version of each test program .....
                Memory allocation time is included in the measurements.

 cuda/ant.....c: source code for the baseline (non-transfer-optimized, non-allocation-optimized)
                code version of each test program.
                Memory allocation time is included in the measurements.

 In all these, the problem size used is appended as a suffix (default = 1M elements)

 cuda/Makefile for compiling all example .cu source files


 make    builds the example binaries t...., nt...., at.... and ant.... for each example DAG ....

 t...., at...., nt...., ant....   run the generated CUDA code variant for a specific example DAG for different problem sizes

 runall  batch script for running and measuring all generated example programs
          for 6 different problem sizes 4K, 16K, 64K, 256K, 1M, 4M.
         (NB - this can take many hours with the standard settings for #runs per problem size to produce averaged times)

3. Microbenchmarking: cd test

 test/  CUDA source code used for microbenchmarking the effect of transfer fusion and allocation fusion
                for ternary ADD and for different vector sizes.
                NB allocation fusion exhibits the anomaly described in the paper,
                which mostly occurs for smaller operand sizes, sometimes for large ones too.


Christoph Kessler:
Global optimization of operand transfer fusion in heterogeneous computing.
Proc. 22nd International Workshop on Software and Compilers for Embedded Systems
 (SCOPES-2019), St. Goar, Germany, May 2019. ACM. 
 DOI: 10.1145/3323439.3323981