A tool for detecting memory performance bottlenecks in GPU programs
GRAB is the tool developed as part of the following paper:
Adrian Horga, Sudipta Chattopadhyay, Petru Eles, Zebo Peng,
Journal of systems architecture, ISSN 1383-7621, E-ISSN 1873-6165, 1-14 p.
Download the system image for Linux X86-64 systems.
Download the file, uncompress it, then follow the instruction from the README.txt file inside.
This image was created using cde virtualization tool
The source code for GRAB can be downloaded from the bitbucket repository.
Description of GRAB
GRAB is a debugging tool built on top of GPGPU-Sim. GPGPU-Sim is an open-source cycle accurate GPGPU simulator written in C++ that runs CUDA or OpenCL.
GRAB is currently available for debugging CUDA programs. These must be run using the version of GPGPU-Sim that includes GRAB.
GRAB captures and analyzes memory calls to GPU global-memory and the L1 cache and presents a report of memory calls that might reduce the performance of the GPU program. The novelty of GRAB is that it saves the slow memory accesses but computes root causes for them and reports these to the programmer. These root causes are fewer than the problematic memory locations and allow the user to focus on a lesser amount of location to improve performance.
GRAB was developed with the goal of helping programmers to find bottlenecks easily in their GPU programs. More specifically,
GRAB is a perfect tool for developers who are not necessarily experts in GPGPU programming.
Source code of tested programsOur tested code is based on the samples from NVIDIA - CUDA code samples. We have modified the code so that it can run on the simulator and have improved it using the information given by GRAB.
- Bitonic Sort
- Matrix Multiplication
The basic GPU implementation for the Lattice Boltzmann Model has been a previous work supported by a grant of the Romanian National Authority for Scientific Research CNCS-UEFISCDI, Project No. PN-II-ID-PCE-2011-3-0516.
Authors : Tonino Biciusca, Adrian Horga and Victor Sofonea
The following two implementations (LBM_V1 and LBM_V2) have been obtained step by step by using information given by GRAB:
To compare the improvement obtained via GRAB, the implementation from an experienced GPU programmer is used. Such an implementation was obtained from the masters' thesis of Adrian Horga ( "Fluid Dynamics Simulation With Lattice Boltzmann Models Using Cuda Enabled GPGPUs", UPT, Romania, 2013). The thesis work started from LBM_basic and got to a version that uses registers for improving performance, but also different modifications from LBM_V1 and LBM_V2. The version obtained from the masters' thesis is faster than LBM_V1 but it is slower than LBM_V2 when ran on the simulator.
The implementation obtained from the masters' thesis was a part of the following paper : T. Biciusca, A. Horga, V. Sofonea, "Simulation of liquid-vapour phase separation on GPUs using Lattice Boltzmann models with off-lattice velocity sets", Comptes Rendus Mecanique (Elsevier), Volume 343, Issues 10-11, October-November 2015, Pages 580-588.
The code for the Susan image processing case study is taken from the Susan site
We have modified the "susan_principle_small" function by transforming it into a GPU kernel (see Susan_basic.cu). From there, we have obtained the other version with the help of GRAB.