SkePU  1.2
 All Classes Namespaces Files Functions Variables Enumerations Friends Macros Groups Pages
Classes | Enumerations | Functions
skepu Namespace Reference

The main namespace for SkePU library. More...

Classes

class  Generate
 A class representing the Generate skeleton. More...
 
class  Map
 A class representing the Map skeleton. More...
 
class  MapArray
 A class representing the MapArray skeleton. More...
 
class  MapOverlap
 A class representing the MapOverlap skeleton. More...
 
class  MapReduce
 A class representing the MapReduce skeleton. More...
 
class  Matrix
 A matrix container class (2D matrix), internally uses 1D container (std::vector) to store elements in a contiguous memory allocations. More...
 
class  Reduce
 A class representing the Reduce skeleton both for 1D and 2D reduce operation for 1D Vector, 2D Dense Matrix/Sparse matrices. More...
 
class  Reduce< ReduceFunc, ReduceFunc >
 A specilalization of above class, used for 1D Reduce operation. Please note that the class name is same. The only difference is how you instantiate it either by passing 1 user function (i.e. 1D reduction) or 2 user function (i.e. 2D reduction). See code examples for more information. More...
 
class  Scan
 A class representing the Scan skeleton. More...
 
class  SparseMatrix
 A sparse matrix container class that mainly stores its data in CSR format. More...
 
class  MapOverlap2D
 A class representing the MapOverlap skeleton for 2D overlap for Matrix operands (useful for convolution and stencil computation). More...
 
class  DataCollector2D
 A class that can be used to collect 2D data. More...
 
class  DeviceAllocations_CU
 
class  Device_CL
 A class representing an OpenCL device. More...
 
class  Device_CU
 A class representing a CUDA device. More...
 
class  DeviceMemPointer_CL
 A class representing an OpenCL device memory allocation for container. More...
 
struct  UpdateInf
 
class  DeviceMemPointer_CU
 A class representing a CUDA device memory allocation for container. More...
 
struct  openclGenProp
 
struct  openclDeviceProp
 
class  EnvironmentDestroyer
 A class that is used to properly deallocate singelton object of Environment class. More...
 
class  Environment
 A class representing a execution environment. More...
 
struct  BackEndParams
 Can be used to specify properties for a backend. More...
 
class  ExecPlan
 A class that describes an execution plan. More...
 
class  TimerLinux_GTOD
 A class that can be used measure time on Linux systems. More...
 
struct  ExtraData
 Any extra information that User want to pass to the function wrapper for implementations can be specified here.... More...
 
struct  Point
 
struct  Node
 
class  Trainer
 end Node class... More...
 
class  TuneData
 A class that can be used to collect tuning data. More...
 
class  DeviceMemPointer_Matrix_CL
 A class representing an OpenCL device memory allocation for Matrix container. Not used much right now. More...
 
class  DeviceMemPointer_Matrix_CU
 A class representing a CUDA 2D device memory allocation for Matrix container. More...
 
class  Threads
 
class  ThreadPool
 ThreadPool class manages all the ThreadPool related activities. This includes keeping track of idle threads and synchronizations between all threads. More...
 
struct  Tuner
 Tuner class: generic definition.... Multiple class specializations are defined for this class, one for each skeleton type. It allows to avoid possible compiler errors considering differences in function arguments for different skeleton types. More...
 
struct  Tuner< StructType, MAPREDUCE, StructType2 >
 Tuner class specilization for MapReduce skeleton. More...
 
struct  Tuner< StructType, MAP, StructType >
 Tuner class specilization for Map skeleton. More...
 
struct  Tuner< StructType, REDUCE, StructType >
 Tuner class specilization for Reduce skeleton. More...
 
struct  Tuner< StructType, MAPARRAY, StructType >
 Tuner class specilization for MapArray skeleton. More...
 
struct  Tuner< StructType, MAPOVERLAP, StructType >
 Tuner class specilization for MapOverlap skeleton. More...
 
class  Vector
 A vector container class, implemented as a wrapper for std::vector. More...
 

Enumerations

enum  OverlapPolicy
 
enum  AccessType
 Can be used to specify whether the access is row-wise or column-wise. More...
 
enum  ReducePolicy
 Can be used to specify the direction of reduce for 2D containers. More...
 
enum  ScanType
 
enum  SparseFileFormat
 Can be used to specify the input format for a sparse matrix that is supplied in constructor. More...
 
enum  BackEnd
 Can be used to specify which backend to use.
 

Functions

double testHostToDeviceTransfer (unsigned int memSize, memoryMode memMode, bool wc)
 test the bandwidth of a host to device memcopy of a specific size
 
double testDeviceToDeviceTransfer (unsigned int memSize)
 test the bandwidth of a device to device memcopy of a specific size
 
DevTimingStruct measureOrLoadCUDABandwidth (int gpuId, bool pinnedMemory=false)
 
template<typename T >
std::string getDataTypeCL ()
 
template<>
std::string getDataTypeCL< int > ()
 
template<>
std::string getDataTypeCL< unsigned int > ()
 
template<>
std::string getDataTypeCL< long > ()
 
template<>
std::string getDataTypeCL< float > ()
 
template<>
std::string getDataTypeCL< double > ()
 
bool cudaPeerToPeerMemAccess (int gpuId1, int gpuId2)
 
static std::string TransposeKernelNoBankConflicts_CL ("__kernel void transposeNoBankConflicts(__global TYPE* odata, __global TYPE* idata, int width, int height, __local TYPE* sdata)\n""{\n"" int xIndex = get_group_id(0) * TILEDIM + get_local_id(0);\n"" int yIndex = get_group_id(1) * TILEDIM + get_local_id(1);\n"" int index_in = xIndex + (yIndex)*width;\n"" if(xIndex<width && yIndex<height)\n"" sdata[get_local_id(1)*TILEDIM+get_local_id(0)] = idata[index_in];\n"" xIndex = get_group_id(1) * TILEDIM + get_local_id(0);\n"" yIndex = get_group_id(0) * TILEDIM + get_local_id(1);\n"" int index_out = xIndex + (yIndex)*height;\n"" \n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(xIndex<height && yIndex<width)\n"" odata[index_out] = sdata[get_local_id(0)*TILEDIM+get_local_id(1)];\n""}\n")
 
static std::string GenerateKernel_CL ("__kernel void GenerateKernel_KERNELNAME(__global TYPE* output, size_t numElements, size_t indexOffset, CONST_TYPE const1)\n""{\n"" size_t i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < numElements)\n"" {\n"" output[i] = FUNCTIONNAME(i+indexOffset, const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string GenerateKernel_CL_Matrix ("__kernel void GenerateKernel_Matrix_KERNELNAME(__global TYPE* output, size_t numElements, size_t xsize, size_t ysize, size_t yoffset, CONST_TYPE const1)\n""{\n"" size_t xindex = get_global_id(0);\n"" size_t yindex = get_global_id(1);\n"" size_t i = yindex*xsize + xindex; \n"" if(i < numElements && xindex<xsize && yindex <ysize)\n"" {\n"" output[i] = FUNCTIONNAME(xindex, yindex+yoffset, const1);\n"" }\n""}\n")
 
template<typename T , typename GenerateFunc >
__global__ void GenerateKernel_CU (GenerateFunc generateFunc, T *output, size_t numElements, size_t indexOffset)
 
template<typename T , typename GenerateFunc >
__global__ void GenerateKernel_CU_Matrix (GenerateFunc generateFunc, T *output, size_t numElements, size_t xsize, size_t ysize, size_t yoffset)
 
static const std::string trimSpaces (const std::string &pString, const std::string &pWhitespace=" \t")
 
template<typename T >
get_random_number (T min, T max)
 
static std::string read_file_into_string (const std::string &filename)
 
static void toLowerCase (std::string &str)
 
static void toUpperCase (std::string &str)
 
static bool startsWith (const std::string &main, const std::string &prefix)
 
template<typename T >
void allocateHostMemory (T *&data, const size_t numElems)
 
template<typename T >
void deallocateHostMemory (T *data)
 
static std::string UnaryMapKernel_CL ("__kernel void UnaryMapKernel_KERNELNAME(__global TYPE* input, __global TYPE* output, size_t numElements, CONST_TYPE const1)\n""{\n"" int i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < numElements)\n"" {\n"" output[i] = FUNCTIONNAME(input[i], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string BinaryMapKernel_CL ("__kernel void BinaryMapKernel_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* output, size_t n, CONST_TYPE const1)\n""{\n"" int i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < n)\n"" {\n"" output[i] = FUNCTIONNAME(input1[i], input2[i], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string TrinaryMapKernel_CL ("__kernel void TrinaryMapKernel_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* input3, __global TYPE* output, size_t n, CONST_TYPE const1)\n""{\n"" int i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < n)\n"" {\n"" output[i] = FUNCTIONNAME(input1[i], input2[i], input3[i], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
template<typename T , typename UnaryFunc >
__global__ void MapKernelUnary_CU (UnaryFunc mapFunc, T *input, T *output, size_t n)
 
template<typename T , typename BinaryFunc >
__global__ void MapKernelBinary_CU (BinaryFunc mapFunc, T *input1, T *input2, T *output, size_t n)
 
template<typename T , typename TrinaryFunc >
__global__ void MapKernelTrinary_CU (TrinaryFunc mapFunc, T *input1, T *input2, T *input3, T *output, size_t n)
 
static std::string MapArrayKernel_CL ("__kernel void MapArrayKernel_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* output, size_t n, CONST_TYPE const1)\n""{\n"" size_t i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < n)\n"" {\n"" output[i] = FUNCTIONNAME(&input1[0], input2[i], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string MapArrayKernel_CL_Matrix_Blockwise ("__kernel void MapArrayKernel_Matrix_Blockwise_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* output, size_t outSize, size_t p2BlockSize, CONST_TYPE const1)\n""{\n"" size_t i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" if(i < outSize)\n"" {\n"" output[i] = FUNCTIONNAME(&input1[0], &input2[i*p2BlockSize], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string MapArrayKernel_CL_Sparse_Matrix_Blockwise ("__kernel void MapArrayKernel_Sparse_Matrix_Blockwise_KERNELNAME(__global TYPE* input1, __global TYPE* in2_values, __global size_t *in2_row_offsets, __global size_t *in2_col_indices, __global TYPE* output, size_t outSize, size_t indexOffset, CONST_TYPE const1)\n""{\n"" size_t i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" if(i < outSize)\n"" {\n"" size_t rowId = in2_row_offsets[i] - indexOffset;\n"" size_t row2Id = in2_row_offsets[i+1] - indexOffset;\n"" output[i] = FUNCTIONNAME(&input1[0], &in2_values[rowId], (row2Id-rowId), &in2_col_indices[rowId], const1);\n"" i += gridSize;\n"" }\n""}\n")
 
static std::string MapArrayKernel_CL_Matrix ("__kernel void MapArrayKernel_Matrix_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* output, size_t n, size_t xsize, size_t ysize, size_t yoffset, CONST_TYPE const1)\n""{\n"" size_t xindex = get_global_id(0);\n"" size_t yindex = get_global_id(1);\n"" size_t i = yindex*xsize + xindex; \n"" if(i < n && xindex<xsize && yindex <ysize)\n"" {\n"" output[i] = FUNCTIONNAME(&input1[0], input2[i], xindex, yindex+yoffset, const1);\n"" }\n""}\n")
 
template<typename T , typename ArrayFunc >
__global__ void MapArrayKernel_CU (ArrayFunc mapArrayFunc, T *input1, T *input2, T *output, size_t n)
 
template<typename T , typename ArrayFunc >
__global__ void MapArrayKernel_CU_Matrix_Blockwise (ArrayFunc mapArrayFunc, T *input1, T *input2, T *output, size_t outSize, size_t p2BlockSize)
 
template<typename T , typename ArrayFunc >
__global__ void MapArrayKernel_CU_Sparse_Matrix_Blockwise (ArrayFunc mapArrayFunc, T *input1, T *in2_values, size_t *in2_row_offsets, size_t *in2_col_indices, T *output, size_t outSize, size_t indexOffset)
 
template<typename T , typename ArrayFunc >
__global__ void MapArrayKernel_CU_Matrix (ArrayFunc mapArrayFunc, T *input1, T *input2, T *output, size_t n, size_t xsize, size_t ysize, size_t yoffset)
 
static std::string MatrixConvolSharedFilter_CL ("__kernel void conv_opencl_shared_filter_KERNELNAME(__global TYPE* input, __global TYPE* output, __constant TYPE* filter, size_t in_rows, size_t in_cols, size_t out_rows, size_t out_cols, size_t filter_rows, size_t filter_cols, size_t in_pitch, size_t out_pitch, size_t sharedRows, size_t sharedCols, __local TYPE* sdata)\n""{\n"" size_t xx = ( (size_t)(get_global_id(0)/get_local_size(0))) * get_local_size(0);\n"" size_t yy = ( (size_t)(get_global_id(1)/get_local_size(1))) * get_local_size(1);\n"" size_t x = get_global_id(0);\n"" size_t y = get_global_id(1);\n"" if(x<(out_cols+filter_cols-1) && y<(out_rows+filter_rows-1))\n"" {\n"" size_t sharedIdx = get_local_id(1) * sharedCols + get_local_id(0);\n"" sdata[sharedIdx]= input[y*in_pitch + x];\n"" size_t shared_x= get_local_id(0)+get_local_size(0);\n"" size_t shared_y= get_local_id(1);\n"" while(shared_y<sharedRows)\n"" {\n"" while(shared_x<sharedCols)\n"" {\n"" sharedIdx = shared_y * sharedCols + shared_x; \n"" sdata[sharedIdx]= input[(yy+shared_y) * in_pitch + xx + shared_x];\n"" shared_x = shared_x + get_local_size(0);\n"" }\n"" shared_x = get_local_id(0);\n"" shared_y = shared_y + get_local_size(1);\n"" } \n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(x<out_cols && y<out_rows)\n"" {\n"" TYPE sum=0;\n"" for(size_t j=0;j<filter_rows;j++) \n"" {\n"" for(size_t i=0;i<filter_cols;i++) \n"" {\n"" sum += sdata[(get_local_id(1)+j) * sharedCols + (get_local_id(0)+i) ] * filter[j*filter_cols+i];\n"" }\n"" }\n"" output[y*out_pitch+x] = sum / (filter_rows * filter_cols);\n"" }\n""}")
 
static std::string MatrixConvol2D_CL ("__kernel void conv_opencl_2D_KERNELNAME(__global TYPE* input, __global TYPE* output, size_t out_rows, size_t out_cols, size_t filter_rows, size_t filter_cols, size_t in_pitch, size_t out_pitch, size_t stride, size_t sharedRows, size_t sharedCols, __local TYPE* sdata)\n""{\n"" size_t xx = ( (size_t)(get_global_id(0)/get_local_size(0))) * get_local_size(0);\n"" size_t yy = ( (size_t)(get_global_id(1)/get_local_size(1))) * get_local_size(1);\n"" size_t x = get_global_id(0);\n"" size_t y = get_global_id(1);\n"" if(x<(out_cols+filter_cols-1) && y<(out_rows+filter_rows-1))\n"" {\n"" size_t sharedIdx = get_local_id(1) * sharedCols + get_local_id(0);\n"" sdata[sharedIdx]= input[y*in_pitch + x];\n"" size_t shared_x= get_local_id(0)+get_local_size(0);\n"" size_t shared_y= get_local_id(1);\n"" while(shared_y<sharedRows)\n"" {\n"" while(shared_x<sharedCols)\n"" {\n"" sharedIdx = shared_y * sharedCols + shared_x; \n"" sdata[sharedIdx]= input[(yy+shared_y) * in_pitch + xx + shared_x];\n"" shared_x = shared_x + get_local_size(0);\n"" }\n"" shared_x = get_local_id(0);\n"" shared_y = shared_y + get_local_size(1);\n"" } \n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(x<out_cols && y<out_rows)\n"" {\n"" output[y*out_pitch+x] = FUNCTIONNAME(&(sdata[(get_local_id(1)+(filter_rows/2)) * sharedCols + (get_local_id(0)+(filter_cols/2))]), stride);\n"" }\n""}")
 
static std::string MatrixConvolShared_CL ("__kernel void conv_opencl_shared_KERNELNAME(__global TYPE* input, __global TYPE* output, size_t in_rows, size_t in_cols, size_t out_rows, size_t out_cols, size_t filter_rows, size_t filter_cols, size_t in_pitch, size_t out_pitch, size_t sharedRows, size_t sharedCols, __local TYPE* sdata)\n""{\n"" size_t xx = ( (size_t)(get_global_id(0)/get_local_size(0))) * get_local_size(0);\n"" size_t yy = ( (size_t)(get_global_id(1)/get_local_size(1))) * get_local_size(1);\n"" size_t x = get_global_id(0);\n"" size_t y = get_global_id(1);\n"" if(x<(out_cols+filter_cols-1) && y<(out_rows+filter_rows-1))\n"" {\n"" size_t sharedIdx = get_local_id(1) * sharedCols + get_local_id(0);\n"" sdata[sharedIdx]= input[y*in_pitch + x];\n"" size_t shared_x= get_local_id(0)+get_local_size(0);\n"" size_t shared_y= get_local_id(1);\n"" while(shared_y<sharedRows)\n"" {\n"" while(shared_x<sharedCols)\n"" {\n"" sharedIdx = shared_y * sharedCols + shared_x; \n"" sdata[sharedIdx]= input[(yy+shared_y) * in_pitch + xx + shared_x];\n"" shared_x = shared_x + get_local_size(0);\n"" }\n"" shared_x = get_local_id(0);\n"" shared_y = shared_y + get_local_size(1);\n"" } \n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(x<out_cols && y<out_rows)\n"" {\n"" TYPE sum=0;\n"" for(size_t j=0;j<filter_rows;j++) \n"" {\n"" for(size_t i=0;i<filter_cols;i++) \n"" {\n"" sum += sdata[(get_local_id(1)+j) * sharedCols + (get_local_id(0)+i) ];\n"" }\n"" }\n"" output[y*out_pitch+x] = sum / (filter_rows * filter_cols);\n"" }\n""}")
 
template<typename T >
max (T a, T b)
 
template<typename T >
min (T a, T b)
 
template<typename T >
size_t calculateTiling (size_t regCountPerThread, size_t filterSizeX, size_t filterSizeY, size_t inputSizeX, bool maximizeTiling=false)
 
template<typename T , typename OverlapFunc >
__global__ void conv_cuda_2D_kernel (OverlapFunc mapOverlapFunc, T *input, T *output, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_kernel (T *input, T *output, const size_t in_rows, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_kernel (T *input, T *output, const size_t numTiles, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_2_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_4_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_6_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_8_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_10_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_12_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_14_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
template<bool useFilter, typename T >
__global__ void conv_cuda_shared_tiling_16_kernel (T *input, T *output, const size_t in_cols, const size_t out_rows, const size_t out_cols, const size_t filter_rows, const size_t filter_cols, size_t in_pitch, size_t out_pitch, const size_t sharedRows, const size_t sharedCols)
 
static std::string MapOverlapKernel_CL ("__kernel void MapOverlapKernel_KERNELNAME(__global TYPE* input, __global TYPE* output, __global TYPE* wrap, size_t n, size_t overlap, size_t out_offset, size_t out_numelements, int poly, TYPE pad, __local TYPE* sdata)\n""{\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0) * get_local_size(0) + get_local_id(0);\n"" if(poly == 0)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[i] : pad;\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (get_group_id(0) == 0) ? pad : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != get_num_groups(0)-1 && i+overlap < n) ? input[i+overlap] : pad;\n"" }\n"" }\n"" else if(poly == 1)\n"" {\n"" if(i < n)\n"" {\n"" sdata[overlap+tid] = input[i];\n"" }\n"" else if(i-n < overlap)\n"" {\n"" sdata[overlap+tid] = wrap[overlap+(i-n)];\n"" }\n"" else\n"" {\n"" sdata[overlap+tid] = pad;\n"" }\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (get_group_id(0) == 0) ? wrap[tid] : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != get_num_groups(0)-1 && i+overlap < n) ? input[i+overlap] : wrap[overlap+(i+overlap-n)];\n"" }\n"" }\n"" else if(poly == 2)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[i] : input[n-1];\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (get_group_id(0) == 0) ? input[0] : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != get_num_groups(0)-1 && i+overlap < n) ? input[i+overlap] : input[n-1];\n"" }\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if( (i >= out_offset) && (i < out_offset+out_numelements) )\n"" {\n"" output[i-out_offset] = FUNCTIONNAME(&(sdata[tid+overlap]));\n"" }\n""}\n")
 
static std::string MapOverlapKernel_CL_Matrix_Row ("__kernel void MapOverlapKernel_MatRowWise_KERNELNAME(__global TYPE* input, __global TYPE* output, __global TYPE* wrap, size_t n, size_t overlap, size_t out_offset, size_t out_numelements, int poly, TYPE pad, size_t blocksPerRow, size_t rowWidth, __local TYPE* sdata)\n""{\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0) * get_local_size(0) + get_local_id(0);\n"" size_t wrapIndex= 2 * overlap * (int)(get_group_id(0)/blocksPerRow);\n"" size_t tmp= (get_group_id(0) % blocksPerRow);\n"" size_t tmp2= (get_group_id(0) / blocksPerRow);\n"" if(poly == 0)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[i] : pad;\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? pad : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (i+overlap < n) && tmp!=(blocksPerRow-1)) ? input[i+overlap] : pad;\n"" }\n"" }\n"" else if(poly == 1)\n"" {\n"" if(i < n)\n"" {\n"" sdata[overlap+tid] = input[i];\n"" }\n"" else if(i-n < overlap)\n"" {\n"" sdata[overlap+tid] = wrap[(overlap+(i-n))+ wrapIndex];\n"" }\n"" else\n"" {\n"" sdata[overlap+tid] = pad;\n"" }\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? wrap[tid+wrapIndex] : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && i+overlap < n && tmp!=(blocksPerRow-1)) ? input[i+overlap] : wrap[overlap+wrapIndex+(tid+overlap-get_local_size(0))];\n"" }\n"" }\n"" else if(poly == 2)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[i] : input[n-1];\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? input[tmp2*rowWidth] : input[i-overlap];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (i+overlap < n) && (tmp!=(blocksPerRow-1))) ? input[i+overlap] : input[(tmp2+1)*rowWidth-1];\n"" }\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if( (i >= out_offset) && (i < out_offset+out_numelements) )\n"" {\n"" output[i-out_offset] = FUNCTIONNAME(&(sdata[tid+overlap]));\n"" }\n""}\n")
 
static std::string MapOverlapKernel_CL_Matrix_Col ("__kernel void MapOverlapKernel_MatColWise_KERNELNAME(__global TYPE* input, __global TYPE* output, __global TYPE* wrap, size_t n, size_t overlap, size_t out_offset, size_t out_numelements, int poly, TYPE pad, size_t blocksPerCol, size_t rowWidth, size_t colWidth, __local TYPE* sdata)\n""{\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0) * get_local_size(0) + get_local_id(0);\n"" size_t wrapIndex= 2 * overlap * (int)(get_group_id(0)/blocksPerCol);\n"" size_t tmp= (get_group_id(0) % blocksPerCol);\n"" size_t tmp2= (get_group_id(0) / blocksPerCol);\n"" size_t arrInd = (tid + tmp*get_local_size(0))*rowWidth + tmp2;\n"" if(poly == 0)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[arrInd] : pad;\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? pad : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+(overlap*rowWidth))] : pad;\n"" }\n"" }\n"" else if(poly == 1)\n"" {\n"" if(i < n)\n"" {\n"" sdata[overlap+tid] = input[arrInd];\n"" }\n"" else if(i-n < overlap)\n"" {\n"" sdata[overlap+tid] = wrap[(overlap+(i-n))+ wrapIndex];\n"" }\n"" else\n"" {\n"" sdata[overlap+tid] = pad;\n"" }\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? wrap[tid+wrapIndex] : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+(overlap*rowWidth))] : wrap[overlap+wrapIndex+(tid+overlap-get_local_size(0))];\n"" }\n"" }\n"" else if(poly == 2)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[arrInd] : input[n-1];\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? input[tmp2] : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+(overlap*rowWidth))] : input[tmp2+(colWidth-1)*rowWidth];\n"" }\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if( (arrInd >= out_offset) && (arrInd < out_offset+out_numelements) )\n"" {\n"" output[arrInd-out_offset] = FUNCTIONNAME(&(sdata[tid+overlap]));\n"" }\n""}\n")
 
static std::string MapOverlapKernel_CL_Matrix_ColMulti ("__kernel void MapOverlapKernel_MatColWiseMulti_KERNELNAME(__global TYPE* input, __global TYPE* output, __global TYPE* wrap, size_t n, size_t overlap, size_t in_offset, size_t out_numelements, int poly, int deviceType, TYPE pad, size_t blocksPerCol, size_t rowWidth, size_t colWidth, __local TYPE* sdata)\n""{\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0) * get_local_size(0) + get_local_id(0);\n"" size_t wrapIndex= 2 * overlap * (int)(get_group_id(0)/blocksPerCol);\n"" size_t tmp= (get_group_id(0) % blocksPerCol);\n"" size_t tmp2= (get_group_id(0) / blocksPerCol);\n"" size_t arrInd = (tid + tmp*get_local_size(0))*rowWidth + tmp2;\n"" if(poly == 0)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[arrInd+in_offset] : pad;\n"" if(deviceType == -1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? pad : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" \n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 0) \n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+in_offset+(overlap*rowWidth))] : pad;\n"" }\n"" }\n"" }\n"" else if(poly == 1)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[arrInd+in_offset] : ((i-n < overlap) ? wrap[(i-n)+ (overlap * tmp2)] : pad);\n"" if(deviceType == -1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? wrap[tid+(overlap * tmp2)] : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 0)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+in_offset+(overlap*rowWidth))] : wrap[(overlap * tmp2)+(tid+overlap-get_local_size(0))];\n"" }\n"" }\n"" }\n"" else if(poly == 2)\n"" {\n"" sdata[overlap+tid] = (i < n) ? input[arrInd+in_offset] : input[n+in_offset-1];\n"" if(deviceType == -1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = (tmp==0) ? input[tmp2] : input[(arrInd-(overlap*rowWidth))];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 0)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = input[(arrInd+in_offset+(overlap*rowWidth))];\n"" }\n"" }\n"" else if(deviceType == 1)\n"" {\n"" if(tid < overlap)\n"" {\n"" sdata[tid] = input[arrInd];\n"" }\n"" if(tid >= (get_local_size(0)-overlap))\n"" {\n"" sdata[tid+2*overlap] = (get_group_id(0) != (get_num_groups(0)-1) && (arrInd+(overlap*rowWidth)) < n && (tmp!=(blocksPerCol-1))) ? input[(arrInd+in_offset+(overlap*rowWidth))] : input[tmp2+in_offset+(colWidth-1)*rowWidth];\n"" }\n"" }\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if( arrInd < out_numelements )\n"" {\n"" output[arrInd] = FUNCTIONNAME(&(sdata[tid+overlap]));\n"" }\n""}\n")
 
template<typename T >
__global__ void transpose (T *odata, T *idata, size_t width, size_t height)
 
template<int poly, typename T , typename OverlapFunc >
__global__ void MapOverlapKernel_CU (OverlapFunc mapOverlapFunc, T *input, T *output, T *wrap, size_t n, size_t out_offset, size_t out_numelements, T pad)
 
template<int poly, typename T , typename OverlapFunc >
__global__ void MapOverlapKernel_CU_Matrix_Row (OverlapFunc mapOverlapFunc, T *input, T *output, T *wrap, size_t n, size_t out_offset, size_t out_numelements, T pad, size_t blocksPerRow, size_t rowWidth)
 
template<int poly, typename T , typename OverlapFunc >
__global__ void MapOverlapKernel_CU_Matrix_Col (OverlapFunc mapOverlapFunc, T *input, T *output, T *wrap, size_t n, size_t out_offset, size_t out_numelements, T pad, size_t blocksPerCol, size_t rowWidth, size_t colWidth)
 
template<int poly, int deviceType, typename T , typename OverlapFunc >
__global__ void MapOverlapKernel_CU_Matrix_ColMulti (OverlapFunc mapOverlapFunc, T *input, T *output, T *wrap, size_t n, size_t in_offset, size_t out_numelements, T pad, size_t blocksPerCol, size_t rowWidth, size_t colWidth)
 
static std::string UnaryMapReduceKernel_CL ("__kernel void UnaryMapReduceKernel_KERNELNAME(__global TYPE* input, __global TYPE* output, size_t n, __local TYPE* sdata, CONST_TYPE const1)\n""{\n"" size_t blockSize = get_local_size(0);\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0)*blockSize + get_local_id(0);\n"" size_t gridSize = blockSize*get_num_groups(0);\n"" TYPE result = 0;\n"" if(i < n)\n"" {\n"" result = FUNCTIONNAME_MAP(input[i], const1);\n"" i += gridSize;\n"" }\n"" while(i < n)\n"" {\n"" TYPE tempMap;\n"" tempMap = FUNCTIONNAME_MAP(input[i], const1);\n"" result = FUNCTIONNAME_REDUCE(result, tempMap, (TYPE)0);\n"" i += gridSize;\n"" }\n"" sdata[tid] = result;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(blockSize >= 512) { if (tid < 256 && tid + 256 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 256], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 256) { if (tid < 128 && tid + 128 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 128], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 128) { if (tid < 64 && tid + 64 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 64], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 64) { if (tid < 32 && tid + 32 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 32], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 32) { if (tid < 16 && tid + 16 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 16], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 16) { if (tid < 8 && tid + 8 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 8], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 8) { if (tid < 4 && tid + 4 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 4], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 4) { if (tid < 2 && tid + 2 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 2], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 2) { if (tid < 1 && tid + 1 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 1], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(tid == 0)\n"" {\n"" output[get_group_id(0)] = sdata[tid];\n"" }\n""}\n")
 
static std::string BinaryMapReduceKernel_CL ("__kernel void BinaryMapReduceKernel_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* output, size_t n, __local TYPE* sdata, CONST_TYPE const1)\n""{\n"" size_t blockSize = get_local_size(0);\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0)*blockSize + get_local_id(0);\n"" size_t gridSize = blockSize*get_num_groups(0);\n"" TYPE result = 0;\n"" if(i < n)\n"" {\n"" result = FUNCTIONNAME_MAP(input1[i], input2[i], const1);\n"" i += gridSize;\n"" }\n"" while(i < n)\n"" {\n"" TYPE tempMap;\n"" tempMap = FUNCTIONNAME_MAP(input1[i], input2[i], const1);\n"" result = FUNCTIONNAME_REDUCE(result, tempMap, (TYPE)0);\n"" i += gridSize;\n"" }\n"" sdata[tid] = result;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(blockSize >= 512) { if (tid < 256 && tid + 256 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 256], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 256) { if (tid < 128 && tid + 128 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 128], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 128) { if (tid < 64 && tid + 64 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 64], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 64) { if (tid < 32 && tid + 32 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 32], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 32) { if (tid < 16 && tid + 16 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 16], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 16) { if (tid < 8 && tid + 8 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 8], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 8) { if (tid < 4 && tid + 4 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 4], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 4) { if (tid < 2 && tid + 2 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 2], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 2) { if (tid < 1 && tid + 1 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 1], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(tid == 0)\n"" {\n"" output[get_group_id(0)] = sdata[tid];\n"" }\n""}\n")
 
static std::string TrinaryMapReduceKernel_CL ("__kernel void TrinaryMapReduceKernel_KERNELNAME(__global TYPE* input1, __global TYPE* input2, __global TYPE* input3, __global TYPE* output, size_t n, __local TYPE* sdata, CONST_TYPE const1)\n""{\n"" size_t blockSize = get_local_size(0);\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0)*blockSize + get_local_id(0);\n"" size_t gridSize = blockSize*get_num_groups(0);\n"" TYPE result = 0;\n"" if(i < n)\n"" {\n"" result = FUNCTIONNAME_MAP(input1[i], input2[i], input3[i], const1);\n"" i += gridSize;\n"" }\n"" while(i < n)\n"" {\n"" TYPE tempMap;\n"" tempMap = FUNCTIONNAME_MAP(input1[i], input2[i], input3[i], const1);\n"" result = FUNCTIONNAME_REDUCE(result, tempMap, (TYPE)0);\n"" i += gridSize;\n"" }\n"" sdata[tid] = result;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(blockSize >= 512) { if (tid < 256 && tid + 256 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 256], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 256) { if (tid < 128 && tid + 128 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 128], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 128) { if (tid < 64 && tid + 64 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 64], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 64) { if (tid < 32 && tid + 32 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 32], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 32) { if (tid < 16 && tid + 16 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 16], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 16) { if (tid < 8 && tid + 8 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 8], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 8) { if (tid < 4 && tid + 4 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 4], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 4) { if (tid < 2 && tid + 2 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 2], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 2) { if (tid < 1 && tid + 1 < n) { sdata[tid] = FUNCTIONNAME_REDUCE(sdata[tid], sdata[tid + 1], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(tid == 0)\n"" {\n"" output[get_group_id(0)] = sdata[tid];\n"" }\n""}\n")
 
template<typename T , typename UnaryFunc , typename BinaryFunc >
__global__ void MapReduceKernel1_CU (UnaryFunc mapFunc, BinaryFunc reduceFunc, T *input, T *output, size_t n)
 
template<typename T , typename BinaryFunc1 , typename BinaryFunc2 >
__global__ void MapReduceKernel2_CU (BinaryFunc1 mapFunc, BinaryFunc2 reduceFunc, T *input1, T *input2, T *output, size_t n)
 
template<typename T , typename TrinaryFunc , typename BinaryFunc >
__global__ void MapReduceKernel3_CU (TrinaryFunc mapFunc, BinaryFunc reduceFunc, T *input1, T *input2, T *input3, T *output, size_t n)
 
template<typename T >
__global__ void transposeNaive (T *odata, T *idata, size_t width, size_t height)
 A näive CUDA kernel to take Matrix transpose.
 
template<typename T >
__global__ void transposeNoBankConflicts (T *odata, T *idata, size_t width, size_t height)
 An optimized CUDA kernel to take Matrix transpose.
 
template<typename ReduceFunc , typename T >
void ompRegularWorkload (ReduceFunc *reduceFunc, SparseMatrix< T > &input, T *result_array, const unsigned int &numThreads)
 
template<typename ReduceFunc , typename T >
void ompIrregularWorkload (ReduceFunc *reduceFunc, SparseMatrix< T > &input, T *result_array)
 
static std::string ReduceKernel_CL ("__kernel void ReduceKernel_KERNELNAME(__global TYPE* input, __global TYPE* output, size_t n, __local TYPE* sdata)\n""{\n"" size_t blockSize = get_local_size(0);\n"" size_t tid = get_local_id(0);\n"" size_t i = get_group_id(0)*blockSize + get_local_id(0);\n"" size_t gridSize = blockSize*get_num_groups(0);\n"" TYPE result = 0;\n"" if(i < n)\n"" {\n"" result = input[i];\n"" i += gridSize;\n"" }\n"" while(i < n)\n"" {\n"" result = FUNCTIONNAME(result, input[i], (TYPE)0);\n"" i += gridSize;\n"" }\n"" sdata[tid] = result;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(blockSize >= 512) { if (tid < 256 && tid + 256 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 256], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 256) { if (tid < 128 && tid + 128 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 128], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 128) { if (tid < 64 && tid + 64 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 64], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 64) { if (tid < 32 && tid + 32 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 32], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 32) { if (tid < 16 && tid + 16 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 16], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 16) { if (tid < 8 && tid + 8 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 8], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 8) { if (tid < 4 && tid + 4 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 4], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 4) { if (tid < 2 && tid + 2 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 2], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(blockSize >= 2) { if (tid < 1 && tid + 1 < n) { sdata[tid] = FUNCTIONNAME(sdata[tid], sdata[tid + 1], (TYPE)0); } barrier(CLK_LOCAL_MEM_FENCE); }\n"" if(tid == 0)\n"" {\n"" output[get_group_id(0)] = sdata[tid];\n"" }\n""}\n")
 
size_t nextPow2 (size_t x)
 A helper to return a value that is nearest value that is power of 2. More...
 
void getNumBlocksAndThreads (size_t n, size_t maxBlocks, size_t maxThreads, size_t &blocks, size_t &threads)
 
template<typename T , typename BinaryFunc >
__global__ void ReduceKernel_CU_oldAndIncorrect (BinaryFunc reduceFunc, T *input, T *output, size_t n)
 
template<typename T , typename BinaryFunc , size_t blockSize, bool nIsPow2>
__global__ void ReduceKernel_CU (BinaryFunc reduceFunc, T *input, T *output, size_t n)
 
bool isPow2 (size_t x)
 A small helper to determine whether the number is a power of 2. More...
 
template<typename ReduceFunc , typename T >
void CallReduceKernel (ReduceFunc *reduceFunc, size_t size, size_t numThreads, size_t numBlocks, T *d_idata, T *d_odata, bool enableIsPow2=true)
 
template<typename ReduceFunc , typename T >
void ExecuteReduceOnADevice (ReduceFunc *reduceFunc, size_t n, size_t numThreads, size_t numBlocks, size_t maxThreads, size_t maxBlocks, T *d_idata, T *d_odata, unsigned int deviceID, bool enableIsPow2=true)
 
template<typename BinaryFunc , typename T >
static T scanLargeVectorRecursivelyM_CU (DeviceMemPointer_CU< T > *input, DeviceMemPointer_CU< T > *output, std::vector< DeviceMemPointer_CU< T > * > &blockSums, size_t numElements, unsigned int level, ScanType type, T init, Device_CU *device, BinaryFunc scanFunc)
 
static std::string ScanKernel_CL ("__kernel void ScanKernel_KERNELNAME(__global TYPE* input, __global TYPE* output, __global TYPE* blockSums, size_t n, size_t numElements, __local TYPE* sdata)\n""{\n"" size_t threadIdx = get_local_id(0);\n"" size_t blockDim = get_local_size(0);\n"" size_t blockIdx = get_group_id(0);\n"" size_t gridDim = get_num_groups(0);\n"" size_t thid = threadIdx;\n"" unsigned int pout = 0;\n"" unsigned int pin = 1;\n"" size_t mem = get_global_id(0);\n"" size_t blockNr = blockIdx;\n"" size_t gridSize = blockDim*gridDim;\n"" size_t numBlocks = numElements/(blockDim) + (numElements%(blockDim) == 0 ? 0:1);\n"" size_t offset;\n"" while(blockNr < numBlocks)\n"" {\n"" sdata[pout*n+thid] = (mem < numElements) ? input[mem] : 0;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" for(offset = 1; offset < n; offset *=2)\n"" {\n"" pout = 1-pout;\n"" pin = 1-pout;\n"" if(thid >= offset)\n"" sdata[pout*n+thid] = FUNCTIONNAME(sdata[pin*n+thid], sdata[pin*n+thid-offset], (TYPE)0);\n"" else\n"" sdata[pout*n+thid] = sdata[pin*n+thid];\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" }\n"" if(thid == blockDim - 1)\n"" blockSums[blockNr] = sdata[pout*n+blockDim-1];\n"" if(mem < numElements)\n"" output[mem] = sdata[pout*n+thid];\n"" mem += gridSize;\n"" blockNr += gridDim;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" }\n""}\n")
 
static std::string ScanUpdate_CL ("__kernel void ScanUpdate_KERNELNAME(__global TYPE* data, __global TYPE* sums, int isInclusive, TYPE init, size_t n, __global TYPE* ret, __local TYPE* sdata)\n""{\n"" __local TYPE offset;\n"" __local TYPE inc_offset;\n"" size_t threadIdx = get_local_id(0);\n"" size_t blockDim = get_local_size(0);\n"" size_t blockIdx = get_group_id(0);\n"" size_t gridDim = get_num_groups(0);\n"" size_t thid = threadIdx;\n"" size_t blockNr = blockIdx;\n"" size_t gridSize = blockDim*gridDim;\n"" size_t mem = get_global_id(0);\n"" size_t numBlocks = n/(blockDim) + (n%(blockDim) == 0 ? 0:1);\n"" while(blockNr < numBlocks)\n"" {\n"" if(thid == 0)\n"" {\n"" if(isInclusive == 0)\n"" {\n"" offset = init;\n"" if(blockNr > 0)\n"" {\n"" offset = FUNCTIONNAME(offset, sums[blockNr-1], (TYPE)0);\n"" inc_offset = sums[blockNr-1];\n"" }\n"" }\n"" else\n"" {\n"" if(blockNr > 0)\n"" offset = sums[blockNr-1];\n"" }\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(isInclusive == 1)\n"" {\n"" if(blockNr > 0)\n"" sdata[thid] = (mem < n) ? FUNCTIONNAME(offset, data[mem], (TYPE)0) : 0;\n"" else\n"" sdata[thid] = (mem < n) ? data[mem] : 0;\n"" if(mem == n-1)\n"" *ret = sdata[thid];\n"" }\n"" else\n"" {\n"" if(mem == n-1)\n"" *ret = FUNCTIONNAME(inc_offset, data[mem], (TYPE)0);\n"" if(thid == 0)\n"" sdata[thid] = offset;\n"" else\n"" sdata[thid] = (mem-1 < n) ? FUNCTIONNAME(offset, data[mem-1], (TYPE)0) : 0;\n"" }\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(mem < n)\n"" data[mem] = sdata[thid];\n"" mem += gridSize;\n"" blockNr += gridDim;\n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" }\n""}\n")
 
static std::string ScanAdd_CL ("__kernel void ScanAdd_KERNELNAME(__global TYPE* data, TYPE sum, size_t n)\n""{\n"" size_t i = get_global_id(0);\n"" size_t gridSize = get_local_size(0)*get_num_groups(0);\n"" while(i < n)\n"" {\n"" data[i] = FUNCTIONNAME(data[i], sum, (TYPE)0);\n"" i += gridSize;\n"" }\n""}\n")
 
template<typename T , typename BinaryFunc >
__global__ void ScanKernel_CU (BinaryFunc scanFunc, T *input, T *output, T *blockSums, size_t n, size_t numElements)
 
template<typename T , typename BinaryFunc >
__global__ void ScanUpdate_CU (BinaryFunc scanFunc, T *data, T *sums, int isInclusive, T init, size_t n, T *ret)
 
template<typename T , typename BinaryFunc >
__global__ void ScanAdd_CU (BinaryFunc scanFunc, T *data, T sum, size_t n)
 
template<typename T >
void ExecuteReduceOnADevice (size_t n, const size_t &numThreads, const size_t &numBlocks, _cl_mem *&in_p, _cl_mem *&out_p, cl_kernel &kernel, Device_CL *device)
 
void replaceTextInString (std::string &text, std::string find, std::string replace)
 
void printCLError (cl_int Err, std::string s=std::string())
 
std::ostream & operator<< (std::ostream &os, ExecPlanNew< 1 > &plan)
 
std::ostream & operator<< (std::ostream &os, ExecPlanNew< 2 > &plan)
 
std::ostream & operator<< (std::ostream &os, const std::vector< size_t > &vec)
 
template<typename StructType , typename StructType2 >
void cpu_tune_wrapper_map (void *arg)
 Do training execution for a single performance context for Map skeleton and sequential CPU implementation. More...
 
template<typename StructType , typename StructType2 >
void cpu_tune_wrapper_reduce (void *arg)
 Do training execution for a single performance context for Reduce skeleton and sequential CPU implementation. More...
 
template<typename StructType , typename StructType2 >
void cpu_tune_wrapper_mapoverlap (void *arg)
 Do training execution for a single performance context for MapOverlap skeleton and sequential CPU implementation. More...
 
template<typename StructType , typename StructType2 >
void cpu_tune_wrapper_maparray (void *arg)
 Do training execution for a single performance context for MapArray skeleton and sequential CPU implementation. More...
 
template<typename StructType , typename StructType2 >
void cpu_tune_wrapper_mapreduce (void *arg)
 Do training execution for a single performance context for MapReduce skeleton and sequential CPU implementation. More...
 
template<typename StructType , typename StructType2 >
void omp_tune_wrapper_map (void *arg)
 the following section contains function that can train OpenMP implementations. Only enabled when OpenMP is enabled in SkePU library More...
 
template<typename StructType , typename StructType2 >
void omp_tune_wrapper_reduce (void *arg)
 Do training execution for a single performance context for Reduce skeleton and parallel OpenMP implementation. More...
 
template<typename StructType , typename StructType2 >
void omp_tune_wrapper_mapoverlap (void *arg)
 Do training execution for a single performance context for MapOverlap skeleton and parallel OpenMP implementation. More...
 
template<typename StructType , typename StructType2 >
void omp_tune_wrapper_maparray (void *arg)
 Do training execution for a single performance context for MapArray skeleton and parallel OpenMP implementation. More...
 
template<typename StructType , typename StructType2 >
void omp_tune_wrapper_mapreduce (void *arg)
 Do training execution for a single performance context for MapReduce skeleton and parallel OpenMP implementation. More...
 
template<typename StructType , typename StructType2 >
void cuda_tune_wrapper_map (void *arg)
 the following functions train for CUDA implementations for different skeletons... More...
 
template<typename StructType , typename StructType2 >
void cuda_tune_wrapper_reduce (void *arg)
 Do training execution for a single performance context for Reduce skeleton and CUDA implementation. More...
 
template<typename StructType , typename StructType2 >
void cuda_tune_wrapper_mapoverlap (void *arg)
 Do training execution for a single performance context for MapOverlap skeleton and CUDA implementation. More...
 
template<typename StructType , typename StructType2 >
void cuda_tune_wrapper_maparray (void *arg)
 Do training execution for a single performance context for MapArray skeleton and CUDA implementation. More...
 
template<typename StructType , typename StructType2 >
void cuda_tune_wrapper_mapreduce (void *arg)
 Do training execution for a single performance context for MapReduce skeleton and CUDA implementation. More...
 
void createDefaultConfiguration (BackEndParams &bp)
 A helper function that creates the default configuration. More...
 
bool loadExecPlan (std::string id, ExecPlan &plan)
 Loads an execution plan for a file into the structure passes as argument. More...
 
bool storeExecPlan (std::string id, const ExecPlan &plan)
 Stores an execution plan for the structure passed as argument to a file. More...
 
bool loadExecPlanArray (std::string id, ExecPlan *planArray)
 Loads execution plans for a file into the structure passes as argument. More...
 
bool storeExecPlanArray (std::string id, const ExecPlan *planArray, unsigned int nImpls)
 Stores execution plans for the structure passed as argument to a file. More...
 

Detailed Description

The main namespace for SkePU library.

All classes and functions in the SkePU library are in this namespace.

Enumeration Type Documentation

Can be used to specify whether the access is row-wise or column-wise.

Used in some cases to mention type of access required in a certain operation.

Enumeration of the different edge policies (what happens when a read outside the vector is performed) that the map overlap skeletons support.

Can be used to specify the direction of reduce for 2D containers.

Used in reduction operations for 2D containers.

Enumeration of the two types of Scan that can be performed: Inclusive and Exclusive.

Can be used to specify the input format for a sparse matrix that is supplied in constructor.

Used to load sparse matrix from existing files.

Function Documentation

template<typename T >
void skepu::allocateHostMemory ( T *&  data,
const size_t  numElems 
)

Method to allocate host memory of a given size. Can do pinned memory allocation if enabled.

bool skepu::cudaPeerToPeerMemAccess ( int  gpuId1,
int  gpuId2 
)

A helper function to check if peer-to-peer mem transfers can be done bwteen 2 gpus.. if yes, it enables it and return true, otherwise false....

template<typename T >
void skepu::deallocateHostMemory ( T *  data)

Method to deallocate host memory.

template<typename T >
void skepu::ExecuteReduceOnADevice ( size_t  n,
const size_t &  numThreads,
const size_t &  numBlocks,
_cl_mem *&  in_p,
_cl_mem *&  out_p,
cl_kernel &  kernel,
Device_CL *  device 
)

A helper function that is used to call the actual kernel for reduction. Used by other functions to call the actual kernel Internally, it just calls 2 kernels by setting their arguments. No synchronization is enforced.

Parameters
nsize of the input array to be reduced.
numThreadsNumber of threads to be used for kernel execution.
numBlocksNumber of blocks to be used for kernel execution.
in_pOpenCL memory pointer to input array.
out_pOpenCL memory pointer to output array.
kernelOpenCL kernel handle.
deviceOpenCL device handle.

References skepu::Device_CL::getQueue().

Here is the call graph for this function:

template<typename T >
T skepu::get_random_number ( min,
max 
)
inline

Method to get a random number between a given range.

References min().

Referenced by skepu::SparseMatrix< T >::SparseMatrix().

Here is the call graph for this function:

template<typename T >
std::string skepu::getDataTypeCL ( )

helper to return data type in a string format using template specialication technique.

template<>
std::string skepu::getDataTypeCL< double > ( )

helper to return data type in a string format using template specialication technique.

template<>
std::string skepu::getDataTypeCL< float > ( )

helper to return data type in a string format using template specialication technique.

template<>
std::string skepu::getDataTypeCL< int > ( )

helper to return data type in a string format using template specialication technique.

template<>
std::string skepu::getDataTypeCL< long > ( )

helper to return data type in a string format using template specialication technique.

template<>
std::string skepu::getDataTypeCL< unsigned int > ( )

helper to return data type in a string format using template specialication technique.

DevTimingStruct skepu::measureOrLoadCUDABandwidth ( int  gpuId,
bool  pinnedMemory = false 
)

This function is called to measure or load bandwidth for a GPU

Parameters
gpuIdID of the GPU for which to measure bandwidth
pinnedMemoryflag specifying whether to measure via pinned memory or not
Returns
A structure containing bandwidth measurement information.

Referenced by skepu::Environment< T >::Environment().

template<typename ReduceFunc , typename T >
void skepu::ompIrregularWorkload ( ReduceFunc *  reduceFunc,
SparseMatrix< T > &  input,
T *  result_array 
)

A function to do 1D reduction on OpenMP considering ir-regular work-load per row. Useful for un-structured sparse matrices.

References skepu::SparseMatrix< T >::begin(), and skepu::SparseMatrix< T >::total_rows().

Here is the call graph for this function:

template<typename ReduceFunc , typename T >
void skepu::ompRegularWorkload ( ReduceFunc *  reduceFunc,
SparseMatrix< T > &  input,
T *  result_array,
const unsigned int &  numThreads 
)

A function to do 1D reduction on OpenMP considering regular work-load per row. Useful for 2D dense matrix as well as structured sparse matrices.

References skepu::SparseMatrix< T >::begin(), and skepu::SparseMatrix< T >::total_rows().

Here is the call graph for this function:

std::ostream& skepu::operator<< ( std::ostream &  os,
const std::vector< size_t > &  vec 
)

To display vector contents...

void skepu::printCLError ( cl_int  Err,
std::string  s = std::string() 
)

A helper function for OpenCL backends. It takes an OpenCL error code and prints the corresponding error message

Parameters
ErrOpenCL error
sOptional text string that may give more information on the error source

Referenced by skepu::MapOverlap2D< MapOverlap2DFunc >::CL(), skepu::Generate< GenerateFunc >::CL(), skepu::MapArray< MapArrayFunc >::CL(), and skepu::Reduce< ReduceFunc, ReduceFunc >::CL().

static std::string skepu::read_file_into_string ( const std::string &  filename)
static

Method to read text file data into a string.

Referenced by skepu::Environment< T >::createOpenCLProgramForMatrixTranspose().

void skepu::replaceTextInString ( std::string &  text,
std::string  find,
std::string  replace 
)

A helper function used by createOpenCLProgram(). It finds all instances of a string in another string and replaces it with a third string.

Parameters
textA std::string which is searched.
findThe std::string which is searched for and replaced.
replaceThe relpacement std::string.

Referenced by skepu::Environment< T >::createOpenCLProgramForMatrixTranspose().

template<typename BinaryFunc , typename T >
static T skepu::scanLargeVectorRecursivelyM_CU ( DeviceMemPointer_CU< T > *  input,
DeviceMemPointer_CU< T > *  output,
std::vector< DeviceMemPointer_CU< T > * > &  blockSums,
size_t  numElements,
unsigned int  level,
ScanType  type,
init,
Device_CU *  device,
BinaryFunc  scanFunc 
)
static

Scans a Vector using the same recursive algorithm as NVIDIA SDK. First the vector is scanned producing partial results for each block. Then the function is called recursively to scan these partial results, which in turn can produce partial results and so on. This continues until only one block with partial results is left. Used by multi-GPU CUDA implementation.

Parameters
inputPointer to the device memory where the input vector resides.
outputPointer to the device memory where the output vector resides.
blockSumsA Vector of device memory pointers where the partial results for each level is stored.
numElementsThe number of elements to scan.
levelThe current recursion level.
typeThe scan type, either INCLUSIVE or EXCLUSIVE.
initThe initialization value for exclusive scans.
devicePointer to the device that will be used for the scan.
scanFuncThe user function used in the scan.

References skepu::DeviceMemPointer_CU< T >::getDeviceDataPointer(), skepu::Device_CU::getDeviceID(), skepu::Device_CU::getMaxBlocks(), skepu::Device_CU::getMaxThreads(), and min().

Referenced by skepu::Scan< ScanFunc >::CU().

Here is the call graph for this function:

static bool skepu::startsWith ( const std::string &  main,
const std::string &  prefix 
)
static

Method to check whether a string starts with a given pattern.

static void skepu::toLowerCase ( std::string &  str)
static

Method to convert to lower case.

static void skepu::toUpperCase ( std::string &  str)
static

Method to convert to upper case.

static std::string skepu::TransposeKernelNoBankConflicts_CL ( "__kernel void transposeNoBankConflicts(__global TYPE* odata, __global TYPE* idata, int width, int height, __local TYPE* sdata)\n""{\n"" int xIndex = get_group_id(0) * TILEDIM + get_local_id(0);\n"" int yIndex = get_group_id(1) * TILEDIM + get_local_id(1);\n"" int index_in = xIndex + (yIndex)*width;\n"" if(xIndex<width && yIndex<height)\n"" sdata[get_local_id(1)*TILEDIM+get_local_id(0)] = idata[index_in];\n"" xIndex = get_group_id(1) * TILEDIM + get_local_id(0);\n"" yIndex = get_group_id(0) * TILEDIM + get_local_id(1);\n"" int index_out = xIndex + (yIndex)*height;\n"" \n"" barrier(CLK_LOCAL_MEM_FENCE);\n"" if(xIndex<height && yIndex<width)\n"" odata[index_out] = sdata[get_local_id(0)*TILEDIM+get_local_id(1)];\n""}\n"  )
static

OpenCL Transpose kernel. Modified the transpose kernel provided by NVIDIA to make it work for any problem size rather than just perfect size such as 1024X1024.

Referenced by skepu::Environment< T >::createOpenCLProgramForMatrixTranspose().

static const std::string skepu::trimSpaces ( const std::string &  pString,
const std::string &  pWhitespace = " \t" 
)
static

Method to remove leading and trailing spaces from a string.

Referenced by loadExecPlan(), and loadExecPlanArray().