Subsections

6 . GPU Manager Library

GPU Manager is a task offload and management library for efficient use of CUDA-enabled GPUs in Charm++ applications. Compared to direct use of CUDA (through issuing kernel invocation and GPU data transfer calls in user code) GPU Manager provides the following advantages:
  1. Automatic management and synchronization of tasks
  2. Automatic overlap of data transfer and kernel invocation for concurrent tasks
  3. A simplified work flow mechanism using CkCallback to return to user code after completion of each work request
  4. Reduced synchronization overhead through centralized management of all GPU tasks

6 . 1 Building GPU Manager

GPU Manager is not included by default when building Charm++ . In order to use GPU Manager, the user must build Charm++ using the CUDA option, e.g.

 ./build charm++ netlrts-linux-x86_64 cuda -j8

Building GPU Manager requires an installation of the CUDA toolkit on the system.

6 . 2 Overview and Work Flow

GPUs are throughput-oriented devices with peak computational capabilities that greatly surpass equivalent-generation CPUs but with limited control logic that constraints them to use as accelerator devices controlled by code executing on the CPU.

The GPU's dependence on the CPU for dispatch and synchronization of coarse-grained data transfer and kernel execution has traditionally required programmers to either (a) halt the execution of work on the CPU whenever issuing GPU work to simplify synchronization or (b) issue GPU work asynchronously and carefully manage and synchronize concurrent GPU work in order to ensure satisfactory progress and good performance. Further, the latter option becomes significantly more difficult in the context of a parallel program with numerous concurrent objects that all issue kernel and data transfer calls to the same GPU.

The Charm++ GPU Manager is a library designed to address this issue by automating the management of GPUs. Users of GPU Manager define work requests that specify the GPU kernel and any data transfer operations required before and after completion of the kernel. The system controls the execution of the work requests submitted by all the chares on a particular processor. This allows it to effectively manage execution of work requests and overlap CPU-GPU data transfer with kernel execution. In steady-state operation, GPU Manager overlaps kernel execution of one work request with data transfer out of GPU memory for a preceding work request and the data transfer into GPU memory for a subsequent work request. This approach avoids blocking the CUDA DMA engine by only submitting data transfers when they are ready to execute. When using GPU Manager, the user does not need to poll for completion of GPU operations. The system manages execution of a work request throughout its life cycle and returns control to the user upon completion of a work request through a CkCallback object specified by the user per work request. Another advantage of using GPU Manager is that the system polls only for a handful of currently executing operations, which avoids the problem of multiple chares all polling the GPU when using CUDA streams directly. GPU Manager has options for recording profiling data for kernel execution and data transfer which can be visualized using the Charm++ Projections profiler.

6 . 2 . 1 Execution Model and Progress Engine

Like any Charm++ application, programs using GPU Manager typically consist of a large number of concurrently executing objects. Each object executes code in response to active messages received from some object within the parallel run, during which it can send its own active messages or issue one or more work requests to the GPU Manager for asynchronous execution. Work requests are always submitted to the local GPU Manager instance at the processing element where the call is issued. Incoming GPU work requests are simply copied into the GPU Manager's scheduling queue, at which point the library returns and the caller can continue with other work.

Charm++ employs a message driven programming model. This includes a runtime system scheduler that is triggered every time a method finishes execution. Under typical CPU-only execution the scheduler examines the queue of incoming messages and selects one based on priority and location in the queue. In a CUDA build of Charm++ , the scheduler is also programmed to periodically invoke the GPU Manager progress engine.

GPU Manager contains a queue of all pending work requests. When its progress function is called, GPU Manager determines whether pending GPU work has completed since the last time the progress function was called, and whether additional work requests can begin executing. A workRequest undergoes the following stages during its execution:

  1. Device memory allocation and data transfer from host to device
  2. Kernel execution
  3. Data transfer back to host from device
  4. Invocation of a callback function (specified in the workRequest )

Based on the instructions contained in each work request, the GPU Manager will allocate the required buffers in GPU global memory and issue asynchronous CUDA data transfer operations directly. In order to execute kernels, the GPU Manager calls the runKernel function that must be defined by the user. This function specifies the CUDA kernel call for your work request.

Under steady state execution with multiple concurrent work requests, as one workRequest progresses to the execution stage, GPU Manager will initiate the data transfer for the second workRequest in the queue, and so on.

In a typical application, the work request definition, kernel run functions, CUDA kernel definitions, and code for submission of work requests would all go in a .cu file that is compiled with nvcc separately from the other files (e.g. .C , .ci ) in the Charm++ application. We make a function call to createWorkRequest from a .C file to create and enqueue the workRequest. The various resulting object files of the application are then to be linked together into the final executable.

6 . 3 API

Using GPU Manager involves:

  1. Defining CUDA kernels as in a regular CUDA application
  2. Defining work requests and their callback functions
  3. Defining the void runMyKernel(workRequest *wr, cudaStream_t kernelStream, void **deviceBuffers) functions, used by the GPU Manager to issue a kernel call based on the kernel identifier defined in the work request
  4. Submitting work requests to the GPU Manager

6 . 3 . 1 Work Request

workRequest is a simple structure which contains the necessary parameters for CUDA kernel execution along with some additional members for automating data transfer between the host and the device. A work request consists of the following data members:

dim3 dimGrid
- a triple which defines the grid structure for the kernel; in the example below dimGrid.x specifies the number of blocks. dimGrid.y and dimGrid.z are unused.

dim3 dimBlock
- a triple defining each block's structure; specifies the number of threads in up to three dimensions.

int smemSize
- the number of bytes in shared memory to be dynamically allocated per block for kernel execution.

int nBuffers
- number of buffers used by the work request.

dataInfo *bufferInfo
- array of dataInfo structs containing buffer information for the execution of the work request. This array must be of size nBuffers , e.g.


 codewr->bufferInfo = (dataInfo *) malloc(wr->nBuffers * sizeof(dataInfo))

We will explain the contents of dataInfo struct later.

void *callbackFn
- a pointer to a CkCallback object specified by the user; executed after the kernel and memory transfers have finished.

const char *traceName
- A short identifier used for tracing and logging.

function runKernel
- A user defined host function to run the kernel. We will pass this function three parameters:
workRequest
- The workrequest being run.
kernelStream
- The cuda stream to run the kernel in.
deviceBuffers
- An array of device pointers, indexed by bufferID.

int state
- the stage of a workRequest 's execution, set and used internally by the GPU Manager

void *userData
- may be used to pass scalar values to kernel calls, such as the size of an array.

6 . 3 . 1 . 1 dataInfo

int bufferID
- the ID of a buffer in the runtime system's buffer table. May be specified by the user if direct control over the buffer space is desired. Otherwise, if it is set to a negative value, the GPU Manager will assign a valid buffer ID.

int transferToDevice, transferFromDevice
- flags to indicate if the buffer should be transferred to the device prior to the execution of a kernel, and/or transferred out after the kernel

int freeBuffer
- a flag to indicate if the device buffer memory should be freed after execution of workRequest .

void *hostBuffer
- pointer to host data buffer. In order to allow asynchronous memory transfer and data computation on device this buffer must be allocated from page-locked memory.


 void *hostBuffer = hapi_poolMalloc(size);

This returns the buffer of required size from the GPU Manager's pool of pinned memory on the host. Direct allocation of pinned memory (e.g. using cudaMallocHost ) is discouraged, as it will block the CPU until pending GPU work has finished executing. The user must add the -DGPU_MEMPOOL flag while compiling CUDA files. This is required to enable fetching of page-locked memory from GPU Manager. You may add it with your NVCC_FLAGS .

size_t size
- size of buffer in bytes.

6 . 3 . 1 . 2 Work Request Example

Here is an example method for creating a workRequest of the addition of two vectors A and B.


 #include "wr.h"
#define BLOCK_SIZE 256

void createWorkRequest(int vectorSize, float *h_A, float *h_B, float **h_C, int myIndex, CkCallback *cb)
{
    dataInfo *info;
    workRequest *vecAdd = new workRequest;
    int size = vectorSize * sizeof(float);

    vecAdd->dimGrid.x = (vectorSize - 1) / BLOCK_SIZE + 1;
    vecAdd->dimBlock.x = BLOCK_SIZE;
    vecAdd->smemSize = 0;
    vecAdd->nBuffers = 3;
    vecAdd->bufferInfo = new dataInfo[vecAdd->nBuffers];

    /* Buffer A */
    info = &(vecAdd->bufferInfo[0]);

    /* The Buffer ID will be given by the API,
       or it can be given by the user. */
    info->bufferID = -1;

    info->transferToDevice = YES;
    info->transferFromDevice = NO;
    info->freeBuffer = YES;

    /* This fetches the pinned host memory already allocated by API,
       required for asynchronous data transfers. */
    info->hostBuffer = hapi_poolMalloc(size);

    /* Copy the data to the workRequest's buffer. */
    memcpy(info->hostBuffer, h_A, size);

    info->size = size;

    /* Buffer B will be same as A.*/

    /* Buffer C */
    info = &(vecAdd->bufferInfo[2]);
    info->transferFromDevice = YES;
    info->hostBuffer = hapi_poolMalloc(size)

    / * We change the address to the address returned by the API
        to read the copied result */
    *h_C = (float *)info->hostBuffer;

    /* a CkCallback pointer */
    vecAdd->callbackFn = cb;

    vecAdd->traceName = "add";

    /* kernel run function */
    vecAdd->runKernel = run_add;

    vecAdd->userData = new int;
    memcpy(vecAdd->userData, &vectorSize, sizeof(int));

    /* enqueue the workRequest in the workRequestQueue.
    wrQueue is declared by our API during the init phase for every processor. */
    enqueue(wrQueue, vecAdd);
}

6 . 3 . 2 Writing Kernels

Writing a kernel is unchanged from normal CUDA programs. Kernels are written in one (or more) .cu files. Here is an example of vectorAdd.cu . The full example can be found in examples/charm++/cuda/vectorAdd/ .


 __global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

6 . 3 . 3 Launching Kernels

Kernel launches are identical to regular kernel launches in normal CUDA programs, run in a small dedicated function.


 void run_add(workRequest *wr, cudaStream_t kernelStream, void **deviceBuffers)
{
    printf("Add KERNEL \n");
/*
 *  devBuffers is declared by our API during the init phase on every processor.
 *  It jumps to the correct array index with the help of bufferID, 
 *  which is supplied by the API or user.
 */
    vecAdd«< wr->dimGrid, wr->dimBlock, wr->smemSize, kernelStream»>
        ((float *) deviceBuffers[wr->bufferInfo[0].bufferID],
         (float *) deviceBuffers[wr->bufferInfo[1].bufferID],
         (float *) deviceBuffers[wr->bufferInfo[2].bufferID],
         *((int *) wr->userData));
}

6 . 4 Compiling

As mentioned earlier, there are no changes to the .ci and .C files. Therefore there are no changes in compiling them. CUDA code, however, must be compiled using nvcc . You can use the following example makefile to compile a .cu file. More example codes can be found in the examples/charm++/cuda directory.


 CUDA_LEVEL=35


NVCC = /usr/local/cuda/bin/nvcc


NVCC_FLAGS = -O3 -c -use_fast_math -DGPU_MEMPOOL


NVCC_FLAGS += -arch=compute_$(CUDA_LEVEL) -code=sm_$(CUDA_LEVEL)


NVCC_INC = -I/usr/local/cuda/include


CHARMINC = -I${CHARMDIR}/include


LD_LIBS= -lcublas


all: vectorAdd
        $(NVCC) $(NVCC_FLAGS) $(NVCC_INC) $(CHARMINC) -o vectorAddCU.o vectorAdd.cu

GPU Manager also supports CuBLAS or other GPU libraries in exactly the same way. Call CuBLAS or the other GPU library directly from a kernel run function; creating the workRequest works the same as any other kernel.

6 . 5 Debugging

A few useful things for debugging:

  1. Enabling the GPU_MEMPOOL_DEBUG flag (using -DGPU_MEMPOOL_DEBUG ) during execution prints debug statements, including when buffers are allocated and freed.

  2. When using ++debug , add the debugging flags -g and -G during compilation.