Redesign of Hybrid API (GPU Manager) to support concurrent kernel execution
The original design of GPUManager had two data transfer streams and one kernel stream per GPUManager instance, which did make use of the memory copy engines on the GPU but failed to utilize the concurrent kernel execution feature.
More importantly, it relied on an inefficient polling scheme, where the scheduler periodically invokes a function that handles workRequests sitting in a queue and blocks the CPU until all relevant work for the workRequest at the head of the queue is complete. While it did allow overlap of data transfer and kernel execution by handling multiple workRequests in the queue, it was limited by the one kernel stream and synchronization caused by the polling scheme.
The new design is to utilize multiple kernel streams per GPUManager instance to allow concurrent kernel execution (multiple kernels may execute simultaneously, as long as one kernel is not using all resources of the GPU), and to make use of stream callbacks (supported from CUDA 5.0) to remove the CPU blocking caused by the current mechanism. The user's point of view does not change; he/she creates workRequests, calls enqueue() and does some other useful work on the CPU before the callback function is invoked, just like before.
#5 Updated by Jaemin Choi almost 2 years ago
- Subject changed from Redesign of GPUManager to utilize concurrent kernel execution and stream callbacks to Redesign of Hybrid API (GPU Manager) to support concurrent kernel execution
Includes 2 schemes: CUDA event-based & CUDA callback-based, but the default is CUDA event-based due to better performance.
New gerrit patch will be up soon.