Live Webcast 15th Annual Charm++ Workshop

Improving Scalability with GPU-Aware Asynchronous Tasks
| Jaemin Choi | David Richards | Laxmikant Kale
International Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS (HIPS) 2022
Publication Type: Paper
Repository URL:
Asynchronous tasks, when created with overdecomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to finegrained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method on GPUs, Jacobi3D. In addition to optimizations for minimizing host-device synchronization and increasing the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to combat fine-grained overheads at scale.
Research Areas