Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q
IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2013
Publication Type: Paper
Repository URL:
Abstract
IBM Blue Gene/Q is the next generation Blue Gene machine that can
scale to tens of Peta Flops with 16 cores and 64 hardware threads per
node. However, significant effort is required to fully exploit its
capacity on various programming models. In this paper, we focus on the
asynchronous message driven parallel programming model --
Charm++. Since its behavior (asynchronous) is substantially different
from MPI, that presents a challenge in porting it efficiently to
BG/Q. On the other hand, it also provides an opportunity to exploit
BG/Q resources. We describe various novel fine-grained threading
techniques in Charm++ to exploit the hardware features of the BG/Q
compute chip. These include the use of L2 atomics to implement
lock-less producer-consumer queues to accelerate communication between
threads, fast memory allocators, hardware communication threads that
are awakened via low overhead interrupts from the BG/Q wakeup
unit. Burst of short messages is processed by using the ManytoMany
interface to reduce runtime overhead. We also present techniques to
optimize NAMD computation via Quad Processing Unit (QPX) vector
instructions and the acceleration of message rate via communication
threads to optimize the Particle Mesh Ewald computation. We
demonstrate the benefits of our techniques via two micro-benchmarks,
3D Fast Fourier Transform, and the molecular dynamics application
NAMD. For the 92,000-atom ApoA1 molecule, we achieved $794\mu s/step$
with PME every 4 steps and $896\mu s/step$ with PME every step.
People
Research Areas