Optimizing Communication for Massively Parallel Processing
Thesis 2005
Publication Type: PhD Thesis
Repository URL: sameer-thesis
Abstract
The current trends in high performance computing show that large
machines with tens of thousands of processors will soon be readily
available. The IBM Bluegene-L machine with 128k processors (which
is currently being deployed) is an important step in this
direction. In this scenario, it is going to be a significant burden
for the programmer to manually scale his applications. This task of
scaling involves addressing issues like load-imbalance and
communication overhead. In this thesis, we explore several
communication optimizations to help parallel applications to easily
scale on a large number of processors. We also present automatic
runtime techniques to relieve the programmer from the burden of
optimizing communication in his applications. This thesis explores
processor virtualization to improve communication performance in
applications. With processor virtualization, the computation is
mapped to virtual processors (VPs). After one VP has finished
computation and is waiting for responses to its messages, another
VP can compute, thus overlapping communication with computation.
This overlap is only effective if the processor overhead of the
communication operation is a small fraction of the total
communication time. Fortunately, with network interfaces having
co-processors, this happens to be true and processor virtualization
has a natural advantage on such interconnects. The communication
optimizations we present in this thesis, are motivated by
applications such as NAMD (a classical molecular dynamics
application) and CPAIMD (a quantum chemistry application).
Applications like NAMD and CPAIMD consume a fair share of the time
available on supercomputers. So, improving their performance would
be of great value. We have successfully scaled NAMD to 1TF of peak
performance on 3000 processors of PSC Lemieux, using the techniques
presented in this thesis. We study both point-to-point
communication and collective communication (specifically all-to-all
communication). On a large number of processors all-to-all
communication can take several milli-seconds to finish. With
synchronous collectives defined in MPI, the processor idles while
the collective messages are in flight. Therefore, we demonstrate an
asynchronous collective communication framework, to let the CPU
compute while the all-to-all messages are in flight. We also show
that the best strategy for all-to-all communication depends on the
message size, number of processors and other dynamic parameters.
This suggests that these parameters can be observed at runtime and
used to choose the optimal strategy for all-to-all communication.
In this thesis, we demonstrate adaptive strategy switching for
all-to-all communication. The communication optimization framework
presented in this thesis, has been designed to optimize
communication in the context of processor virtualization and
dynamic migrating objects. We present the streaming strategy to
optimize fine grained object-to-object communication. In this
thesis, we motivate the need for hardware collectives, as processor
based collectives can be delayed by intermediate that processors
busy with computation. We explore a next generation interconnect
that supports collectives in the switching hardware. We show the
performance gains of hardware collectives through synthetic
benchmarks.
TextRef
Sameer Kumar, "Optimizing Communication for Massively Parallel Processing",
University of Illinois at Urbana-Champaign, May 2005.
People
Research Areas