Scalable Message-Logging Techniques for Effective Fault Tolerance in HPC Applications
Thesis 2013
Publication Type: PhD Thesis
Repository URL:
Abstract
An important set of challenges emerge as the High Performance Computing (HPC)
community aims to reach extreme scale. Resilience and energy consumption are two
of those challenges. Extreme-scale machines are expected to have a high failure
frequency. This is an inevitable consequence of the mismatch between two trends.
The number of components assembled in supercomputers grows exponentially.
However, the improvement on the reliability of each individual component is much
slower. At the same time, the vast number of components in a single machine will
consume a non-trivial amount of energy. To keep a supercomputer within
operational margins, HPC systems have to be both reliable and energy-aware. For
an application to be able to run and make progress in spite of constant
interruptions, it has to incorporate some fashion of fault tolerance.
Rollback-recovery techniques provide a framework to overcome crashes in the
system by periodically saving the state of the application and rolling back to
checkpoints in case of failures. Two well-known rollback-recovery techniques are
checkpoint/restart and message-logging. The former is easier to implement and
has become the de facto standard to make applications fault tolerant. It
has, however, a high performance and energy cost during recovery.
Message-logging, on the other hand, makes it possible to recover faster from a
failure and to consume less energy. The downside of message-logging is the
overhead it exhibits in the failure-free scenario. Memory and performance
overheads may offset its advantages. This thesis focuses on techniques to
alleviate the downsides of message-logging. It presents a mechanism based on
high-level programming language constructs to decrease the performance overhead
of message-logging. It also introduces two strategies to reduce the memory
overhead created by the message log. Additionally, it addresses important
architectural constraints of modern supercomputers. Based on large-scale
experimental results and projections from an analytical model, we conclude
message-logging is a promising strategy to provide fault tolerance at a low
energy cost for extreme-scale machines.
People
Research Areas