FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI
IEEE International Conference on Cluster Computing (Cluster) 2004
Publication Type: Paper
Repository URL: inmemchpt
Abstract
As high performance clusters continue to grow in size, the mean
time between failure shrinks. Thus, the issues of fault tolerance
and reliability are becoming one of the challenging factors for
application scalability. The traditional disk-based method of
dealing with faults is to checkpoint the state of the entire
application periodically to reliable storage and restart from the
recent checkpoint. The recovery of the application from faults
involves (often manually) restarting applications on all processors
and having it read the data from disks on all processors. The
restart can therefore take minutes after it has been initiated.
Such a strategy requires that the failed processor can be replaced
so that the number of processors at checkpoint-time and
recovery-time are the same. We present FTC-Charm++, a
fault-tolerant runtime based on a scheme for fast and scalable
in-memory checkpoint and restart. At restart, the program can
continue to run on the remaining processors without performance
penalty due to load imbalance. The recovery time is reduced to
seconds without actual ``down time''. The method is useful for
applications whose memory footprint is small at the checkpoint
state, while a variation of this scheme --- in-disk
checkpoint/restart can be applied to applications with large memory
footprint. The scheme does not require any individual component to
be fault-free. We have implemented this scheme for Charm++ and AMPI
(an adaptive version of MPI). This paper will describe the scheme
and show performance data on a cluster using 128 processors.
TextRef
Gengbin Zheng and Lixia Shi and Laxmikant V. Kale, "FTC-Charm++: An In-Memory
Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI", 2004 IEEE
International Conference on Cluster Computing, San Diego, CA, September, 2004.
pp. 93-103.
People
Research Areas