Proactive Fault Tolerance in MPI Applications via Task Migration
IEEE International Conference on High Performance Computing (HiPC) 2006
Publication Type: Paper
Repository URL: fault-avoidance-sc
Abstract
Failures are likely to be more frequent in systems with thousands
of processors. Therefore, schemes for dealing with faults become
increasingly important. In this paper, we present a fault tolerance
solution for parallel applications that proactively migrates
execution from processors where failure is imminent. Our approach
assumes that some failures are predictable, and leverages the
features in current hardware devices supporting early indication of
faults. We use the concepts of processor virtualization and dynamic
task migration, provided by Charm++ and Adaptive MPI (AMPI), to
implement a mechanism that migrates tasks away from processors
which are expected to fail. To demonstrate the feasibility of our
approach, we present performance data from experiments with
existing MPI applications. Our results show that proactive task
migration is an effective technique to tolerate faults in MPI
applications.
TextRef
Sayantan Chakravorty and Celso L. Mendes and Laxmikant V. Kale,
"Proactive Fault Tolerance in MPI Applications Via Task Migration.", HiPC,
Publ: Springer, Lecture Notes in Computer Science, vol. 4297, pp. 485-496, 2006.
People
Research Areas