System Support for Checkpoint/Restart of Charm++ and AMPI Applications
As both modern supercomputers and new generation scientific computing applications grow in size and complexity, the probability of system failure rises commensurately. Making parallel computing fault tolerant has become an increasingly important issue. Checkpoint/restart mechanism provides for fault tolerance capability as well as other benefits for parallel programmers. This thesis describes the On-Disk Checkpoint/Restart Mechanism for Charm++ and Adaptive MPI programming framework, its motivation, design, and implementation. This mechanism has proven to be useful in practice and can also be extended to implement other fault tolerant techniques.
Chao Huang, "System Support for Checkpoint and Restart of Charm++ and AMPI Applications", Dept. of Computer Science, University of Illinois, 2004.
