21. Checkpoint/Restart-Based Fault Tolerance
Charm++ offers a couple of checkpoint/restart mechanisms. Each
of these targets a specific need in parallel programming. However,
both of them are based on the same infrastructure.
Traditional chare-array-based Charm++ applications, including AMPI
applications, can be checkpointed to storage buffers (either files or
memory regions) and be restarted later from those buffers. The basic
idea behind this is straightforward: checkpointing an application is
like migrating its parallel objects from the processors onto buffers,
and restarting is the reverse. Thanks to the migration utilities like
PUP methods (Section 6
), users can decide what data to
save in checkpoints and how to save them. However, unlike migration
(where certain objects do not need a PUP method), checkpoint requires
all the objects to implement the PUP method.
The two checkpoint/restart schemes implemented are:
- Shared filesystem: provides support for split execution, where the
execution of an application is interrupted and later resumed.
- Double local-storage: offers an online fault
tolerance mechanism for applications running on unreliable machines.