Live Webcast 15th Annual Charm++ Workshop

-->
A Semi-Blocking Checkpoint Protocol to Minimize Checkpoint Overhead
Thesis 2012
Publication Type: MS Thesis
Repository URL:
Abstract
The increasing number of cores on current supercomputers will quickly decrease the mean time to failures (MTTF) of the system. With such high failure rates, long time running applications will have little chance to complete successfully if they don’t use any fault tolerance strategy. Double in memory/disk checkpointing is a production fault tolerance strategy in Charm++ runtime system. Each node will store one copy of its checkpoint in its own memory or disk as a local checkpoint and another copy in other node’s memory or disk as a global checkpoint. This method takes advantage of the relatively high network bandwidth compared to I/O bandwidth. It is able to store a checkpoint faster than the traditional NFS- based checkpoint/restart. However, as the core counts on each node keep increasing, the large checkpoint size of a node will quickly saturate the limited network bandwidth. In this thesis, we introduce the semi-blocking checkpoint/restart protocol to hide the checkpoint overhead by overlapping global checkpoint with applications. To further analyze the benefits of using semi-blocking checkpoint protocol in case of failures, we extend Daly’s model and show the usefulness of the semi-blocking protocol for different kinds of applications. Solid state disk (SSD) is used in the semi-blocking checkpoint protocol when there is no space to store checkpoint in memory. We present two strategies to choose what data to store in SSD based on the memory usage of applications. In this thesis, we show the scalability and benefits of the semi-blocking checkpoint protocol. Semi-blocking checkpoint protocol has a performance improvement of 20% compared to blocking checkpoint. And the overhead of semi-blocking checkpoint protocol could be as low as 1.6% with the consideration of checkpoints dumping time and the extra time to recover applications from failures.
TextRef
People
Research Areas