Efficient, Language-Based Checkpointing for Massively Parallel Programs
PPL Technical Report 2005
Publication Type: Paper
Repository URL:
Abstract
Checkpointing and restart is an approach to ensuring forward
progress of a program in spite of system failures or planned
interruptions. We investigate issues in checkpointing and restart
of programs running on massively parallel computers. We identify a
new set of issues that have to be considered for the MPP platform,
based on which we have designed an approach based on the language
and run-time system. Hence our checkpointing facility can be used
on virtually any parallel machine in a portable manner,
irrespective of whether the operating system supports
checkpointing. We present methods to make checkpointing and restart
space- and time-efficient, including object-specific functions that
save the state of an object. We present techniques to automatically
generate checkpointing code for parallel objects, without
programmer intervention. We also present mechanisms to allow the
programmer to easily incorporate application specific knowledge
selectively to make the checkpointing more efficient. The
techniques developed here have been implemented in the Charm++
parallel object-oriented programming language and run-time system.
Performance results are presented for the checkpointing overhead of
programs running on parallel machines.
TextRef
Sanjeev Krishnan and L. V. Kale, "Efficient, Language-Based Checkpointing for
Massively Parallel Programs", Parallel Programming Laboratory, Department of
Computer Science , University of Illinois, Urbana-Champaign, January 1995.
People
Research Areas