Clustering Parallel Applications to Enhance Message Logging Protocols
PPL Talk (PPL Talk) 2010
Publication Type: Talk
Repository URL:
Download:
[PDF]
Summary
Computing systems will grow significantly larger in the near future to
satisfy the needs of computational scientists in areas like climate
modeling, biophysics and cosmology. Supercomputers being installed in
the next few years will comprise millions of cores, hundreds of
thousands of processor chips and millions of physical components.
However, it is expected that failures become more prevalent in those
machines to the point where 10% of an Exascale system will be wasted
just recovering from failures. Further, with such large numbers of
cores, fine-grained and dynamic load balance will become increasingly
critical for maintaining good system utilization. This talk addresses
both fault tolerance and load balancing by presenting a novel
extension of traditional message logging protocols based on team
checkpointing. Message logging makes it possible to recover from
localized failures by rolling back just the failed processing
elements. Since this comes at a high memory overhead from logging all
communication, we reduce this cost by organizing processing elements
into teams and only logging messages between teams. Further, we show
how to dynamically partition the application into teams to
simultaneously minimize the cost of fault tolerance and to balance
application load. We experimentally show that this scheme has low
overhead and can dramatically reduce the memory cost of message
logging.
People
Research Areas