Towards Realizing the Potential of Malleable Jobs
IEEE International Conference on High Performance Computing (HiPC) 2014
Publication Type: Paper
Repository URL:
Abstract
Malleable jobs are those which can dynamically
shrink or expand the number of processors on which they
are executing at runtime in response to an external command.
Malleable jobs can significantly improve system utilization and
reduce average response time, compared to traditional jobs.
To realize these benefits, three components are critical – an
adaptive job scheduler, an adaptive resource manager, and an
adaptive parallel runtime system. In this paper, we present a
novel mechanism for enabling shrink/expand capability in the
parallel runtime system using task migration and dynamic load
balancing, checkpoint-restart, and Linux shared memory. Our
technique performs true shrink/expand eliminating the need of
any residual processes, requires little application programmer
effort, and is fast. Further, we establish a bidirectional communication
channel between the resource manager and the parallel
runtime, and present an asynchronous split-phase mechanism
for executing adaptive scheduling decisions. Performance results
using Charm++ on Stampede supercomputer show the efficacy,
scalability, and benefits of our approach. Shrinking from 2k to
1k cores takes 16s while expand from 1k to 2k takes 40s. Also,
we demonstrate the utility of our runtime in traditional as well
as emerging scenarios, e.g., proactive fault tolerance and clouds.
People
Research Areas