Continuous Performance Monitoring for Large-Scale Parallel Applications
IEEE International Conference on High Performance Computing (HiPC) 2009
Publication Type: Paper
Repository URL: 2009_CCS_Projections
Abstract
Traditional performance analysis techniques are performed after a
parallel program has completed. In this paper, we describe an
online method for continuously monitoring the performance of a
parallel program, specifically the fraction of the time spent in
various activities as the program executes. Our implementation of
both a visualization client and the parallel performance framework
that gathers utilization data are described. The data gathering
uses a scalable and asynchronous reduction with an appropriate
lossless compressed data format. The overheads in the initial
system are low, even when run on thousands of processors. The data
gathering occurs in an out-of-band communication mechanism,
interleaving itself transparently with the execution of the
parallel application by leveraging a message-driven runtime system.
People
Research Areas