Runtime Systems and Tools:
Performance-analysis-based Introspective Control System to Tune Applications and Runtime
Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. Instead of full automation or completely manual tuning, we take the approach of cooperative automation. In this approach, the application developers provide knobs (control points) that reconfigure applications in ways that affect performance in specific ways. We then allow the runtime system to adjust the configuration automatically based on runtime observations and its understanding of the effects of each control point.

We have designed and developed PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic interface for describing control points and their effects. This is how application-specific knowledge is exposed to the runtime system. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the performance data and a decision tree encoding expert knowledge, program characteristics are extracted to assist the search for optimal configurations of the control points.

We have demonstrated how PICS can be applied into both the runtime system and applications to optimize the application performance. For example, our results show its effectiveness for ChaNGa, a full-fledged cosmology application, on 16,384 cores. Mirroring is a technique to reduce communication bottleneck by replicating the data on multiple processors and forwarding the communication requests. Figure 1 compares the time cost of calculating gravity without mirroring and with mirroring using different number of replicas. The top red curve is the time cost without mirroring while the bottom green curve shows the cost of using replicas adaptively. The runtime converges on using 2 replicas.

[PhD Thesis]
Software Topological Message Aggregation Techniques For Large-scale Parallel Systems [Thesis 2015]
PICS: A Performance-Analysis-Based Introspective Control System to Steer Parallel Applications [ROSS 2014]
[PhD Thesis]
Intelligent Runtime Tuning of Parallel Applications With Control Points [Thesis 2010]
Control Points for Adaptive Parallel Performance Tuning [PPL Technical Report 2008]