Live Webcast 15th Annual Charm++ Workshop

Power, Reliability, and Performance: One System to Rule Them All
In a design based on the Charm++ parallel programming framework, an adaptive runtime system dynamically interacts with a datacenter’s resource manager to control power by intelligently scheduling jobs, reallocating resources, and reconfiguring hardware. It simultaneously manages reliability by cooling the system to the running application’s optimal level and maintains performance through load balancing. For more details on this work, please read our article published in Computer Oct'16 issue.
FlipBack: Automatic Targeted Protection Against Silent Data Corruption
With and without Flipback protection

The decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, less energy is required to flip a bit, and thus silent data corruptions (SDCs) become more common. In this paper, we present FlipBack, an automatic software-based approach that protects applications from SDCs. FlipBack provides targeted protection for different types of data and calculations based on their characteristics. It leverages compile-time analysis and program markup to identify data critical for control flow and enables selective replication of computation by the runtime system. We evaluate FlipBack with various HPC mini-applications that capture the behavior of real scientific applications and show that FlipBack is able to protect applications from most silent data corruptions with only 6 − 20% performance degradation.
OpenAtom: Scalable Ab-Initio Molecular Dynamics
OpenAtom is a parallel implementation of the Car-Parrinello Ab-initio Molecular Dynamics (CPAIMD) and Born Openheimer Molecular Dynamics (BOMD) methods atop Charm++. It is used to simulate interesting problems in atomic and molecular systems based on fundamental quantum chemical principles. OpenAtom exploits the overdecomposition and asynchrony in Charm++ to automatically overlap computation and communication and scales to thousands of cores with increased performance.

We have extended OpenAtom to support several interesting ab-initio molecular dynamics simulation methods, in addition to the Car-Parrinello method and Born-Oppenheimer method, including k-points, parallel tempering, and path integrals. Recent technical advances in OpenAtom include the construction of a generic library for calculating FFT's in charm with minimal communication, Charm-FFT, the inclusion of generalized topology-aware mapping schemes, and the ability to run "Uber" or multi-instance methods in a single run.

The following graph highlights recent performance results we obtained on Blue Gene/Q and Cray XE6/XK7 systems using a variety of the methods newly integrated into OpenAtom. These include topology-aware mapping and showcase Charm++'s unqiue suitability to this problem.
For more information, please see the OpenAtom reserach page (link), the OpenAtom project page (link), and our recent paper at ISC (link)
Mitigating Process Variation Through Dynamic Load Balancing

The design and manufacture of present-day CPUs causes inherent variation in supercomputer architectures such as variation in power and temperature of the chips. The variation also manifests itself as frequency differences among processors under Turbo Boost dynamic overclocking. This variation can lead to unpredictable and suboptimal performance in tightly coupled HPC applications. In this study, we use compute-intensive kernels and applications to analyze the variation among processors in four top supercomputers: Edison, Cab, Stampede, and Blue Waters.

As shown in the below figures, we observe that there is an execution time difference of up to 16% among processors on the Turbo Boost-enabled supercomputers: Edison, Cab, Stampede. There is less than 1% variation on Blue Waters, which does not have a dynamic overclocking feature. We analyze measurements from temperature and power instrumentation and find that intrinsic differences in the chips’ power efficiency is the culprit behind the frequency variation.

Moreover, we show that with a speed-aware dynamic load balancing algorithm we can reduce the negative effects of performance variation and improve the performance up to 18% compared to no load balancing performance and 6% better than the non-speed-aware counterpart.

For more details, please read ICS'16 paper:
and VarSys, IPDPS'16 paper:
Advances in AMPI
We released Charm++ 6.7.1, the lastest stable version of Charm++, at the conclusion of our 14th annual Charm++ Workshop last week. While this release is primarily a bug fix release, Adaptive MPI has seen the addition of many new features. Adaptive MPI (AMPI) is an implementation of the MPI standard on top of Charm++. AMPI gives users with pre-existing MPI codes easy access to the high-level features of Charm++ and its runtime system. AMPI provides application-independent features like process virtualization, dynamic load balancing, overlap of communication and computation, and automatic fault tolerance. It does so by implementing MPI ranks as lightweight, migratable user-level threads that are scheduled and can be load balanced by the Charm++ runtime system.

Application developers need only re-compile their code with AMPI's compiler wrappers (ampicc, ampif90, etc.) to begin experimenting with its functionality. The only change needed to pre-existing MPI codes is to privatize mutable global/static variables before running with virtualization. The 6.7.1 release contains many updates to AMPI. We have bumped AMPI's official support to MPI-2.2, and we have implemented the non-blocking collectives and neighborhood collectives from MPI-3.1. We have also renamed AMPI's extensions to the MPI standard to all be prefixed with AMPI_ rather than MPI_, so as to avoid polluting MPI's namespace. To learn more about AMPI, see the Workshop talk on AMPI (link) or consult AMPI's manual (link).
Empowering PDES with an Adaptive Runtime System

Parallel Discrete Event Simulation (PDES) is a type of simulation that differs significantly from many common HPC applications. In PDES, the simulation progresses forward by processing events that happen at discrete points in virtual time. The distribution of these events in time, as well as how the events affect the state of the simulation, may result in a very irregular pattern of communication that is not known a priori. Due to the dynamic and irregular nature of PDES, we believe that Charm++ makes a good target for implementing a scalable PDES engine. Over the past two years we have collaborated with the ROSS ( team at RPI to reimplement ROSS on top of Charm++. ROSS is a PDES engine that has been developed and highly optimized on top of MPI. ROSS has shown great scalability when running models with a low amount of inter-process communication, however it has shown some limitations for more communication intensive models. Because the Charm++ runtime system is able to effectively handle asynchronous communication, overlapped automatically with computation, the Charm++ version of ROSS is able to overcome some of these limitations. As shown in the plot below, as we increase the amount of inter-process communication, the Charm++ version of ROSS achieves up to 40% higher event rates than the MPI version of ROSS on up to 2 million cores. More details are in the PADS’16 paper (
Towards Realizing the Potential of Malleable Jobs
Malleable jobs are those, which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. They can significantly improve system utilization and reduce average response time, compared to traditional jobs. Three components are critical in a malleable system– an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. Figures below shows interactive between these components and how an adaptive run-time system shrinks the number of processors.

We add shrink and expand capability to Charm++ adaptive parallel runtime system and show the efficacy, scalability, and benefits of our approach, which is based on task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Figures below show the performance of rescaling with difference problem size and different number of processors.

In our later collaboration, we show an implementation of an adaptive resource manager by making Torque/Maui malleable with a malleable job-scheduling algorithm and show the integration with Charm++. For more details on Charm++ implementation of malleability, see our HiPC’14 paper: For more details on malleability with Torque/Maui see our IPDPS’15 paper:
Recovering Logical Structure from Charm++ Traces
Asynchrony and non-determinism in Charm++ programs present a significant challenge in analyzing their event traces. In this paper, published at SC’15, we present a new framework to organize event traces of parallel programs written in Charm++. Our reorganization allows one to more easily explore and analyze such traces by providing context through logical structure. We describe several heuristics to compensate for missing dependencies between events that currently cannot be easily recorded. We introduce a new task ordering that recovers logical structure from the non-deterministic execution order. Using the logical structure, we have developed several metrics to help guide developers to performance problems. The approach presented in this paper is demonstrated through two proxy applications written in Charm++: Jacobi2d (a stencil code) and LULESH (a hydrodynamics simulation code).

We use the term logical structure to denote an exact ordering of events reflecting patterns of parallel dependency behavior as de- signed by the application’s developers. Generally, programming such applications requires breaking the problem into smaller pieces, or phases, some of which require parallel interaction. The logical structure separates these phases which may be interleaved in physical time. Similarly, within each phase, the events are ordered by their dependencies into discrete logical steps rather than their true physical time order and displacement. This matches how the events were programmed: with logical ordering rather than known timing information guiding the design.

In the above figure, we show the logical and physical timeline of the Jacobi2d application. In the logical view, the reasons for the idle time experienced in the application become more evident: namely a reduction that is mapped to processors directly.

For more details see the SC'15 paper.
TraceR: Parallel Trace Replay Tool for HPC Network Simulations

TraceR is a trace replay tool built upon the ROSS-based CODES simulation framework. It executes under an optimistic parallel discrete-event paradigm using reversible computing for production HPC applications. It can be used for predicting network performance and understanding network behavior by simulating messaging on interconnection networks. It addresses two major shortcomings in current network simulators. First, it enables fast and scalable simulations of large-scale supercomputer networks. Second, it can simulate production HPC applications using BigSim's emulation framework.

Figure below presents the execution time for simulating 3D Stencil on various node counts of 3D torus. TraceR scales well for all simulated system sizes. For the simulations with half a million (512K) nodes, the execution time is 542 seconds using 3D TorusNet model.

We also do a validation study with LeanMD application on 5D Torus. We compare real execution time with simulated execution time in Figure below. We observe that for all node counts (512 to 8192nodes) the error in the prediction is less than 9%. For most cases, the predicted time is within 2% of the observed execution time.

For more details:
Improving fine-grained communication performance with topological message aggregation

It is often convenient to use fine-grained messages to implement the interactions between objects in a Charm++ program, because the typically small work and data units that result from object-based decomposition tend to produce small message payloads. While the utility of fine-grained communication is clear, its performance characteristics are poor due to overhead. Much of the per message overhead at the application, RTS, and network level is independent of message size. In applications with large fine-grained message volume, this overhead can dominate run time. The Charm++ Topological Routing and Aggregation Module (TRAM) is a library that improves fine-grained communication performance by coalescing fine-grained communication units, or data items, into larger messages. TRAM constructs a virtual topology comprising the processes in the parallel run, and defines peers within this topology as any processes that can be reached by traveling an arbitrary number of hops along a single dimension. Separately aggregating data items for each destination requires substantial memory and limits the potential for aggregating concurrent data items with different destinations but common sub-paths. To address this issue, TRAM aggregates data items at the level of peers. Items whose destinations are not in the set of peers at the source process are routed through one or more intermediate destinations along a minimal route. As a result, the memory footprint of the buffer space normally fits in lower level cache.

TRAM can dramatically improve performance for applications dominated by fine-grained communication. An example is the simulator of contagion, EpiSimdemics, where TRAM provides up to a 4x speedup compared to direct use of fine-grained messages. In strong scaling scenarios for EpiSimdemics, TRAM also allows for better scaling than an application-specific point to point aggregation approach. As the number of nodes in a run is increased, the individual sends are distributed throughout the network, so that direct aggregation at the source for each destination fails to aggregate items into large enough buffers to sufficiently mitigate communication overhead. By comparison, a topological aggregation approach is able to maintain the advantage of aggregation to much higher node counts.

Link to thesis.
ICPP paper.
SC paper.
Migratable Objects Make Supercomputing More Reliable

It has been long recognized by the HPC community that reliability is one of the major hurdles in reaching extreme scale. The same gigantic number of components that will make supercomputers powerful, will make them equally vulnerable to frequent failures. Estimates differ as to how often extreme scale machines will crash, but there is a consensus that applications will require some form of fault tolerance.
Migratable objects facilitate the task of developing new resilient algorithms, or refining existing ones. The very nature of the programming model, on which Charm++ is based, allows more flexibility in the execution of an application. That same flexibility is crucial in adapting to failures.
In a recent journal publication, we collected our work on fault tolerance strategies using migratable objects. The figure shows the efficiency of those strategies when projected at extreme scale. The traditional checkpoint/restart based on a network file system will not scale too far. However, schemes enriched with migratable objects may keep the efficiency of extreme-scale systems above 60%. Among those schemes we find checkpoint/restart based on local storage, proactive evacuation, and parallel recovery with message logging. A comprehensive approach, combining all the schemes, may increase the efficiency of a system to almost 80% at extreme scale.

Link to IEEE Transactions on Parallel and Distributed Systems publication
Link to research area page on Fault Tolerance
Charm++ and MPI: Combining the Best of Both Worlds

This work explores a framework that enables hybrid parallel programming with MPI and Charm++, and allows programmers to develop different modules of a parallel application in these two languages while facilitating smooth interoperation. MPI and Charm++ embody two distinct perspectives for writing parallel programs. While MPI provides a processor-centric, user-driven model for developing parallel codes, Charm++ supports work-centric, overdecomposition-based, system-driven parallel programming. One or the other can be the best or most natural fit for distinct modules that constitute a parallel application. In the published work linked below, we describe the challenges in enabling interoperation between MPI and Charm++, and present techniques for managing the control flow and resource sharing in such scenarios. As the attached figure indicates, interoperation between MPI and Charm++ has been enabled by exposing the Charm++ scheduler to the programmer, and enabling the reuse of Charm++ and MPI applications as external libraries. We also demonstrate the benefits of interoperation between MPI and Charm++ through several case studies that use production applications and libraries, including CHARM/Chombo, EpiSimdemics, NAMD, FFTW, MPI-IO and ParMETIS.

Link to the paper published in IPDPS, 2015.
Link to the presentation at IPDPS, 2015.
Adaptive Techniques for Clustered N-Body Cosmological Simulations
Fig 1

Simulating the process of cosmological structure formation with enough resolution to determine galaxy morphologies requires an enormous dynamic range in space and time. ChaNGa is a parallel n-body+SPH cosmology code implemented using Charm++ for the simulation of astrophysical systems on a wide range of spatial and time scales. Scaling such codes to large processor count requires overcoming not only spatial resolution challenges, but also large ranges in timescales. In this work, we use an approach called multi-stepping which involves using different time steps for different particles in relation to their dynamical time scales, leading to an algorithm that is challenging to parallelize effectively. We present the optimizations implemented in ChaNGa that allow it to scale to large numbers of processors, and address the challenges brought on by the high dynamic ranges of clustered datasets. We demonstrated strong scaling results for uniform datasets on up to 512K cores on Blue Waters evolving 12 and 24 billion particles (Fig 1). We also have shown strong scaling results for cosmo25 and dwarf datasets, which are more challenging due to their highly clustered nature. In Fig 2, we can see that we obtain good performance on up to 128K cores of Blue Waters and also show up to a 3 fold improvement in time with multi-stepping over single-stepping.

Link to the paper: PDF
Fig 2
Scalable Asynchronous Contact Mechanics for Cloth Simulation

Asynchronous Contact Mechanics (ACM) algorithm is a reliable method to simulate flexible material subject to complex collisions and contact geometries. For example, we can apply ACM to perform cloth simulation for animation. The parallelization of ACM is challenging due to its highly irregular communication pattern, its need for dynamic load balancing, and its extremely fine-grained computations. In our recent work, we utilize CHARM++ to address these challenges and show good strong scaling of ACM to 384 cores for problems with fewer than 100k vertices. By comparison, the previously published shared memory implementation only scales well to about 30 cores for the same examples. We demonstrate the scalability of our implementation through a number of examples which, to the best of our knowledge, are only feasible with the ACM algorithm. In particular, for a simulation of 3 seconds of a cylindrical rod twisting within a cloth sheet, the simulation time is reduced by 12× from 9 hours on 30 cores to 46 minutes using our implementation on 384 cores of a Cray XC30.
The Figure above shows the performance comparison between CHARM++ implementation and TBB implementation of ACM.
For more details, please refer to the IPDPS'2015 publication here.
PARM: Power Aware Resource Management Under Power Budget

Recent advances in processor and memory hardware designs have made it possible for the user to control the power consumption of the CPU and memory through software, e.g., the power consumption of Intel’s Sandy Bridge family of processors can be user-controlled through the Running Average Power Limit (RAPL) library. It has been shown that increase in the power allowed to the processor (and/or memory) does not yield a proportional increase in the application’s performance. As a result, for a given power budget, it can be better to run an application on larger number of nodes with each node capped at lower power than fewer nodes each running at its TDP. This is also called overprovisioning. The optimal resource configuration for an application can be determined by profiling an application’s performance for varying number of nodes, CPU power and memory power and then selecting the best performing configuration for the given power budget. In this paper, we propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. Our online power aware resource manager (PARM) uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. With a power budget of 4.75 MW, PARM can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. With real experiments on a relatively small scale cluster, PARM obtained 1.7X improvement. An adaptive runtime system allows further improvement by allowing already running jobs to shrink and expand for optimal resource allocation. Link to the paper: PDF

Reconfiguring Cache Hierarchy with Adaptive HPC RTS
The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.

Figure 1(a,b) show the best cache configurations and the energy savings using our method for different problem sizes of HPCCG application. As illustrated, our approach can save significant energy when the problem size is either smaller or larger than the cache sizes.
Maximizing Network Throughput on the Dragonfly Interconnect

This work, accepted to SC'14, analyzes the behavior of the dragonfly network for various routing strategies, job placement policies, and application communication patterns. The study is based on a novel model that predicts traffic on individual links for direct, indirect, and adaptive routing strategies. In the paper, we analyze results for single jobs and some common parallel job workloads. The predictions presented in this paper are for a 100+ Petaflop/s prototype machine with 92,160 high-radix routers and 8.8 million cores.

The general observations are that a randomized placement at the granularity of nodes and routers and/or indirect routing can help spread the messaging traffic over the network and reduce hot-spots. If the communication pattern results in non-uniform distribution of traffic, adaptive routing may provide significantly better traffic distributions by reducing hot-spots. For parallel job workloads, adaptive hybrid routing is useful for combining good features of adaptive direct and adaptive indirect routings and may provide a good traffic distribution with lower maximum traffic. Adaptive routings also improve the traffic distribution significantly in comparison to static routings. We also observed that allowing the users to choose a routing for their application can be beneficial in most cases on dragonfly networks. Use of randomized placement at the granularity of nodes and routers is the suggested choice for such scenarios also.

Paper link on PPL's website
Paper schedule at SC'14
Scalable Fault Tolerance with Partial-Order Dependencies
Winner of the Best Student Paper at CLUSTER'14!
Fault tolerance has become increasingly important as system sizes become larger and the probability of hard or soft errors increases. In this work, we focus on methodologies for tolerating hard errors: those that stem from hardware failure and cause a detectable fault in the system. Many approaches have been applied in this area, ranging from checkpoint/restart, where the application's state is saved to disk, to in-memory checkpoint, where the application incrementally saves its state to partner nodes. Both of the approaches, while effective, require the entire system to synchronize and rollback to the checkpoint to ensure consistent recovery. Message logging is an effective methodology for saving the messages that were sent in a localized manner, along with there order, which is encoded in a determinant. Message logging is also effective for deterministic replay, which is commonly used for discovering bugs.
For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application to maintain generality. However, in many applications, only a partial order must be recorded due to determinism intrinsic in the program, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In a paper published at CLUSTER'14 (to be presented in Madrid, Spain in September), we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different orderings and interleavings. By exploiting this framework, we improve on an existing scalable message-logging fault tolerance scheme that uses a total order. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2x lower overhead.
Parallel Programming with Migratable Objects: Charm++ in Practice
migratable objects

Science and engineering applications are increasingly becoming complex. Further, we are faced with new challenges due to dynamic variations in the machine itself. In the face of these challenges, what are the key concepts that are essential for parallel applications to take advantage of today’s modern supercomputers? In the paper accepted at SC14, we describe key designs and attributes that have been instrumental in enabling efficient parallelization. The paper describes the state of practice of CHARM++, using multiple mini-applications as well as some full-fledged applications and showcase various features enabled by the execution model.

It’s becoming more and more apparent that adaptive run time systems (RTS) will play an essential role in addressing various challenges in scaling parallel applications. However, to support the RTS, the programming model must possess certain key attributes including over-decomposition, asynchronous message driven and migratability. Over decomposition along with support for migratability can be exploited by the run time system to enable critical features and perform adaptive intelligent strategies. Over decomposition allows the run time system to map and remap objects to processors at run time which can be exploited to enable features such as adaptive load balancing, automatic checkpoint and restart, thermal and power management, adaptive job scheduling and communication optimizations. With migratability, the run time system can perform dynamic load balancing by migrating objects from overloaded processors to underloaded ones and adapt based on the application characteristics. It can also be used to automatically checkpoint the objects and restart them on a different number of processors. The shrink and expand feature also uses this capability to improve the cluster utilization by adaptively changing the number of processors assigned to a job and migrating the corresponding objects. We can use DVFS for temperature control and dynamic load balancing to decrease the cooling energy. The run time system can also monitor the communication pattern and with the knowledge of the underlying topology performs various communication optimizations. We show the benefits of these attributes and features on various mini-apps and full-scale applications run on modern day supercomputers.
An Adaptive Parallel Sparse Triangular Solver
a triangular matrix

Solving sparse triangular linear systems is an important kernel for many numerical linear algebra problems in science and engineering simulations, i.e. linear systems and least square problems. It is used extensively in direct methods following a triangular factorization to solve the system (possibly with many right-hand sides), or to improve an approximate solution iteratively. It is also a fundamental kernel in many iterative methods (such as Gauss-Seidel method) and in many preconditioners for other iterative methods (such as Incomplete-Cholesky preconditioner for Conjugate Gradient). Unfortunately, performance of parallel algorithms for triangular solution is notoriously poor, resulting in performance bottlenecks for many of these methods.

We present a novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods. By analyzing and reordering operations, our algorithm can often extract parallelism even for cases where most of the nonzero matrix entries are near the diagonal. Our main parallelism strategies are: (1) identify independent rows, (2) send data earlier to achieve greater overlap, and (3) process dense off-diagonal regions in parallel. We implemented our algorithm in Charm++, and present promising experimental results using numerous sparse matrices from real applications.

Using various matrices, the graph below compares the performance of our method with that of HYPRE (blue lines). HYPRE is a commonly used linear algebra package developed at Lawrence Livermore National Laboratory (LLNL). As shown, our method can exploit parallelism on many matrices, whereas HYPRE’s performance is nearly sequential in all cases. The performance of HYPRE is worse than sequential in many cases because of parallel overhead, although there is some improvement for large numbers of processors, probably due to cache effects. Overall, our method is a significant improvement over this existing code and will reduce the solution time for many problems.

The paper on this work is accepted by Parallel Computing (ParCo) 2014. PhD student Ehsan Totoni received 3rd for ACM Student Research Competition (SRC) Grand Finals 2014 based on this research.
Steering Parallel Applications via Introspection
Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. Instead of full automation or completely manual tuning, we take the approach of cooperative automation. In this approach, the application developers provide knobs (control points) that reconfigure applications in ways that affect performance in specific ways. We then allow the runtime system to adjust the configuration automatically based on runtime observations and its understanding of the effects of each control point.

We have designed and developed PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic interface for describing control points and their effects. This is how application-specific knowledge is exposed to the runtime system. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the performance data and a decision tree encoding expert knowledge, program characteristics are extracted to assist the search for optimal configurations of the control points.

We have demonstrated how PICS can be applied into both the runtime system and applications to optimize the application performance. For example, our results show its effectiveness for ChaNGa, a full-fledged cosmology application, on 16,384 cores. Mirroring is a technique to reduce communication bottleneck by replicating the data on multiple processors and forwarding the communication requests. Figure 1 compares the time cost of calculating gravity without mirroring and with mirroring using different number of replicas. The top red curve is the time cost without mirroring while the bottom green curve shows the cost of using replicas adaptively. The runtime converges on using 2 replicas.

The paper on this work is accepted at ROSS14, Munich Germany.
An Epidemic Algorithm for Dynamic Load Balancing
Fig 1
Load imbalance is an insidious factor that can reduce the performance of a parallel application significantly. For many applications, computation load varies over time. Such applications require dynamic load balancing to improve performance. Many applications employ centralized load balancing strategies, where load information is collected on to a single processor, and their decision algorithm is run sequentially. Such strategies have been shown to be effective for a few hundred to thousand processors. However, they present a clear performance bottleneck beyond a few thousand processors. An alternative to centralized strategies are distributed strategies that use local information, e.g. diffusion based. In a distributed strategy, each processor makes autonomous decisions based on its local view of the system. The local view typically consists of the load of its neighboring processors. Such strategies are scalable, but tend to yield poor load balance due to the limited local information.

We propose a fully distributed strategy, GrapevineLB, that has been designed to overcome the drawback of other distributed strategies by obtaining a partial representation of the global state of the system and basing the load balancing decisions on this. We use a light-weight information propagation algorithm based on epidemic algorithm (also known as the gossip protocol) to propagate the load information about the underloaded processors in the system to the overloaded processors. This spreads the information in the same fashion as gossip spreads through the grapevine in a society. Based on this information, GrapevineLB makes probabilistic transfer of work units to obtain good load distribution. The proposed algorithm is scalable and can be tuned to optimize for either cost or performance. We demonstrated its effectiveness in comparison to several other load balancing strategies for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.

Figure 1 shows the total time for AMR application which includes the time for load balancing as well as the application time per step. Since AMR requires frequent load balancing, the overhead of the centralized and hierarchical strategies, AmrLB and HierarchicalLB, exceeds the benefit of load balancing. DiffusionLB gives marginal benefit. GrapevineLB provides a large performance gain by achieving a better load balance and incurring less overhead.

The paper on this work was accepted at SC13 and was one of the finalist for the best student paper award.
Large-scale Parallelization of Stochastic Integer Programs

In many real world optimization problems such as resource allocation, the parameters that correspond to future outcomes are subject to enormous uncertainty. In stochastic optimization we assume a probabilistic distribution of these unknown parameters and optimize the resource allocation across a collection of possibilities, known as scenarios. Applications of stochastic optimization spans a diverse set of areas including aircraft scheduling, production planning, financial planning, energy grid optimization, supply chain management, and many more.

The novelty of the work is in combining high performance computing and stochastic optimization to solve large-sized stochastic integer optimization problems that have not been tackled before. We use branch-and-bound based approach which is an accepted norm for solving integer programs. Stochastic integer optimization programs have unique characteristics that are different from the challenges faced in the typical scientific iterative applications. These characteristics make it very difficult to realize a scalable design. Some such characteristics are coarse and unpredictable grain sizes, large memory footprints, variable amount of available computation, and perturbable branch-and-bound trees. Additionally, in this application, better utilization of processors does not mean faster time to optimal solution.

We propose a nested parallelism approach that exploits the inherent scenario parallelism in stochastic optimization along with the parallel exploration of branch-and-bound tree vertices. State-sharing across tree vertices allows interleaved vertex evaluations on the same linear program solver instance. This reduces the memory footprint and also the total time to solution. We show strong scaling on up to hundreds of cores. The application achieves speedup of up to 32x while scaling from 3 to 480 cores and has an efficiency of greater than 0.4 at 120 cores in 100% of the trials.

The paper on this work has won the best paper award at HiPC 2013 conference. The work is supported by MITRE Corporation and US Air Mobility Command.
Predicting Application Performance using Supervised Learning on Communication Features

Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. Recently, we have attempted to model the performance of an application using communication data, such as the communication graph and network hardware counters. We have used supervised learning algorithms to correlate performance with prior and new metrics. We have also proposed new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction.

The focus of an associated publication at SC'13 is on the use of supervised learning algorithms, such as forests of randomized decision trees, to correlate individual metrics and their combinations with application performance. In addition to using prior metrics for predicting performance, this work presents several new metrics. Some of these can be obtained by analyzing the communication graph, network graph and the mappings, and others can be obtained through an experimental or simulated application run. For the results in the associated paper, the latter are obtained using real experiments. As part of future work, we plan to develop simulation techniques to obtain them offline. Maximum dilation is an example of an analytically derived metric. In contrast, maximum bytes on a link is an experimentally obtained metric. In addition to these new metrics, we also use derived metrics that use information only from some outliers (nodes or links). In the paper, we present performance predictions using the randomized forests ensemble method for three different communication kernels: a two-dimensional five-point halo exchange, a three-dimensional 15-point halo exchange, and an all-to- all benchmark over sub-communicators. We show almost perfect correlation for runs on 16,384 and 65,536 cores of Blue Gene/Q. We also show predictions for a production application, pF3D, and for combining samples from different benchmarks into a single training set and testing set.

Link to SC'13 paper associated with this work
A ‘Cool’ Way of Improving the Reliability of HPC Machines
(This work will be presented at SC13)

In order to move to the next generation of supercomputers, the HPC community will have to cope with the two important challenges of increasing energy consumption and decreasing reliability. Studies predict that at exascale level the overall reliability could be so precarious that failures may become the norm. HPC researchers are working on improving the present fault tolerance protocols to address this issue. Hardware researchers are working on improving the reliability of each component the machine is made up of. In this paper, we use both hardware and software aspects to improve the reliability of HPC machines. It is well known that for every 10C increase in core temperatures, the fault rate doubles. We exploit this idea to demonstrate the use of constraining core temperatures, and load balancing to improve the reliability of parallel machines that can also result in a reduction of total execution time required by the application. We formulate a model that relates total execution time of an application to reliability and the associated slowdown for temperature restraint. We validate our model by implementing our approach and comparing the results to actual experimental results. Our scheme can reduce the execution time for all three applications considered in our work by more than 35% for 200,000 sockets in addition to improving the machine reliability by a factor of 1.48.
Partitioning in Charm++
Support for system-driven partitioning has been added to Charm++ in the latest stable version 6.5.0.

In many scenarios, simultaneous execution of multiple instances of similar jobs is required or is beneficial. For example, in NAMD, multiple instances of the same molecular system with similar but different initial conditions can be used to reduce the time to solution, and gain insight into the behavior of the molecular system. Alternatively, one may want to execute multiple small jobs together to get better topology, or sanitize performance comparison against differences in system noise. An inefficient but simple way to achieve the above is by running these jobs independently (one by one or simultaneously), and use file system to perform the necessary information exchange. In MPI programs, application specific solution based on communicators can be used. This leads to productivity loss, and leaves a lot to be desired.

Charm++ offers a runtime system based solution that seeks to address most of the above mentioned issues. Based on user input - the number and size of partitions - the runtime system divides the set of allocated cores into different partitions that are independent Charm++ instances. In the basic form, multiple instances of Charm++ programs, which are agnostic to partitions, execute till the end in their partitions. The RTS ensures synchronization at the beginning and end. If required, several such programs can be stitched together using a simple wrapper to be executed as a single job in their partitions. Communication among partitions is supported using Converse. Like any other message in Charm++, the inter-partition messages are also active entities that cause method invocation at the receiver. Communication, as of now, is restricted to processes, i.e. an application message is directed to a process in a partition (not at chares).

NAMD has been augmented to make use of the partitioning framework (currently used by many users using the development version, to be released). Using Tcl script as a controller, NAMD initiates simulation of similar systems in different partitions. Periodically, these partitions exchange information, and help each other progress faster. Another existing use case for partitions is the replica-enhanced checkpoint framework (visit Charm++ workshop website for more details).

The current stable version of Charm++ only supports partitioning of cores into fixed size Charm++ instances. Many other features including variable sized partitions and topology aware partitioning are available in the developmental version of Charm++. Manual entry on partitioning.
Dynamic Load Balancing for HPC in Cloud
Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud.

The insufficient network performance is a major bottleneck for HPC in cloud, and has been widely explored. Two less explored challenges are resource heterogeneity and multi-tenancy - which are fundamental artifacts of running in cloud. Clouds evolve over time, leading to heterogeneous configurations in processors, memory, and network. Similarly, multi-tenancy is also intrinsic of cloud, enhancing the business value of providing a cloud. Multi-tenancy leads to multiple sources of interference due to sharing of CPU, cache, memory access, and interconnect. For tightly-coupled HPC applications, heterogeneity and multi-tenancy can result in severe performance degradation and unpredictable performance, since one slow processor slows down the entire application. Moreover, the distribution of such interference is fairly random and unpredictable in a cloud. Hence, we need a mechanism to adapt to the dynamic variation in the execution environment.

With its adaptivity features, Charm++ is naturally suited for deployment in cloud infrastructures. Our primary hypothesis is that the challenges of heterogeneity and noise arising from multi-tenancy can be handled by an adaptive parallel runtime system. To validate our hypothesis, we explored the adaptation of Charm++ runtime system to virtualized environment. We developed a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. The main idea is periodic refinement of task distribution using measured CPU loads, task loads, and idle times.

In an accepted paper for IEEE/ACM CCGrid 2013, we demonstrate that even without the accurate information of the nature and amount of heterogeneity (static and dynamic but hidden from user as an artifact of virtualization), the approach of periodically measuring idle time and migrating load away from time-shared VMs works well in practice. In that paper, we perform experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, and demonstrate performance benefits up to 45%. The figure shows that we achieve significant % reduction in execution time using load balancing compared to the NoLB case, for three different applications and different number of VMs in presence of both the effects – the inherent hardware heterogeneity and the heterogeneity introduced by multi-tenancy. Through experimental analysis, we also learned that choice of load balancing period and computational granularity can have significant impact on performance but the optimal values depend on application characteristics, size, and scale. Runtime systems which can automate the selection and dynamic adjustment of such decisions will be increasingly useful in future.

Our related research on HPC in Cloud include evaluation of HPC performance in cloud, identification of performance bottlenecks, and addressing them through HPC-aware cloud schedulers and cloud-aware parallel runtime system. Related publications can be found on our research page on HPC on cloud. Our research on was awarded one of the HP labs Innovation Research Award (IRP) 2012.
Scalable Algorithms for Adaptive Mesh Refinement
As we move towards the exascale era, we need to reconsider the traditional algorithms and programming models. Adaptive Mesh Refinement (AMR) is an efficient methodology used for simulating differential equations on very large structured meshes that arise in many domains, e.g. numerical cosmology, global atmospheric modeling, mantle convection modeling, etc. The regions in the mesh which needs simulation at finer precision are refined to deeper levels while others are kept at the same level or are coarsened. The mesh adapts itself as the simulation progresses (see Figure below).

In traditional approaches, each process is assigned a set of blocks based on a space-filling curve. This approach has several limitations - 1) It requires O(#blocks) memory per process to store the mesh structure, 2) takes O(logP) time to locate neighboring blocks 3) O(d) rounds of global reductions during mesh restructuring (where d is the maximum depth of the mesh) 4) load balancing takes O(#blocks) time. Therefore this approach has memory bottleneck and has large time overheads for mesh restructuring and load balancing.

We developed a new design in which each mesh block acts as a first class entity - a virtual processor. This offers several algorithmic as well as implementation benefits - 1) a block acts as a virtual processor which allows overlap of its computation with communication of other blocks on the same physical process, 2) a bock can be uniquely identified by its location in the tree, 3) a block can be placed dynamically on any physical process which forms the basis for dynamic load balancing, 4) a block acts as a unit of algorithm expression which significantly reduces the implementation complexity, 4) a block is the end-point of communication and the underlying run-time system handles communication between arbitrary blocks, 6) O(#blocks/P) memory per process is required to store the mesh structure (as compared to O(#blocks) in the traditional algorithms. 7) the simulation progresses asynchronously with no barriers after each iteration and each block enters asynchronously into the mesh restructuring phase. The associated mesh restructuring algorithm takes only 2 system quiescence states (where complexity of detecting a quiescence is O(logP)) as compared to O(d) global reductions taking O(logP) time each, in traditional approaches (details in paper).

We implemented this design in Charm++. Each block is an instance of an element of a dynamic chare array with customized bit-vector indices to identify a block with its location in the mesh. Since blocks are the fundamental unit of work, they need to be equally distributed across processes for load balance. We plugged in Charm++ run-time provided gossip-based distributed load balancer called Grapevine. Performance of Grapevine is competitive with the centralized load balancers while it incurs negligible load balancing overhead. To empirically test our approach, we benchmarked a finite difference simulation of advection using a first-order upwind method in two-dimensional space (see Fig below). We achieved a strong scaling efficiency of 76.7% while scaling from 1k cores to 16k cores of BG/Q (see Fig above), on a mesh with maximum depth of 15. Our new algorithm is fully distributed and highly asynchronous, thus enabling high performance for much more deeply refined computations than are currently practised. For more details please refer to paper.
Energy Efficiency of Fault Tolerance Protocols
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. We evaluated a set of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we showed parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explored the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart. The figure plots the relative improvement in the energy consumption of message-logging and parallel recovery with respect to checkpoint/restart against the number of sockets. This study was published in the 24th International Symposium on Computer Architecture and High Performance Computing. The paper was awarded the Julio Salek Aude Best Paper Award at that conference.
PPL at Supercomputing 2012
As in years past, the Parallel Programming Laboratory will be well-represented at Supercomputing 2012 in Salt Lake City, Utah. Charm++ will be in the spotlight of a half-day tutorial preceding the conference and a Birds-of-a-Feather (BoF) session. The conference's technical program will include a paper by PPLers and collaborators on tuning NAMD for Cray XK6 systems, such as the upcoming Blue Waters at Illinois' National Center for Supercomputing Applications and Titan at Oak Ridge National Laboratory. Osman Sarood, a fifth-year PhD student in PPL, will give a presentation on his work on reducing energy usage in HPC systems during the conference's Doctoral Showcase.

Professor Kale and Senior Research Programmer Eric Bohm will teach a tutorial titled "Parallel Programming with Migratable Objects for Performance and Productivity." This tutorial will focus on application development following the principles of migratable, message-driven objects that underlies the Charm++ parallel programming environment. Attendees of the course will learn how to construct next-generation parallel applications in this setting in order to benefit from automatic load balancing, fault tolerance, and easy and efficient parallel composition. This tutorial will run from 8:30 to noon on Sunday, November 11.

PPL will host a Birds-of-a-Feather Session on the Charm++ ecosystem - its runtime, programming tools, and applications. Participants will hear about how these components are improving in performance, features, ease of use, portability, and interoperability. Several application developers will share their recent experiences and the latest advances. This is an ideal opportunity to learn more about how Charm++ is used in practice, see how it's growing, and influence its future direction. The session will take place from 12:15 to 1:15 on Thursday, November 15.

PhD candidate, Yanhua Sun, will present optimization techniques on Cray XK6 architecture for Charm++ runtime system and the molecular dynamics application -- NAMD, as published in "Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6". For 100-million-atom STMV benchmark, the NAMD team improved upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6. The performance improvement was achieved through careful tuning of Charm++ on Gemini interconnects, along with improvements in the parallelization of the Particle Mesh Ewald algorithm for long range force contributions, as well as GPU optimization for NAMD. This session is scheduled to take place on Wednesday November 4th from 2:00PM - 2:30PM in 355-EF.

PPL Director Professor Kale and long-standing PPL collaborator Professor Klaus Schulten will receive the Sidney Fernbach award "for outstanding contributions to the development of widely used parallel software for large biomolecular systems simulation."

Fifth-year PhD candidate Osman Sarood is focused on the challenge of meeting the energy and power requirements for huge exascale machines. Energy costs for data centers can be divided into two main categories: machine energy and cooling energy consumptions. Sarood's thesis investigates reduction in energy consumption for HPC data centers in both these categories. His recent work on reducing cooling energy consumption shows that we can reduce cooling energy consumption by up to 63% using our temperature aware load balancer. In this work, we also demonstrate that data centers can reduce machine energy consumption by up to 28% by running different parts of the applications at different frequencies. The focus of our work is to gauge the potential for energy saving by reducing both machine and cooling energy consumption separately (and their associated execution time penalty) and then come up with a scheme that combines them optimally to reduce total energy consumption for large HPC data centers. Sarood will present his research in the Dissertation Showcase at 11:15AM - 11:30AM on Wednesday, November 14th.

Last year, a team from PPL was awarded first place in the HPC Challenge Competition for the overall execution performance of the Charm++ benchmark implementations. This year, they will compete again to demonstrate portability to the latest extreme-scale systems, further exploitation of the load balancing and fault tolerance features of Charm++, and improved performance. If their submission is selected as a finalist, there will be a presentation during the HPC Challenge BoF session from 12:15-1:15 on Tuesday, November 13th.

In addition to these events, many PPL members will be in attendance at the conference. We look forward to seeing you there!