Charm++/AMPI Talk in Blue Waters Webinar Series
05/25/2018
Prof. Sanjay Kale will be giving a talk as a part of NCSA's Blue Waters Webinar Series on Wednesday, May 30th at 10am CT. The topic of the webinar with will be "What's New in Charm++ and Adaptive MPI".

Charm++ is a machine independent parallel programming system. Programs written using this system run unchanged on MIMD machines with or without shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Adaptive MPI is an MPI library implemented on top of Charm++, providing the high-level features of Charm++, such as dynamic load balancing and fault tolerance, to MPI applications.

For more information on the webinar, click here.
Multi-level Load Balancing with an Integrated Runtime Approach
02/21/2018
The recent trend of increasing numbers of cores per chip has resulted in vast amounts of on-node parallelism. These high core counts result in hardware variability that introduces imbalance. Applications are also becoming more complex, re- sulting in dynamic load imbalance. Load imbalance of any kind can result in loss of performance and system utilization. We address the challenge of handling both transient and persistent load imbalances while maintaining locality with low overhead. In this paper, we propose an integrated runtime system that combines the Charm++ distributed programming model with concurrent tasks to mitigate load imbalances within and across shared memory address spaces.
Figure 1. Integrated OpenMP into Charm++
It utilizes a periodic assignment of work to cores based on load measurement, in combination with user created tasks to handle load imbalance. We integrate OpenMP with Charm++ to enable creation of potential tasks via OpenMP’s parallel loop construct. This is also available to MPI applications through the Adaptive MPI implementation. We demonstrate the benefits of our work on three applications. We show improvements of Lassen by 29.6% on Cori and 46.5% on Theta. We also demonstrate the benefits on a Charm++ application, ChaNGa by 25.7% on Theta, as well as an MPI proxy application, Kripke, using Adaptive MPI. This work has been accepted to CCGrid `18 and will be presented on May.
Figure 2. Strong scaling result of Lassen on Haswell(Cori) and KNL(Theta)
Prof. Kale selected an an ACM Fellow
12/11/2017
Laxmikant Kale was recognized as an ACM Fellow on December 11, 2017 for development of new parallel programming techniques and their deployment in high performance computing applications. His research on adaptive runtime systems over the past three-plus decades is embodied by the state-of-the-art Charm++ programming system and the successes of its applications.

Prof. Kale was selected, alongside 53 others this year, to join the elite group of computer scientists that represents less than 1% of ACM's overall membership. ACM will formally recognize its 2017 Fellows at the annual Awards Banquet, to be held in San Francisco on June 23, 2018.

An excerpt from the Department of Computer Science news article "Kale, Zhai Named ACM Fellows" by David Mercer follows:

Kalé, who is the Paul and Cynthia Saylor Professor of Computer Science, was honored “for development of new parallel programming techniques and their deployment in high performance computing applications.”

He led pioneering work to develop adaptive runtime systems in parallel computing and incorporated it in the Charm++ parallel programming framework. Kale and his group are known for collaborative development of several highly scalable computational science and engineering applications, including NAMD (biophysics), ChaNGa (Astronomy), and OpenAtom (Quantum Chemistry).

Kalé also is a Fellow of the Institute of Electrical and Electronics Engineers, a winner of the 2012 IEEE Computer Society Sidney Fernbach Award, and a co-winner of the 2002 Gordon Bell Prize.

Q. What is the new parallel programming technique for HPC that you created?

A. I and my group have pioneered the idea of adaptive runtime systems in parallel computing, or HPC. We not only developed the ideas, but always incorporated them in a practical parallel programming system called Charm++ that we distribute and maintain from Illinois. We co-developed several science and engineering applications using Charm++, which allowed us to validate and improve the adaptive runtime techniques we were developing in our research in the context of full applications. The application codes developed include NAMD (biophysics), OpenAtom (quantum chemistry and materials modeling), ChaNGa (astronomy), Episimdemics (simulation of epidemics), and others.

Q. What HPC systems has it been deployed on, and where do you see it being deployed in the near future?

A. These are highly scalable codes that run from small clusters to supercomputers, including Blue, Waters, on hundreds of thousands of processor cores.

Q. What benefit does this bring to the larger HPC community?

A. Our approach allows parallel programmers to write code without worrying about where (on which processor) the code will execute, or which data will be on what processor.

The runtime system continuously watches the program behavior and moves data and code-execution among processors so as to automatically improve performance, via dynamic load balancing. This approach especially helps development of complex or sophisticated parallel algorithms.

Q. Becoming an ACM Fellow is obviously quite an honor. Tell me a bit about your career path and how you believe it led you to this point.

A. Apart from the novelty of automated runtime optimization, my research is characterized by continuous engagement with science applications. This is a costly path for a researcher, and I paid the price for it early in my career. But I believed such application-oriented yet computer science-centered research is the only way to try to make a lasting impact.

The credit for my success and for this award certainly goes to generations of my students who worked on various aspects of adaptive runtime systems.

ACM's announcement can be found here.

Congratulations Prof. Kale!
Power, Reliability, and Performance: One System to Rule Them All
07/20/2017
Figure 1
Our article which provides a design for a power efficient, resilient and high performance system got featured in IEEE Computer magazine. In a design based on the Charm++ parallel programming framework, an adaptive runtime system dynamically interacts with a datacenter’s resource manager to control power by intelligently scheduling jobs, reallocating resources, and reconfiguring hardware. It simultaneously manages reliability by cooling the system to the running application’s optimal level and maintains performance through load balancing. Figure 1 shows interacting components in our design based on the Charm++ framework. The two major interacting components are the resource manager and an adaptive runtime system. Our design achieves the objectives of both datacenter users and system administrators by allowing dynamic interaction between the system resource manager or scheduler and the job runtime system to meet system level constraints such as power caps and hardware configurations. For more details on this work, please read our article published in Computer Oct'16 issue.
FlipBack: Automatic Targeted Protection Against Silent Data Corruption
03/28/2017
With and without Flipback protection

The decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, less energy is required to flip a bit, and thus silent data corruptions (SDCs) become more common. In this paper, we present FlipBack, an automatic software-based approach that protects applications from SDCs. FlipBack provides targeted protection for different types of data and calculations based on their characteristics. It leverages compile-time analysis and program markup to identify data critical for control flow and enables selective replication of computation by the runtime system. We evaluate FlipBack with various HPC mini-applications that capture the behavior of real scientific applications and show that FlipBack is able to protect applications from most silent data corruptions with only 6 − 20% performance degradation.
OpenAtom: Scalable Ab-Initio Molecular Dynamics
09/29/2016
OpenAtom is a parallel implementation of the Car-Parrinello Ab-initio Molecular Dynamics (CPAIMD) and Born Openheimer Molecular Dynamics (BOMD) methods atop Charm++. It is used to simulate interesting problems in atomic and molecular systems based on fundamental quantum chemical principles. OpenAtom exploits the overdecomposition and asynchrony in Charm++ to automatically overlap computation and communication and scales to thousands of cores with increased performance.

We have extended OpenAtom to support several interesting ab-initio molecular dynamics simulation methods, in addition to the Car-Parrinello method and Born-Oppenheimer method, including k-points, parallel tempering, and path integrals. Recent technical advances in OpenAtom include the construction of a generic library for calculating FFT's in charm with minimal communication, Charm-FFT, the inclusion of generalized topology-aware mapping schemes, and the ability to run "Uber" or multi-instance methods in a single run.

The following graph highlights recent performance results we obtained on Blue Gene/Q and Cray XE6/XK7 systems using a variety of the methods newly integrated into OpenAtom. These include topology-aware mapping and showcase Charm++'s unqiue suitability to this problem.
For more information, please see the OpenAtom reserach page (link), the OpenAtom project page (link), and our recent paper at ISC (link)
Mitigating Process Variation Through Dynamic Load Balancing
06/14/2016

The design and manufacture of present-day CPUs causes inherent variation in supercomputer architectures such as variation in power and temperature of the chips. The variation also manifests itself as frequency differences among processors under Turbo Boost dynamic overclocking. This variation can lead to unpredictable and suboptimal performance in tightly coupled HPC applications. In this study, we use compute-intensive kernels and applications to analyze the variation among processors in four top supercomputers: Edison, Cab, Stampede, and Blue Waters.

As shown in the below figures, we observe that there is an execution time difference of up to 16% among processors on the Turbo Boost-enabled supercomputers: Edison, Cab, Stampede. There is less than 1% variation on Blue Waters, which does not have a dynamic overclocking feature. We analyze measurements from temperature and power instrumentation and find that intrinsic differences in the chips’ power efficiency is the culprit behind the frequency variation.

Moreover, we show that with a speed-aware dynamic load balancing algorithm we can reduce the negative effects of performance variation and improve the performance up to 18% compared to no load balancing performance and 6% better than the non-speed-aware counterpart.

For more details, please read ICS'16 paper: http://charm.cs.illinois.edu/newPapers/16-08/paper.pdf
and VarSys, IPDPS'16 paper: http://charm.cs.illinois.edu/newPapers/16-03/paper.pdf
Advances in AMPI
04/25/2016
We released Charm++ 6.7.1, the lastest stable version of Charm++, at the conclusion of our 14th annual Charm++ Workshop last week. While this release is primarily a bug fix release, Adaptive MPI has seen the addition of many new features. Adaptive MPI (AMPI) is an implementation of the MPI standard on top of Charm++. AMPI gives users with pre-existing MPI codes easy access to the high-level features of Charm++ and its runtime system. AMPI provides application-independent features like process virtualization, dynamic load balancing, overlap of communication and computation, and automatic fault tolerance. It does so by implementing MPI ranks as lightweight, migratable user-level threads that are scheduled and can be load balanced by the Charm++ runtime system.

Application developers need only re-compile their code with AMPI's compiler wrappers (ampicc, ampif90, etc.) to begin experimenting with its functionality. The only change needed to pre-existing MPI codes is to privatize mutable global/static variables before running with virtualization. The 6.7.1 release contains many updates to AMPI. We have bumped AMPI's official support to MPI-2.2, and we have implemented the non-blocking collectives and neighborhood collectives from MPI-3.1. We have also renamed AMPI's extensions to the MPI standard to all be prefixed with AMPI_ rather than MPI_, so as to avoid polluting MPI's namespace. To learn more about AMPI, see the Workshop talk on AMPI (link) or consult AMPI's manual (link).
Empowering PDES with an Adaptive Runtime System
03/01/2016

Parallel Discrete Event Simulation (PDES) is a type of simulation that differs significantly from many common HPC applications. In PDES, the simulation progresses forward by processing events that happen at discrete points in virtual time. The distribution of these events in time, as well as how the events affect the state of the simulation, may result in a very irregular pattern of communication that is not known a priori. Due to the dynamic and irregular nature of PDES, we believe that Charm++ makes a good target for implementing a scalable PDES engine. Over the past two years we have collaborated with the ROSS (http://www.cs.rpi.edu/~chrisc/ross.html) team at RPI to reimplement ROSS on top of Charm++. ROSS is a PDES engine that has been developed and highly optimized on top of MPI. ROSS has shown great scalability when running models with a low amount of inter-process communication, however it has shown some limitations for more communication intensive models. Because the Charm++ runtime system is able to effectively handle asynchronous communication, overlapped automatically with computation, the Charm++ version of ROSS is able to overcome some of these limitations. As shown in the plot below, as we increase the amount of inter-process communication, the Charm++ version of ROSS achieves up to 40% higher event rates than the MPI version of ROSS on up to 2 million cores. More details are in the PADS’16 paper (http://charm.cs.illinois.edu/papers/16-01).
Towards Realizing the Potential of Malleable Jobs
02/01/2016
Malleable jobs are those, which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. They can significantly improve system utilization and reduce average response time, compared to traditional jobs. Three components are critical in a malleable system– an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. Figures below shows interactive between these components and how an adaptive run-time system shrinks the number of processors.

We add shrink and expand capability to Charm++ adaptive parallel runtime system and show the efficacy, scalability, and benefits of our approach, which is based on task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Figures below show the performance of rescaling with difference problem size and different number of processors.


In our later collaboration, we show an implementation of an adaptive resource manager by making Torque/Maui malleable with a malleable job-scheduling algorithm and show the integration with Charm++. For more details on Charm++ implementation of malleability, see our HiPC’14 paper: http://charm.cs.uiuc.edu/papers/14-29 For more details on malleability with Torque/Maui see our IPDPS’15 paper: http://charm.cs.uiuc.edu/papers/15-12
Recovering Logical Structure from Charm++ Traces
10/30/2015
Asynchrony and non-determinism in Charm++ programs present a significant challenge in analyzing their event traces. In this paper, published at SC’15, we present a new framework to organize event traces of parallel programs written in Charm++. Our reorganization allows one to more easily explore and analyze such traces by providing context through logical structure. We describe several heuristics to compensate for missing dependencies between events that currently cannot be easily recorded. We introduce a new task ordering that recovers logical structure from the non-deterministic execution order. Using the logical structure, we have developed several metrics to help guide developers to performance problems. The approach presented in this paper is demonstrated through two proxy applications written in Charm++: Jacobi2d (a stencil code) and LULESH (a hydrodynamics simulation code).

We use the term logical structure to denote an exact ordering of events reflecting patterns of parallel dependency behavior as de- signed by the application’s developers. Generally, programming such applications requires breaking the problem into smaller pieces, or phases, some of which require parallel interaction. The logical structure separates these phases which may be interleaved in physical time. Similarly, within each phase, the events are ordered by their dependencies into discrete logical steps rather than their true physical time order and displacement. This matches how the events were programmed: with logical ordering rather than known timing information guiding the design.

In the above figure, we show the logical and physical timeline of the Jacobi2d application. In the logical view, the reasons for the idle time experienced in the application become more evident: namely a reduction that is mapped to processors directly.

For more details see the SC'15 paper.
TraceR: Parallel Trace Replay Tool for HPC Network Simulations
10/01/2015

TraceR is a trace replay tool built upon the ROSS-based CODES simulation framework. It executes under an optimistic parallel discrete-event paradigm using reversible computing for production HPC applications. It can be used for predicting network performance and understanding network behavior by simulating messaging on interconnection networks. It addresses two major shortcomings in current network simulators. First, it enables fast and scalable simulations of large-scale supercomputer networks. Second, it can simulate production HPC applications using BigSim's emulation framework.

Figure below presents the execution time for simulating 3D Stencil on various node counts of 3D torus. TraceR scales well for all simulated system sizes. For the simulations with half a million (512K) nodes, the execution time is 542 seconds using 3D TorusNet model.

We also do a validation study with LeanMD application on 5D Torus. We compare real execution time with simulated execution time in Figure below. We observe that for all node counts (512 to 8192nodes) the error in the prediction is less than 9%. For most cases, the predicted time is within 2% of the observed execution time.

For more details: http://charm.cs.illinois.edu/newPapers/15-14/paper.pdf
Improving fine-grained communication performance with topological message aggregation
09/03/2015

It is often convenient to use fine-grained messages to implement the interactions between objects in a Charm++ program, because the typically small work and data units that result from object-based decomposition tend to produce small message payloads. While the utility of fine-grained communication is clear, its performance characteristics are poor due to overhead. Much of the per message overhead at the application, RTS, and network level is independent of message size. In applications with large fine-grained message volume, this overhead can dominate run time. The Charm++ Topological Routing and Aggregation Module (TRAM) is a library that improves fine-grained communication performance by coalescing fine-grained communication units, or data items, into larger messages. TRAM constructs a virtual topology comprising the processes in the parallel run, and defines peers within this topology as any processes that can be reached by traveling an arbitrary number of hops along a single dimension. Separately aggregating data items for each destination requires substantial memory and limits the potential for aggregating concurrent data items with different destinations but common sub-paths. To address this issue, TRAM aggregates data items at the level of peers. Items whose destinations are not in the set of peers at the source process are routed through one or more intermediate destinations along a minimal route. As a result, the memory footprint of the buffer space normally fits in lower level cache.

TRAM can dramatically improve performance for applications dominated by fine-grained communication. An example is the simulator of contagion, EpiSimdemics, where TRAM provides up to a 4x speedup compared to direct use of fine-grained messages. In strong scaling scenarios for EpiSimdemics, TRAM also allows for better scaling than an application-specific point to point aggregation approach. As the number of nodes in a run is increased, the individual sends are distributed throughout the network, so that direct aggregation at the source for each destination fails to aggregate items into large enough buffers to sufficiently mitigate communication overhead. By comparison, a topological aggregation approach is able to maintain the advantage of aggregation to much higher node counts.


Link to thesis.
ICPP paper.
SC paper.
Migratable Objects Make Supercomputing More Reliable
08/18/2015

It has been long recognized by the HPC community that reliability is one of the major hurdles in reaching extreme scale. The same gigantic number of components that will make supercomputers powerful, will make them equally vulnerable to frequent failures. Estimates differ as to how often extreme scale machines will crash, but there is a consensus that applications will require some form of fault tolerance.
Migratable objects facilitate the task of developing new resilient algorithms, or refining existing ones. The very nature of the programming model, on which Charm++ is based, allows more flexibility in the execution of an application. That same flexibility is crucial in adapting to failures.
In a recent journal publication, we collected our work on fault tolerance strategies using migratable objects. The figure shows the efficiency of those strategies when projected at extreme scale. The traditional checkpoint/restart based on a network file system will not scale too far. However, schemes enriched with migratable objects may keep the efficiency of extreme-scale systems above 60%. Among those schemes we find checkpoint/restart based on local storage, proactive evacuation, and parallel recovery with message logging. A comprehensive approach, combining all the schemes, may increase the efficiency of a system to almost 80% at extreme scale.

Link to IEEE Transactions on Parallel and Distributed Systems publication
Link to research area page on Fault Tolerance
Charm++ and MPI: Combining the Best of Both Worlds
07/21/2015

This work explores a framework that enables hybrid parallel programming with MPI and Charm++, and allows programmers to develop different modules of a parallel application in these two languages while facilitating smooth interoperation. MPI and Charm++ embody two distinct perspectives for writing parallel programs. While MPI provides a processor-centric, user-driven model for developing parallel codes, Charm++ supports work-centric, overdecomposition-based, system-driven parallel programming. One or the other can be the best or most natural fit for distinct modules that constitute a parallel application. In the published work linked below, we describe the challenges in enabling interoperation between MPI and Charm++, and present techniques for managing the control flow and resource sharing in such scenarios. As the attached figure indicates, interoperation between MPI and Charm++ has been enabled by exposing the Charm++ scheduler to the programmer, and enabling the reuse of Charm++ and MPI applications as external libraries. We also demonstrate the benefits of interoperation between MPI and Charm++ through several case studies that use production applications and libraries, including CHARM/Chombo, EpiSimdemics, NAMD, FFTW, MPI-IO and ParMETIS.

Link to the paper published in IPDPS, 2015.
Link to the presentation at IPDPS, 2015.
Adaptive Techniques for Clustered N-Body Cosmological Simulations
06/17/2015
Fig 1

Simulating the process of cosmological structure formation with enough resolution to determine galaxy morphologies requires an enormous dynamic range in space and time. ChaNGa is a parallel n-body+SPH cosmology code implemented using Charm++ for the simulation of astrophysical systems on a wide range of spatial and time scales. Scaling such codes to large processor count requires overcoming not only spatial resolution challenges, but also large ranges in timescales. In this work, we use an approach called multi-stepping which involves using different time steps for different particles in relation to their dynamical time scales, leading to an algorithm that is challenging to parallelize effectively. We present the optimizations implemented in ChaNGa that allow it to scale to large numbers of processors, and address the challenges brought on by the high dynamic ranges of clustered datasets. We demonstrated strong scaling results for uniform datasets on up to 512K cores on Blue Waters evolving 12 and 24 billion particles (Fig 1). We also have shown strong scaling results for cosmo25 and dwarf datasets, which are more challenging due to their highly clustered nature. In Fig 2, we can see that we obtain good performance on up to 128K cores of Blue Waters and also show up to a 3 fold improvement in time with multi-stepping over single-stepping.

Link to the paper: PDF
Fig 2
Scalable Asynchronous Contact Mechanics for Cloth Simulation
03/25/2015


Asynchronous Contact Mechanics (ACM) algorithm is a reliable method to simulate flexible material subject to complex collisions and contact geometries. For example, we can apply ACM to perform cloth simulation for animation. The parallelization of ACM is challenging due to its highly irregular communication pattern, its need for dynamic load balancing, and its extremely fine-grained computations. In our recent work, we utilize CHARM++ to address these challenges and show good strong scaling of ACM to 384 cores for problems with fewer than 100k vertices. By comparison, the previously published shared memory implementation only scales well to about 30 cores for the same examples. We demonstrate the scalability of our implementation through a number of examples which, to the best of our knowledge, are only feasible with the ACM algorithm. In particular, for a simulation of 3 seconds of a cylindrical rod twisting within a cloth sheet, the simulation time is reduced by 12× from 9 hours on 30 cores to 46 minutes using our implementation on 384 cores of a Cray XC30.
The Figure above shows the performance comparison between CHARM++ implementation and TBB implementation of ACM.
For more details, please refer to the IPDPS'2015 publication here.
PARM: Power Aware Resource Management Under Power Budget
01/27/2015

Recent advances in processor and memory hardware designs have made it possible for the user to control the power consumption of the CPU and memory through software, e.g., the power consumption of Intel’s Sandy Bridge family of processors can be user-controlled through the Running Average Power Limit (RAPL) library. It has been shown that increase in the power allowed to the processor (and/or memory) does not yield a proportional increase in the application’s performance. As a result, for a given power budget, it can be better to run an application on larger number of nodes with each node capped at lower power than fewer nodes each running at its TDP. This is also called overprovisioning. The optimal resource configuration for an application can be determined by profiling an application’s performance for varying number of nodes, CPU power and memory power and then selecting the best performing configuration for the given power budget. In this paper, we propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. Our online power aware resource manager (PARM) uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. With a power budget of 4.75 MW, PARM can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. With real experiments on a relatively small scale cluster, PARM obtained 1.7X improvement. An adaptive runtime system allows further improvement by allowing already running jobs to shrink and expand for optimal resource allocation. Link to the paper: PDF

Reconfiguring Cache Hierarchy with Adaptive HPC RTS
12/16/2014
The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.

Figure 1(a,b) show the best cache configurations and the energy savings using our method for different problem sizes of HPCCG application. As illustrated, our approach can save significant energy when the problem size is either smaller or larger than the cache sizes.
Maximizing Network Throughput on the Dragonfly Interconnect
10/07/2014


This work, accepted to SC'14, analyzes the behavior of the dragonfly network for various routing strategies, job placement policies, and application communication patterns. The study is based on a novel model that predicts traffic on individual links for direct, indirect, and adaptive routing strategies. In the paper, we analyze results for single jobs and some common parallel job workloads. The predictions presented in this paper are for a 100+ Petaflop/s prototype machine with 92,160 high-radix routers and 8.8 million cores.

The general observations are that a randomized placement at the granularity of nodes and routers and/or indirect routing can help spread the messaging traffic over the network and reduce hot-spots. If the communication pattern results in non-uniform distribution of traffic, adaptive routing may provide significantly better traffic distributions by reducing hot-spots. For parallel job workloads, adaptive hybrid routing is useful for combining good features of adaptive direct and adaptive indirect routings and may provide a good traffic distribution with lower maximum traffic. Adaptive routings also improve the traffic distribution significantly in comparison to static routings. We also observed that allowing the users to choose a routing for their application can be beneficial in most cases on dragonfly networks. Use of randomized placement at the granularity of nodes and routers is the suggested choice for such scenarios also.

Paper link on PPL's website
Paper schedule at SC'14
Scalable Fault Tolerance with Partial-Order Dependencies
08/26/2014
Winner of the Best Student Paper at CLUSTER'14!
Fault tolerance has become increasingly important as system sizes become larger and the probability of hard or soft errors increases. In this work, we focus on methodologies for tolerating hard errors: those that stem from hardware failure and cause a detectable fault in the system. Many approaches have been applied in this area, ranging from checkpoint/restart, where the application's state is saved to disk, to in-memory checkpoint, where the application incrementally saves its state to partner nodes. Both of the approaches, while effective, require the entire system to synchronize and rollback to the checkpoint to ensure consistent recovery. Message logging is an effective methodology for saving the messages that were sent in a localized manner, along with there order, which is encoded in a determinant. Message logging is also effective for deterministic replay, which is commonly used for discovering bugs.
For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application to maintain generality. However, in many applications, only a partial order must be recorded due to determinism intrinsic in the program, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In a paper published at CLUSTER'14 (to be presented in Madrid, Spain in September), we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different orderings and interleavings. By exploiting this framework, we improve on an existing scalable message-logging fault tolerance scheme that uses a total order. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2x lower overhead.
Parallel Programming with Migratable Objects: Charm++ in Practice
08/01/2014
migratable objects

Science and engineering applications are increasingly becoming complex. Further, we are faced with new challenges due to dynamic variations in the machine itself. In the face of these challenges, what are the key concepts that are essential for parallel applications to take advantage of today’s modern supercomputers? In the paper accepted at SC14, we describe key designs and attributes that have been instrumental in enabling efficient parallelization. The paper describes the state of practice of CHARM++, using multiple mini-applications as well as some full-fledged applications and showcase various features enabled by the execution model.

It’s becoming more and more apparent that adaptive run time systems (RTS) will play an essential role in addressing various challenges in scaling parallel applications. However, to support the RTS, the programming model must possess certain key attributes including over-decomposition, asynchronous message driven and migratability. Over decomposition along with support for migratability can be exploited by the run time system to enable critical features and perform adaptive intelligent strategies. Over decomposition allows the run time system to map and remap objects to processors at run time which can be exploited to enable features such as adaptive load balancing, automatic checkpoint and restart, thermal and power management, adaptive job scheduling and communication optimizations. With migratability, the run time system can perform dynamic load balancing by migrating objects from overloaded processors to underloaded ones and adapt based on the application characteristics. It can also be used to automatically checkpoint the objects and restart them on a different number of processors. The shrink and expand feature also uses this capability to improve the cluster utilization by adaptively changing the number of processors assigned to a job and migrating the corresponding objects. We can use DVFS for temperature control and dynamic load balancing to decrease the cooling energy. The run time system can also monitor the communication pattern and with the knowledge of the underlying topology performs various communication optimizations. We show the benefits of these attributes and features on various mini-apps and full-scale applications run on modern day supercomputers.
An Adaptive Parallel Sparse Triangular Solver
07/02/2014
a triangular matrix

Solving sparse triangular linear systems is an important kernel for many numerical linear algebra problems in science and engineering simulations, i.e. linear systems and least square problems. It is used extensively in direct methods following a triangular factorization to solve the system (possibly with many right-hand sides), or to improve an approximate solution iteratively. It is also a fundamental kernel in many iterative methods (such as Gauss-Seidel method) and in many preconditioners for other iterative methods (such as Incomplete-Cholesky preconditioner for Conjugate Gradient). Unfortunately, performance of parallel algorithms for triangular solution is notoriously poor, resulting in performance bottlenecks for many of these methods.

We present a novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods. By analyzing and reordering operations, our algorithm can often extract parallelism even for cases where most of the nonzero matrix entries are near the diagonal. Our main parallelism strategies are: (1) identify independent rows, (2) send data earlier to achieve greater overlap, and (3) process dense off-diagonal regions in parallel. We implemented our algorithm in Charm++, and present promising experimental results using numerous sparse matrices from real applications.

Using various matrices, the graph below compares the performance of our method with that of HYPRE (blue lines). HYPRE is a commonly used linear algebra package developed at Lawrence Livermore National Laboratory (LLNL). As shown, our method can exploit parallelism on many matrices, whereas HYPRE’s performance is nearly sequential in all cases. The performance of HYPRE is worse than sequential in many cases because of parallel overhead, although there is some improvement for large numbers of processors, probably due to cache effects. Overall, our method is a significant improvement over this existing code and will reduce the solution time for many problems.



The paper on this work is accepted by Parallel Computing (ParCo) 2014. PhD student Ehsan Totoni received 3rd for ACM Student Research Competition (SRC) Grand Finals 2014 based on this research.
Steering Parallel Applications via Introspection
06/03/2014
Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. Instead of full automation or completely manual tuning, we take the approach of cooperative automation. In this approach, the application developers provide knobs (control points) that reconfigure applications in ways that affect performance in specific ways. We then allow the runtime system to adjust the configuration automatically based on runtime observations and its understanding of the effects of each control point.

We have designed and developed PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic interface for describing control points and their effects. This is how application-specific knowledge is exposed to the runtime system. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the performance data and a decision tree encoding expert knowledge, program characteristics are extracted to assist the search for optimal configurations of the control points.

We have demonstrated how PICS can be applied into both the runtime system and applications to optimize the application performance. For example, our results show its effectiveness for ChaNGa, a full-fledged cosmology application, on 16,384 cores. Mirroring is a technique to reduce communication bottleneck by replicating the data on multiple processors and forwarding the communication requests. Figure 1 compares the time cost of calculating gravity without mirroring and with mirroring using different number of replicas. The top red curve is the time cost without mirroring while the bottom green curve shows the cost of using replicas adaptively. The runtime converges on using 2 replicas.

The paper on this work is accepted at ROSS14, Munich Germany.
An Epidemic Algorithm for Dynamic Load Balancing
03/06/2014
Fig 1
Load imbalance is an insidious factor that can reduce the performance of a parallel application significantly. For many applications, computation load varies over time. Such applications require dynamic load balancing to improve performance. Many applications employ centralized load balancing strategies, where load information is collected on to a single processor, and their decision algorithm is run sequentially. Such strategies have been shown to be effective for a few hundred to thousand processors. However, they present a clear performance bottleneck beyond a few thousand processors. An alternative to centralized strategies are distributed strategies that use local information, e.g. diffusion based. In a distributed strategy, each processor makes autonomous decisions based on its local view of the system. The local view typically consists of the load of its neighboring processors. Such strategies are scalable, but tend to yield poor load balance due to the limited local information.

We propose a fully distributed strategy, GrapevineLB, that has been designed to overcome the drawback of other distributed strategies by obtaining a partial representation of the global state of the system and basing the load balancing decisions on this. We use a light-weight information propagation algorithm based on epidemic algorithm (also known as the gossip protocol) to propagate the load information about the underloaded processors in the system to the overloaded processors. This spreads the information in the same fashion as gossip spreads through the grapevine in a society. Based on this information, GrapevineLB makes probabilistic transfer of work units to obtain good load distribution. The proposed algorithm is scalable and can be tuned to optimize for either cost or performance. We demonstrated its effectiveness in comparison to several other load balancing strategies for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.

Figure 1 shows the total time for AMR application which includes the time for load balancing as well as the application time per step. Since AMR requires frequent load balancing, the overhead of the centralized and hierarchical strategies, AmrLB and HierarchicalLB, exceeds the benefit of load balancing. DiffusionLB gives marginal benefit. GrapevineLB provides a large performance gain by achieving a better load balance and incurring less overhead.

The paper on this work was accepted at SC13 and was one of the finalist for the best student paper award.
Large-scale Parallelization of Stochastic Integer Programs
01/10/2014


In many real world optimization problems such as resource allocation, the parameters that correspond to future outcomes are subject to enormous uncertainty. In stochastic optimization we assume a probabilistic distribution of these unknown parameters and optimize the resource allocation across a collection of possibilities, known as scenarios. Applications of stochastic optimization spans a diverse set of areas including aircraft scheduling, production planning, financial planning, energy grid optimization, supply chain management, and many more.

The novelty of the work is in combining high performance computing and stochastic optimization to solve large-sized stochastic integer optimization problems that have not been tackled before. We use branch-and-bound based approach which is an accepted norm for solving integer programs. Stochastic integer optimization programs have unique characteristics that are different from the challenges faced in the typical scientific iterative applications. These characteristics make it very difficult to realize a scalable design. Some such characteristics are coarse and unpredictable grain sizes, large memory footprints, variable amount of available computation, and perturbable branch-and-bound trees. Additionally, in this application, better utilization of processors does not mean faster time to optimal solution.

We propose a nested parallelism approach that exploits the inherent scenario parallelism in stochastic optimization along with the parallel exploration of branch-and-bound tree vertices. State-sharing across tree vertices allows interleaved vertex evaluations on the same linear program solver instance. This reduces the memory footprint and also the total time to solution. We show strong scaling on up to hundreds of cores. The application achieves speedup of up to 32x while scaling from 3 to 480 cores and has an efficiency of greater than 0.4 at 120 cores in 100% of the trials.



The paper on this work has won the best paper award at HiPC 2013 conference. The work is supported by MITRE Corporation and US Air Mobility Command.
Predicting Application Performance using Supervised Learning on Communication Features
10/10/2013




Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. Recently, we have attempted to model the performance of an application using communication data, such as the communication graph and network hardware counters. We have used supervised learning algorithms to correlate performance with prior and new metrics. We have also proposed new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction.

The focus of an associated publication at SC'13 is on the use of supervised learning algorithms, such as forests of randomized decision trees, to correlate individual metrics and their combinations with application performance. In addition to using prior metrics for predicting performance, this work presents several new metrics. Some of these can be obtained by analyzing the communication graph, network graph and the mappings, and others can be obtained through an experimental or simulated application run. For the results in the associated paper, the latter are obtained using real experiments. As part of future work, we plan to develop simulation techniques to obtain them offline. Maximum dilation is an example of an analytically derived metric. In contrast, maximum bytes on a link is an experimentally obtained metric. In addition to these new metrics, we also use derived metrics that use information only from some outliers (nodes or links). In the paper, we present performance predictions using the randomized forests ensemble method for three different communication kernels: a two-dimensional five-point halo exchange, a three-dimensional 15-point halo exchange, and an all-to- all benchmark over sub-communicators. We show almost perfect correlation for runs on 16,384 and 65,536 cores of Blue Gene/Q. We also show predictions for a production application, pF3D, and for combining samples from different benchmarks into a single training set and testing set.



Link to SC'13 paper associated with this work
A ‘Cool’ Way of Improving the Reliability of HPC Machines
07/10/2013
(This work will be presented at SC13)

In order to move to the next generation of supercomputers, the HPC community will have to cope with the two important challenges of increasing energy consumption and decreasing reliability. Studies predict that at exascale level the overall reliability could be so precarious that failures may become the norm. HPC researchers are working on improving the present fault tolerance protocols to address this issue. Hardware researchers are working on improving the reliability of each component the machine is made up of. In this paper, we use both hardware and software aspects to improve the reliability of HPC machines. It is well known that for every 10C increase in core temperatures, the fault rate doubles. We exploit this idea to demonstrate the use of constraining core temperatures, and load balancing to improve the reliability of parallel machines that can also result in a reduction of total execution time required by the application. We formulate a model that relates total execution time of an application to reliability and the associated slowdown for temperature restraint. We validate our model by implementing our approach and comparing the results to actual experimental results. Our scheme can reduce the execution time for all three applications considered in our work by more than 35% for 200,000 sockets in addition to improving the machine reliability by a factor of 1.48.
Partitioning in Charm++
05/06/2013
Support for system-driven partitioning has been added to Charm++ in the latest stable version 6.5.0.



In many scenarios, simultaneous execution of multiple instances of similar jobs is required or is beneficial. For example, in NAMD, multiple instances of the same molecular system with similar but different initial conditions can be used to reduce the time to solution, and gain insight into the behavior of the molecular system. Alternatively, one may want to execute multiple small jobs together to get better topology, or sanitize performance comparison against differences in system noise. An inefficient but simple way to achieve the above is by running these jobs independently (one by one or simultaneously), and use file system to perform the necessary information exchange. In MPI programs, application specific solution based on communicators can be used. This leads to productivity loss, and leaves a lot to be desired.



Charm++ offers a runtime system based solution that seeks to address most of the above mentioned issues. Based on user input - the number and size of partitions - the runtime system divides the set of allocated cores into different partitions that are independent Charm++ instances. In the basic form, multiple instances of Charm++ programs, which are agnostic to partitions, execute till the end in their partitions. The RTS ensures synchronization at the beginning and end. If required, several such programs can be stitched together using a simple wrapper to be executed as a single job in their partitions. Communication among partitions is supported using Converse. Like any other message in Charm++, the inter-partition messages are also active entities that cause method invocation at the receiver. Communication, as of now, is restricted to processes, i.e. an application message is directed to a process in a partition (not at chares).



NAMD has been augmented to make use of the partitioning framework (currently used by many users using the development version, to be released). Using Tcl script as a controller, NAMD initiates simulation of similar systems in different partitions. Periodically, these partitions exchange information, and help each other progress faster. Another existing use case for partitions is the replica-enhanced checkpoint framework (visit Charm++ workshop website for more details).



The current stable version of Charm++ only supports partitioning of cores into fixed size Charm++ instances. Many other features including variable sized partitions and topology aware partitioning are available in the developmental version of Charm++. Manual entry on partitioning.
Dynamic Load Balancing for HPC in Cloud
03/16/2013
Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud.

The insufficient network performance is a major bottleneck for HPC in cloud, and has been widely explored. Two less explored challenges are resource heterogeneity and multi-tenancy - which are fundamental artifacts of running in cloud. Clouds evolve over time, leading to heterogeneous configurations in processors, memory, and network. Similarly, multi-tenancy is also intrinsic of cloud, enhancing the business value of providing a cloud. Multi-tenancy leads to multiple sources of interference due to sharing of CPU, cache, memory access, and interconnect. For tightly-coupled HPC applications, heterogeneity and multi-tenancy can result in severe performance degradation and unpredictable performance, since one slow processor slows down the entire application. Moreover, the distribution of such interference is fairly random and unpredictable in a cloud. Hence, we need a mechanism to adapt to the dynamic variation in the execution environment.

With its adaptivity features, Charm++ is naturally suited for deployment in cloud infrastructures. Our primary hypothesis is that the challenges of heterogeneity and noise arising from multi-tenancy can be handled by an adaptive parallel runtime system. To validate our hypothesis, we explored the adaptation of Charm++ runtime system to virtualized environment. We developed a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. The main idea is periodic refinement of task distribution using measured CPU loads, task loads, and idle times.

In an accepted paper for IEEE/ACM CCGrid 2013, we demonstrate that even without the accurate information of the nature and amount of heterogeneity (static and dynamic but hidden from user as an artifact of virtualization), the approach of periodically measuring idle time and migrating load away from time-shared VMs works well in practice. In that paper, we perform experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, and demonstrate performance benefits up to 45%. The figure shows that we achieve significant % reduction in execution time using load balancing compared to the NoLB case, for three different applications and different number of VMs in presence of both the effects – the inherent hardware heterogeneity and the heterogeneity introduced by multi-tenancy. Through experimental analysis, we also learned that choice of load balancing period and computational granularity can have significant impact on performance but the optimal values depend on application characteristics, size, and scale. Runtime systems which can automate the selection and dynamic adjustment of such decisions will be increasingly useful in future.

Our related research on HPC in Cloud include evaluation of HPC performance in cloud, identification of performance bottlenecks, and addressing them through HPC-aware cloud schedulers and cloud-aware parallel runtime system. Related publications can be found on our research page on HPC on cloud. Our research on was awarded one of the HP labs Innovation Research Award (IRP) 2012.