Support for system-driven partitioning has been added to Charm++ in the latest stable version 6.5.0.
In many scenarios, simultaneous execution of multiple instances of similar jobs is required or is beneficial. For example, in NAMD, multiple instances of the same molecular system with similar but different initial conditions can be used to reduce the time to solution, and gain insight into the behavior of the molecular system. Alternatively, one may want to execute multiple small jobs together to get better topology, or sanitize performance comparison against differences in system noise. An inefficient but simple way to achieve the above is by running these jobs independently (one by one or simultaneously), and use file system to perform the necessary information exchange. In MPI programs, application specific solution based on communicators can be used. This leads to productivity loss, and leaves a lot to be desired.
Charm++ offers a runtime system based solution that seeks to address most of the above mentioned issues. Based on user input - the number and size of partitions - the runtime system divides the set of allocated cores into different partitions that are independent Charm++ instances. In the basic form, multiple instances of Charm++ programs, which are agnostic to partitions, execute till the end in their partitions. The RTS ensures synchronization at the beginning and end. If required, several such programs can be stitched together using a simple wrapper to be executed as a single job in their partitions. Communication among partitions is supported using Converse. Like any other message in Charm++, the inter-partition messages are also active entities that cause method invocation at the receiver. Communication, as of now, is restricted to processes, i.e. an application message is directed to a process in a partition (not at chares).
NAMD has been augmented to make use of the partitioning framework (currently used by many users using the development version, to be released). Using Tcl script as a controller, NAMD initiates simulation of similar systems in different partitions. Periodically, these partitions exchange information, and help each other progress faster. Another existing use case for partitions is the replica-enhanced checkpoint framework (visit Charm++ workshop website for more details).
The current stable version of Charm++ only supports partitioning of cores into fixed size Charm++ instances. Many other features including variable sized partitions and topology aware partitioning are available in the developmental version of Charm++. Manual entry on partitioning.
Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud.
The insufficient network performance is a major bottleneck for HPC in cloud, and has been widely explored. Two less explored challenges are resource heterogeneity and multi-tenancy - which are fundamental artifacts of running in cloud. Clouds evolve over time, leading to heterogeneous configurations in processors, memory, and network. Similarly, multi-tenancy is also intrinsic of cloud, enhancing the business value of providing a cloud. Multi-tenancy leads to multiple sources of interference due to sharing of CPU, cache, memory access, and interconnect. For tightly-coupled HPC applications, heterogeneity and multi-tenancy can result in severe performance degradation and unpredictable performance, since one slow processor slows down the entire application. Moreover, the distribution of such interference is fairly random and unpredictable in a cloud. Hence, we need a mechanism to adapt to the dynamic variation in the execution environment.
With its adaptivity features, Charm++ is naturally suited for deployment in cloud infrastructures. Our primary hypothesis is that the challenges of heterogeneity and noise arising from multi-tenancy can be handled by an adaptive parallel runtime system. To validate our hypothesis, we explored the adaptation of Charm++ runtime system to virtualized environment. We developed a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. The main idea is periodic refinement of task distribution using measured CPU loads, task loads, and idle times.
In an accepted paper for IEEE/ACM CCGrid 2013, we demonstrate that even without the accurate information of the nature and amount of heterogeneity (static and dynamic but hidden from user as an artifact of virtualization), the approach of periodically measuring idle time and migrating load away from time-shared VMs works well in practice. In that paper, we perform experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, and demonstrate performance benefits up to 45%. The figure shows that we achieve significant % reduction in execution time using load balancing compared to the NoLB case, for three different applications and different number of VMs in presence of both the effects – the inherent hardware heterogeneity and the heterogeneity introduced by multi-tenancy. Through experimental analysis, we also learned that choice of load balancing period and computational granularity can have significant impact on performance but the optimal values depend on application characteristics, size, and scale. Runtime systems which can automate the selection and dynamic adjustment of such decisions will be increasingly useful in future.
Our related research on HPC in Cloud include evaluation of HPC performance in cloud, identification of performance bottlenecks, and addressing them through HPC-aware cloud schedulers and cloud-aware parallel runtime system. Related publications can be found on our research page on HPC on cloud. Our research on was awarded one of the HP labs Innovation Research Award (IRP) 2012.
As we move towards the exascale era, we need to reconsider the traditional algorithms and programming models. Adaptive Mesh Refinement (AMR) is an efficient methodology used for simulating differential equations on very large structured meshes that arise in many domains, e.g. numerical cosmology, global atmospheric modeling, mantle convection modeling, etc. The regions in the mesh which needs simulation at finer precision are refined to deeper levels while others are kept at the same level or are coarsened. The mesh adapts itself as the simulation progresses (see Figure below).
In traditional approaches, each process is assigned a set of blocks based on a space-filling curve. This approach has several limitations - 1) It requires O(#blocks) memory per process to store the mesh structure, 2) takes O(logP) time to locate neighboring blocks 3) O(d) rounds of global reductions during mesh restructuring (where d is the maximum depth of the mesh) 4) load balancing takes O(#blocks) time. Therefore this approach has memory bottleneck and has large time overheads for mesh restructuring and load balancing.
We developed a new design in which each mesh block acts as a first class entity - a virtual processor. This offers several algorithmic as well as implementation benefits - 1) a block acts as a virtual processor which allows overlap of its computation with communication of other blocks on the same physical process, 2) a bock can be uniquely identified by its location in the tree, 3) a block can be placed dynamically on any physical process which forms the basis for dynamic load balancing, 4) a block acts as a unit of algorithm expression which significantly reduces the implementation complexity, 4) a block is the end-point of communication and the underlying run-time system handles communication between arbitrary blocks, 6) O(#blocks/P) memory per process is required to store the mesh structure (as compared to O(#blocks) in the traditional algorithms. 7) the simulation progresses asynchronously with no barriers after each iteration and each block enters asynchronously into the mesh restructuring phase. The associated mesh restructuring algorithm takes only 2 system quiescence states (where complexity of detecting a quiescence is O(logP)) as compared to O(d) global reductions taking O(logP) time each, in traditional approaches (details in paper).
We implemented this design in Charm++. Each block is an instance of an element of a dynamic chare array with customized bit-vector indices to identify a block with its location in the mesh. Since blocks are the fundamental unit of work, they need to be equally distributed across processes for load balance. We plugged in Charm++ run-time provided gossip-based distributed load balancer called Grapevine. Performance of Grapevine is competitive with the centralized load balancers while it incurs negligible load balancing overhead. To empirically test our approach, we benchmarked a finite difference simulation of advection using a first-order upwind method in two-dimensional space (see Fig below). We achieved a strong scaling efficiency of 76.7% while scaling from 1k cores to 16k cores of BG/Q (see Fig above), on a mesh with maximum depth of 15. Our new algorithm is fully distributed and highly asynchronous, thus enabling high performance for much more deeply refined computations than are currently practised. For more details please refer to paper.

An exascale machine is expected to be delivered in
the time frame 2018-2020. Such a machine will be able to tackle
some of the hardest computational problems and to extend our
understanding of Nature and the universe. However, to make
that a reality, the HPC community has to solve a few important
challenges. Resilience will become a prominent problem because
an exascale machine will experience frequent failures due to the
large amount of components it will encompass. Some form of fault
tolerance has to be incorporated in the system to maintain the
progress rate of applications as high as possible. In parallel, the
system will have to be more careful about power management.
There are two dimensions of power. First, in a power-limited
environment, all the layers of the system have to adhere to
that limitation (including the fault tolerance layer). Second,
power will be relevant due to energy consumption: an exascale
installation will have to pay a large energy bill. It is fundamental
to increase our understanding of the energy profile of different
fault tolerance schemes.
We evaluated a set of
three different fault tolerance approaches: checkpoint/restart,
message-logging and parallel recovery. Using programs from
different programming models, we showed parallel recovery is the
most energy-efficient solution for an execution with failures. At
the same time, parallel recovery is able to finish the execution
faster than the other approaches. We explored the behavior of
these approaches at extreme scales using an analytical model.
At large scale, parallel recovery is predicted to reduce the total
execution time of an application by 17% and reduce the energy
consumption by 13% when compared to checkpoint/restart.
The figure plots the relative improvement in the energy consumption of message-logging and parallel recovery with respect to checkpoint/restart against the number of sockets.
This study was published in the 24th International Symposium on Computer Architecture and High Performance Computing. The paper was awarded the Julio Salek Aude Best Paper
Award at that conference.
Professor Kale and Senior Research Programmer Eric Bohm will teach a tutorial titled "Parallel Programming with Migratable Objects for Performance and Productivity." This tutorial will focus on application development following the principles of migratable, message-driven objects that underlies the Charm++ parallel programming environment. Attendees of the course will learn how to construct next-generation parallel applications in this setting in order to benefit from automatic load balancing, fault tolerance, and easy and efficient parallel composition. This tutorial will run from 8:30 to noon on Sunday, November 11.
PPL will host a Birds-of-a-Feather Session on the Charm++ ecosystem - its runtime, programming tools, and applications. Participants will hear about how these components are improving in performance, features, ease of use, portability, and interoperability. Several application developers will share their recent experiences and the latest advances. This is an ideal opportunity to learn more about how Charm++ is used in practice, see how it's growing, and influence its future direction. The session will take place from 12:15 to 1:15 on Thursday, November 15.
PhD candidate, Yanhua Sun, will present optimization techniques on Cray XK6 architecture for Charm++ runtime system and the molecular dynamics application -- NAMD, as published in "Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6". For 100-million-atom STMV benchmark, the NAMD team improved upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6. The performance improvement was achieved through careful tuning of Charm++ on Gemini interconnects, along with improvements in the parallelization of the Particle Mesh Ewald algorithm for long range force contributions, as well as GPU optimization for NAMD. This session is scheduled to take place on Wednesday November 4th from 2:00PM - 2:30PM in 355-EF.
PPL Director Professor Kale and long-standing PPL collaborator Professor Klaus Schulten will receive the Sidney Fernbach award "for outstanding contributions to the development of widely used parallel software for large biomolecular systems simulation."
Fifth-year PhD candidate Osman Sarood is focused on the challenge of meeting the energy and power requirements for huge exascale machines. Energy costs for data centers can be divided into two main categories: machine energy and cooling energy consumptions. Sarood's thesis investigates reduction in energy consumption for HPC data centers in both these categories. His recent work on reducing cooling energy consumption shows that we can reduce cooling energy consumption by up to 63% using our temperature aware load balancer. In this work, we also demonstrate that data centers can reduce machine energy consumption by up to 28% by running different parts of the applications at different frequencies. The focus of our work is to gauge the potential for energy saving by reducing both machine and cooling energy consumption separately (and their associated execution time penalty) and then come up with a scheme that combines them optimally to reduce total energy consumption for large HPC data centers. Sarood will present his research in the Dissertation Showcase at 11:15AM - 11:30AM on Wednesday, November 14th.
Last year, a team from PPL was awarded first place in the HPC Challenge Competition for the overall execution performance of the Charm++ benchmark implementations. This year, they will compete again to demonstrate portability to the latest extreme-scale systems, further exploitation of the load balancing and fault tolerance features of Charm++, and improved performance. If their submission is selected as a finalist, there will be a presentation during the HPC Challenge BoF session from 12:15-1:15 on Tuesday, November 13th.
In addition to these events, many PPL members will be in attendance at the conference. We look forward to seeing you there!
For applications written in Charm++, the adaptive run time system provides measurement based dynamic load balancing using principle of persistence. However, performing load balancing incurs a cost which may not be known a priori. Due to the cost of load balancing, it is important to determine if invoking the load balancer is profitable, i.e., whether the overhead due to load balancing is less than the gain obtained after load balancing for a period of time. Typically, application behavior depends on the size of system being simulated and the parallel system being used for simulation. As a result, finding the time steps (or iterations) at which load balancing should be invoked to obtain best performance is a difficult task. Most run time systems (RTS) depend on the application programmers to decide when to balance the load. A common practice is to choose a fixed period to invoke the load balancer; for example every 100 time steps. This, however, prevents the load balancing from adapting to the changing application behavior.
Meta-Balancer framework is a step towards automating load balancing related decision making. Based on the application characteristics observed at runtime and a set of guiding principles, Meta-Balancer relieves the application programmer from the critical task of deciding when the load balancer should be invoked. Meta-Balancer continuously monitors the application and using a linear prediction model on the collected information, predicts the time steps (or iterations) at which load balancing should be performed for optimal performance. In addition, Meta-Balancer monitors for sudden changes in the application behavior and invokes the load balancer if needed.
Meta-Balancer has been implemented on top of Charm++ load balancing framework in order to take advantage of its support for dynamic load balancing. In an accepted paper for IEEE Cluster 2012 [12-29], it has been demonstrated that Meta-Balancer improves application performance by choosing the correct time steps to invoke load balancer for iterative applications. In the paper, we present a comparison of the performance of Meta-Balancer with respect to periodic load balancing using a real world applications such as LeanMD and Fractography3D. We show that Meta-Balancer is able to identify the ideal load balancing period which changes as the application evolves and extracts the best performance for the applications automatically at runtime.
Application Study:
Fractography is used to study fracture surfaces of materials. Figure 2 shows the processor utilization graph generated when Fractography3D is run on 64 cores of Jaguar.
For more details on theory behind working of Meta-Balancer and results on Fractography and LeanMD, please refer to paper
Applications running on large scale machines will have to face a high failure frequency. The traditional way to provide applications with fault tolerance is by using a rollback-recovery mechanism. Checkpoint/restart is the most popular of such mechanisms, mainly because its simplicity. Each task in the application checkpoints its state periodically and in case of a failure all the tasks rollback to the most recent checkpoint and restart from there. The built-in migratability support in Charm++ makes the use of checkpoint/restart even easier. The required methods to save the state of a task are already implemented as part of the load balancing support. Adding fault tolerance support in Charm++ is usually just a matter of adding a function call in the code. The Charm++ runtime system will automatically handle a failure and restart the application. No user involvement is required.
However, checkpoint/restart may suffer from multiple bottlenecks. If checkpoints are sent to disk, the file system may get saturated and collapse. Storing the checkpoints in memory may alleviate this issue, but it will still have to deal with another problem: bottleneck at the network card. For applications with a modest memory footprint, the increase of the checkpoint size will quickly saturate the network. It is estimated that the total checkpoint overhead goes beyond 10% of the total execution time. Furthermore, the increasing checkpoint overhead will make applications hard to make any progress as the mean-time-between-failures (MTBF) is predicted as 2 to 60 minutes for future systems.
Members of PPL have designed a new checkpointing algorithm (called semi-blocking checkpoint/restart) which can significantly hide the checkpoint overhead for applications by asynchronously sending the checkpoints. There are two phases in the algorithm. In the first phase, after reaching a global synchronization point, each node will store its checkpoint in the main memory. After every node safely stores the newest copy of the checkpoint, they will resume to the normal execution. Then, the Charm++ runtime system will take the responsibility to send the checkpoints of each node to another node as a double copy at the appropriate time: when the applications are busy computing and leave the network free. With this algorithm, checkpointing becomes transparent to applications and incur less interference to its normal execution.
In a paper recently accepted at CLUSTER'12 [12-32], we measured our algorithm on Trestles supercomputer at SDSC which has 32 cores per node with a peak performance of 100 teraflops. We ran both weak and strong scaling experiments with two applications: wave2D and ChaNGa. As seen in the figures, the time to perform a single checkpoint is dramatically reduced from more than 50 seconds to less than 10 seconds. Using different MTBF values predicted for future system with an optimum checkpoint interval, applications can reduce their total execution time up to 22% with the semi-blocking checkpoint/restart algorithm.
Current implementations of process groups (subcommunicators) have non-scalable (O(group size)) memory footprints and even worse time complexities for setting up communication. We propose system-ranked process groups, where member ranks are picked by the runtime system, as a cheaper and faster alternative for a subset of collective operations (barrier, broadcast, reduction, allreduce). Letting the runtime system assign ranks to processes enrolling in a group liberates it from having to sort user-supplied keys to identify ranks. Avoiding this O(n logn) computational complexity can result in significant speedups of group creation mechanisms. In keeping with a pay-only-for-use policy, user-supplied ranks can be supported atop system-ranked groups by performing the sort as an additional step. Tailoring the communicators (groups) to subsets of use-cases will permit implementations that are less resource-intensive and faster.
We developed two distributed algorithms, namely, Shrink-and-Balance and Rank-and-Hash, for balanced spanning tree construction over system-ranked process groups obtained by splitting a parent group. Rank-and-Hash has constant memory footprint and O(log n) time complexity but exchanges more messages than Shrink-and-Balance scheme which has O(logn) space and O((logn)^2) time complexity. Our distributed algorithms perform faster than the centralized scheme even at modest number of processors (see figure, p is the probability with which a process participates in the new process group). These algorithms perform orders of magnitude faster than MPICH's MPI_Comm_split implementation(see Table) for smaller number of splits.
These algorithms are an improvement over existing approaches as they support construction of balanced k-ary spanning trees as compared to other existing work on generalized MPI_Comm_split which supports only binary trees.
| Num splits | MPI_Comm_split | Rank-and-Hash |
| 1 | 134.968 | 0.708 |
| 2 | 106.573 | 0.713 |
| 4 | 96.989 | 0.760 |
| 8 | 93.536 | 0.785 |
Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to large-scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets. Intel SCC is an experimental chip intended to represent future many-core architectures. In recent research, we have used Charm++ infrastructure to run and inspect various scalable applications on different architectures including the SCC. Thus, we can quantitatively compare and analyze the performance, power consumption and energy efficiency of different cutting-edge platforms.
We port the Charm++ infrastructure to SCC to be able to run Charm++ and MPI codes without any change. This task does not have any conceptual difficulty as SCC can be thought of as a small cluster. The figure below shows the scaling of different applications on the SCC using Charm++ and Adaptive MPI. As can be seen, most applications (except CG) scale very well up to 48 cores of SCC, using the Charm++ system.
Our results (the figures shown below) show that the GPGPU has outstanding results in performance, power consumption and energy efficiency for many applications, but it requires significant programming effort and is not general enough to show the same level of efficiency for all the applications. The “light-weight” many-core presents an opportunity for better performance per watt over the “heavy-weight” multi-core, although the multi-core is still very effective for some sophisticated applications. In addition, the low-power processor is not necessarily energy-efficient, since the runtime delay effect can be greater than the power savings.
Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. To solve this problem, a parallel tracking package was developed with Charm++ in collaboration with researchers from the Dept. of Agricultural and Biological Engineering [12-02].
Charm’s over-decomposition allows fine-grain control over the accuracy and parallel efficiency of the particle tracking algorithm, enabling researchers to output results in real-time as data is gathered from multiple high FPS cameras or process results off-line for higher accuracy. The tracking software is planned to be an integral part of a portable particle-tracking instrument for measuring particle dispersion in confined spaces, such as aircraft cabins.
We recently implemented a dense LU solver in Charm++ that won a HPCC challenge award for best performance [11-34]. To meet specification (and ensure numerical stability), a dense LU must perform partial pivoting, which classically requires a column-by-column identification of the pivot row, swapping of the identified row across the matrix, and a corresponding update. Pivoting results in a fine decomposition and is likely to incur excessive overhead: however, we demonstrate that these overheads are ameliorated by Charm++’s ability to overlap computation and communication.
Charm++ also allows blocks of the matrix to be arbitrarily mapped to a machine by a user-defined function; this feature decouples mapping from application-specific logic and allows the user to quickly and easily experiment with various mapping schemes. In a paper accepted at IPDPS 2012 [11-51], we use this ability to explore novel mappings of block to processor that substantially increase performance.
In this paper, we build on the traditional block-cyclic mapping to consider an explicit tradeoff between memory contention effects and network locality by mapping blocks to the process grid in a hybrid of row- and column-major order (called striding). We also introduce the notion of rotation in the mapping of blocks to adjust relative degrees of parallelism between panel factorization and triangular solves to match architectural parameters.
Overall, our code is competitive with modern LU implementations.
Gemini is the new interconnect for Cray XE/XK systems; it is characterized by low latency, high bandwidth and strong scalability. A user-level Generic Network Interface (uGNI) is provided as a low-level interface for Gemini, on top of which the Cray MPI is implemented. Gemini-based interconnects have been deployed in two petaflop supercomputers, Cielo at Los Alamos National Lab and Hopper at Lawrence Berkeley National Lab. Jaguar, one of the most powerful supercomputers, is being upgraded to a XK6 system with Gemini interconnect. The newly announced Blue Waters supercomputer, which will deliver sustained performance of 1 petaflop, will also be equipped with a Gemini interconnect. As Cray supercomputers with Gemini interconnects become one of the major platforms for large scale real-world science and engineering applications, it is important to port and scale Charm++ runtime systems and its parallel applications on the new interconnect.
One possible strategy is to implement the message-driven programming model on Cray's MPI implementation, which has been optimized for message passing using the low level features provided by the Gemini hardware. However, in order to avoid the extra overhead caused by the MPI software stack, and and to exploit the available performance as best as possible, we targeted the new port directly on top of the uGNI.
In recent months, several people from PPL have been working intensively on this port, and developed a new machine layer for Gemini on uGNI. Several approaches are explored to optimize the communication to best utilize the Gemini capability. These include reducing memory copy and registration overhead using memory pool; exploiting persistent communication for one-sided communication; and improving intra-node communication via POSIX shared memory and XPMEM kernel module. These optimization techniques leverage the Charm++'s message driven programming model where message buffers are managed by the runtime itself. Our micro-benchmarks show that the uGNI-based Charm++ outperforms the MPI-based implementation by up to 50% on a Cray XE machine. For an irregular program N-Queens, the uGNI-based implementation scales up to 15,360 cores using 70% less execution time than the MPI-based implementation.
For NAMD, a full-fledged molecular dynamics application, the performance is significantly improved by as much as 18% in a fine-grain parallel run. To the best of our knowledge, this uGNI-based Charm++ runtime is the first public third-party runtime ported on uGNI, which achieves scalability up to tens of thousands of cores on a Cray XE machine.
This new uGNI-based machine layer will be included in the coming new Charm++ release. A paper on this work is also accepted for publication by IPDPS 2012 in Shanghai, China.
As the size of supercomputers multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. Leveraging Charm++'s parallel objects for checkpointing, two variations of checkpointing schemes, a disk-based and a double in-memory checkpointing schemes, are incorporated in the production distribution of Charm++. One of the unique features of both schemes is that the program can be restarted on smaller number of available processors as a result of failure. Furthermore, the application will continue to execute on the remaining processors without much performance penalty after automatic load balancing. In particular, compared to the disk-based method, the double in-memory checkpointing scheme takes advantage of the fast memory access for checkpointing to both local memory and remote memory through high speed interconnect.
Recently, members of Parallel Programming Laboratory are working on further minimizing the checkpoint and restart overhead by applying more efficient collectives for barriers. We demonstrated the in-memory checkpointing scheme using MPI on very large scale supercomputers. One obstacle for demonstrating fault tolerance on MPI applications is that the queueing system on supercomputers will kill a job when a process fails. Without the support of the queueing system, we developed a scheme that mimics a failure of a process without actually killing it. This is implemented as a DieNow() function, which users insert at any place in their program to trigger a failure. When DieNow() function is called by the program, the process will hang and stop responding to any communication as if it dies. Charm++ will pick up a spare processor from a pool and restart the application from the recent checkpoint in memory.

To demonstrate the performance and scalability of the newly optimized double in-memory checkpointing scheme, we used two benchmarks, which are leanMD (a molecular dynamic benchmark) and Jacobi (a 7-point stencil benchmark). In the experiments, we measured the overhead of checkpoint and restart of these two benchmarks on a BlueGene/L machine. The results are shown below in the figures.
We could see the checkpoint time scales well in both applications with small and large memory footprint. In particular, the restart time of leanMD simulating an 1-million atom system only increases from 0.08 seconds on 2K cores to 0.38 seconds on 32K cores.
Modern processors are fast reaching the limits of CPU frequency scaling. In
their continued bid to increase performance, chip vendors have now made
multicore processors commonplace. However, the means to program these
increasingly powerful processors have not yet matured. Traditional languages
and frameworks for programming multicore systems either provide very low-level
constructs, or a performance model that is not well-matched with the underlying
hardware.
In recent work we have explored an alternative to these approaches, namely the use of the message-driven execution (MDE) paradigm of Charm++ to write shared memory programs. There are several productivity and performance benefits to doing so:
- users can explicitly control the granularity of parallel tasks
- dependences are specified explicitly through asynchronous messages, so that programs do not use locks
- messages are asynchronous, allowing the adaptive overlap of computation and communication
- since objects share a single address space, the delivery of messages is efficient and does not require copying
- an intelligent runtime system performs automated dynamic optimizations such as load balancing and message coalescing
In this work, we examined the advantages in performance that a message-driven
shared memory program enjoys. We carried out our study in the context of an
application that is challenging to parallelize effectively. Surface- area
heuristic (SAH) based kd-tree construction exhibits a tree-structured
arrangement of parallelism, low intensity of floating point operations, a mix
of task- and data-parallel operations of varying granularity, and extensive
data movement, making it hard to scale its performance with increasing
processor counts.
In writing the Charm++ kd-tree construction program, we were able to employ several optimizations that are enabled by the runtime system, e.g. grain size control, dynamic seed-based load balancing for data parallel tasks, lightweight fork-join synchronization, and a novel chunked array abstraction to reduce synchronization between data parallel tasks. The cumulative impact of these optimizations was significant--a comparison with a leading TBB-based kd-tree construction program, ParKD, showed that the Charm++ version was up to 3.1 times as fast as ParKD. As shown in the accompanying figures, the relative advantage of using the Charm++ version grew as the amount of parallel work (indicated by the depth of the tree constructed) increased.
Future work will explore the shared memory MDE paradigm in the context of other applications. While the performance benefits of using MDE have been studied, we would also like to examine the improvements in programmer productivity that this paradigm engenders. More generally, we would like to create abstractions that enable the portability of shared memory MDE programs across shared- and distributed-memory machines.
During SuperComputing 2011, PPL won first place in the 2011 HPC Challenge contest for its parallel programming framework Charm++. The award recognizes Charm++ as the best performing system in the class 2 (productivity and performance) category of the contest.PPL's submission this year represents its first ever participation in the contest. The submission uses Charm++ to implement five benchmark programs that are representative of relevant computational algorithms. Three of these had to be chosen from a list of recommended kernels supplied by the contest organizers. Chosen kernels were dense LU factorization (the HPL benchmark), one-dimensional FFT and random access. Additionally, PPL also submitted two optional benchmarks; selected because they are popular computational kernels, require non-trivial parallel control flow and exhibit dynamic imbalances. These were molecular dynamics for the simulation of atomic systems and the Barnes-Hut algorithm for simulating N-body systems.
Charm++ is a general-purpose parallel programming framework that offers several practical, productivity-enhancing capabilities. It allowed the submitted benchmarks to remain succinct, yet scale to more than sixty five thousand cores of large supercomputers. Several features like asynchronous non-blocking collectives, topology-aware software routing of messages and automatic, dynamic load balancing allow these benchmarks to achieve high performance. Importantly, the developers writing the benchmarks did not have to incorporate these performance-enhancing features and adaptive capabilities directly into the programs themselves.The HPC class 2 Challenge is an annual event aimed at encouraging the development of high level parallel programming models that offer both productivity and performance. The contest requires the submission of several benchmark programs implemented using the candidate programming model, along with data about the performance achieved by these benchmarks. A panel of expert judges then evaluates the submissions on the performance they achieve and also on metrics like the size of the code.
Details of the benchmarks, performance and code size metrics are available in the submitted document, now a tech report on the PPL website. The document contains a more detailed discussion of the practical productivity benefits that Charm++ offers high-performance computational science applications.
A low-diameter, fast interconnection network is going to be a pre-requisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume (Figure 1). Hence, routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology.This work explores intelligent topology aware mappings of near neighbor communication patterns to the physical topology to identify cases that minimize link utilization. Topology aware mappings, used in this work, are based on clustering communicating tasks onto processors that are physical neighbors. The clustering can be done at different levels of the hierarchy – Blocked Node Mapping (BNM), Blocked Drawer Mapping (BDM) and Blocked Supernode Mapping (BSM). Here, drawers and supernodes refer to a collection of nodes and drawers respectively that are at a distance of one hop. To make use of large number of links emanating from the high radix routers used in two-level direct network, a mapping with clustering at node level (RNM) and drawer level (RDM), that spreads tasks uniformly at random is also considered. In addition, we explore random mapping with clustering at drawer level using indirect routing (RDI).
We use detailed network simulations for up to 307,200 MPI tasks and three different communication patterns – 2D stencil, 4d stencil and an n-targets multicast pattern – to compare various mappings. It is found that if direct routing is used, mapping blocks of MPI tasks to random nodes gives the best performance and evenly distributed usage of second level links. It is also observed that indirect routing can achieve similar performance to an intelligent mapping and obviates the need for mapping, at the cost of increasing overall traffic on the network (Figure 2). This work also highlights the utility of simulation-based predictions to analyze algorithms and make design choices before a parallel machine is installed and available for use. This will become increasingly important as machine sizes grow making it essential to do application and hardware co-design.For more details, please find the full paper at: http://charm.cs.uiuc.edu/papers/11-21
As supercomputers grow in size, their propensity to fail increases substantially. The most optimistic projections estimate Exascale machines will experience a failure every few hours. This is specially problematic for many HPC applications that run for long time and use a large portion of the machine. These applications will have to become resilient in order to finish execution. Our adaptive runtime system provides the scalable fault tolerance support for Charm++ and AMPI programs that is needed to weather frequent failures.Moreover, the unique technology of Charm++ leverages most of the traditional fault tolerance techniques. First, the traditional checkpoint/restart approach can work even in the case where the number of available nodes decreases as failures bring down some of them.
Second, proactive migration of tasks fits perfectly our infrastructure for migratable objects, making natural the connection of a failure predictor. Third, message logging protocols benefit in allowing the recovery to run in parallel, minimizing the time to redo the work lost in a failure.
Recently, we have been investigating another way in which message logging can be improved by allowing a load balancer to provide crucial information about the application. With this information, we vastly reduce the memory overhead associated with storing the messages. We call this approach Dynamic Team-based Message Logging. Nodes are divided into teams and only messages crossing team boundaries are logged. Teams are, however, dynamic. Depending on the runtime conditions, the load balancer may change the teams to reflect the change in the load of different nodes.
The figure shows the results of the Team Load Balancer (TeamLB) that attains two goals: i) provides a good load balance across the computation nodes and, ii) reduces the message logging overhead by grouping highly connected objects in the same team. Although this load balancer has a small overhead penalty, it drastically reduces the memory overhead of message logging.
Currently, we are working on extending our fault tolerance infrastructure to consider the multicore nature of supercomputers.
Meeting power requirements of the huge exascale machines of the future is one major challenge facing the HPC community. As power consumption and power costs rise, the bottom line impact is felt by everyone involved in parallel research. Members of the Parallel Programming Laboratory (PPL) are focusing on ways to minimize cooling power for these machines. We propose a technique that uses a combination of DVFS and temperature aware load balancing to constrain core temperatures as well as save cooling energy. Our scheme is specifically designed to suit parallel applications which are typically tightly coupled. Currently, the temperature control comes at the cost of execution time and we are working to minimize the timing penalty. We have run experiments with three parallel applications, each with a different power utilization profiles, run on a 128-core (32-node) cluster with a dedicated air conditioning unit. As the experiment is running, we calibrate the efficacy of our scheme based on three metrics: ability to control average core temperatures thereby avoiding hot spot occurrence, timing penalty minimization, and cooling energy savings. Our preliminary results show cooling energy savings of up to 57% with timing penalty mostly in the range of 2 to 20%.
To demonstrate the effectiveness of our scheme, we use three applications having different utilization and power profiles. The first is a canonical benchmark, Jacobi2D, that uses 5 point stencil to average values in a 2D grid using 2D decomposition. The second application, Wave2D, uses a finite differencing scheme to calculate pressure information over a discretized 2D grid. The third application, Mol3D, is from molecular dynamics and is a real world application to simulate large bio-molecular systems. The experiment shows that we were able to reduce the timing penalty associated with DVFS by a great margin. Using our load balancing strategy, we were able to reduce the timing penalty to 27% for Jacobi2D. Other than that we also used performance counters in order to relate application characteristics to temperature control.
Looking towards the future, we plan to take the DAG of the application into account in order to reduce the timing penalty even further. We are also looking for ways to save machine energy consumption based on the application characteristics.
One of the requirements for acceptance of the upcoming Blue Waters petascale system is the successful simulation of a 100-million-atom biomolecular system
with NAMD. Simulating this large molecular system presents great challenges,
including I/O, large memory footprint and getting good strong-scaling.
Our group has recently improved the capabilities of NAMD to simulate on existing petascale machines. A new SMP model in
Charm++ was designed to efficiently utilize ubiquitous wide multicore
clusters by extending the Charm++ asynchronous message-driven runtime.
Hierarchical load balancing was further exploited
to scale NAMD to the full extent of Jaguar, a petaflop Cray XT5 (224,076 cores)
at Oak Ridge National Laboratory, both with and without PME full electrostatics,
achieving 93% parallel efficiency (compared to a base-case of 6,720 cores) at
9 ms per step for a simple cutoff calculation.









