An Adaptive Parallel Sparse Triangular Solver
07/02/2014
a triangular matrix

Solving sparse triangular linear systems is an important kernel for many numerical linear algebra problems in science and engineering simulations, i.e. linear systems and least square problems. It is used extensively in direct methods following a triangular factorization to solve the system (possibly with many right-hand sides), or to improve an approximate solution iteratively. It is also a fundamental kernel in many iterative methods (such as Gauss-Seidel method) and in many preconditioners for other iterative methods (such as Incomplete-Cholesky preconditioner for Conjugate Gradient). Unfortunately, performance of parallel algorithms for triangular solution is notoriously poor, resulting in performance bottlenecks for many of these methods.

We present a novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods. By analyzing and reordering operations, our algorithm can often extract parallelism even for cases where most of the nonzero matrix entries are near the diagonal. Our main parallelism strategies are: (1) identify independent rows, (2) send data earlier to achieve greater overlap, and (3) process dense off-diagonal regions in parallel. We implemented our algorithm in Charm++, and present promising experimental results using numerous sparse matrices from real applications.

Using various matrices, the graph below compares the performance of our method with that of HYPRE (blue lines). HYPRE is a commonly used linear algebra package developed at Lawrence Livermore National Laboratory (LLNL). As shown, our method can exploit parallelism on many matrices, whereas HYPRE’s performance is nearly sequential in all cases. The performance of HYPRE is worse than sequential in many cases because of parallel overhead, although there is some improvement for large numbers of processors, probably due to cache effects. Overall, our method is a significant improvement over this existing code and will reduce the solution time for many problems.



The paper on this work is accepted by Parallel Computing (ParCo) 2014. PhD student Ehsan Totoni received 3rd for ACM Student Research Competition (SRC) Grand Finals 2014 based on this research.
Steering Parallel Applications via Introspection
06/03/2014
Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. Instead of full automation or completely manual tuning, we take the approach of cooperative automation. In this approach, the application developers provide knobs (control points) that reconfigure applications in ways that affect performance in specific ways. We then allow the runtime system to adjust the configuration automatically based on runtime observations and its understanding of the effects of each control point.

We have designed and developed PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic interface for describing control points and their effects. This is how application-specific knowledge is exposed to the runtime system. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the performance data and a decision tree encoding expert knowledge, program characteristics are extracted to assist the search for optimal configurations of the control points.

We have demonstrated how PICS can be applied into both the runtime system and applications to optimize the application performance. For example, our results show its effectiveness for ChaNGa, a full-fledged cosmology application, on 16,384 cores. Mirroring is a technique to reduce communication bottleneck by replicating the data on multiple processors and forwarding the communication requests. Figure 1 compares the time cost of calculating gravity without mirroring and with mirroring using different number of replicas. The top red curve is the time cost without mirroring while the bottom green curve shows the cost of using replicas adaptively. The runtime converges on using 2 replicas.

The paper on this work is accepted at ROSS14, Munich Germany.
An Epidemic Algorithm for Dynamic Load Balancing
03/06/2014
Fig 1
Load imbalance is an insidious factor that can reduce the performance of a parallel application significantly. For many applications, computation load varies over time. Such applications require dynamic load balancing to improve performance. Many applications employ centralized load balancing strategies, where load information is collected on to a single processor, and their decision algorithm is run sequentially. Such strategies have been shown to be effective for a few hundred to thousand processors. However, they present a clear performance bottleneck beyond a few thousand processors. An alternative to centralized strategies are distributed strategies that use local information, e.g. diffusion based. In a distributed strategy, each processor makes autonomous decisions based on its local view of the system. The local view typically consists of the load of its neighboring processors. Such strategies are scalable, but tend to yield poor load balance due to the limited local information.

We propose a fully distributed strategy, GrapevineLB, that has been designed to overcome the drawback of other distributed strategies by obtaining a partial representation of the global state of the system and basing the load balancing decisions on this. We use a light-weight information propagation algorithm based on epidemic algorithm (also known as the gossip protocol) to propagate the load information about the underloaded processors in the system to the overloaded processors. This spreads the information in the same fashion as gossip spreads through the grapevine in a society. Based on this information, GrapevineLB makes probabilistic transfer of work units to obtain good load distribution. The proposed algorithm is scalable and can be tuned to optimize for either cost or performance. We demonstrated its effectiveness in comparison to several other load balancing strategies for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.

Figure 1 shows the total time for AMR application which includes the time for load balancing as well as the application time per step. Since AMR requires frequent load balancing, the overhead of the centralized and hierarchical strategies, AmrLB and HierarchicalLB, exceeds the benefit of load balancing. DiffusionLB gives marginal benefit. GrapevineLB provides a large performance gain by achieving a better load balance and incurring less overhead.

The paper on this work was accepted at SC13 and was one of the finalist for the best student paper award.
Large-scale Parallelization of Stochastic Integer Programs
01/10/2014


In many real world optimization problems such as resource allocation, the parameters that correspond to future outcomes are subject to enormous uncertainty. In stochastic optimization we assume a probabilistic distribution of these unknown parameters and optimize the resource allocation across a collection of possibilities, known as scenarios. Applications of stochastic optimization spans a diverse set of areas including aircraft scheduling, production planning, financial planning, energy grid optimization, supply chain management, and many more.

The novelty of the work is in combining high performance computing and stochastic optimization to solve large-sized stochastic integer optimization problems that have not been tackled before. We use branch-and-bound based approach which is an accepted norm for solving integer programs. Stochastic integer optimization programs have unique characteristics that are different from the challenges faced in the typical scientific iterative applications. These characteristics make it very difficult to realize a scalable design. Some such characteristics are coarse and unpredictable grain sizes, large memory footprints, variable amount of available computation, and perturbable branch-and-bound trees. Additionally, in this application, better utilization of processors does not mean faster time to optimal solution.

We propose a nested parallelism approach that exploits the inherent scenario parallelism in stochastic optimization along with the parallel exploration of branch-and-bound tree vertices. State-sharing across tree vertices allows interleaved vertex evaluations on the same linear program solver instance. This reduces the memory footprint and also the total time to solution. We show strong scaling on up to hundreds of cores. The application achieves speedup of up to 32x while scaling from 3 to 480 cores and has an efficiency of greater than 0.4 at 120 cores in 100% of the trials.



The paper on this work has won the best paper award at HiPC 2013 conference. The work is supported by MITRE Corporation and US Air Mobility Command.
Predicting Application Performance using Supervised Learning on Communication Features
10/10/2013




Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. Recently, we have attempted to model the performance of an application using communication data, such as the communication graph and network hardware counters. We have used supervised learning algorithms to correlate performance with prior and new metrics. We have also proposed new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction.

The focus of an associated publication at SC'13 is on the use of supervised learning algorithms, such as forests of randomized decision trees, to correlate individual metrics and their combinations with application performance. In addition to using prior metrics for predicting performance, this work presents several new metrics. Some of these can be obtained by analyzing the communication graph, network graph and the mappings, and others can be obtained through an experimental or simulated application run. For the results in the associated paper, the latter are obtained using real experiments. As part of future work, we plan to develop simulation techniques to obtain them offline. Maximum dilation is an example of an analytically derived metric. In contrast, maximum bytes on a link is an experimentally obtained metric. In addition to these new metrics, we also use derived metrics that use information only from some outliers (nodes or links). In the paper, we present performance predictions using the randomized forests ensemble method for three different communication kernels: a two-dimensional five-point halo exchange, a three-dimensional 15-point halo exchange, and an all-to- all benchmark over sub-communicators. We show almost perfect correlation for runs on 16,384 and 65,536 cores of Blue Gene/Q. We also show predictions for a production application, pF3D, and for combining samples from different benchmarks into a single training set and testing set.



Link to SC'13 paper associated with this work
A ‘Cool’ Way of Improving the Reliability of HPC Machines
07/10/2013
(This work will be presented at SC13)

In order to move to the next generation of supercomputers, the HPC community will have to cope with the two important challenges of increasing energy consumption and decreasing reliability. Studies predict that at exascale level the overall reliability could be so precarious that failures may become the norm. HPC researchers are working on improving the present fault tolerance protocols to address this issue. Hardware researchers are working on improving the reliability of each component the machine is made up of. In this paper, we use both hardware and software aspects to improve the reliability of HPC machines. It is well known that for every 10C increase in core temperatures, the fault rate doubles. We exploit this idea to demonstrate the use of constraining core temperatures, and load balancing to improve the reliability of parallel machines that can also result in a reduction of total execution time required by the application. We formulate a model that relates total execution time of an application to reliability and the associated slowdown for temperature restraint. We validate our model by implementing our approach and comparing the results to actual experimental results. Our scheme can reduce the execution time for all three applications considered in our work by more than 35% for 200,000 sockets in addition to improving the machine reliability by a factor of 1.48.
Partitioning in Charm++
05/06/2013
Support for system-driven partitioning has been added to Charm++ in the latest stable version 6.5.0.



In many scenarios, simultaneous execution of multiple instances of similar jobs is required or is beneficial. For example, in NAMD, multiple instances of the same molecular system with similar but different initial conditions can be used to reduce the time to solution, and gain insight into the behavior of the molecular system. Alternatively, one may want to execute multiple small jobs together to get better topology, or sanitize performance comparison against differences in system noise. An inefficient but simple way to achieve the above is by running these jobs independently (one by one or simultaneously), and use file system to perform the necessary information exchange. In MPI programs, application specific solution based on communicators can be used. This leads to productivity loss, and leaves a lot to be desired.



Charm++ offers a runtime system based solution that seeks to address most of the above mentioned issues. Based on user input - the number and size of partitions - the runtime system divides the set of allocated cores into different partitions that are independent Charm++ instances. In the basic form, multiple instances of Charm++ programs, which are agnostic to partitions, execute till the end in their partitions. The RTS ensures synchronization at the beginning and end. If required, several such programs can be stitched together using a simple wrapper to be executed as a single job in their partitions. Communication among partitions is supported using Converse. Like any other message in Charm++, the inter-partition messages are also active entities that cause method invocation at the receiver. Communication, as of now, is restricted to processes, i.e. an application message is directed to a process in a partition (not at chares).



NAMD has been augmented to make use of the partitioning framework (currently used by many users using the development version, to be released). Using Tcl script as a controller, NAMD initiates simulation of similar systems in different partitions. Periodically, these partitions exchange information, and help each other progress faster. Another existing use case for partitions is the replica-enhanced checkpoint framework (visit Charm++ workshop website for more details).



The current stable version of Charm++ only supports partitioning of cores into fixed size Charm++ instances. Many other features including variable sized partitions and topology aware partitioning are available in the developmental version of Charm++. Manual entry on partitioning.
Dynamic Load Balancing for HPC in Cloud
03/16/2013
Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud.

The insufficient network performance is a major bottleneck for HPC in cloud, and has been widely explored. Two less explored challenges are resource heterogeneity and multi-tenancy - which are fundamental artifacts of running in cloud. Clouds evolve over time, leading to heterogeneous configurations in processors, memory, and network. Similarly, multi-tenancy is also intrinsic of cloud, enhancing the business value of providing a cloud. Multi-tenancy leads to multiple sources of interference due to sharing of CPU, cache, memory access, and interconnect. For tightly-coupled HPC applications, heterogeneity and multi-tenancy can result in severe performance degradation and unpredictable performance, since one slow processor slows down the entire application. Moreover, the distribution of such interference is fairly random and unpredictable in a cloud. Hence, we need a mechanism to adapt to the dynamic variation in the execution environment.

With its adaptivity features, Charm++ is naturally suited for deployment in cloud infrastructures. Our primary hypothesis is that the challenges of heterogeneity and noise arising from multi-tenancy can be handled by an adaptive parallel runtime system. To validate our hypothesis, we explored the adaptation of Charm++ runtime system to virtualized environment. We developed a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. The main idea is periodic refinement of task distribution using measured CPU loads, task loads, and idle times.

In an accepted paper for IEEE/ACM CCGrid 2013, we demonstrate that even without the accurate information of the nature and amount of heterogeneity (static and dynamic but hidden from user as an artifact of virtualization), the approach of periodically measuring idle time and migrating load away from time-shared VMs works well in practice. In that paper, we perform experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, and demonstrate performance benefits up to 45%. The figure shows that we achieve significant % reduction in execution time using load balancing compared to the NoLB case, for three different applications and different number of VMs in presence of both the effects – the inherent hardware heterogeneity and the heterogeneity introduced by multi-tenancy. Through experimental analysis, we also learned that choice of load balancing period and computational granularity can have significant impact on performance but the optimal values depend on application characteristics, size, and scale. Runtime systems which can automate the selection and dynamic adjustment of such decisions will be increasingly useful in future.

Our related research on HPC in Cloud include evaluation of HPC performance in cloud, identification of performance bottlenecks, and addressing them through HPC-aware cloud schedulers and cloud-aware parallel runtime system. Related publications can be found on our research page on HPC on cloud. Our research on was awarded one of the HP labs Innovation Research Award (IRP) 2012.
Scalable Algorithms for Adaptive Mesh Refinement
02/15/2013
As we move towards the exascale era, we need to reconsider the traditional algorithms and programming models. Adaptive Mesh Refinement (AMR) is an efficient methodology used for simulating differential equations on very large structured meshes that arise in many domains, e.g. numerical cosmology, global atmospheric modeling, mantle convection modeling, etc. The regions in the mesh which needs simulation at finer precision are refined to deeper levels while others are kept at the same level or are coarsened. The mesh adapts itself as the simulation progresses (see Figure below).

In traditional approaches, each process is assigned a set of blocks based on a space-filling curve. This approach has several limitations - 1) It requires O(#blocks) memory per process to store the mesh structure, 2) takes O(logP) time to locate neighboring blocks 3) O(d) rounds of global reductions during mesh restructuring (where d is the maximum depth of the mesh) 4) load balancing takes O(#blocks) time. Therefore this approach has memory bottleneck and has large time overheads for mesh restructuring and load balancing.

We developed a new design in which each mesh block acts as a first class entity - a virtual processor. This offers several algorithmic as well as implementation benefits - 1) a block acts as a virtual processor which allows overlap of its computation with communication of other blocks on the same physical process, 2) a bock can be uniquely identified by its location in the tree, 3) a block can be placed dynamically on any physical process which forms the basis for dynamic load balancing, 4) a block acts as a unit of algorithm expression which significantly reduces the implementation complexity, 4) a block is the end-point of communication and the underlying run-time system handles communication between arbitrary blocks, 6) O(#blocks/P) memory per process is required to store the mesh structure (as compared to O(#blocks) in the traditional algorithms. 7) the simulation progresses asynchronously with no barriers after each iteration and each block enters asynchronously into the mesh restructuring phase. The associated mesh restructuring algorithm takes only 2 system quiescence states (where complexity of detecting a quiescence is O(logP)) as compared to O(d) global reductions taking O(logP) time each, in traditional approaches (details in paper).

We implemented this design in Charm++. Each block is an instance of an element of a dynamic chare array with customized bit-vector indices to identify a block with its location in the mesh. Since blocks are the fundamental unit of work, they need to be equally distributed across processes for load balance. We plugged in Charm++ run-time provided gossip-based distributed load balancer called Grapevine. Performance of Grapevine is competitive with the centralized load balancers while it incurs negligible load balancing overhead. To empirically test our approach, we benchmarked a finite difference simulation of advection using a first-order upwind method in two-dimensional space (see Fig below). We achieved a strong scaling efficiency of 76.7% while scaling from 1k cores to 16k cores of BG/Q (see Fig above), on a mesh with maximum depth of 15. Our new algorithm is fully distributed and highly asynchronous, thus enabling high performance for much more deeply refined computations than are currently practised. For more details please refer to paper.
Energy Efficiency of Fault Tolerance Protocols
12/20/2012
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. We evaluated a set of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we showed parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explored the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart. The figure plots the relative improvement in the energy consumption of message-logging and parallel recovery with respect to checkpoint/restart against the number of sockets. This study was published in the 24th International Symposium on Computer Architecture and High Performance Computing. The paper was awarded the Julio Salek Aude Best Paper Award at that conference.
PPL at Supercomputing 2012
10/10/2012
As in years past, the Parallel Programming Laboratory will be well-represented at Supercomputing 2012 in Salt Lake City, Utah. Charm++ will be in the spotlight of a half-day tutorial preceding the conference and a Birds-of-a-Feather (BoF) session. The conference's technical program will include a paper by PPLers and collaborators on tuning NAMD for Cray XK6 systems, such as the upcoming Blue Waters at Illinois' National Center for Supercomputing Applications and Titan at Oak Ridge National Laboratory. Osman Sarood, a fifth-year PhD student in PPL, will give a presentation on his work on reducing energy usage in HPC systems during the conference's Doctoral Showcase.

Professor Kale and Senior Research Programmer Eric Bohm will teach a tutorial titled "Parallel Programming with Migratable Objects for Performance and Productivity." This tutorial will focus on application development following the principles of migratable, message-driven objects that underlies the Charm++ parallel programming environment. Attendees of the course will learn how to construct next-generation parallel applications in this setting in order to benefit from automatic load balancing, fault tolerance, and easy and efficient parallel composition. This tutorial will run from 8:30 to noon on Sunday, November 11.

PPL will host a Birds-of-a-Feather Session on the Charm++ ecosystem - its runtime, programming tools, and applications. Participants will hear about how these components are improving in performance, features, ease of use, portability, and interoperability. Several application developers will share their recent experiences and the latest advances. This is an ideal opportunity to learn more about how Charm++ is used in practice, see how it's growing, and influence its future direction. The session will take place from 12:15 to 1:15 on Thursday, November 15.

PhD candidate, Yanhua Sun, will present optimization techniques on Cray XK6 architecture for Charm++ runtime system and the molecular dynamics application -- NAMD, as published in "Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6". For 100-million-atom STMV benchmark, the NAMD team improved upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6. The performance improvement was achieved through careful tuning of Charm++ on Gemini interconnects, along with improvements in the parallelization of the Particle Mesh Ewald algorithm for long range force contributions, as well as GPU optimization for NAMD. This session is scheduled to take place on Wednesday November 4th from 2:00PM - 2:30PM in 355-EF.

PPL Director Professor Kale and long-standing PPL collaborator Professor Klaus Schulten will receive the Sidney Fernbach award "for outstanding contributions to the development of widely used parallel software for large biomolecular systems simulation."

Fifth-year PhD candidate Osman Sarood is focused on the challenge of meeting the energy and power requirements for huge exascale machines. Energy costs for data centers can be divided into two main categories: machine energy and cooling energy consumptions. Sarood's thesis investigates reduction in energy consumption for HPC data centers in both these categories. His recent work on reducing cooling energy consumption shows that we can reduce cooling energy consumption by up to 63% using our temperature aware load balancer. In this work, we also demonstrate that data centers can reduce machine energy consumption by up to 28% by running different parts of the applications at different frequencies. The focus of our work is to gauge the potential for energy saving by reducing both machine and cooling energy consumption separately (and their associated execution time penalty) and then come up with a scheme that combines them optimally to reduce total energy consumption for large HPC data centers. Sarood will present his research in the Dissertation Showcase at 11:15AM - 11:30AM on Wednesday, November 14th.

Last year, a team from PPL was awarded first place in the HPC Challenge Competition for the overall execution performance of the Charm++ benchmark implementations. This year, they will compete again to demonstrate portability to the latest extreme-scale systems, further exploitation of the load balancing and fault tolerance features of Charm++, and improved performance. If their submission is selected as a finalist, there will be a presentation during the HPC Challenge BoF session from 12:15-1:15 on Tuesday, November 13th.

In addition to these events, many PPL members will be in attendance at the conference. We look forward to seeing you there!
Meta-Balancer
09/05/2012
Fig 1
Parallel applications, such as NAMD, running on large systems often involve simulations of dynamic and complex systems. A lot of effort is spent on writing these parallel applications in order to fully exploit the processing power of large systems and show scalability. For such applications, load balancing techniques are crucial to achieve high performance at scale because load imbalance among processors leads to significant drop in system utilization and hampers application's scalability. With ever-growing parallelism available in supercomputers of today, tackling the imbalance in an efficient manner is a difficult problem.

For applications written in Charm++, the adaptive run time system provides measurement based dynamic load balancing using principle of persistence. However, performing load balancing incurs a cost which may not be known a priori. Due to the cost of load balancing, it is important to determine if invoking the load balancer is profitable, i.e., whether the overhead due to load balancing is less than the gain obtained after load balancing for a period of time. Typically, application behavior depends on the size of system being simulated and the parallel system being used for simulation. As a result, finding the time steps (or iterations) at which load balancing should be invoked to obtain best performance is a difficult task. Most run time systems (RTS) depend on the application programmers to decide when to balance the load. A common practice is to choose a fixed period to invoke the load balancer; for example every 100 time steps. This, however, prevents the load balancing from adapting to the changing application behavior.

Meta-Balancer framework is a step towards automating load balancing related decision making. Based on the application characteristics observed at runtime and a set of guiding principles, Meta-Balancer relieves the application programmer from the critical task of deciding when the load balancer should be invoked. Meta-Balancer continuously monitors the application and using a linear prediction model on the collected information, predicts the time steps (or iterations) at which load balancing should be performed for optimal performance. In addition, Meta-Balancer monitors for sudden changes in the application behavior and invokes the load balancer if needed.

Meta-Balancer has been implemented on top of Charm++ load balancing framework in order to take advantage of its support for dynamic load balancing. In an accepted paper for IEEE Cluster 2012 [12-29], it has been demonstrated that Meta-Balancer improves application performance by choosing the correct time steps to invoke load balancer for iterative applications. In the paper, we present a comparison of the performance of Meta-Balancer with respect to periodic load balancing using a real world applications such as LeanMD and Fractography3D. We show that Meta-Balancer is able to identify the ideal load balancing period which changes as the application evolves and extracts the best performance for the applications automatically at runtime.

Application Study:
Fractography is used to study fracture surfaces of materials. Figure 2 shows the processor utilization graph generated when Fractography3D is run on 64 cores of Jaguar.
Fig 2
It can be seen that Fractography3D has a large variation in processor utilization during a simulation run. Also, for a large portion of the execution, substantial amount of processor resources are wasted.
Fig 3
Figure 3 presents the application run time for Fractography3D when using periodic LB periods on various core counts. Figure 1 shows the processor utilization graph generated when Fractography3D is run on 64 cores with Meta-Balancer. It can been seen that Meta-Balancer increases processor utilization which results in a performance gain of 31% in comparison to the case in which no load balancing is performed and better performance than hand tuned LB period. An interesting thing to note is the frequent invocation of load balancing by Meta-Balancer in the first quarter of the execution as seen by the vertical notches in the plot. This is because of the fact that the load variation among processors changes frequently in the first quarter. Thereafter, when the load variation decreases in the second half of execution, the frequency of load balancing also goes down.
For more details on theory behind working of Meta-Balancer and results on Fractography and LeanMD, please refer to paper
Scalable Asynchronous Checkpoint-based Fault Tolerance
08/02/2012
Applications running on large scale machines will have to face a high failure frequency. The traditional way to provide applications with fault tolerance is by using a rollback-recovery mechanism. Checkpoint/restart is the most popular of such mechanisms, mainly because its simplicity. Each task in the application checkpoints its state periodically and in case of a failure all the tasks rollback to the most recent checkpoint and restart from there. The built-in migratability support in Charm++ makes the use of checkpoint/restart even easier. The required methods to save the state of a task are already implemented as part of the load balancing support. Adding fault tolerance support in Charm++ is usually just a matter of adding a function call in the code. The Charm++ runtime system will automatically handle a failure and restart the application. No user involvement is required.

However, checkpoint/restart may suffer from multiple bottlenecks. If checkpoints are sent to disk, the file system may get saturated and collapse. Storing the checkpoints in memory may alleviate this issue, but it will still have to deal with another problem: bottleneck at the network card. For applications with a modest memory footprint, the increase of the checkpoint size will quickly saturate the network. It is estimated that the total checkpoint overhead goes beyond 10% of the total execution time. Furthermore, the increasing checkpoint overhead will make applications hard to make any progress as the mean-time-between-failures (MTBF) is predicted as 2 to 60 minutes for future systems.

Members of PPL have designed a new checkpointing algorithm (called semi-blocking checkpoint/restart) which can significantly hide the checkpoint overhead for applications by asynchronously sending the checkpoints. There are two phases in the algorithm. In the first phase, after reaching a global synchronization point, each node will store its checkpoint in the main memory. After every node safely stores the newest copy of the checkpoint, they will resume to the normal execution. Then, the Charm++ runtime system will take the responsibility to send the checkpoints of each node to another node as a double copy at the appropriate time: when the applications are busy computing and leave the network free. With this algorithm, checkpointing becomes transparent to applications and incur less interference to its normal execution.

In a paper recently accepted at CLUSTER'12 [12-32], we measured our algorithm on Trestles supercomputer at SDSC which has 32 cores per node with a peak performance of 100 teraflops. We ran both weak and strong scaling experiments with two applications: wave2D and ChaNGa. As seen in the figures, the time to perform a single checkpoint is dramatically reduced from more than 50 seconds to less than 10 seconds. Using different MTBF values predicted for future system with an optimum checkpoint interval, applications can reduce their total execution time up to 22% with the semi-blocking checkpoint/restart algorithm.
Scalable Algorithms for System-ranked Process Groups
07/02/2012
Current implementations of process groups (subcommunicators) have non-scalable (O(group size)) memory footprints and even worse time complexities for setting up communication. We propose system-ranked process groups, where member ranks are picked by the runtime system, as a cheaper and faster alternative for a subset of collective operations (barrier, broadcast, reduction, allreduce). Letting the runtime system assign ranks to processes enrolling in a group liberates it from having to sort user-supplied keys to identify ranks. Avoiding this O(n logn) computational complexity can result in significant speedups of group creation mechanisms. In keeping with a pay-only-for-use policy, user-supplied ranks can be supported atop system-ranked groups by performing the sort as an additional step. Tailoring the communicators (groups) to subsets of use-cases will permit implementations that are less resource-intensive and faster.

We developed two distributed algorithms, namely, Shrink-and-Balance and Rank-and-Hash, for balanced spanning tree construction over system-ranked process groups obtained by splitting a parent group. Rank-and-Hash has constant memory footprint and O(log n) time complexity but exchanges more messages than Shrink-and-Balance scheme which has O(logn) space and O((logn)^2) time complexity. Our distributed algorithms perform faster than the centralized scheme even at modest number of processors (see figure, p is the probability with which a process participates in the new process group). These algorithms perform orders of magnitude faster than MPICH's MPI_Comm_split implementation(see Table) for smaller number of splits.

These algorithms are an improvement over existing approaches as they support construction of balanced k-ary spanning trees as compared to other existing work on generalized MPI_Comm_split which supports only binary trees.
Num splits MPI_Comm_split Rank-and-Hash
1 134.968 0.708
2 106.573 0.713
4 96.989 0.760
8 93.536 0.785
Using Charm++ to test future many-core architectures
06/04/2012
Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to large-scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets. Intel SCC is an experimental chip intended to represent future many-core architectures. In recent research, we have used Charm++ infrastructure to run and inspect various scalable applications on different architectures including the SCC. Thus, we can quantitatively compare and analyze the performance, power consumption and energy efficiency of different cutting-edge platforms.

We port the Charm++ infrastructure to SCC to be able to run Charm++ and MPI codes without any change. This task does not have any conceptual difficulty as SCC can be thought of as a small cluster. The figure below shows the scaling of different applications on the SCC using Charm++ and Adaptive MPI. As can be seen, most applications (except CG) scale very well up to 48 cores of SCC, using the Charm++ system.

Our results (the figures shown below) show that the GPGPU has outstanding results in performance, power consumption and energy efficiency for many applications, but it requires significant programming effort and is not general enough to show the same level of efficiency for all the applications. The “light-weight” many-core presents an opportunity for better performance per watt over the “heavy-weight” multi-core, although the multi-core is still very effective for some sophisticated applications. In addition, the low-power processor is not necessarily energy-efficient, since the runtime delay effect can be greater than the power savings.
3-D Particle Tracking with Charm++
05/01/2012
Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. To solve this problem, a parallel tracking package was developed with Charm++ in collaboration with researchers from the Dept. of Agricultural and Biological Engineering [12-02].

Charm’s over-decomposition allows fine-grain control over the accuracy and parallel efficiency of the particle tracking algorithm, enabling researchers to output results in real-time as data is gathered from multiple high FPS cameras or process results off-line for higher accuracy. The tracking software is planned to be an integral part of a portable particle-tracking instrument for measuring particle dispersion in confined spaces, such as aircraft cabins.
A Highly-Scalable Dense LU Solver in Charm++
04/02/2012
We recently implemented a dense LU solver in Charm++ that won a HPCC challenge award for best performance [11-34]. To meet specification (and ensure numerical stability), a dense LU must perform partial pivoting, which classically requires a column-by-column identification of the pivot row, swapping of the identified row across the matrix, and a corresponding update. Pivoting results in a fine decomposition and is likely to incur excessive overhead: however, we demonstrate that these overheads are ameliorated by Charm++’s ability to overlap computation and communication.

Charm++ also allows blocks of the matrix to be arbitrarily mapped to a machine by a user-defined function; this feature decouples mapping from application-specific logic and allows the user to quickly and easily experiment with various mapping schemes. In a paper accepted at IPDPS 2012 [11-51], we use this ability to explore novel mappings of block to processor that substantially increase performance.

In this paper, we build on the traditional block-cyclic mapping to consider an explicit tradeoff between memory contention effects and network locality by mapping blocks to the process grid in a hybrid of row- and column-major order (called striding). We also introduce the notion of rotation in the mapping of blocks to adjust relative degrees of parallelism between panel factorization and triangular solves to match architectural parameters.

Overall, our code is competitive with modern LU implementations.
uGNI-based Charm++ Runtime for Cray Gemini Network
02/28/2012
Gemini is the new interconnect for Cray XE/XK systems; it is characterized by low latency, high bandwidth and strong scalability. A user-level Generic Network Interface (uGNI) is provided as a low-level interface for Gemini, on top of which the Cray MPI is implemented. Gemini-based interconnects have been deployed in two petaflop supercomputers, Cielo at Los Alamos National Lab and Hopper at Lawrence Berkeley National Lab. Jaguar, one of the most powerful supercomputers, is being upgraded to a XK6 system with Gemini interconnect. The newly announced Blue Waters supercomputer, which will deliver sustained performance of 1 petaflop, will also be equipped with a Gemini interconnect. As Cray supercomputers with Gemini interconnects become one of the major platforms for large scale real-world science and engineering applications, it is important to port and scale Charm++ runtime systems and its parallel applications on the new interconnect.

One possible strategy is to implement the message-driven programming model on Cray's MPI implementation, which has been optimized for message passing using the low level features provided by the Gemini hardware. However, in order to avoid the extra overhead caused by the MPI software stack, and and to exploit the available performance as best as possible, we targeted the new port directly on top of the uGNI.

In recent months, several people from PPL have been working intensively on this port, and developed a new machine layer for Gemini on uGNI. Several approaches are explored to optimize the communication to best utilize the Gemini capability. These include reducing memory copy and registration overhead using memory pool; exploiting persistent communication for one-sided communication; and improving intra-node communication via POSIX shared memory and XPMEM kernel module. These optimization techniques leverage the Charm++'s message driven programming model where message buffers are managed by the runtime itself. Our micro-benchmarks show that the uGNI-based Charm++ outperforms the MPI-based implementation by up to 50% on a Cray XE machine. For an irregular program N-Queens, the uGNI-based implementation scales up to 15,360 cores using 70% less execution time than the MPI-based implementation.

For NAMD, a full-fledged molecular dynamics application, the performance is significantly improved by as much as 18% in a fine-grain parallel run. To the best of our knowledge, this uGNI-based Charm++ runtime is the first public third-party runtime ported on uGNI, which achieves scalability up to tens of thousands of cores on a Cray XE machine.

This new uGNI-based machine layer will be included in the coming new Charm++ release. A paper on this work is also accepted for publication by IPDPS 2012 in Shanghai, China.
Efficient Checkpoint-based Fault Tolerance Methods
02/06/2012
As the size of supercomputers multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. Leveraging Charm++'s parallel objects for checkpointing, two variations of checkpointing schemes, a disk-based and a double in-memory checkpointing schemes, are incorporated in the production distribution of Charm++. One of the unique features of both schemes is that the program can be restarted on smaller number of available processors as a result of failure. Furthermore, the application will continue to execute on the remaining processors without much performance penalty after automatic load balancing. In particular, compared to the disk-based method, the double in-memory checkpointing scheme takes advantage of the fast memory access for checkpointing to both local memory and remote memory through high speed interconnect.

Recently, members of Parallel Programming Laboratory are working on further minimizing the checkpoint and restart overhead by applying more efficient collectives for barriers. We demonstrated the in-memory checkpointing scheme using MPI on very large scale supercomputers. One obstacle for demonstrating fault tolerance on MPI applications is that the queueing system on supercomputers will kill a job when a process fails. Without the support of the queueing system, we developed a scheme that mimics a failure of a process without actually killing it. This is implemented as a DieNow() function, which users insert at any place in their program to trigger a failure. When DieNow() function is called by the program, the process will hang and stop responding to any communication as if it dies. Charm++ will pick up a spare processor from a pool and restart the application from the recent checkpoint in memory.

To demonstrate the performance and scalability of the newly optimized double in-memory checkpointing scheme, we used two benchmarks, which are leanMD (a molecular dynamic benchmark) and Jacobi (a 7-point stencil benchmark). In the experiments, we measured the overhead of checkpoint and restart of these two benchmarks on a BlueGene/L machine. The results are shown below in the figures.

We could see the checkpoint time scales well in both applications with small and large memory footprint. In particular, the restart time of leanMD simulating an 1-million atom system only increases from 0.08 seconds on 2K cores to 0.38 seconds on 32K cores.
Charm++ Case Study: Shared Memory Programming
01/05/2012
Modern processors are fast reaching the limits of CPU frequency scaling. In their continued bid to increase performance, chip vendors have now made multicore processors commonplace. However, the means to program these increasingly powerful processors have not yet matured. Traditional languages and frameworks for programming multicore systems either provide very low-level constructs, or a performance model that is not well-matched with the underlying hardware.

In recent work we have explored an alternative to these approaches, namely the use of the message-driven execution (MDE) paradigm of Charm++ to write shared memory programs. There are several productivity and performance benefits to doing so:
  1. users can explicitly control the granularity of parallel tasks
  2. dependences are specified explicitly through asynchronous messages, so that programs do not use locks
  3. messages are asynchronous, allowing the adaptive overlap of computation and communication
  4. since objects share a single address space, the delivery of messages is efficient and does not require copying
  5. an intelligent runtime system performs automated dynamic optimizations such as load balancing and message coalescing
In this work, we examined the advantages in performance that a message-driven shared memory program enjoys. We carried out our study in the context of an application that is challenging to parallelize effectively. Surface- area heuristic (SAH) based kd-tree construction exhibits a tree-structured arrangement of parallelism, low intensity of floating point operations, a mix of task- and data-parallel operations of varying granularity, and extensive data movement, making it hard to scale its performance with increasing processor counts.

In writing the Charm++ kd-tree construction program, we were able to employ several optimizations that are enabled by the runtime system, e.g. grain size control, dynamic seed-based load balancing for data parallel tasks, lightweight fork-join synchronization, and a novel chunked array abstraction to reduce synchronization between data parallel tasks. The cumulative impact of these optimizations was significant--a comparison with a leading TBB-based kd-tree construction program, ParKD, showed that the Charm++ version was up to 3.1 times as fast as ParKD. As shown in the accompanying figures, the relative advantage of using the Charm++ version grew as the amount of parallel work (indicated by the depth of the tree constructed) increased.

Future work will explore the shared memory MDE paradigm in the context of other applications. While the performance benefits of using MDE have been studied, we would also like to examine the improvements in programmer productivity that this paradigm engenders. More generally, we would like to create abstractions that enable the portability of shared memory MDE programs across shared- and distributed-memory machines.
PPL's Winning HPCC Submission
12/06/2011
During SuperComputing 2011, PPL won first place in the 2011 HPC Challenge contest for its parallel programming framework Charm++. The award recognizes Charm++ as the best performing system in the class 2 (productivity and performance) category of the contest.

PPL's submission this year represents its first ever participation in the contest. The submission uses Charm++ to implement five benchmark programs that are representative of relevant computational algorithms. Three of these had to be chosen from a list of recommended kernels supplied by the contest organizers. Chosen kernels were dense LU factorization (the HPL benchmark), one-dimensional FFT and random access. Additionally, PPL also submitted two optional benchmarks; selected because they are popular computational kernels, require non-trivial parallel control flow and exhibit dynamic imbalances. These were molecular dynamics for the simulation of atomic systems and the Barnes-Hut algorithm for simulating N-body systems.

Charm++ is a general-purpose parallel programming framework that offers several practical, productivity-enhancing capabilities. It allowed the submitted benchmarks to remain succinct, yet scale to more than sixty five thousand cores of large supercomputers. Several features like asynchronous non-blocking collectives, topology-aware software routing of messages and automatic, dynamic load balancing allow these benchmarks to achieve high performance. Importantly, the developers writing the benchmarks did not have to incorporate these performance-enhancing features and adaptive capabilities directly into the programs themselves.

The HPC class 2 Challenge is an annual event aimed at encouraging the development of high level parallel programming models that offer both productivity and performance. The contest requires the submission of several benchmark programs implemented using the candidate programming model, along with data about the performance achieved by these benchmarks. A panel of expert judges then evaluates the submissions on the performance they achieve and also on metrics like the size of the code.

Details of the benchmarks, performance and code size metrics are available in the submitted document, now a tech report on the PPL website. The document contains a more detailed discussion of the practical productivity benefits that Charm++ offers high-performance computational science applications.
Evaluating Future Exascale Interconnect Designs
11/02/2011
A low-diameter, fast interconnection network is going to be a pre-requisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM's PERCS topology and the dragonfly network discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume (Figure 1). Hence, routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology.

This work explores intelligent topology aware mappings of near neighbor communication patterns to the physical topology to identify cases that minimize link utilization. Topology aware mappings, used in this work, are based on clustering communicating tasks onto processors that are physical neighbors. The clustering can be done at different levels of the hierarchy – Blocked Node Mapping (BNM), Blocked Drawer Mapping (BDM) and Blocked Supernode Mapping (BSM). Here, drawers and supernodes refer to a collection of nodes and drawers respectively that are at a distance of one hop. To make use of large number of links emanating from the high radix routers used in two-level direct network, a mapping with clustering at node level (RNM) and drawer level (RDM), that spreads tasks uniformly at random is also considered. In addition, we explore random mapping with clustering at drawer level using indirect routing (RDI).

We use detailed network simulations for up to 307,200 MPI tasks and three different communication patterns – 2D stencil, 4d stencil and an n-targets multicast pattern – to compare various mappings. It is found that if direct routing is used, mapping blocks of MPI tasks to random nodes gives the best performance and evenly distributed usage of second level links. It is also observed that indirect routing can achieve similar performance to an intelligent mapping and obviates the need for mapping, at the cost of increasing overall traffic on the network (Figure 2). This work also highlights the utility of simulation-based predictions to analyze algorithms and make design choices before a parallel machine is installed and available for use. This will become increasingly important as machine sizes grow making it essential to do application and hardware co-design.

For more details, please find the full paper at: http://charm.cs.uiuc.edu/papers/11-21

Adaptive Support for Fault Tolerance
10/05/2011
As supercomputers grow in size, their propensity to fail increases substantially. The most optimistic projections estimate Exascale machines will experience a failure every few hours. This is specially problematic for many HPC applications that run for long time and use a large portion of the machine. These applications will have to become resilient in order to finish execution. Our adaptive runtime system provides the scalable fault tolerance support for Charm++ and AMPI programs that is needed to weather frequent failures.

Moreover, the unique technology of Charm++ leverages most of the traditional fault tolerance techniques. First, the traditional checkpoint/restart approach can work even in the case where the number of available nodes decreases as failures bring down some of them.

Second, proactive migration of tasks fits perfectly our infrastructure for migratable objects, making natural the connection of a failure predictor. Third, message logging protocols benefit in allowing the recovery to run in parallel, minimizing the time to redo the work lost in a failure.

Recently, we have been investigating another way in which message logging can be improved by allowing a load balancer to provide crucial information about the application. With this information, we vastly reduce the memory overhead associated with storing the messages. We call this approach Dynamic Team-based Message Logging. Nodes are divided into teams and only messages crossing team boundaries are logged. Teams are, however, dynamic. Depending on the runtime conditions, the load balancer may change the teams to reflect the change in the load of different nodes.

The figure shows the results of the Team Load Balancer (TeamLB) that attains two goals: i) provides a good load balance across the computation nodes and, ii) reduces the message logging overhead by grouping highly connected objects in the same team. Although this load balancer has a small overhead penalty, it drastically reduces the memory overhead of message logging.

Currently, we are working on extending our fault tolerance infrastructure to consider the multicore nature of supercomputers.
Using Load Balancing to Control Power Consumption
08/10/2011
Meeting power requirements of the huge exascale machines of the future is one major challenge facing the HPC community. As power consumption and power costs rise, the bottom line impact is felt by everyone involved in parallel research. Members of the Parallel Programming Laboratory (PPL) are focusing on ways to minimize cooling power for these machines. We propose a technique that uses a combination of DVFS and temperature aware load balancing to constrain core temperatures as well as save cooling energy. Our scheme is specifically designed to suit parallel applications which are typically tightly coupled. Currently, the temperature control comes at the cost of execution time and we are working to minimize the timing penalty. We have run experiments with three parallel applications, each with a different power utilization profiles, run on a 128-core (32-node) cluster with a dedicated air conditioning unit. As the experiment is running, we calibrate the efficacy of our scheme based on three metrics: ability to control average core temperatures thereby avoiding hot spot occurrence, timing penalty minimization, and cooling energy savings. Our preliminary results show cooling energy savings of up to 57% with timing penalty mostly in the range of 2 to 20%.

To demonstrate the effectiveness of our scheme, we use three applications having different utilization and power profiles. The first is a canonical benchmark, Jacobi2D, that uses 5 point stencil to average values in a 2D grid using 2D decomposition. The second application, Wave2D, uses a finite differencing scheme to calculate pressure information over a discretized 2D grid. The third application, Mol3D, is from molecular dynamics and is a real world application to simulate large bio-molecular systems. The experiment shows that we were able to reduce the timing penalty associated with DVFS by a great margin. Using our load balancing strategy, we were able to reduce the timing penalty to 27% for Jacobi2D. Other than that we also used performance counters in order to relate application characteristics to temperature control.

Looking towards the future, we plan to take the DAG of the application into account in order to reduce the timing penalty even further. We are also looking for ways to save machine energy consumption based on the application characteristics.

Scaling NAMD on 200K+ Cores
06/06/2011
One of the requirements for acceptance of the upcoming Blue Waters petascale system is the successful simulation of a 100-million-atom biomolecular system with NAMD. Simulating this large molecular system presents great challenges, including I/O, large memory footprint and getting good strong-scaling. Our group has recently improved the capabilities of NAMD to simulate on existing petascale machines. A new SMP model in Charm++ was designed to efficiently utilize ubiquitous wide multicore clusters by extending the Charm++ asynchronous message-driven runtime. Hierarchical load balancing was further exploited to scale NAMD to the full extent of Jaguar, a petaflop Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (compared to a base-case of 6,720 cores) at 9 ms per step for a simple cutoff calculation.