Program

The program schedule below is tentative.

Time	Type	Description	Slides	Webcast
Day 1 (Monday, May 7th)
8:00 am - 8:30 am	Continental Breakfast / Registration
Morning	Opening Session
8:30 am - 9:00 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale		[youtube]
9:00 am - 10:00 am	Keynote	Preparing Large Multi-physics Applications for Next Generation Advanced Architectures Rob Neely, Lawrence Livermore National Laboratory Click here to expand description Not since the start of the ASCI program in the early-mid 90's has there been as much excitement and fear about how to develop and maintain our large multiphysics applications in the face of deep paradigm shifts in HPC architectures. However, unlike those early days, we now have a massive application and library codebase that has developed over 15+ years, and is used daily in DOE mission-critical service. On one hand, this gives us tremendous insight into our requirements. On the other hand, this severely limits our agility in making wholesale changes or rewrites – yet we know we can't just keep developing applications like we have been. Like many others, LLNL is in the process ofdeveloping a strategy under the banner of co-design to collaboratively address the looming challenges presented by the increasingly complex and variable HPC landscape. In this talk, I’ll address the breadth of work occurring and planned at LLNL to tackle these challenges, discuss how we envision evaluating various emerging and established technologies, our plans for a portfolio of proxy applications, describe some of our forward-leaning research in extreme scale computing, and relate how lessons learned (and then unlearned) from the distant past are reemerging.	[pdf]	[youtube]
10:00 am - 10:30 am	Break
Morning	Technical Session: Load Balancing and Object Mapping (Chair: Dr. Gengbin Zheng)
10:30 am - 11:00 am	Talk	Mapping Dense LU Factorization on Multicore Supercomputer Nodes Jonathan Lifflander, University of Illinois at Urbana-Champaign Click here to expand description Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. By shifting the computation-communication trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multicore architectures. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce striding to vary between the two extremes of row- and column-major process grids. The maximum available parallelism in the critical path work (active panel factorization, triangular solves, and subsequent broadcasts) is bounded by the length or width of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes and nodes in the other dimension. To increase the harnessed parallelism in both dimensions, we start with a tall process grid. We then apply periodic rotation to this grid to restore exploited parallelism along the row to previous levels. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.	[pdf]	[youtube]
11:00 am - 11:30 am	Talk	Meta-Balancer : Automated Load Balancing Based on Application Behavior Harshitha Menon, University of Illinois at Urbana-Champaign Click here to expand description Understanding the characteristics of an application and taking appropriate load balancing decisions is key to improving its performance. Some of these decisions involve how frequently to call a load balancer or the type of strategy to use. In the case of a dynamic application, it becomes even more challenging to identify these characteristics. For such applictions, it is difficult and suboptimal to decide upfront on how frequently the load balancing should be done and which type of load balancing strategy should be used. To this end, we propose a metabalancer which will relieve application writers of such key decision making related to load balancing and improve performance. The Charm++ runtime system maintains the database which contains information about an application run such as load on the processors and associated communication. The metabalancer periodically collects these statistics without incurring high overhead and analyzes them to make suitable load balancing decision.	[pdf]	[youtube]
11:30 am - 12:00 pm	Talk	Process placement on multicore. Dynamic load balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier and François Tessier, INRIA Click here to expand description TreeMatch is an algorithm and a tool to perform process placement based on process affinity and numa topology. We have used this algorithm to design a dynamic load balancer in Charm++ called TreeMatchLB. We will present this implementation and the results provided by TreeMatchLB.	[pdf]	[youtube]
12:00 pm - 01:00 pm	Lunch Provided
Afternoon	Technical Session: Preparing for Blue Waters (Chair: Dr. Celso Mendes)
1:00 pm - 1:30 pm	Talk	uGNI-based Charm++ Runtime for Cray Gemini Network Yanhua Sun, University of Illinois at Urbana-Champaign Click here to expand description Gemini is the network for the new Cray XE/XT systems, and features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the Generic Network Interface (GNI) is designed to support message-passing applications, it is still challenging to attain good performance for applications written in alternative programming models, such as the message-driven programming model. In our earlier work, we showed that Charm++, an object-oriented message-driven programming model, scales up to the full Jaguar Cray machine. In this talk, we present a general and light-weight asynchronous, low-level, runtime system (LRTS) for Charm++, and its implementation on the uGNI software stack for Cray XE systems. Several techniques are presnted to exploit the uGNI capability by reducing memory copy and registration overhead, taking advantage of persistent communication, and improving intra-node communication. Our micro-benchmark results demonstrate that the uGNI-based runtime system outperforms the MPI-based implementation by up to 50% in terms of message latency. For communication intensive applications such as N-Queens, this implementation scales up to 15,360 cores of a Cray XE6 machine and is 70% faster than an MPI-based implementation. In molecular dynamics appliction NAMD, the performance is also considerably improved by as high as 18%.	[pptx]	[youtube]
1:30 pm - 2:00 pm	Talk	A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale Dr. Gengbin Zheng, University of Illinois at Urbana-Champaign Click here to expand description As the size of supercomputers increase, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applicatios. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel pbjects for checkpointing. In this talk, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a million atom molecular dynamics simulation on up to 64k cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart times were measured to be less than 0.15 seconds on 64k cores.	[ppt]	[youtube]
2:00 pm - 2:30 pm	Talk	Early Science Results on Blue Waters Klaus Schulten, University of Illinois at Urbana-Champaign Click here to expand description		[youtube]
2:30 pm - 3:00 pm	Break
Afternoon	Technical Session: Energy-Aware Computing and Accelerators (Chair: Ramprasad Venkataraman)
3:00 pm - 3:30 pm	Talk	Towards Saving Total Energy Consumption While Constraining Core Temperatures Osman Sarood, University of Illinois at Urbana-Champaign Click here to expand description As we move to exascale machines, both peak power and total energy consumption have become prominent major challenges. There has been a lot of research on saving machine energy consumption for HPC data centers. However, a significant part of energy consumption for HPC data centers can be attributed to cooling the machine room. We have already shown significant reduction in cooling energy consumption by constraining core temperatures in our previous work. In this work, we strive to save machine energy consumption while constraining core temperatures in order to provide a total energy solution for HPC data centers that saves both machine and cooling energy consumption. Our approach uses Dynamic Voltage and Frequency Scaling (DVFS) to constrain core temperatures and is particularly designed to reduce the timing penalty associated with DVFS. Using a heuristic that exploits the difference in frequency sensitivity for different parts of an application, we present results that show 17% reduction in machine energy consumption with as little as 0.9% increase in execution time while constraining core temperatures below 60 C.	[pptx]	[youtube]
3:30 pm - 4:00 pm	Talk	Runtime System Support for Heterogeneous Systems David Kunzman, University of Illinois at Urbana-Champaign Click here to expand description This talk will cover recent updates within the Charm++ runtime system for increased support of accelerator technologies. The overall goal of this work is to provide a single portable method for expressing application code, allowing the workload of the application (or at least a portion of it) to be executed on a variety of processing elements. With this added flexibility and support from the runtime system, the workload of the application can then be spread across a heterogeneous set of processing elements available to the application. In particular, recent efforts to incorporate support for GPGPUs using the same accelerated entry method previously applied to the SPEs on the Cell processor, will be discussed. Additionally, dynamic load balancing strategies that balance a workload between a host core and an attached accelerator device will also be discussed.		[youtube]
4:00 pm - 4:30 pm	Talk	Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, University of Illinois at Urbana-Champaign Click here to expand description Dynamic scheduling and varying decomposition granularity are well-known techniques for achieving high performance in parallel computing. Heterogeneous clusters with highly data-parallel processors, such as GPUs, present unique problems for the application of these techniques. These systems reveal a dichotomy between grain sizes: decompositions ideal for the CPUs may yield insufficient data-parallelism for accelerators, and decompositions targeted at the GPU may decrease performance on the CPU. This problem is typically ameliorated by statically scheduling a fixed amount of work for agglomeration. However, determining the ideal amount of work to compose requires experimentation because it varies between architectures and problem configurations. We describe a novel methodology for dynamically agglomerating work units at runtime and scheduling them on accelerators. This approach is demonstrated in the context of two applications: an n-body particle simulation, which offloads particle interaction work; and a parallel dense LU solver, which relocates DGEMM kernels to the GPU. In both cases dynamic agglomeration yields comparable or better results over statically scheduling the work across a variety of system configurations.	[pdf]	[youtube]
4:30 pm - 5:30 pm	Fun	Blue Waters Facility Tour NCSA
5:30 pm - 7:00 pm	Break
07:00 pm onwards	Workshop Banquet (for registered participants only)
Day 2 (Tuesday, May 8th)
8:00 am - 8:30 am	Continental Breakfast / Registration
8:30 am - 9:30 am	Keynote	Thoughts on system software for next-generation hardware Pete Beckman, Argonne National Laboratory Click here to expand description	[pdf]	[youtube]
9:30 am - 10:00 am	Talk	Advances in Charm++ from the 2011 HPC Challenge Competition Phil Miller, University of Illinois at Urbana-Champaign Click here to expand description At the HPC Challenge award session during SC'11, PPL members were presented with the first place award for their submission to the 2011 HPC Challenge Class 2 (programming environment) in the performance category. This represents PPL's first showing in the competition. This talk will describe various aspects of the submission's implementation and performance. I will discuss the infrastructure and tooling improvements that were driven by this effort, and how they've continued since then. I will also present possible plans for a new submission in the coming year.	[pdf]	[youtube]
10:00 am - 10:30 am	Break
Morning	Technical Session: Languages and Programming Models (Chair: Ramprasad Venkataraman)
10:30 am - 11:00 am	Talk	Programming models for quantum chemistry applications Jeff Hammond and James Dinan, Argonne National Laboratory Click here to expand description Quantum chemistry applications have long been associated with irregular communication patterns and load-balancing, which motivated the development of Global Arrays (GA), the Distributed Data Interface (DDI) and, more recently, the Super Instruction Assembly Language (SIAL), which form the basis for essentially all parallel implementations of wavefunction-based quantum chemistry methods, as found in codes like NWChem, GAMESS, ACES III and others. In this talk, the mathematical and algorithmic fundamentals of a popular family of quantum chemistry methods known as coupled-cluster methods and various parallelization schemes associated with their implementation for supercomputers. First, the aforementioned runtimes (GA, DDI, SIAL) will be compared to Charm++ on various axes, including asynchronous communication, dynamic load-balancing, data decomposition, and topology awareness. Second, we describe the Cyclops Tensor Framework, which is a completely new approach to coupled-cluster methods that uses some of the key concepts found in Charm++. Finally, a case is made for using Charm++ to implement reduced-scaling coupled cluster methods.	[pdf]	[youtube]
11:00 am - 11:30 am	Talk	TASCEL: A Task Parallel Runtime System for Non-SPMD Programs Sriram Krishnamoorthy, Pacific Northwest National Laboratory Click here to expand description		[youtube]
11:30 am - 12:00 pm	Talk	Enabling Generative Recursion on Large-Scale Distributed Memory Machines Pritish Jetley, University of Illinois at Urbana-Champaign Click here to expand description We consider the challenges of performing divide-and-conquer computations on large-scale distributed memory machines. In particular, we consider divide-and-conquer algorithms that exhibit generative recursion, wherein the application of a function f on a (possibly ordered) set A can be expressed as a composition of f(A_i) on a finite number of (ordered) subsets A_i of A. The paradigm of generative recursion has widespread applications in computing and computational science: sorting, adaptive quadrature, Monte-Carlo integration with adaptive sampling, various graph computations, etc., can all be expressed in this mold. Algorithms with generative recursion are characterized by the movement of data between A and A_i, and as such incur communication costs with every parallel invocation of the recursive function f. Whereas on shared memory systems such data movement only involves memcpy's, on distributed memory machines, network communication costs can prove to be prohibitive. In this talk, we consider solutions to this problem of data movement. We also describe the design of an object-oriented parallel framework which helps programmers specify recursive computations without sacrificing the visibility of gloval control flow, or scalability of the resulting program. Finally, we present scaling results to demonstrate the utility of this abstraction.		[youtube]
12:00 pm - 12:30 pm	Talk	Charj: A Language for Writing Better Charm Programs More Easily Aaron Becker, University of Illinois at Urbana-Champaign Click here to expand description Charm is a powerful system, but using it effectively is not always easy or intuitive. Because Charm is implemented as a C++ library and associated translator, it suffers from difficult-to-interpret error messages, duplication of code, and the inability to perform optimizations. This talk describes ongoing work on Charj, a language and compiler targetting the Charm runtime which aims to address these problems.	[pdf]	[youtube]
12:30 pm - 01:30 pm	Lunch Provided
01:30 pm - 03:00 pm	Panel	What do computational scientists need from computer scientists? Panelists: Pete Beckman (Argonne National Laboratory), Jeff Hammond (Argonne National Laboratory), James Phillips (University of Illinois at Urbana-Champaign), Tom Quinn (University of Washington) Click here to expand description Here, what I mean by "computational scientist" is a scientist who uses computational methods, in keeping with the view that science now has 3 flavors; experimental, theoretical, and computational. This clarification Is necessary because some people have been using the phrase computational scientists to mean computer scientists. Also, I want to keep the "numerical methods" out of the discussion, as a contribution of computer scientist. (It is very important, but not relevant for this discussion. Not controversial enough :-) ). Historically, the science codes were written by scientists (often with renaissance people with expertise in programming and science). Increasingly, Computer Scientists have been assisting with architecting, designing, and developing science/engineering codes, especially now that parallel programming has become more challenging. The questions is: in what way can they help? One view could be "the CS people are too enamored by their own abstractions, are hunting for the nails now that they have a hammer, and therefore are not very useful)", the other could be "they are useful as programmers.. After wall we need programmers and don't have enough of the multifaceted renaissance researchers". More seriously, what is the ideal role for computer scientists, and how much science they need to know to be able to assist? What attitudes are helpful? How to factor expertise, so different people can contribute to a common code without needing to become experts in each other's fields?		[youtube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Applications and Algorithms (Chair: Eric Bohm)
3:30 pm - 4:00 pm	Talk	ChaNGa: a Charm++ N-body Treecode Tom Quinn, University of Washington Click here to expand description Astrophysical simulations demand significant computational resources because of their vast dynamic range, and because of the long range interactions of gravity. The computational complexities can be tackled using tree-based algorithms, but such algorithms are not easy to implement in parallel. Charm++ has a number of features which make the parallel implementation of tree algorithms easier, although not effortless. Using Charm++, ignificant performance improvements have been a achieved compared to a legacy parallel code.		[youtube]
4:00 pm - 4:30 pm	Talk	Contagion Diffusion over very large networks Keith Bisset, Virginia Tech Click here to expand description Modeling the diffusion of contagion over large networks is an important and challenging problem, especially when complex interventions are considered. One such example is modeling the spread of an infectious disease such as influenza through the US population when schools are closed on a county basis depending on local prevalence. Other examples include the spread of fads and norms through society or the immune response of the human gut to pathogenic bacteria. We will describe the contagion diffusion problem, along with our agent based modeling solution and our efforts to exploit the features of Charm++ to increase the efficiency of our simulator. In particular, we use completion detection for synchronization, mesh streaming to increase the efficiency of sending numerous small messages, dynamic load balancing, and the PUP (Pack/UnPack) mechanism to serialize the initial data input.	[pdf]	[youtube]
4:30 pm - 5:00 pm	Talk	Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems Ehsan Totoni, University of Illinois at Urbana-Champaign Click here to expand description Solution of the sparse triangular linear systems arise as a bottleneck in many methods for the solution of linear systems. In both direct methods and iterative preconditioners, it is used to solve the system and refine the solution, possibly in many iterations. However, it is resistant to parallelism because it has lots of structural dependencies and very limited work per data element. Existing standard parallel linear algebra packages such as HPYRE and SuperLU DIST appear not to be effective in exploiting any parallelism for this problem. We have developed a parallel algorithm with different heuristics that adapts to the structure of the matrix and tries to extract as much parallelism as possible. By analysis and reordering of the rows, our algorithm can even extract parallelism in some of the cases where most of matrix's non-zeros are near its diagonal. We also describe our implementation in Charm++ and present promising results on up to 512 nodes of BlueGene/P using many sparse matrices of real application domains.	[pdf]	[youtube]
5:00 pm - 5:30 pm	Talk	Exascale Algorithms for Balanced Spanning Tree Construction in System-ranked Process Groups Akhil Langer, University of Illinois at Urbana-Champaign Click here to expand description Centralized algorithms for creating balanced spanning trees of subcommunicators suffer from memory bottlenecks. These algorithms present a bigger challenge especially for exascale machines in which memory is going to be a limiting factor and the memory requirements of the centralized scheme can increase by a factor of 100. We present novel distributed algorithms for construction of balanced spanning trees that use only a small constant amount of memory per node and also beat the performance of centralized scheme even at modest processor count.	[pptx]	[youtube]
5:30 pm - 6:00 pm	Fun	Annual PPL Photograph David Kunzman
Day 3 (Wednesday, May 9th)
8:00 am - 09:00 am	Registration
Morning	Technical Session: Tutorial
09:00 am - 10:00 am	Tutorial	Load Balancing and its implementation in Charm++ Eric Bohm Click here to expand description This tutorial session will cover the conceptual underpinnings of load balancing for parallel applications, the strategies Charm++ offers to address load imbalance, and the mechanics of enabling the application of those strategies to users' code.	[pptx]	[youtube]
10:00 am - 11:00 am	Tutorial	Adaptive MPI (AMPI) Gengbin Zheng Click here to expand description This tutorial session will cover Adaptive MPI (AMPI), the implementation of the MPI standard on the Charm++ runtime system to let MPI codes benefit from its adaptive features. It will address porting codes to run on AMPI and demonstrate how they can exploit the various Charm++ features.	[ppt]	[youtube]

Abstracts Due:	March 30
Author Notification:	April 4
Workshop:	May 7-9