Time |
Type |
Description |
Slides |
Webcast |
|
8:00 am - 8:30 am |
Continental Breakfast / Registration |
Morning |
|
8:30 am - 9:00 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale
|
|
|
9:00 am - 10:00 am |
Keynote
|
Preparing Large Multi-physics Applications for Next Generation Advanced Architectures
Rob Neely, Lawrence Livermore National Laboratory
Click here to expand description
Not since the start of the ASCI program in the early-mid 90's has there been as much excitement and fear about how to develop and maintain our large multiphysics applications in the face of deep paradigm shifts in HPC architectures. However, unlike those early days, we now have a massive application and library codebase that has developed over 15+ years, and is used daily in DOE mission-critical service. On one hand, this gives us tremendous insight into our requirements. On the other hand, this severely limits our agility in making wholesale changes or rewrites – yet we know we can't just keep developing applications like we have been. Like many others, LLNL is in the process ofdeveloping a strategy under the banner of co-design to collaboratively address the looming challenges presented by the increasingly complex and variable HPC landscape. In this talk, I’ll address the breadth of work occurring and planned at LLNL to tackle these challenges, discuss how we envision evaluating various emerging and established technologies, our plans for a portfolio of proxy applications, describe some of our forward-leaning research in extreme scale computing, and relate how lessons learned (and then unlearned) from the distant past are reemerging.
|
|
|
10:00 am - 10:30 am |
|
Morning |
Technical Session: Load Balancing and Object Mapping (Chair: Dr. Gengbin Zheng) |
10:30 am - 11:00 am |
Talk
|
Mapping Dense LU Factorization on Multicore Supercomputer Nodes
Jonathan Lifflander, University of Illinois at Urbana-Champaign
Click here to expand description
Dense LU factorization is a prominent benchmark used to rank the performance of
supercomputers. Many implementations use block-cyclic distributions of matrix
blocks onto a two-dimensional process grid. The process grid dimensions drive a
trade-off between communication and computation and are architecture- and
implementation-sensitive. The critical panel factorization steps can be made
less communication-bound by overlapping asynchronous collectives for pivoting
with the computation of rank-k updates. By shifting the
computation-communication trade-off, a modified block-cyclic distribution can
beneficially exploit more available parallelism on the critical path, and
reduce panel factorization's memory hierarchy contention on now-ubiquitous
multicore architectures.
During active panel factorization, rank-1 updates stream through memory with
minimal reuse. In a column-major process grid, the performance of this access
pattern degrades as too many streaming processors contend for access to
memory. A block-cyclic mapping in the row-major order does not encounter this
problem, but consequently sacrifices node and network locality in the critical
pivoting steps. We introduce striding to vary between the two extremes of row-
and column-major process grids.
The maximum available parallelism in the critical path work (active panel
factorization, triangular solves, and subsequent broadcasts) is bounded by the
length or width of the process grid. Increasing one dimension of the process
grid decreases the number of distinct processes and nodes in the other
dimension. To increase the harnessed parallelism in both dimensions, we start
with a tall process grid. We then apply periodic rotation to this grid to
restore exploited parallelism along the row to previous levels.
As a test-bed for further mapping experiments, we describe a dense LU
implementation that allows a block distribution to be defined as a general
function of block to processor. Other mappings can be tested with only small,
local changes to the code.
|
|
|
11:00 am - 11:30 am |
Talk
|
Meta-Balancer : Automated Load Balancing Based on Application Behavior
Harshitha Menon, University of Illinois at Urbana-Champaign
Click here to expand description
Understanding the characteristics of an application and taking appropriate load balancing decisions is key to improving its performance. Some of these decisions involve how frequently to call a load balancer or the type of strategy to use. In the case of a dynamic application, it becomes even more challenging to identify these characteristics. For such applictions, it is difficult and suboptimal to decide upfront on how frequently the load balancing should be done and which type of load balancing strategy should be used. To this end, we propose a metabalancer which will relieve application writers of such key decision making related to load balancing and improve performance. The Charm++ runtime system maintains the database which contains information about an application run such as load on the processors and associated communication. The metabalancer periodically collects these statistics without incurring high overhead and analyzes them to make suitable load balancing decision.
|
|
|
11:30 am - 12:00 pm |
Talk
|
Process placement on multicore. Dynamic load balancing in Charm++
Emmanuel Jeannot, Guillaume Mercier and François Tessier, INRIA
Click here to expand description
TreeMatch is an algorithm and a tool to perform process placement based on process affinity and numa topology. We have used this algorithm to design a dynamic load balancer in Charm++ called TreeMatchLB. We will present this implementation and the results provided by TreeMatchLB.
|
|
|
12:00 pm - 01:00 pm |
|
Afternoon |
Technical Session: Preparing for Blue Waters (Chair: Dr. Celso Mendes) |
1:00 pm - 1:30 pm |
Talk
|
uGNI-based Charm++ Runtime for Cray Gemini Network
Yanhua Sun, University of Illinois at Urbana-Champaign
Click here to expand description
Gemini is the network for the new Cray XE/XT systems, and features low latency, high bandwidth and strong scalability. Its hardware support for remote direct memory access enables efficient implementation of the global address space programming languages. Although the Generic Network Interface (GNI) is designed to support message-passing applications, it is still challenging to attain good performance for applications written in alternative programming models, such as the message-driven programming model.
In our earlier work, we showed that Charm++, an object-oriented message-driven programming model, scales up to the full Jaguar Cray machine. In this talk, we present a general and light-weight asynchronous, low-level, runtime system (LRTS) for Charm++, and its implementation on the uGNI software stack for Cray XE systems. Several techniques are presnted to exploit the uGNI capability by reducing memory copy and registration overhead, taking advantage of persistent communication, and improving intra-node communication. Our micro-benchmark results demonstrate that the uGNI-based runtime system outperforms the MPI-based implementation by up to 50% in terms of message latency. For communication intensive applications such as N-Queens, this implementation scales up to 15,360 cores of a Cray XE6 machine and is 70% faster than an MPI-based implementation. In molecular dynamics appliction NAMD, the performance is also considerably improved by as high as 18%.
|
|
|
1:30 pm - 2:00 pm |
Talk
|
A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale
Dr. Gengbin Zheng, University of Illinois at Urbana-Champaign
Click here to expand description
As the size of supercomputers increase, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applicatios. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint.
In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel pbjects for checkpointing. In this talk, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a million atom molecular dynamics simulation on up to 64k cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart times were measured to be less than 0.15 seconds on 64k cores.
|
|
|
2:00 pm - 2:30 pm |
Talk
|
Early Science Results on Blue Waters
Klaus Schulten, University of Illinois at Urbana-Champaign
|
|
|
2:30 pm - 3:00 pm |
|
Afternoon |
Technical Session: Energy-Aware Computing and Accelerators (Chair: Ramprasad Venkataraman) |
3:00 pm - 3:30 pm |
Talk
|
Towards Saving Total Energy Consumption While Constraining Core Temperatures
Osman Sarood, University of Illinois at Urbana-Champaign
Click here to expand description
As we move to exascale machines, both peak power and total energy consumption have become prominent major challenges. There has been a lot of research on saving machine energy consumption for HPC data centers. However, a significant part of energy consumption for HPC data centers can be attributed to cooling the machine room. We have already shown significant reduction in cooling energy consumption by constraining core temperatures in our previous work. In this work, we strive to save machine energy consumption while constraining core temperatures in order to provide a total energy solution for HPC data centers that saves both machine and cooling energy consumption. Our approach uses Dynamic Voltage and Frequency Scaling (DVFS) to constrain core temperatures and is particularly designed to reduce the timing penalty associated with DVFS. Using a heuristic that exploits the difference in frequency sensitivity for different parts of an application, we present results that show 17% reduction in machine energy consumption with as little as 0.9% increase in execution time while constraining core temperatures below 60 C.
|
|
|
3:30 pm - 4:00 pm |
Talk
|
Runtime System Support for Heterogeneous Systems
David Kunzman, University of Illinois at Urbana-Champaign
Click here to expand description
This talk will cover recent updates within the Charm++ runtime system for increased support of accelerator technologies. The overall goal of this work is to provide a single portable method for expressing application code, allowing the workload of the application (or at least a portion of it) to be executed on a variety of processing elements. With this added flexibility and support from the runtime system, the workload of the application can then be spread across a heterogeneous set of processing elements available to the application. In particular, recent efforts to incorporate support for GPGPUs using the same accelerated entry method previously applied to the SPEs on the Cell processor, will be discussed. Additionally, dynamic load balancing strategies that balance a workload between a host core and an attached accelerator device will also be discussed.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Jonathan Lifflander, University of Illinois at Urbana-Champaign
Click here to expand description
Dynamic scheduling and varying decomposition granularity are well-known
techniques for achieving high performance in parallel computing. Heterogeneous
clusters with highly data-parallel processors, such as GPUs, present unique
problems for the application of these techniques. These systems reveal a
dichotomy between grain sizes: decompositions ideal for the CPUs may yield
insufficient data-parallelism for accelerators, and decompositions targeted at
the GPU may decrease performance on the CPU. This problem is typically
ameliorated by statically scheduling a fixed amount of work for
agglomeration. However, determining the ideal amount of work to compose
requires experimentation because it varies between architectures and problem
configurations.
We describe a novel methodology for dynamically agglomerating work units at
runtime and scheduling them on accelerators. This approach is demonstrated in
the context of two applications: an n-body particle simulation, which offloads
particle interaction work; and a parallel dense LU solver, which relocates
DGEMM kernels to the GPU. In both cases dynamic agglomeration yields comparable
or better results over statically scheduling the work across a variety of
system configurations.
|
|
|
4:30 pm - 5:30 pm |
Fun
|
Blue Waters Facility Tour
NCSA
|
|
|
5:30 pm - 7:00 pm |
|
07:00 pm onwards |
Workshop Banquet (for registered participants only) |
|
8:00 am - 8:30 am |
Continental Breakfast / Registration |
8:30 am - 9:30 am |
Keynote
|
Thoughts on system software for next-generation hardware
Pete Beckman, Argonne National Laboratory
|
|
|
9:30 am - 10:00 am |
Talk
|
Advances in Charm++ from the 2011 HPC Challenge Competition
Phil Miller, University of Illinois at Urbana-Champaign
Click here to expand description
At the HPC Challenge award session during SC'11, PPL members were
presented with the first place award for their submission to the 2011
HPC Challenge Class 2 (programming environment) in the performance
category. This represents PPL's first showing in the competition.
This talk will describe various aspects of the submission's
implementation and performance. I will discuss the infrastructure and
tooling improvements that were driven by this effort, and how they've
continued since then. I will also present possible plans for a new
submission in the coming year.
|
|
|
10:00 am - 10:30 am |
|
Morning |
Technical Session: Languages and Programming Models (Chair: Ramprasad Venkataraman) |
10:30 am - 11:00 am |
Talk
|
Programming models for quantum chemistry applications
Jeff Hammond and James Dinan, Argonne National Laboratory
Click here to expand description
Quantum chemistry applications have long been associated with
irregular communication patterns and load-balancing, which motivated
the development of Global Arrays (GA), the Distributed Data Interface
(DDI) and, more recently, the Super Instruction Assembly Language
(SIAL), which form the basis for essentially all parallel
implementations of wavefunction-based quantum chemistry methods, as
found in codes like NWChem, GAMESS, ACES III and others. In this
talk, the mathematical and algorithmic fundamentals of a popular
family of quantum chemistry methods known as coupled-cluster methods
and various parallelization schemes associated with their
implementation for supercomputers. First, the aforementioned runtimes
(GA, DDI, SIAL) will be compared to Charm++ on various axes, including
asynchronous communication, dynamic load-balancing, data
decomposition, and topology awareness. Second, we describe the
Cyclops Tensor Framework, which is a completely new approach to
coupled-cluster methods that uses some of the key concepts found in
Charm++. Finally, a case is made for using Charm++ to implement
reduced-scaling coupled cluster methods.
|
|
|
11:00 am - 11:30 am |
Talk
|
TASCEL: A Task Parallel Runtime System for Non-SPMD Programs
Sriram Krishnamoorthy, Pacific Northwest National Laboratory
|
|
|
11:30 am - 12:00 pm |
Talk
|
Enabling Generative Recursion on Large-Scale Distributed Memory Machines
Pritish Jetley, University of Illinois at Urbana-Champaign
Click here to expand description
We consider the challenges of performing divide-and-conquer computations on large-scale distributed memory machines. In particular, we consider divide-and-conquer algorithms that exhibit generative recursion, wherein the application of a function f on a (possibly ordered) set A can be expressed as a composition of f(Ai) on a finite number of (ordered) subsets Ai of A. The paradigm of generative recursion has widespread applications in computing and computational science: sorting, adaptive quadrature, Monte-Carlo integration with adaptive sampling, various graph computations, etc., can all be expressed in this mold.
Algorithms with generative recursion are characterized by the movement of data between A and Ai, and as such incur communication costs with every parallel invocation of the recursive function f. Whereas on shared memory systems such data movement only involves memcpy's, on distributed memory machines, network communication costs can prove to be prohibitive.
In this talk, we consider solutions to this problem of data movement. We also describe the design of an object-oriented parallel framework which helps programmers specify recursive computations without sacrificing the visibility of gloval control flow, or scalability of the resulting program. Finally, we present scaling results to demonstrate the utility of this abstraction.
|
|
|
12:00 pm - 12:30 pm |
Talk
|
Charj: A Language for Writing Better Charm Programs More Easily
Aaron Becker, University of Illinois at Urbana-Champaign
Click here to expand description
Charm is a powerful system, but using it effectively is not always easy
or intuitive. Because Charm is implemented as a C++ library and associated
translator, it suffers from difficult-to-interpret error messages, duplication
of code, and the inability to perform optimizations. This talk describes
ongoing work on Charj, a language and compiler targetting the Charm runtime
which aims to address these problems.
|
|
|
12:30 pm - 01:30 pm |
|
01:30 pm - 03:00 pm |
Panel
|
What do computational scientists need from computer scientists?
Panelists: Pete Beckman (Argonne National Laboratory), Jeff Hammond (Argonne National Laboratory), James Phillips (University of Illinois at Urbana-Champaign), Tom Quinn (University of Washington)
Click here to expand description
Here, what I mean by "computational scientist" is a scientist who uses computational methods, in keeping with the view that science now has 3 flavors; experimental, theoretical, and computational. This clarification Is necessary because some people have been using the phrase computational scientists to mean computer scientists. Also, I want to keep the "numerical methods" out of the discussion, as a contribution of computer scientist. (It is very important, but not relevant for this discussion. Not controversial enough :-) ). Historically, the science codes were written by scientists (often with renaissance people with expertise in programming and science). Increasingly, Computer Scientists have been assisting with architecting, designing, and developing science/engineering codes, especially now that parallel programming has become more challenging. The questions is: in what way can they help? One view could be "the CS people are too enamored by their own abstractions, are hunting for the nails now that they have a hammer, and therefore are not very useful)", the other could be "they are useful as programmers.. After wall we need programmers and don't have enough of the multifaceted renaissance researchers". More seriously, what is the ideal role for computer scientists, and how much science they need to know to be able to assist? What attitudes are helpful? How to factor expertise, so different people can contribute to a common code without needing to become experts in each other's fields?
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Applications and Algorithms (Chair: Eric Bohm) |
3:30 pm - 4:00 pm |
Talk
|
ChaNGa: a Charm++ N-body Treecode
Tom Quinn, University of Washington
Click here to expand description
Astrophysical simulations demand significant computational resources
because of their vast dynamic range, and because of the long range
interactions of gravity. The computational complexities can be tackled
using tree-based algorithms, but such algorithms are not easy to
implement in parallel. Charm++ has a number of features which make the
parallel implementation of tree algorithms easier, although not
effortless. Using Charm++, ignificant performance improvements have
been a achieved compared to a legacy parallel code.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Contagion Diffusion over very large networks
Keith Bisset, Virginia Tech
Click here to expand description
Modeling the diffusion of contagion over large networks is an
important and challenging problem, especially when complex
interventions are considered. One such example is modeling the spread
of an infectious disease such as influenza through the US population
when schools are closed on a county basis depending on local
prevalence. Other examples include the spread of fads and norms
through society or the immune response of the human gut to pathogenic
bacteria. We will describe the contagion diffusion problem, along
with our agent based modeling solution and our efforts to exploit the
features of Charm++ to increase the efficiency of our simulator. In
particular, we use completion detection for synchronization, mesh
streaming to increase the efficiency of sending numerous small
messages, dynamic load balancing, and the PUP (Pack/UnPack) mechanism
to serialize the initial data input.
|
|
|
4:30 pm - 5:00 pm |
Talk
|
Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems
Ehsan Totoni, University of Illinois at Urbana-Champaign
Click here to expand description
Solution of the sparse triangular linear systems arise as a bottleneck in many methods for the solution of linear systems. In both direct methods and iterative preconditioners, it is used to solve the system and refine the solution, possibly in many iterations. However, it is resistant to parallelism because it has lots of structural dependencies and very limited work per data element. Existing standard parallel linear algebra packages such as HPYRE and SuperLU DIST appear not to be effective in exploiting any parallelism for this problem. We have developed a parallel algorithm with different heuristics that adapts to the structure of the matrix and tries to extract as much parallelism as possible. By analysis and reordering of the rows, our algorithm can even extract parallelism in some of the cases where most of matrix's non-zeros are near its diagonal. We also describe our implementation in Charm++ and present promising results on up to 512 nodes of BlueGene/P using many sparse matrices of real application domains.
|
|
|
5:00 pm - 5:30 pm |
Talk
|
Exascale Algorithms for Balanced Spanning Tree Construction in System-ranked Process Groups
Akhil Langer, University of Illinois at Urbana-Champaign
Click here to expand description
Centralized algorithms for creating balanced spanning trees of subcommunicators suffer from memory bottlenecks. These algorithms present a bigger challenge especially for exascale machines in which memory is going to be a limiting factor and the memory requirements of the centralized scheme can increase by a factor of 100. We present novel distributed algorithms for construction of balanced spanning trees that use only a small constant amount of memory per node and also beat the performance of centralized scheme even at modest processor count.
|
|
|
5:30 pm - 6:00 pm |
Fun
|
Annual PPL Photograph
David Kunzman
|
|
|
|
8:00 am - 09:00 am |
|
Morning |
Technical Session: Tutorial |
09:00 am - 10:00 am |
Tutorial
|
Load Balancing and its implementation in Charm++
Eric Bohm
Click here to expand description
This tutorial session will cover the conceptual underpinnings of load
balancing for parallel applications, the strategies Charm++ offers to
address load imbalance, and the mechanics of enabling the application
of those strategies to users' code.
|
|
|
10:00 am - 11:00 am |
Tutorial
|
Adaptive MPI (AMPI)
Gengbin Zheng
Click here to expand description
This tutorial session will cover Adaptive MPI (AMPI), the
implementation of the MPI standard on the Charm++ runtime system to
let MPI codes benefit from its adaptive features. It will address
porting codes to run on AMPI and demonstrate how they can exploit
the various Charm++ features.
|
|
|