Location: 2405 Siebel Center
The schedule below is tentative, and subject to change
Time |
Type |
Description |
Slides |
Webcast |
|
8:15 am - 8:30 am |
|
Morning |
|
8:30 am - 9:00 am |
|
|
|
|
8:45 am - 09:30 am |
Keynote
|
Architecture-aware Algorithms and Software for Peta and Exascale Computing
Prof. Jack Dongarra University Distinguished Professor of Electrical Engineering and Computer Science, University of Tennessee
Click here to expand description
In this talk we examine how high performance computing has changed over the
last 10-year and look toward the future in terms of trends. These changes have
had and will continue to have a major impact on our software. Some of the
software and algorithm challenges have already been encountered, such as
management of communication and memory hierarchies through a combination of
compile--time and run--time techniques, but the increased scale of computation,
depth of memory hierarchies, range of latencies, and increased run--time
environment variability will make these problems much harder.
We will look at five areas of research that will have an importance impact in
the development of software and algorithms. We will focus on following themes:
- Redesign of software to fit multicore and hybrid architectures
- Automatically tuned application software
- Exploiting mixed precision for performance
- The importance of fault tolerance
- Communication avoiding algorithms
|
|
|
9:30 am - 10:00 am |
Talk
|
Charm++ Research Agenda: Recent Developments and Plans
Prof. Laxmikant V. Kale Professor of Computer Science, University of Illinois
Click here to expand description
Charm++ and the rich research agenda engendered by its idea of object-based
over-decomposition made significant progress during the past year. I will
review the basic concepts that have been the foundation of our approach to
parallel programming, and highlight specific achie vements of the past year.
These include progress on our production-quality collaboratively-developed
science and engineering applications, including NAMD (biophysics), OpenAtom
(Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the
progress and challenges in our agenda of developing higher level parallel
languages.
|
|
|
10:00 am - 10:15 am |
|
Morning |
Technical Session: Load Balancing (Chair: Dr. Gengbin Zheng) |
10:15 am - 10:45 am |
Submitted Paper
|
Improving Charm++ Performance with a NUMA-aware Load Balancer
Laercio Lima Pilla Federal University of Rio Grande do Sul, Brazil
Click here to expand description
The importance of Non-Uniform Memory Access (NUMA) machines has been increasing
as a solution to ease the memory wall problem and to provide better scalability
for multi-core machines. On such machines, the shared memory is physically
distributed into memory banks connected by a network. Due to this, memory
access costs may vary depending on the distance between the desired memory bank
and the requesting processing unit. We propose a NUMA-aware load balancer that
combines the information about the NUMA topology with the statistics captured
by the Charm++ RTS. We present improvements of up to 10% over existing load
balancing strategies in benchmark performance. In addition, our algorithm
presents up to seven times smaller overheads than the other strategies by
avoiding unnecessary migrations.
|
|
|
10:45 am - 11:15 am |
Talk
|
Temperature aware Load Balancing for Parallel Applications
Osman Sarood
Click here to expand description
Increasing number of cores and clock speeds on a smaller chip area implies more heat dissipation and an ever increasing heat density. This increased heat, in turn, leads to higher cooling cost and occurrence of hot spots. Effective use of dynamic voltage and frequency scaling (DVFS) can help us alleviate this problem. But there is an associated execution time penalty which can get amplified in parallel applications. In high performance computing, applications are typically tightly coupled and even a single overloaded core can adversely affect the execution time of the entire application. This makes load balancing of utmost value. In this paper, we outline a temperature aware load balancing scheme, which uses DVFS to keep core temperatures below a user-defined threshold with minimum timing penalty. While doing so, it also reduces the possibility of hot spots. We apply our scheme to three parallel applications with different energy consumption profiles. Results from our technique show that we save up to 14% in execution time and 12% in machine energy consumption as compared to frequency scaling without using load balancing. We are also able to bound the average temperature of all the cores and reduce the temperature deviation amongst the cores by a factor of 3.
|
|
|
11:15 am - 11:45 am |
Talk
|
New Developments in the Charm++ Load Balancing Framework and its Applications
Dr. Abhinav Bhatele
Click here to expand description
The theme of this year's workshop is load balancing. Over the past one year,
several improvements have been made to the Charm++ load balancing framework and
new strategies have been added. We will highlight new developments and also
discuss new load balancers and their performance. Some new load balancing
strategies are based on recursive bi-partitioning and Scotch (a graph
partitioning library).
|
|
|
11:45 am - 12:15 pm |
Talk
|
Impact Of Type Ia Supernova Ejecta On The Binary Companion
Kuo-Chuan Pan
Click here to expand description
Type Ia supernovae are thought to be due to thermonuclear explosions of
carbon-oxygen white dwarfs in close binary systems.
In the single-degenerate scenario, the companion star is non-degenerate and
can be significantly altered by the explosion. We explore this interaction
by means of three-dimensional adaptive mesh refinement (AMR) simulations
using the FLASH code. Since the simulations are computationally expensive,
we optimize the FLASH in two approaches: 1. using an automatic MPI to AMPI
tool to get benefit from object migration in Charm++; 2. re-design the AMR framework
using Charm++ for dynamic scheduling and processor virtualization.
We consider several different companion types,
including red giants, main-sequence-like stars, and helium stars, and we
include the symmetry-breaking effects of orbital motion, Roche-lobe
overflow, and pre-supernova mass loss.
Our analysis focuses on mass loss by the companion, contamination of the companion's atmosphere by
supernova ejecta, and post-supernova motion of the companion relative to
the ejecta. We discuss the implications of our results for variation in
Type Ia supernova properties and searches for remnant companion stars.
|
|
|
12:15 pm - 01:15 pm |
|
Afternoon |
Technical Session: Exascale Efforts (Chair: Ryan Mokos) |
01:15 pm - 01:45 pm |
Talk
|
Fault Tolerance Support for Supercomputers with Multicore Nodes
Esteban Meneses and Xiang Ni
Click here to expand description
The widespread adoption of multicore chips as the basis to build supercomputers brings new design options for fault tolerance strategies. In this talk, we will describe how we evolved our two major fault tolerance techniques, checkpoint/restart and message logging, to take full advantage of the opportunities offered by this architecture.
|
|
|
01:45 pm - 02:15 pm |
Talk
|
Architectural constraints to attain 1 Exaflop/s for molecular dynamics and cosmology simulations
Dr. Abhinav Bhatele and Pritish Jetley
Click here to expand description
The first Teraflop/s computer, the ASCI Red, became
operational in 1997, and it took more than 11 years for a Petaflop/s
performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts
have begun to study the hardware and software challenges for building an
exascale machine. It is important to understand and meet these challenges in
order to reach Exaflop/s performance. In this talk, we will present a
feasibility study of two important application classes to formulate the
constraints that these classes will impose on the machine architecture for
achieving a sustained performance of 1 Exaflop/s. The application classes are
classical molecular dynamics and cosmological simulations. We will analyze the
problem sizes required for representative applications to achieve 1 Exaflop/s
and the hardware requirements in terms of the network and memory. Based on the
analysis for achieving an Exaflop/s, we will also discuss the performance of
these applications for much smaller problem sizes.
|
|
|
02:15 pm - 02:45 pm |
Talk
|
Large scale simulations enabled by BigSim
Dr. Gengbin Zheng and Ehsan Totoni
Click here to expand description
Petaflop/s class computers are currently being deployed and even larger exascale
computers are being planned. Our BigSim project is aimed at developing tools
that allow one to develop, debug and tune/scale/predict the performance of
applications before such machines are available so that the applications can be
ready when the machine first becomes operational. It also allows easier
"offline" experimentation of parallel performance tuning strategies --- without
using the full parallel computer. This talk will focus on our recent
development in BigSim project , including work on simulating petascale machines
using NAMD with improved accuracy, a new faster implementation of a sequential
network simulator, and several case studies of using BigSim.
|
|
|
02:45 pm - 03:00 pm |
|
Afternoon |
Technical Session: Applications (Chair: Eric Bohm) |
03:00 pm - 03:30 pm |
Submitted Paper
|
Large-Scale Computational Epidemiology Simulations using Charm++
Keith Bisset Virginia Bioinformatics Institute at Virginia Tech
Click here to expand description
Contagion (or diffusion) models are pervasive in social and physical
sciences, such as potential pandemics caused by avian
influenza. Developing computational models to reason about these
systems is complicated and scientifically challenging. The size and
scale of these systems can be extremely large (e.g., pandemic planning
at a global scale requires models with 6 Billion agents). Developing
scientific foundations for practical global-scale problems requires
one to model systems comprised of multiple interacting behaviors,
networks, and contagions. In our example of epidemiology, we are
interested in at least two separate contagion processes: spread of
disease through the population and spread of fear, influence and
information in response the epidemic.
We present preliminary results from our work applying the Charm++
framework to the domain of Computational Epidemiology. Computational
Epidemiology provides a type of computation that is different that the
typical HPC codes. It is characterized by unstructured, dynamically
changing communication patterns. Our initial results are promising,
showing significant improvements over the original MPI-based
implementation. In addition, the task-based decomposition dictated by
Charm++ is a natural fit for these types of problems.
|
|
|
03:30 pm - 04:00 pm |
Talk
|
Scaling Dense LU Factorization in Charm++
Jonathan Lifflander
Click here to expand description
We describe a new implementation of LU factorization of dense matrices
in Charm++. This is a popular supercomputer benchmark, whose
best-known implementation, HPL, is used to generate the Top500
rankings. Our implementation focuses on maximising dynamic execution
flexibility, based on the arrival of input data. We will describe the
tools and techniques this application uses, and present scaling
results obtained on Cray XT5 and Bluegene/P.
|
|
|
04:00 pm - 05:00 pm |
CS Distinguished Lecture Series (1404 SC)
|
Constraint-based Synchronization for Strong Scaled Execution
Prof. Thomas Sterling Arnaud and Edwards Professor of Computer Science, Louisiana State University
Click here to expand description
Weak-scaled parallel execution has demonstrated dramatic advances in the last
two decades through the combination of MPP and commodity cluster architectures
with communicating sequential processes programming methods (e.g., MPI).
Increased performance in that time frame for suitable applications (e.g.,
Linpack) has been observed in excess of four to five orders of magnitude.
However, many applications requiring reduction of execution time for fixed
sized data sets have exhibited less favorable progress. These strong scaled
applications are constrained by limited parallelism, high overheads, delays due
to remote access latencies, and contention for shared resources. Conventional
practices mitigate such challenges through regular coarse grained tasks
avoiding significant effects of these sources of performance degradation. This
presentation will discuss recent results of experiments in strong scaling using
lightweight synchronization constructs including dataflow and futures objects
combined with message-driven computation and dynamic thread scheduling. The
ParalleX execution model is an experimental synthesis of these and other
methods that expose and exploit parallelism inherent to the meta-data of
dynamic graph-based applications. A proof-of-concept reference implementation
of ParalleX, HPX-3, is a runtime system that serves on conventional SMPs and
commodity clusters with Unix-like operating system interfaces. The goal of this
research has been to investigate the potential of constraint-based
synchronization for detection and dynamic scheduling of this form of
parallelism for strong scaling. An advanced Adaptive Mesh Refinement code used
for numerical relativity studies was originally developed in MPI and has been
ported to the HPX library. Initial measurements suggest that such methods may
exhibit significant improvements of scalability compared to conventional
practices. This talk will present these findings and conclude with requirements
for future work.
|
|
|
05:00 pm - 06:30 pm |
|
06:30 pm onwards |
Workshop Banquet (for registered participants only) |
|
8:15 am - 8:30 am |
|
Morning |
Technical Session: Invited Talks (Chair: Prof. Laxmikant Kale) |
08:30 am - 09:15 am |
Keynote
|
TSUBAME2.0, or the long road from tiny clusters to petascale, and its possible contributions to high-resolution natural disaster simulations
Prof. Satoshi Matsuoka Professor, Tokyo Institute of Technology and National Institute of Informatics, Japan
Click here to expand description
TSUBAME2.0 is the latest incarnation of the series of clusters that have been
built at Tokyo Institute of Technology, and has become the first supercomputer
in Japan to reach the Petaflops plateau. TSUBAME 2.0 embodies many unique
features derived from years of research into HPC, especially keeping in mind
retaining or improving bandwidth scalability, fault tolerance, green, using the
latest hardware components such as GPUs and SSDs, as well as employing some of
the latest software research results from labs at Tokyo Tech. I will also touch
upon its possible use to simulations of natural disasters that have hit Japan
recently, demonstrating that despite its relatively small size as well as
adoption of hybrid architectures it scales well to the use of thousands of GPUs
as well as demonstrates performance topping the largest machines such as the
ORNL Jaguar.
|
|
|
09:15 am - 09:45 am |
Talk
|
Towards a Usable Programming Model for GPGPUs
Prof. Orion Sky Lawlor University of Alaska at Fairbanks
Click here to expand description
The enormous performance potential of the modern Graphics Processing Unit (GPU)
for general purpose programming (GPGPU) is matched by the enormous difficulty
of writing correct and maintainable high-performance GPGPU software. In
particular, today's mainstream GPGPU languages like CUDA and OpenCL manage to
combine many of the worst features of both shared memory programming, such as
locking and race conditions, with the worst features of distributed memory
parallel programming, such as explicit byte-level memory copies. As an
alternative, we present a simple restricted programming model for GPGPU with
clean syntax, guaranteed race condition free memory access, and excellent
performance.
We compare this new model against Charm++'s accelerator interface, Charm++
Arrays, SDAG, and MSA. We conclude by exploring methods by which Charm++'s
flagship dynamic migration and automatic load balancing capabilities could be
more naturally extended to the GPGPU era.
|
|
|
09:45 am - 10:15 am |
Talk
|
Exploring Novel Parallel Implementations of Stochastic Programs
Prof. Udatta Palekar Professor of Business Administration, University of Illinois
Click here to expand description
Real-life stochastic programs typically involve integer programs, which are
hard to solve. The problem we are interested in, involves the scheduling of
aircraft for transportation of passengers and cargo. Typical real-life
applications involve millions of equations and constraints.
We present interesting parallelization challenges, both inherent to the
algorithms and imposed by the use of popular numeric libraries. We will also
discuss a design that should significantly enhance the scalability of a
parallel implementation. Decomposing the stochastic program into multi-stage
linear programs and adopting a branch-and-bound technique can enhance the
scalability.
|
|
|
10:15 am - 10:30 am |
|
Morning |
Technical Session: Languages (Chair: Dr. Celso Mendes) |
10:30 am - 11:00 am |
Talk
|
Adventures in Load-Balancing at Large Scale: Successes, Fizzles, and Next Steps
Ewing "Rusty" Lusk Mathematics and Computer Science Division, Argonne National Laboratory
Click here to expand description
This talk will describe an ongoing project to scale simple load-balancing
approaches to hundreds of thousands of processes. The project has developed
the Asynchronous, Dynamic Load-Balancing Library (ADLB) interface, and
experimented with multiple implementations. The API is small and easy to use,
yet flexible enough to serve both as a high-level manager/worker programming
model in its own right and as a low-level execution model for higher-level
approaches. This talk will describe a sophisticated application that has used
ADLB to scale to today's largest machines while simplifying its programming
approach, an alternate implementation that has many good properties but scales
less well, and plans for future improvements.
|
|
|
11:00 am - 11:30 am |
Talk
|
Using distributed shared-array abstractions in a virtualized message-driven execution environment
Phil Miller
Click here to expand description
The strength of the Charm++ programming model is the flexibility it
affords to remap computational objects for load balance and
locality. However, programs using the Multiphase Shared Arrays (MSA)
library have not been able to easily exploit this flexibility due to
restrictions of the implementation. This talk will outline the work
done to lift those limitations, and present some preliminary results
on the performance improvements that can be achieved.
|
|
|
11:30 am - 12:00 pm |
Talk
|
Charj: A Language for Writing Better Charm Programs More Easily
Aaron Becker
Click here to expand description
Charm is a powerful system, but using it effectively is not always easy
or intuitive. Because Charm is implemented as a C++ library and associated
translator, it suffers from difficult-to-interpret error messages, duplication
of code, and the inability to perform optimizations. This talk describes
ongoing work on Charj, a language and compiler targetting the Charm runtime
which aims to address these problems.
|
|
|
12:00 pm - 12:30 pm |
Talk
|
Asynchronous message-driven programming in a shared-memory context
Pritish Jetley
Click here to expand description
In the context of shared-memory programs, application performance is typically
confouned by the following inter-related issues: dynamic task creation and
placement, synchronization and data movement costs, and critical path delays.
Charm++ offers a solution to these problems in the form of an object-based,
message-driven approach to the programming of irregular, shared-memory
programs. The defining features of this approach are:
- Overdecomposition of problems into medium-grained tasks, which engenders data locality;
- Implicit synchronization between objects, which is realized through the exchange of messages;
- Asynchrony of communication, which allows communication latency to be overlapped with useful computation;
- Automatic scheduling of dynamically generated tasks, which removes imbalance of load;
- Prioritization of tasks, which prevents the critical path from being delayed.
We outline the productivity and performance benefits of this paradigm in the
context of two tree-based applications, namely, N-body computations using
the Barnes-Hut method, and the construction of SAH-balanced kd-trees for
efficient rendering of three-dimensional scenes.
|
|
|
12:30 pm - 01:30 pm |
|
01:30 pm - 03:00 pm |
Panel
|
Message driven execution and migratability: niche or necessity?
Panelists: William D. Gropp, Laxmikant V. Kale, Ewing 'Rusty' Lusk, Satoshi Matsuoka, Thomas Sterling Moderator: Orion Lawlor
Click here to expand description
These two ideas have been at the core of the Charm++ approach
to parallel programming. Some or all of the elements of these ideas (esp MDE)
have also occured in the actor model, J- machine, macro-data-flow, Earth,
ParalleX and active messages.
Message-driven execution (MDE) is the idea that a processor
should be scheduled based on availability of data, typically arriving from
remote processors. It can also be called data-driven execution, except that
that phrase has been overloaded in different contexts. . Benefits of MDE
include adaptive overlap of communication-and-computation, compositionality,
and pre-fetching enabled by the concomitant scheduler's queue.
Migratibility is the notion that work-units and data-units of
parallel programs should not be tied to processors by the programming model,
but rather allowed to migrate across processors under the control of an
adaptive runtime system. Migratability is useful for automated resource
management, and fault tolerance, for example. Migratability requires
overdecomposition (aka virtualization, but that’s another overloaded word),
I.e. The number of work and data-units should be larger than the number of
processors.
The question before the panel is: Will these features remain
in a small niche of parallel programming approaches, or are they so essential
in parallel programming of the future that they must be part of every
approach?
|
|
|
03:00 pm - 03:15 pm |
|
Afternoon |
Technical Session: Applications (Chair: Ramprasad Venkataraman) |
03:15 pm - 03:45 pm |
Talk
|
ChaNGa
Prof. Thomas Quinn Professor, Department of Astronomy, University of Washington at Seattle
Click here to expand description
Astrophysical N-body simulation is a uniquely challenging application for
parallel performance. The large dynamic range in space calls for irregular
data structures, while the large dynamic range in time poses challenges for
load balancing. The successes and pitfalls of developing the parallel N-body
code, ChaNGa, in Charm++ will be presented.
|
|
|
03:45 pm - 04:15 pm |
Talk
|
Asynchronous Message-driven Runtime Innovations to Enable Biomolecular Simulations of 100 Million Atoms on Petascale Machines
Dr. James C. Phillips and Chao Mei
Click here to expand description
100 million atom biomolecular simulation with NAMD is one of the three
scientific benchmarks for BlueWaters machine, the NSF-funded sustainable
petascale machine at University of Illinois. To simulate such a huge molecular
system on a machine with hundreds of thousands of cores presents great
challenges. Those issues not only include the traditional optimization problem
as how to achieve good strong scaling results, but also include new problems
due to the problem size, such as loading input data into the simulation and
outputting trajectory data to the file system. Unlike other application
optimizations, we adopted a holistic approach to optimizing both the
application itself (NAMD) and its underlying asynchronous message-driven
runtime (Charm++) for this enablement. In this paper, we examine the issues
one by one, and explore the techniques employed to overcome them. In
particular, we designed and optimized a new mode in the runtime to take
advantage of the wide multicore nodes installed on petascale machines, then
demonstrated how this new mode improves the performance of this simulation by
about 20%. In addition, a new hybrid scheme is introduced in runtime to handle
the large memory footprint requirement when doing load balancing for the
simulation. Without the techniques described in this paper, the 100M-atom
simulation in NAM would not be able to run, not to mention scaling on the
petascale machine. By using those techniques, we were able to achieve very good
performance for the 100M-atom simulation, scaling up to 100K cores, on the
Jaguar Cray XT5 machine at NCCS.
|
|
|
04:15 pm - 04:45 pm |
Talk
|
OpenAtom
Dr. Glenn Martyna and Eric Bohm IBM Research
Click here to expand description
The goal of simulation studies is to provide insight into important systems of scientific and technical interest. Today, appoaching these systems involves treating accurately complex heterogeneous interfaces. The modeling of nanostructures is reviewed with application to problems in engineering, physics, and biochemistry. In particular, computer models of phase change memories and transparent electrodes for solar cells are described along with the novel parallel algorithms underlying the computations. Of particular interest is the discovery chemistry underlying the doping of graphene sheets for use in photovoltaic cells.
|
|
|
04:45 pm - 05:15 pm |
Talk
|
Towards Message-driven Mixed-Quantum/Classical Dynamics in Atomic Simulation
Dr. Chris Harrison
Click here to expand description
Massively parallel simulations of complex atomic and molecular processes using
mixed quantum/classical dynamics remains an important goal of computational
physics and chemistry. Mixed quantum/classical simulations promise new insight
into the coupling between large scale dynamics, treated classically, and key
quantum events, such as chemical reactions, excited state transitions and
similar electronic phenomena treated quantum mechanically. Example problems to
benefit include the conversion of light into energy in photosynthetic and
solar-cell systems, or the coupling between chemical reactions and protein
dynamics in enzyme performance. Previous quantum/classical simulation codes
using leapfrog algorithms offered limited parallelism and concurrency. We
present the early evolution of a message-driven mixed-quantum/classical
dynamics interface using multiphase shared arrays, and some preliminary
results.
|
|
|
5:15 pm - 5:45 pm |
Fun
|
Annual PPL Photograph
David Kunzman
|
|
|
|
8:45 am - 09:00 am |
|
Morning |
Technical Session: Tutorial |
09:00 am - 10:30 pm |
Tutorial
|
Charm++
Coordinator: Eric Bohm
|
|
|
12:00 pm - 01:00 pm |
|
Afternoon |
Technical Session: Tutorial |
01:00 pm - 02:00 pm |
Fun
|
Tour of the Blue Waters Facility
Organized by NCSA
|
|
|
02:00 pm - 05:00 pm |
Tutorial
|
BigSim Simulation Framework
Dr. Celso Mendes and Ryan Mokos
Click here to expand description
BigSim is a simulation system that allows application programmers to
develop, debug and tune/scale/predict the performance of applications
on large machines. One of its main advantages is to allow the study of
application behavior on future machines, so that the those applications
can be ready when the machine first becomes operational. It also allows
easy "offline" experimentation of parallel performance tuning strategies,
without using the full parallel computer.
This tutorial will cover the main aspects involved in using BigSim,
including presentation of its major components: emulation and simulation.
It will also show how to prepare MPI applications for use in BigSim,
how to analyze the emulation traces used in the simulation, how to
obtain and visualize performance data for the simulated code, and
how to produce various statistics about the network of the machine
being simulated. To illustrate how different network models can be
used in BigSim, some examples will be presented covering both simple
models and more advanced models capable of modeling network congestion,
such as the model for the Blue Waters interconnect.
|
|
|
|