Time |
Type |
Description |
Slides |
Webcast |
|
8:30 am - 9:00 am |
|
Morning |
|
9:00 am - 9:15 am |
Talk
|
Opening Remarks
Prof. Laxmikant V. Kale
|
|
|
9:15 am - 10:00 am |
Talk
|
Charm++ Research Agenda: Recent Developments and Plans
Prof. Laxmikant V. Kale
Click here to expand description
Charm++ and the rich research agenda engendered by its idea of object-based over-decomposition made significant progress during the past y
ear. I will review the basic concepts that have been the foundation of our approach to parallel programming, and highlight specific achie
vements of the past year. These include progress on our production-quality collaboratively-developed science and engineering applications,
including NAMD (biophysics), OpenAtom (Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the progress and challenges
in our agenda of developing higher level parallel languages.
|
|
|
10:00 am - 10:15 am |
|
Morning |
Technical Session: Charm++ on Blue Waters (Chair: Eric Bohm) |
10:15 am - 10:45 am |
Talk
|
Adaptive MPI
Celso Mendes
Click here to expand description
In this talk, we discuss Adaptive MPI (AMPI), an implementation of the popular MPI standard.
AMPI is based on Charm++, and implements the traditional MPI tasks with user-level migratable threads.
Thus, AMPI provides advanced features such as dynamic load balancing and automatic overlap between
computation and communication to traditional MPI codes. Porting legacy MPI codes to AMPI typically
involves no change to the sources. We will review AMPI's basic features, and discuss its current status.
|
|
|
10:45 am - 11:15 am |
Talk
|
The BigSim Parallel Simulation System
Gengbin Zheng and Ryan Mokos
Click here to expand description
PetaFLOPS-class computers are currently being developed and even larger computers are being planned. Our BigSim project is aimed at developing tools that allow one to develop, debug and tune/scale/predict the performance of applications before such machines are available so that the applications can be ready when the machine first becomes operational. It also allows easier "offline" experimentation of parallel performance tuning strategies --- without using the full parallel computer. To the machine architects, BigSim provides a method for modeling the impact of architectural choices (including the communication network) on actual, full-scale applications. In this talk, we will present our simulation framework which consists of an emulator and a simulator; we will focus on the recent progress in integrating instruction level simulation with our framework, and out-of-core emulation support.
|
|
|
11:15 am - 11:45 am |
Submitted Paper
|
Automatic MPI to AMPI Conversion using Photran
Stas Negara
Click here to expand description
Adaptive MPI is an implementation of the Message Passing Interface (MPI)
standard. AMPI benefits MPI programs with features such as dynamic load
balancing, virtualization, and checkpointing. AMPI runs each MPI process in a
user-level thread, therefore causing problems when an MPI program has global
variables. Manually removing the global variables in the program is tedious and
error-prone. In this talk, we present a tool that automates this task with a
source-to-source transformation that supports Fortran. We evaluate our tool on
a real-world large-scale FLASH code and present preliminary results of running
FLASH on AMPI. Our results demonstrate that the tool makes it easier to use
AMPI.
|
|
|
11:45 am - 12:15 pm |
Submitted Paper
|
BigDFT with AMPI: Preliminary Results
Jean-Francois Mehaut
Click here to expand description
In this paper, we show what we have done to adapt BigDFT, an atomistic simulation to AMPI. We compare the performance
of both MPI and AMPI version of BigDFT. We also evaluate the impact of GPU on this performance.
|
|
|
12:15 pm - 1:00 pm |
|
Afternoon |
Technical Session: Large scale MD Simulations (Chair: Celso Mendes) |
1:00 pm - 1:45 pm |
Talk
|
NAMD Preparation for Blue Waters
Eric Bohm
Click here to expand description
BlueWaters offers exciting new opportunities for the simulation of
large biological molecular systems. This talk will cover a few of the
ways we are extending NAMD to shine on the BlueWaters architecture.
Such as, support for molecular systems larger than 100 million atoms,
parallel startup, parallel I/O, performance tuning for Power 7,
conversion of SSE routines to VSX, uses of SMT and SMP, and more.
|
|
|
1:45 pm - 2:30 pm |
Talk
|
Charm++ Hits and Misses - A NAMD Perspective
Jim Phillips, Beckman Institute, University of Illinois
Click here to expand description
NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used now by most scalable programs for biomolecular simulations, including Blue Matter and Desmond developed by IBM and D. E. Shaw respectively. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing.
|
|
|
2:30 pm - 3:00 pm |
Talk
|
Hierarchical Load Balancing for Large Scale Supercomputers
Gengbin Zheng
Click here to expand description
Parallel machines with hundreds of thousands of processors are already in use.
Ensuring good load balance is critical for
scaling certain classes of parallel applications on these machines.
Centralized load balancing algorithms face scalability problems,
especially on machines with relatively small amount of memory. Fully
distributed load balancing algorithms, on the other hand, tend to yield poor
load balance on very large machines. In this talk, we present an automatic
adaptive hierarchical load balancing method that overcomes the scalability
challenges of centralized schemes and poor solutions of traditional distributed
schemes. This is done by creating multiple levels of aggressive load balancing
domains which form a tree.
We show performance data of the hierarchical load balancing method
on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also
demonstrate the successful deployment of the method in NAMD.
|
|
|
3:00 pm - 3:15 pm |
|
Afternoon |
Technical Session (Chair: Gengbin Zheng) |
3:15 pm - 3:45 pm |
Talk
|
Scalable Fault Tolerance with Charm++
Esteban Meneses
Click here to expand description
In the next few years we will witness the deployment of unprecedented large supercomputers, comprised of hundreds of thousands of cores. Rough estimates predict that the mean time between failures in those machines will be smaller than one day. For an application that runs for long time and uses a large core count, the only way to survive in this environment is by using fault tolerance mechanisms. One possibility is to rely on a runtime system that can recover from failures in an automatic fashion, like Charm++. In this talk, we will present recent developments in Charm++ infrastructure that will make us able to deal with failures at a high core count.
|
|
|
3:45 pm - 4:30 pm |
Talk
|
ChaNGa: Charm N-body GrAvity solver
Thomas R. Quinn Professor, Department of Astronomy, University of Washington
Click here to expand description
Simulations of galaxies forming in their cosmological context poses a
number of challenges to performance on large parallel machines. The
first is the very non-local nature of gravitational forces. Galaxies
are influenced by the gravitational forces originating tens of
megaparsecs away, requiring significant communication in the force
solver. Second is the enormous spatial dynamic ranges involved, from
megaparsecs to sub-parsec scales, requiring dynamic hierarchical data
structures. Third is the vast time scales involve, from less than 1
million years to the age of the Universe, posing significant
challenges for load balancing. This talk will present how these
challenges have been addressed in the design of ChaNGa, the Charm
N-body GrAvity solver.
|
|
|
4:30 pm - 6:00 pm |
Panel Discussion
|
Exascale by 2018. Really?
Click here to expand description
Your desktop can probably do a few billion operations per second. If you multiply that billion fold, you get a petaFLOP/s machine. At Illinois, we will be deploying a multi-petaFLOP/s machine during the next year or so. Now imagine a machine thousand times powerful than that! That is an Exascale machine, and scientists and funding agencies have been discussing development of such machine in eight years time.. by 2018. Can we build such a powerful machine? What can we do with this powerful a machine? Can the society afford it? How much electricity will it consume, and can we reduce that to a practical number? What kinds of software innovations are needed to program it? We will discuss these questions with experts.
|
|
|
6:00 pm - 7:00 pm |
|
7:00 pm onwards |
Workshop Banquet (for registered participants only) |
|
8:30 am - 9:00 am |
|
Morning |
Technical Session (Chair: L. V. Kale) |
9:00 am - 10:00 am |
Keynote
|
An Off-The-Wall, Possibly CHARMing View of Future Parallel Applications
James C. Browne, Regents Chair in Computer Sciences, University of Texas at Austin
Click here to expand description
Development methods for HPC applications change slowly and will continue to change slowly.
It is thus safe to suggest radical changes because the chance they will be adopted quickly
is low. This talk will sketch a few possible futures for HPC application development which
are considerably different from current practice. The first part of the talk will sketch
possible influences of development practices and the second some responses to these influences
including, components, self-management, a merger of grid and HPC developments, tools based
on expert systems technology.
|
|
|
10:00 am - 10:45 am |
Talk
|
Processor Virtualization in Weather Models
Eduardo Rodrigues
Click here to expand description
In this work, we investigate the usefulness of Processor Virtualization in
weather models as a tool for load balancing this type of application. That
is because this strategy can solve both issues presented by Xue et al : (1) it
simplifies the implementation of a load balance scheme and (2) it can hide
the overhead of migrating load with computation. In our experiments we
used the weather model BRAMS.
|
|
|
10:45 am - 11:00 am |
|
11:00 am - 11:30 am |
Submitted Paper
|
NUMA Support for Charm++
Christiane Pousa Ribeiro
Click here to expand description
This paper presents the
work we have done on charm++ in order to provide a transparent NUMA support. Such support
is based on three parts: a command line option to bind application data to memory banks, an
interleaved heap and a NUMA-aware memory allocator.
|
|
|
11:30 am - 12:00 pm |
Talk
|
Parallel Sorting
Edgar Solomonik
Click here to expand description
Efficiently scaling parallel sorting on modern supercomputers is inhibited by the communication-intensive problem of migrating large amounts of data between processors. The challenge is to design a highly scalable sorting algorithm that uses minimal communication, maximizes overlap between computation and communication, and uses memory efficiently. This talk presents a scalable extension of the Histogram Sorting method which makes modifications to the original algorithm in order to minimize message contention and exploit overlap. We compare the performance of Histogram Sort, Sample Sort, and Radix Sort, all implemented in Charm++. The choice of algorithm as well as the importance of the optimizations is validated by performance tests on two predominant modern supercomputer architectures: XT4 at ORNL (Jaguar) and Blue Gene/P at ANL (Intrepid).
|
|
|
Noon |
|
Afternoon |
Technical Session (Chair: Ryan Mokos) |
1:30 pm - 2:15 pm |
Talk
|
David Kunzman and Lukasz Wesolowski
Click here to expand description
Accelerators such as Graphical Processing Units (GPUs) and specialized
cores, such as the Synergistic Processing Elements (SPEs) on the Cell
processor, are being used with greater frequency in the realm of parallel
computing to speedup computationally heavy portions of code. These systems are
comprised of multiple types of processing elements, each with unique
characteristics, strengths, weaknesses, and programming paradigms. Developing
applications can be challenging since many architectural details must be taken
into account. In this talk we will summarize the ongoing efforts to allow the
Charm++ Runtime System to utilize accelerators while trying to abstract away as
many architectural details as possible. Specifically, we will cover work
related to the Cell processor and GPUs.
|
|
|
2:15 pm - 3:00 pm |
Talk
|
Pritish Jetley
Click here to expand description
We study the use of clusters of general purpose graphics processors for tree-based N-body simulations.
We investigate key performance issues in the context of clusters of GPUs. These include
kernel organization and efficiency, the balance between tree traversal and force computation
work, grain size selection through the tuning of offloaded work request sizes, and the
reduction of sequential bottlenecks. The effects of various application parameters are
studied and experiments are carried out to quantify gains in performance. Our studies
are carried out in the context of a production-quality parallel cosmological simulator
called ChaNGa. We highlight the re-engineering of the application to make it more suitable
for GPU-based environments. Finally, we present scaling performance results from experiments
on the NCSA's Lincoln GPU cluster.
|
|
|
3:00 pm - 3:15 pm |
|
3:15 pm - 4:00 pm |
Talk
|
Debugging Large Scale Parallel Applications
Filippo Gioachin
Click here to expand description
In this talk, I will expose recent research in the field of parallel debgging. The main topic discussed will be: how can we debug an application on thousands of processors without burning all our allocation on the machine? The two techniques I will present are Virtualized Debugging and Processor Extraction.
|
|
|
4:00 pm - 4:45 pm |
Talk
|
Automating Topology Aware Mapping for Supercomputers
Abhinav Bhatele
Click here to expand description
Parallel computing is entering the era of petascale machines. This era brings
enormous computing power to us and new challenges to harness this power
efficiently. Machines with hundreds of thousands of processors already exist,
connected by complex interconnect topologies. Network contention is becoming an
increasingly important factor affecting overall performance. The farther
different messages travel on the network, greater is the chance of resource
sharing between messages and hence, of contention. Recent studies on IBM Blue
Gene and Cray XT machines have shown that under contention, message latencies
can be severely affected.
Mapping of communicating tasks on nearby processors can minimize contention and
lead to better application performance. In this talk, I will propose algorithms
and techniques for automatic mapping of parallel applications to relieve the
application developers of this burden. I will first demonstrate the effect of
contention on message latencies and use these studies to guide the design of
mapping algorithms. I will introduce the hop-bytes metric for the evaluation of
mapping algorithms and suggest that it is a better metric than the previously
used maximum dilation metric. I will then discuss in some detail, the mapping
framework which comprises of topology aware mapping algorithms for parallel
applications with regular and irregular communication patterns.
|
|
|
4:45 pm - 5:30 pm |
Event
|
Annual PPL Group Photograph
|
|
|
|
9:30 am - 10:00 am |
|
Morning |
Technical Session (Chair: Ramprasad Venkataraman) |
10:00 am - 10:30 am |
Talk
|
Implementing Dense LU Factorizations in Parallel
Isaac Dooley
Click here to expand description
This talk will give an overview of how dense matrix LU factorizations are performed in parallel. The LU factorization is an important problem because it is used to test the speed of supercomputers for the Top500 list. A new Charm++ implementation will be discussed along with the common MPI implementation named HPL and a UPC implementation.
|
|
|
10:30 am - 11:00 am |
Talk
|
Stochastic Programming : Aircraft Allocation Problem
Gagan Gupta
Click here to expand description
We present our initial results on the parallelization of a classic
example of two stage stochastic programming with linear recourse. This
problem, inspired by the Air Mobility Command's operational problem
involves the assignment of aircrafts at various bases for a
period of one month so that the subsequent disruptions due to
emergencies and variable demands is minimized. We discuss some of the
peculiar aspects of the problem (coarse grain computation, dependency
in the execution times, large message sizes) and the approaches we
plan to take to achieve good speedup.
|
|
|
11:00 am - 11:30 am |
Talk
|
OpenAtom
Glenn Martyna IBM Research
Click here to expand description
The goal of simulation studies is to provide insight into important systems of scientific and technical interest. Today, appoaching these systems involves treating accurately complex heterogeneous interfaces. The modeling of nanostructures is reviewed with application to problems in engineering, physics, and biochemistry. In particular, computer models of phase change memories and transparent electrodes for solar cells are described along with the novel parallel algorithms underlying the computations. Of particular interest is the discovery chemistry underlying the doping of graphene sheets for use in photovoltaic cells.
|
|
|
11:30 am - 12:00 pm |
|
12:00 pm - 1:00 pm |
Tour
|
NCSA Blue Waters Facility Tour
|
|
|