Time |
Type |
Description |
Slides |
Webcast |
|
8:00 am - 8:45 am |
Continental Breakfast / Registration |
Morning |
|
8:45 am - 9:00 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale
|
|
|
9:00 am - 9:45 am |
Invited Talk
|
X10 at Scale
Olivier Tardieu, IBM
Click here to expand description
X10 is an open-source imperative concurrent object-oriented programming language developed by IBM Research to ease the programming of scalable concurrent and distributed applications. In this talk, I will report and reflect on our experience running HPC application kernels and graph algorithms implemented in X10 on a Petaflop IBM Power 775 supercomputer (with up to 55,000 Power7 cores). I will discuss design and implementation decisions that make it possible to achieve competitive performance at scale while retaining X10's productivity. In particular, I'll describe our implementation of the Unbalanced Tree Search benchmark (UTS), which illustrates X10's handling of irregular parallelism.
|
|
|
9:45 am - 10:30 am |
Invited Talk
|
The Chapel Runtime
Greg Titus, Cray
Click here to expand description
The Chapel runtime provides a variety of services during Chapel program execution. Among these services are managing memory, performing remote communication, handling parallelism and synchronization, and many others. In this talk I'll give an overview of the runtime's architecture and implementation, describe its relationship with Chapel's built-in modules, and give several examples of how Chapel language constructs translate into runtime activities. I'll also discuss how we see the responsibilities of the runtime and its relationships with other parts of the Chapel software stack changing in the future.
|
|
|
10:30 am - 11:00 am |
|
Morning |
Technical Session: Run Time System (Chair: Gengbin Zheng) |
11:00 am - 11:30 am |
Talk
|
A Multi-Paradigm Approach to High-Performance Scientific Programming
Pritish Jetley, University of Illinois at Urbana-Champaign
Click here to expand description
In this talk, we will consider the following key questions:
(1) Is it possible to write parallel programs in a modular manner, so that independently developed modules (e.g. a finite element code and a computational fluid dynamics code, or even a chemical kinetics module and a cortical neuron simulator) can be composed in a noninvasive and efficient manner?
(2) Is it possible for non-expert programmers to write parallel code in abstract specifications without losing performance? That is to say, can we reconcile the two opposing forces of performance and productivity?
These questions are of immediate importance to the developers of complex multi-physics codes intended to scale to hundreds of thousands of processors. Our approach to this problem is characterized by the following salient features:
1. Specialization: programs are written using a set of abstract and specialized, but individually incomplete frameworks/languages (collectively, programming paradigms) to engender productivity of programming.
2. Interoperability: we provide a common runtime substrate for the communication of data between modules written in different frameworks/languages, thereby enabling completeness of expression.
3. Runtime Management: our abstractions, frameworks and languages are based on an adaptive run time system (ARTS) that dynamically optimizes the execution of a running program, namely Charm++.
We will discuss these inter-related ideas, and examine a few languages and frameworks (specialized and otherwise) in the Charm++ ecosystem.
|
|
|
11:30 am - 12:00 pm |
Talk
|
LRTS: a Portable High Performance Lower-level Communication Interface
Yanhua Sun, University of Illinois at Urbana-Champaign
Click here to expand description
As the modern interconnection network turns highly complex,
it is challenging to obtain good performance of various parallel applications
over different supercomputers. Over years, asynchronous message-driven
runtime-based parallel programming model has been proven to be scalable
and productive, where Charm++ is one major instance. To better exploit the performance
and portability of Charm++ on modern supercomputers,
we define a set of functions for a universal lower-level communication interface. In this talk, I will present
this set of functions and how they support the asynchronous runtime system .
Features like persistent communication, intra-node communication
are also discussed. The interface has been implemented on MPI, Cray uGNI, and IBM
PAMI libraries. I will present some results for small benchmarks and real applications on state-of-art supercomputers.
|
|
|
12:00 pm - 12:30 pm |
Talk
|
Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure
Xiang Ni, University of Illinois at Urbana-Champaign
Click here to expand description
As the scale of machines increase, the HPC community has seen a steady decrease in reliability of the systems, and hence an increase in the down time. Moreover, soft errors such as bit flips do not prevent execution but generate incorrect results. Checkpoint/restart is by far the most commonly used fault tolerance method for hard errors, and its efficiency and scalability has been improved with recent research. In this talk, we will discuss the asynchronous double in-memorycheckpoint scheme which can significantly hide the checkpoint overhead for applications by overlapping checkpoint with application execution. Soft errors are becoming more important even on terrestrial attitude
because of the manufacturing limit on the small architecture. Long time running program with only traditional fault tolerance support like checkpoint have a high chance of ending up with incorrect result due to the soft error corruption. We will also talk about how replication can enhance checkpoint/restart technique to provide soft error protection for applications. In the mean time, replication can increase program efficiency in the environment where fail-stop failure rate is high. We evaluate our approach using multiple benchmarks written in Charm++ and MPI, including stencil codes and molecular dynamics mini-applications. Both benchmarks show minimal overhead when scaled to 32768 cores.
|
|
|
12:30 pm - 01:30 pm |
|
Afternoon |
Technical Session: Performance at Scale (Chair: Pritish Jetley) |
1:30 pm - 2:00 pm |
Invited Talk
|
Intuitive Visualizations for Performance Analysis at Scale
Todd Gamblin, Lawrence Livermore National Laboratory
Click here to expand description
Performance data is highly complex and often difficult to interpret for even a single process. As concurrency levels and on-node complexity rise in modern supercomputers, the difficulty of understanding performance problems also increases. Traditional performance analysis tools can provide some insight into where a code spends its time, but they typically leave to the programmer the task of ascribing meaning to the numbers.
At LLNL, the PAVE project is developing techniques to map performance data to more intuitive domains, and to visualize the result. For example, to analyze a data-dependent problem in a physics simulation, we may want to know how performance measurements relate to the physics mesh. Or, to optimize communication within a parallel load balancer, we need to know more about communication patterns to understand which processes are waiting. In this talk, we give an overview of the tools and techniques we have developed for performance measurement and visualization, and we describe the insights we have gained from this approach.
|
|
|
2:00 pm - 2:30 pm |
Talk
|
Distributed Load Balancing
Harshitha Menon, University of Illinois at Urbana-Champaign
Click here to expand description
As we move towards the exascale era, dynamic load balance will become critical for achieving good system utilization. With a large number of cores, centralized load balancing schemes, which collect global information and compute the decisions at a central location, are not scalable. In contrast, fully distributed strategies are scalable but do not produce balanced work distribution because they tend to consider only local information. We will talk about a fully distributed algorithm for persistence based load balancing which performs good load balancing with less overhead and compare it with other load balancing strategies.
|
|
|
2:30 pm - 3:00 pm |
Talk
|
Load Balancing for Cloud Environments
Abhishek Gupta, University of Illinois at Urbana-Champaign
Click here to expand description
Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud. For tightly-coupled iterative applications, one slow processor slows down the entire application, resulting in poor CPU utilisation.
In this talk, we present a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. Through experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, we demonstrate performance benefits up to 45%. Finally, we analyse the effect of load balancing frequency, problem size and computational granularity (problem decomposition) on the performance and scalability of our techniques.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Interoperability (Chair: Ramprasad Venkataraman) |
3:30 pm - 4:00 pm |
Talk
|
Charm++ Interoperability
Nikhil Jain, University of Illinois at Urbana-Champaign
Click here to expand description
Charm++ is a unique parallel programming paradigm based on message-driven execution powered by a runtime-system. Over decomposition into migratable work units by application writers and (almost) total control of execution by the run-time system allows Charm++ to improve application performance in conjunction with programmer's productivity. This is in stark contrast with MPI that follows (generally) a bulk-synchronous programming model with all decisions taken by the programmer. Given the wide range of applications and their characteristics, one of these two styles of programming may be suitable for development of those applications. In fact, given the complexity of present day applications, different programming paradigm may suit different parts of an application. This talk explores interoperability of Charm++ with other programming paradigms (with focus on MPI). The topics touched upon include Adaptive MPI, hybrid programming in Charm++ and MPI and interoperability with OpenMP.
|
|
|
4:00 pm - 4:30 pm |
Invited Talk
|
Charm++ Implementation of a Detonation Shock Dynamics Algorithm
Brian McCandless, Lawrence Livermore National Laboratory
Click here to expand description
This work presents a Charm++ implementation for a narrow band algorithm for the Detonation Shock Dynamics (DSD) model to simulate
the propagation of detonations. This algorithm is regarded as a fast and
accurate method, however there are significant challenges to implement it in a
scalable way. In this algorithm, nearly all the computational work is
located on the small region near the propagation front. If the algorithm
is implemented with a simple domain decomposition, then the problem
becomes
load imbalanced, where only a small fraction of the processors will have
work
to do at any given time-step. One approach to solve the load balancing
problem using over-decomposition with Charm++ has been investigated.
The second focus of this work is MPI interoperability.
This work explores the relatively new Charm++/MPI interoperability
feature. The DSD algorithm is a small part of a much larger MPI based
application. It is not practical to rewrite the entire application in
Charm++,
but it is possible to incrementally rewrite certain algorithms, provided
that
it works well with MPI in control. Initial exploration on the
interoperability will be discussed.
|
|
|
4:30 pm - 5:00 pm |
Submitted Paper
|
Optimizing Charm++ over MPI
Ralf Gunter, Argonne National Laboratory
Click here to expand description
Charm++ may employ any of a myriad network-specific APIs for handling communication, which
are usually promoted as being faster than its catch-all MPI module. Such a performance difference not
only causes development effort to be spent on tuning vendor-specific APIs, but also discourages hybrid
Charm++/MPI applications. We investigate this disparity across several machines and applications, ranging
from small InfiniBand clusters to Blue Gene/Q supercomputers; from synthetic benchmarks to large-scale
biochemistry codes. Finally, we demonstrate the use of one feature from the recent MPI-3 standard
to bridge this gap where applicable, and what can be done today.
|
|
|
5:00 pm - 6:00 pm |
Fun
|
Blue Waters Facility Tour
NCSA
|
|
|
6:00 pm - 7:00 pm |
|
07:00 pm onwards |
Workshop Banquet (for registered participants only) Located at the 2nd floor Atrium outside 2405 Siebel Center |
|
8:00 am - 9:00 am |
Continental Breakfast / Registration |
9:00 am - 10:00 am |
Keynote Talk
|
Towards an Ecosystem for Heterogeneous Parallel Computing
Wu-chun Feng, Virginia Tech
Click here to expand description
With processor core counts doubling every 18-24 months and penetrating all markets from high-end servers in supercomputers to desktops and laptops down to even mobile phones, we sit at the dawn of a world of ubiquitous parallelism, one where extracting performance via parallelism is paramount. That is, the "free lunch" to better performance, where programmers could rely on substantial increases in single-threaded performance to improve software, is over. The burden falls on developers to exploit parallel hardware for performance gains. But how do we lower the cost of extracting such parallelism, particularly in the face of the increasing heterogeneity of processing cores? To address this issue, this talk will present a vision for an ecosystem for delivering accessible and personalized supercomputing to the masses, one with a heterogeneity of (hardware) processing cores on a die or in a package, coupled with enabling software that tunes the parameters of the processing cores with respect to performance, power, and portability via a benchmark suite of computational dwarfs and applications.
|
|
|
10:00 am - 10:30 am |
Talk
|
Dynamic Power Management in Charm++
Osman Sarood, University of Illinois at Urbana-Champaign
Click here to expand description
As we move to exascale machines, both peak power demand and total energy consumption have become prominent challenges. A significant portion of that power and energy consumption is devoted to cooling, which is overlooked by most HPC researchers. It is possible to save cooling energy consumption given that processor cores do not overheat. Due to the exponential relationship between core temperatures and fault rate, saving cooling at the expense of higher core temperature might imply an increase number of faults. In our work we propose a scheme which leverages DVFS to constrain core temperatures what allows us to reduce cooling energy consumption while reducing the fault rate at the same time.
Our approach is particularly designed for parallel applications, which are typically tightly coupled, and tries to minimize the timing penalty associated with temperature control. We formulate a model that captures the expected reduction in execution time that can result due to better reliability resulting from temperature control. We demonstrate the uses of our scheme using 5 different applications on a 32-node cluster with a dedicated Computer Room Air-conditioning Unit (CRAC).
|
|
|
10:30 am - 11:00 am |
|
Morning |
Technical Session: Power and Energy (Chair: Osman Sarood) |
11:00 am - 11:30 am |
Invited Talk
|
Power-performance modeling, analyses and challenges
Kirk W. Cameron, Virginia Tech
Click here to expand description
The power consumption of supercomputers ultimately limits their performance. The current challenge is not whether we will can build an exaflop system by 2018, but whether we can do it in less than 20 megawatts. The SCAPE Laboratory at Virginia Tech has been studying the tradeoffs between performance and power for over a decade. We've developed an extensive tool chain for monitoring and managing power and performance in supercomputers. We will discuss our power-performance modeling efforts and the implications of our findings for exascale systems as well as some research directions ripe for innovation.
|
|
|
11:30 am - 12:00 am |
Talk
|
Toward Runtime Power Management of Exascale Networks by On/Off Control of Links
Ehsan Totoni, University of Illinois at Urbana-Champaign
Click here to expand description
Higher radix networks, such as high-dimensional
tori and multi-level directly connected networks, are being used
for supercomputers as they become larger but need lower
diameter. These networks have more resources (e.g. links) in
order to provide good performance for a range of applications.
We observe that a sizeable fraction of the links in the interconnect
are never used or underutilized during execution of common
parallel applications. Thus, in order to save power, we propose
addition of hardware support for on/off control of links in
software and their management using adaptive runtime systems.
We study the effectiveness of our approach using real
applications (NAMD, MILC), and application benchmarks (NAS
Parallel Benchmarks, Jacobi). They are simulated on
representative networks such as 6-D Torus and IBM PERCS (similar
to Dragonfly). For common applications, our approach can save
up to 20% of total machineĀ¹s power and energy, without any
performance penalty.
|
|
|
12:00 am - 12:30 pm |
Talk
|
Energy Profile of Fault Tolerance Methods
Esteban Meneses, University of Illinois at Urbana-Champaign
Click here to expand description
An exascale machine is expected to be delivered in the time frame 2018-2022. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This talk presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.
|
|
|
12:30 pm - 01:30 pm |
|
01:30 pm - 03:00 pm |
Panel
|
Temperature, Power, Time and Energy: Can software control it all?
Panelists: Kirk W. Cameron (Virginia Tech), Martin Schulz (Lawrence Livermore National Laboratory), Wu-chun Feng (Virginia Tech), Mitsuhisa Sato (University of Tsukuba), Laxmikant V. Kale (University of Illinois at Urbana-Champaign) Moderator: William Gropp (University of Illinois at Urbana-Champaign)
Click here to expand description
As move from Petascale to Exascale, these factors are becoming increasingly important. To avoid overheating the chips, frequencies have stopped increasing.
Instantaneous power needs to be kept within the facility's limits. The energy per FLOP needs be kept within bounds for attaining exaFLOP/s rates in a practical manner. Architecture innovations are clearly needed. But the question is: given a particular machine hardware, can software do something so as to (a) keep instantaneous power within pre-define limits, while (b) keeping chip temperatures within pre-set thresholds, and (c) minimizing execution time for a given job and (d) minimize energy used per job? This panel will seek answers from leading researchers in this area.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Applications (Chair: Celso L. Mendes) |
3:30 pm - 4:00 pm |
Talk
|
Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines
Keith Bisset, Virginia Tech
Click here to expand description
Applications that model dynamical systems involve large scale, ir-
regular graph processing. These applications are difficult to scale
due to the evolutionary nature of their workload, irregular commu-
nication and load imbalance. EpiSimdemics implements a graph
based system that captures dynamics among co-evolving entities,
while simulating contagious diffusion in extremely large and real-
istic social contact networks. EpiSimdemics relies on individual-
based models, thus allowing studies in great detail. This paper
presents a novel implementation of EpiSimdemics in Charm++,
which enables future research by social, biological and compu-
tational scientists at unprecedented data and system scales. We
present new methods for application-specific decomposition of graph
data and predictive dynamic load migration and demonstrate the
effectiveness of these methods on Cray XE6/XK7 and IBM Blue
Gene/Q.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
ChaNGa: a Charm++ N-body Treecode
Tom Quinn, University of Washington
Click here to expand description
Simulations of cosmological structure formation demand significant
computational resources because of the vast range of scales involved:
from the size of star formation regions to, literally, the size of the
Universe. I will describe the cosmology problems we plan to address
with our Blue Waters allocation. ChaNGa, a Charm++ N-body/Smooth
Particle Hydrodynamics tree-code is the application we will be running
with this allocation. I will describe the improvements in both
astrophysical modeling and in parallel performance we have made over
the past year in preparation for running these simulations.
|
|
|
4:30 pm - 5:00 pm |
Talk
|
Scalable Algorithms for Structured Adaptive Mesh Refinement
Akhil Langer, University of Illinois Urbana-Champaign
Click here to expand description
We present scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a distributed load balancer in contrast to traditional linear numbering schemes, which incur unnecessary synchronization for load balancing. In contrast to the existing approaches which take O(P ) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 16k ranks on IBM Blue Gene/Q.
|
|
|
5:00 pm - 5:30 pm |
Talk
|
TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages
Lukasz Wesolowski, University of Illinois at Urbana-Champaign
Click here to expand description
Fine-grained communication in supercomputing applications often limits
performance due to high communication overhead and saturation of network
bandwidth. In this talk I describe how to optimize fine-grained communication
performance using the Topological Routing and Aggregation Module (TRAM). TRAM
collects units of fine-grained communication from application code and combines
them into aggregate messages with a common intermediate destination. It routes
these messages along a virtual mesh topology mapped onto the physical topology
of the network, recombining message fragments at intermediate destinations.
TRAM leads to improved network bandwidth utilization and reduced message
overhead. It is particularly effective in improving performance of patterns with
global communication and a large number of messages, such as all-to-all and
many-to-many paradigms. This will be demonstrated with performance results from
two scientific applications: EpiSimdemics, a simulator of the spread of
contagion, and ChaNGa, an N-Body cosmological simulator.
|
|
|
5:30 pm - 6:00 pm |
Fun
|
Annual PPL Photograph
Jonathan Lifflander
|
|
|