Time |
Type |
Description |
Slides |
Webcast |
|
8:15 am - 8:45 am |
Continental Breakfast / Registration - 2nd floor atrium outside 2405 Siebel Center |
Morning |
Opening Session - 2405 Siebel Center |
8:45 am - 9:00 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign
|
|
|
9:00 am - 10:00 am |
Keynote
|
Directive-Based Programming at Scale?
Prof. Barbara Chapman, Stony Brook University
Click here to expand description
There have been a number of attempts to define compiler directives that enable an application developer to specify concurrency at a reasonably high level. Recent years have seen the growing use of both OpenMP and OpenACC directives for scientific computation, the expansion of their feature sets and growing maturity of the commercial implementations. We discuss how these have evolved to meet the needs of large-scale applications and systems, and the opportunities they offer for run-time adaptation."
|
|
|
10:00 am - 10:45 am |
Talk
|
Handling Transient and Persistent Imbalance Together in Distributed and Shared Memory
Harshitha Menon, Seonmyeong Bak
Click here to expand description
This recent trend of rapid increase in the number of cores per chip has resulted in vast amount of on-node parallelism on HPC systems. Not only the number of cores per node is increasing substantially but also the cores are becoming heterogeneous.Increasingly, science and engineering applications are becoming more complex and dynamic to utilize the increasing computation capability of systems. One of the critical factors that affect the performance of many applications is load imbalance. We leverage the multi-core shared memory systems along with the persistent object based Charm++ model to mitigate the load imbalance problem for Charm++.
OpenMP is the defacto standard in task-level parallel programming. Because OpenMP keeps its own thread pool, the interoperation of OpenMP with Charm++ incurs overheads such as oversubscription. Now, we extend OpenMP to use Charm++ threads so that Charm++ applications can make efficient use of OpenMP for better performance in intra-node level. This talk gives how OpenMP integration works on Charm++ RTS and shows performance results with some scientific applications.
|
|
|
10:45 am - 11:15 am |
|
Morning |
Technical Session: Charm++ Interfaces (Chair: Dr. Juan Galvez) - 2405 Siebel Center |
11:15 am - 11:45 am |
Talk
|
Adaptive MPI: Overview and Recent Work
Samuel White
Click here to expand description
Adaptive MPI (AMPI) is an implementation of the MPI standard written on top of Charm++. AMPI provides high-level, application-independent features such as over-decomposition, dynamic load balancing, and automatic fault tolerance to MPI codes. This talk gives a high-level overview of AMPI, its features, recent improvements, and performance results.
|
|
|
11:45 pm - 12:15 pm |
Invited Talk
|
Argobots and its Application to Charm++
Sangmin Seo
Click here to expand description
In this talk, we first introduce Argobots, a lightweight low-level threading and tasking framework, which aims at providing high-level runtimes and domain-specific libraries with efficient threading and tasking mechanisms. Argobots is designed to deal with massive on-node parallelism by exposing two levels of execution components (execution streams for hardware resources, i.e., cores or hardware threads, and work units such as user-level threads and tasklets for user tasks) and providing mapping and scheduling mechanisms between them. Then, we demonstrate how an existing parallel programming model, Charm++, can be implemented on top of Argobots. In our implementation, we basically replace the Converse runtime in the Charm++ infrastructure with Argobots. We show the performance results of our implementation using LeanMD. Finally, we show how we incorporate the shrink-expand capability of Argobots into Charm++ and discuss its performance and power effects.
|
|
|
12:15 pm - 01:30 pm |
Lunch - Provided - 2nd floor atrium outside 2405 Siebel Center |
Afternoon |
Technical Session: Virtualization (Chair: Eric Bohm) - 2405 Siebel Center |
1:30 pm - 2:00 pm |
Submitted Paper
|
Using SimGrid to Evaluate the Impact of AMPI's Load Balancing in a Geophysics HPC Application
Rafael Keller Tesser
Click here to expand description
Iterative parallel applications based on MPI are commonplace in High Performance Computing.
Some have intrinsic and inherent load balancing issues that are very difficult to address at the application level.
One way to tackle this load imbalance is through over-decomposition coupled with dynamic load balancing.
This approach, however, requires intrusive modifications of the application before even being able to evaluate the potential benefit.
We propose a simulation based performance evaluation methodology that requires minimal application modification and allows for quick exploration of over-decomposition parameters and load balancing strategies at very low cost.
We show some preliminary and convincing validation results on a real Geophysics code, comparing simulation with real executions.
|
|
|
2:00 pm - 2:30 pm |
Talk
|
Reducing Checkpoint Size in PlascomCM with Lossy Compression
Jon Calhoun
Click here to expand description
Data movement, particularly I/O, is starting to limit application scalability. I/O performance can be improved by employing compression techniques; however, traditional lossless compression fails to reduce floating-point dataset size significantly. Lossy compression can be used to reduce dataset sizes by 5-50x, but adds error into the simulation. In this talk, we investigate lossy compression to reduce checkpoint size in the AMPI application PlascomCM. In particular, we discuss how application knowledge can be used to fix the compression error tolerance such that any error added is masked by the application.
|
|
|
2:30 pm - 3:00 pm |
Invited Talk
|
A parallel library for multidimensional array computations with runtime tuning
Edgar Solomonik
Click here to expand description
Cyclops Tensor Framework (CTF) is a distributed-memory library targeted at scientific applications working with multidimensional datasets. The framework follows in the footsteps of Charm++, providing efficient runtime data orchestration derived from user-directed high-level algorithm expression. CTF has been developed alongside applications for electronic structure calculations, facilitating expression of many quantum chemistry methods as sequences of algebraic operations on tensors (multidimensional matrices). The library uses virtualization and topology-aware mapping to efficiently decompose tensors, selecting the best layout and algorithm by an online parallel evaluation of performance models. Recent new features in CTF include support for sparse tensors and user-defined algebraic structures (e.g. semirings), stepping forward from quantum chemistry to arbitrary (hyper)graph computations.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: New Runtime System Features (Chair: Xiang Ni) - 2405 Siebel Center |
3:30 pm - 4:00 pm |
Submitted Paper
|
An Extension of Charm++ to Optimize Fine-Grained Applications
Alexander Frolov
Click here to expand description
In this presentation an extension to the Charm++ parallel
programming language is proposed to enable additional
optimizations for fine-grained Charm++ applications. The extension
provides a concept of microchares (or uchares) incorporated
into standard Charm++ chares to enable the Charm++
message-driven programming model to be used for very fine-grained
objects (up to tens of thousands of uchares per PE) with
significantly less overhead in the Charm++ runtime system than
it would be if traditional Charm++ chares were used. The Charm++
translator is modified to enable uchare array construct and the
uchare library is developed. Preliminary evaluation has shown
significant performance gain over pure Charm++ and the TRAM
aggregation library on RandomAccess and asynchronous
Breadth-First Search benchmarks when the number of chares per core is
sufficiently large.
|
|
|
4:00 pm - 4:20 pm |
Talk
|
One-sided: Accelerating Large Charm++ Messages Using RDMA
Nitin Bhat
Click here to expand description
One sided communication with the help of RDMA supported hardware has proven to provide reduced latencies and increased bandwidth for large payloads in HPC networks. With the advent of Exascale computing, the number and size of messages is expected to increase. The existing one sided communication paradigm in Charm++ makes a copy of the payload at the sender side. As the payload gets larger, the cost that we pay for this copy also increases proportionally. This talk will be focusing on the implementation and benefits of the new Charm++ feature of zero copy one sided communication which aims at eliminating the large payload copy. The Charm++ User Level API for performing RDMA calls with zero copy will also be discussed.
|
|
|
4:20 pm - 4:50 pm |
Talk
|
Heterogeneous Task Execution Frameworks in Charm++
Michael Robson
Click here to expand description
With the increased adoption of and reliance on accelerators, particularly GPUs,
to achieve more performance in current and next generation supercomputers,
effectively utilizing these devices has become very important.
However, there has not been a commensurate increase in the ability to program
and interact with these devices.
We seek to bridge the GPU usability and programmability gap in Charm++ through
a variety of GPU frameworks that programmers can utilize.
Our ultimate goal is to enable our users to easily and automatically leverage
the compute power of these devices without having to rewrite significant
portions of their code.
In this talk we will present the various frameworks available in Charm++ for
programmers interacting with accelerators, their current features and trade-offs,
and a brief overview of some major Charm applications that currently utilize
various pieces of the Charm++ accelerator stack.
We will also present some preliminary performance results and review the
programmability enhancements these frameworks offer.
Finally, we will examine Charm's future directions as nodes grow in size, new
accelerators are introduced, and heterogeneous load balancing at various levels
and across different node types becomes increasingly important.
|
|
|
4:50 pm - 5:10 pm |
Discussion
|
Ongoing Research and Upcoming Features in Charm++
Prof. Laxmikant V. Kale
|
|
|
06:30 pm onwards |
Workshop Banquet (for registered participants only) Located at the 2nd floor atrium outside 2405 Siebel Center |
|
8:30 am - 9:00 am |
Continental Breakfast / Registration - Atrium outside NCSA Auditorium |
Morning |
Opening Session - NCSA Auditorium |
9:00 am - 10:00 am |
Keynote
|
HPC Runtimes: Opportunities, Requirements, and Examples
Prof. Thomas Sterling, Indiana University Bloomington/CREST
Click here to expand description
The future of High Performance Computing (HPC) is in the midst of severe
controversy as the community decides how it is to attempt to achieve exascale
capability. The choice appears to be between almost static or truly dynamic. To put
it a different way: between ballistic computing and guided computing. Advanced
runtime systems like Charm++ provide the basis for guided computing where
introspection of system and application status on a continuing basis enables
dynamic adaptive control of resource management and task scheduling. Ballistic
computing largely determines where and when activities are going happen at
programming, compile, and load times. The problem with this is that ballistic
computing doesn't know everything about the computation and guided computing
requires more work to get it right. Which way the HPC community and government
sponsored programs to exascale will go is still unresolved but there is a danger that
it will pursue a rick adverse approach and focus on "MPI + OpenMP" as an
incremental step beyond current methods. And for some applications, this will work
adequately, perhaps even to the scale intended. Then why go to the extra effort to
invest in the development and application of dynamic runtime systems? This
Keynote presentation will discuss the opportunities that emerging runtime systems
may deliver to future computations of extreme scale but also considers the
challenges and requirements that they will have to satisfy in order to be effective.
Examples from the HPX-5 experimental runtime system based on the ParalleX
execution model will be provided. Part of this discussion will center on how
computer architecture may have to change in order to maximize the scalability and
efficiency of runtimes to deliver best performance and time to solution. Questions
from the audience are encouraged throughout the presentation.
|
|
|
10:00 am - 10:30 am |
Invited Talk
|
The Parallel Research Kernels: an Objective Tool for Parallel System Research
Maria Garzaran, Rob van der Wijngaart, and Tim Mattson
Click here to expand description
Much research in parallel computing is driven by anecdote: “we built it, it worked, and we really like it”. This is not because researchers are lazy. Rather, exploring parallel systems with available workloads is difficult and too time consuming for most groups to tolerate. Benchmarks and mini-applications help, but they still require more work to port to multiple parallel systems than most research groups can justify.
In response to this situation, we have embarked on a minimalist approach. We created a set of programs that probe the bottlenecks application programmers encounter when writing parallel applications. These programs are small, designed around particular scalability issues, and self-testing (to assess correctness as well as performance). We call these the parallel research kernels (https://github.com/ParRes/Kernels). Using these kernels, we have collected objective data on an unusually large collection of programming systems. In this talk, we discuss results from the parallel research kernels on a number of programming systems with exascale ambitions (including Charm++).
|
|
|
10:30 am - 11:00 am |
|
Morning |
Technical Session: Tools (Chair: Harshitha Menon) - NCSA Auditorium |
11:00 am - 11:30 am |
Talk
|
Performance Analysis and Projections
Ronak Buch
Click here to expand description
One of the chief challenges in developing and running extreme scale applications is actually achieving high performance in practice. It is crucial for developers to be able to identify, diagnose, and fix computational imbalance, message bottlenecks, underutilization of processors, and other performance problems. To this end, we have developed Projections, a performance analysis tool for Charm++. In this talk, we discuss recent additions to Projections and how they can be used to analyze various performance properties. Additionally, we will describe how Projections has been used to solve performance woes in production applications.
|
|
|
11:30 am - 12:00 pm |
Talk
|
Variation Among Processors Under Turbo Boost in HPC Systems
Bilge Acun
Click here to expand description
The design and manufacture of present-day CPUs causes inherent variation in supercomputer architectures such as variation in power and temperature of the chips. The variation also manifests itself as frequency differences among processors under Turbo Boost dynamic overclocking. This variation can lead to unpredictable and suboptimal performance in tightly coupled HPC applications. In this study, we use compute-intensive kernels and applications to analyze the variation among processors in four top supercomputers: Edison, Cab, Stampede, and Blue Waters. We observe that there is an execution time difference of up to 16% among processors on the Turbo Boost-enabled supercomputers: Edison, Cab, Stampede. There is less than 1% variation on Blue Waters, which does not have a dynamic overclocking feature. We analyze measurements from temperature and power instrumentation and find that intrinsic differences in the chips' power efficiency is the culprit behind the frequency variation. Moreover, we analyze potential solutions such as disabling Turbo Boost, leaving idle cores and replacing slow chips to mitigate the variation. We also propose a speed-aware dynamic task redistribution (load balancing) algorithm to reduce the negative effects of performance variation. Our speed-aware load balancing algorithm improves the performance up to 18% compared to no load balancing performance and 6% better than the non-speed aware counterpart.
|
|
|
12:00 pm - 12:30 pm |
Talk
|
FlipBack: Automatic Targeted Protection Against Silent Data Corruption
Xiang Ni
Click here to expand description
The decreasing size of transistors has been critical to the increase in capacity
of supercomputers. It is predicted that transistors will likely be one third of
their current size by the time exascale computers are available. The smaller the
transistors are, less energy is required to flip a bit, and thus silent data
corruptions (SDCs) are likely to occur more frequently. Traditional approaches
to protect applications from SDCs come at the cost of either doubling the
hardware resources utilized or elongating application execution time to at least
twice as before. In this paper, we present FlipBack, a novel automatic
software-based approach that protects application from SDCs. FlipBack enables
targeted protection for different types of data and calculations based on their
characteristics. We evaluate FlipBack with HPC proxy applications that capture
the behavior of real scientific applications and show that FlipBack is able to
fully protect applications from silent data corruptions with only 6-20%
performance degradation.
|
|
|
12:30 pm - 01:30 pm |
Lunch - Provided - 2nd floor atrium outside 2405 Siebel Center |
1:30 pm - 3:00 pm |
Panel
|
Higher Level Abstractions for Parallel Programming: Promise or Distraction?
Panelists: Dr. Thomas Sterling, Dr. Laxmikant Kale, Dr. William Gropp, Dr. Maria Garzaran
Click here to expand description
After several decades of experience, parallel programming today is still fairly low level. Explicit decomposition of data and work is prevalent. Charm++ alleviates resource management, but it still requires explicit decomposition. Higher level general purpose languages, viz. Chapel, X10, and Legion, are promising but still not quite there. On the other hand, "low level" languages are doing reasonably well as workhorses of parallel programming. This panel will examine whether the promise of improved productivity via higher-level languages is realizable and if so, what forms it will take.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Applications (Chair: Bilge Acun) - 2405 Siebel Center |
3:30 pm - 4:00 pm |
Talk
|
Load Balancing and Asynchrony in PDES Applications
Eric Mikida
Click here to expand description
Parallel Discrete Event Simulation (PDES) differs in many ways from traditional HPC applications from the science and engineering domains. PDES applications often involve a large number of autonomous agents, responding to very fine-grain messages. These messages are often spread non-uniformly throughout time, and the communication pattern among agents can be difficult to know a priori. This results in simulations with complex and irregular behavior which can be difficult to manage effectively. In this talk we'll show how the Charm++ model is well suited for PDES applications, and how certain features of the Charm++ runtime system can help manage some of the complexity present in PDES applications.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions
Jim Phillips
Click here to expand description
The Charm++ partitions feature allows a parallel run to be divided into a
number of independent Charm++ contexts (at most one per process) that can
exchange Converse messages via CmiInter... send functions. The partitions
feature is analogous to MPI_Comm_split, which is how it was prototyped as
a patch for the Charm++ 6.4 MPI machine layer in NAMD 2.9 before being
integrated into the LRTS layer in Charm++ 6.6 for NAMD 2.10.
Inter-partition communication is exposed in the NAMD Tcl scripting
interface as MPI-style replicaSend, replicaRecv, and replicaSendRecv
commands. These commands provide synchronous data exchange between pairs
of paused simulations, as required by basic multi-copy algorithms such as
replica exchange. NAMD 2.11 extends this interface, exploiting the
message-driven nature of Charm++ to provide asynchronous script evaluation
and in-memory checkpointing on remote partitions with currently-running
simulations. These new asynchronous multi-copy scripting capabilities
enable workflow-style programming (without a dedicated master partition),
and hence multiplexing of replica simulations on a smaller number of
Charm++ partitions.
|
|
|
4:30 pm - 5:20 pm |
Talk
|
OpenAtom: On the fly ab initio molecular dynamics on the ground state surface with instantaneous GW-BSE level spectra
Glenn Martyna, Eric Bohm, and Subhasish Mandal
Click here to expand description
The goal of the OpenAtom project is to statistically sample complex environments in order to understand physical systems with application across Science and Solutions. These include studying biological function enabled by fluctuations in both the environment and the biomolecules, pollutant detection in complex aqueous systems via spectroscopic signatures, understanding chemical reactions in dense arrays whose complex many-body reaction paths challenge simple models. We describe our progress towards fast ab initio computations to provide the ground energy surface that describe the systems of interest within Density Functional Theory and excited state properties and spectroscopy within the GW / Bethe-Saltpeter approach.
|
|
|