Program

Most sessions will be held in Room 2405 of the Thomas M. Siebel Center for Computer Science.

Some sessions will be located nearby at the National Center for Supercomputing Applications (NCSA).

All slides will be made available after the workshop. View program from last year's workshop here.

Time	Type	Description	Slides	Webcast
Day 1 (Tuesday, April 19th)
8:15 am - 8:45 am	Continental Breakfast / Registration - 2nd floor atrium outside 2405 Siebel Center
Morning	Opening Session - 2405 Siebel Center
8:45 am - 9:00 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign	[pptx]	[YouTube]
9:00 am - 10:00 am	Keynote	Directive-Based Programming at Scale? Prof. Barbara Chapman, Stony Brook University Click here to expand description There have been a number of attempts to define compiler directives that enable an application developer to specify concurrency at a reasonably high level. Recent years have seen the growing use of both OpenMP and OpenACC directives for scientific computation, the expansion of their feature sets and growing maturity of the commercial implementations. We discuss how these have evolved to meet the needs of large-scale applications and systems, and the opportunities they offer for run-time adaptation."	[pptx] [pdf]	[YouTube]
10:00 am - 10:45 am	Talk	Handling Transient and Persistent Imbalance Together in Distributed and Shared Memory Harshitha Menon, Seonmyeong Bak Click here to expand description This recent trend of rapid increase in the number of cores per chip has resulted in vast amount of on-node parallelism on HPC systems. Not only the number of cores per node is increasing substantially but also the cores are becoming heterogeneous.Increasingly, science and engineering applications are becoming more complex and dynamic to utilize the increasing computation capability of systems. One of the critical factors that affect the performance of many applications is load imbalance. We leverage the multi-core shared memory systems along with the persistent object based Charm++ model to mitigate the load imbalance problem for Charm++. OpenMP is the defacto standard in task-level parallel programming. Because OpenMP keeps its own thread pool, the interoperation of OpenMP with Charm++ incurs overheads such as oversubscription. Now, we extend OpenMP to use Charm++ threads so that Charm++ applications can make efficient use of OpenMP for better performance in intra-node level. This talk gives how OpenMP integration works on Charm++ RTS and shows performance results with some scientific applications.	[key] [pdf]	[YouTube]
10:45 am - 11:15 am	Break
Morning	Technical Session: Charm++ Interfaces (Chair: Dr. Juan Galvez) - 2405 Siebel Center
11:15 am - 11:45 am	Talk	Adaptive MPI: Overview and Recent Work Samuel White Click here to expand description Adaptive MPI (AMPI) is an implementation of the MPI standard written on top of Charm++. AMPI provides high-level, application-independent features such as over-decomposition, dynamic load balancing, and automatic fault tolerance to MPI codes. This talk gives a high-level overview of AMPI, its features, recent improvements, and performance results.	[pdf]	[YouTube]
11:45 pm - 12:15 pm	Invited Talk	Argobots and its Application to Charm++ Sangmin Seo Click here to expand description In this talk, we first introduce Argobots, a lightweight low-level threading and tasking framework, which aims at providing high-level runtimes and domain-specific libraries with efficient threading and tasking mechanisms. Argobots is designed to deal with massive on-node parallelism by exposing two levels of execution components (execution streams for hardware resources, i.e., cores or hardware threads, and work units such as user-level threads and tasklets for user tasks) and providing mapping and scheduling mechanisms between them. Then, we demonstrate how an existing parallel programming model, Charm++, can be implemented on top of Argobots. In our implementation, we basically replace the Converse runtime in the Charm++ infrastructure with Argobots. We show the performance results of our implementation using LeanMD. Finally, we show how we incorporate the shrink-expand capability of Argobots into Charm++ and discuss its performance and power effects.	[pdf]
12:15 pm - 01:30 pm	Lunch - Provided - 2nd floor atrium outside 2405 Siebel Center
Afternoon	Technical Session: Virtualization (Chair: Eric Bohm) - 2405 Siebel Center
1:30 pm - 2:00 pm	Submitted Paper	Using SimGrid to Evaluate the Impact of AMPI's Load Balancing in a Geophysics HPC Application Rafael Keller Tesser Click here to expand description Iterative parallel applications based on MPI are commonplace in High Performance Computing. Some have intrinsic and inherent load balancing issues that are very difficult to address at the application level. One way to tackle this load imbalance is through over-decomposition coupled with dynamic load balancing. This approach, however, requires intrusive modifications of the application before even being able to evaluate the potential benefit. We propose a simulation based performance evaluation methodology that requires minimal application modification and allows for quick exploration of over-decomposition parameters and load balancing strategies at very low cost. We show some preliminary and convincing validation results on a real Geophysics code, comparing simulation with real executions.	[pdf]	[YouTube]
2:00 pm - 2:30 pm	Talk	Reducing Checkpoint Size in PlascomCM with Lossy Compression Jon Calhoun Click here to expand description Data movement, particularly I/O, is starting to limit application scalability. I/O performance can be improved by employing compression techniques; however, traditional lossless compression fails to reduce floating-point dataset size significantly. Lossy compression can be used to reduce dataset sizes by 5-50x, but adds error into the simulation. In this talk, we investigate lossy compression to reduce checkpoint size in the AMPI application PlascomCM. In particular, we discuss how application knowledge can be used to fix the compression error tolerance such that any error added is masked by the application.	[pdf] [tar.gz]	[YouTube]
2:30 pm - 3:00 pm	Invited Talk	A parallel library for multidimensional array computations with runtime tuning Edgar Solomonik Click here to expand description Cyclops Tensor Framework (CTF) is a distributed-memory library targeted at scientific applications working with multidimensional datasets. The framework follows in the footsteps of Charm++, providing efficient runtime data orchestration derived from user-directed high-level algorithm expression. CTF has been developed alongside applications for electronic structure calculations, facilitating expression of many quantum chemistry methods as sequences of algebraic operations on tensors (multidimensional matrices). The library uses virtualization and topology-aware mapping to efficiently decompose tensors, selecting the best layout and algorithm by an online parallel evaluation of performance models. Recent new features in CTF include support for sparse tensors and user-defined algebraic structures (e.g. semirings), stepping forward from quantum chemistry to arbitrary (hyper)graph computations.	[pdf]	[YouTube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: New Runtime System Features (Chair: Xiang Ni) - 2405 Siebel Center
3:30 pm - 4:00 pm	Submitted Paper	An Extension of Charm++ to Optimize Fine-Grained Applications Alexander Frolov Click here to expand description In this presentation an extension to the Charm++ parallel programming language is proposed to enable additional optimizations for fine-grained Charm++ applications. The extension provides a concept of microchares (or uchares) incorporated into standard Charm++ chares to enable the Charm++ message-driven programming model to be used for very fine-grained objects (up to tens of thousands of uchares per PE) with significantly less overhead in the Charm++ runtime system than it would be if traditional Charm++ chares were used. The Charm++ translator is modified to enable uchare array construct and the uchare library is developed. Preliminary evaluation has shown significant performance gain over pure Charm++ and the TRAM aggregation library on RandomAccess and asynchronous Breadth-First Search benchmarks when the number of chares per core is sufficiently large.	[pdf]	[YouTube]
4:00 pm - 4:20 pm	Talk	One-sided: Accelerating Large Charm++ Messages Using RDMA Nitin Bhat Click here to expand description One sided communication with the help of RDMA supported hardware has proven to provide reduced latencies and increased bandwidth for large payloads in HPC networks. With the advent of Exascale computing, the number and size of messages is expected to increase. The existing one sided communication paradigm in Charm++ makes a copy of the payload at the sender side. As the payload gets larger, the cost that we pay for this copy also increases proportionally. This talk will be focusing on the implementation and benefits of the new Charm++ feature of zero copy one sided communication which aims at eliminating the large payload copy. The Charm++ User Level API for performing RDMA calls with zero copy will also be discussed.	[pptx] [pdf]
4:20 pm - 4:50 pm	Talk	Heterogeneous Task Execution Frameworks in Charm++ Michael Robson Click here to expand description With the increased adoption of and reliance on accelerators, particularly GPUs, to achieve more performance in current and next generation supercomputers, effectively utilizing these devices has become very important. However, there has not been a commensurate increase in the ability to program and interact with these devices. We seek to bridge the GPU usability and programmability gap in Charm++ through a variety of GPU frameworks that programmers can utilize. Our ultimate goal is to enable our users to easily and automatically leverage the compute power of these devices without having to rewrite significant portions of their code. In this talk we will present the various frameworks available in Charm++ for programmers interacting with accelerators, their current features and trade-offs, and a brief overview of some major Charm applications that currently utilize various pieces of the Charm++ accelerator stack. We will also present some preliminary performance results and review the programmability enhancements these frameworks offer. Finally, we will examine Charm's future directions as nodes grow in size, new accelerators are introduced, and heterogeneous load balancing at various levels and across different node types becomes increasingly important.	[pptx] [pdf]	[YouTube]
4:50 pm - 5:10 pm	Discussion	Ongoing Research and Upcoming Features in Charm++ Prof. Laxmikant V. Kale	[pdf]	[YouTube]
06:30 pm onwards	Workshop Banquet (for registered participants only) Located at the 2nd floor atrium outside 2405 Siebel Center
Day 2 (Wednesday, April 20th)
8:30 am - 9:00 am	Continental Breakfast / Registration - Atrium outside NCSA Auditorium
Morning	Opening Session - NCSA Auditorium
9:00 am - 10:00 am	Keynote	HPC Runtimes: Opportunities, Requirements, and Examples Prof. Thomas Sterling, Indiana University Bloomington/CREST Click here to expand description The future of High Performance Computing (HPC) is in the midst of severe controversy as the community decides how it is to attempt to achieve exascale capability. The choice appears to be between almost static or truly dynamic. To put it a different way: between ballistic computing and guided computing. Advanced runtime systems like Charm++ provide the basis for guided computing where introspection of system and application status on a continuing basis enables dynamic adaptive control of resource management and task scheduling. Ballistic computing largely determines where and when activities are going happen at programming, compile, and load times. The problem with this is that ballistic computing doesn't know everything about the computation and guided computing requires more work to get it right. Which way the HPC community and government sponsored programs to exascale will go is still unresolved but there is a danger that it will pursue a rick adverse approach and focus on "MPI + OpenMP" as an incremental step beyond current methods. And for some applications, this will work adequately, perhaps even to the scale intended. Then why go to the extra effort to invest in the development and application of dynamic runtime systems? This Keynote presentation will discuss the opportunities that emerging runtime systems may deliver to future computations of extreme scale but also considers the challenges and requirements that they will have to satisfy in order to be effective. Examples from the HPX-5 experimental runtime system based on the ParalleX execution model will be provided. Part of this discussion will center on how computer architecture may have to change in order to maximize the scalability and efficiency of runtimes to deliver best performance and time to solution. Questions from the audience are encouraged throughout the presentation.	[pptx]	[YouTube]
10:00 am - 10:30 am	Invited Talk	The Parallel Research Kernels: an Objective Tool for Parallel System Research Maria Garzaran, Rob van der Wijngaart, and Tim Mattson Click here to expand description Much research in parallel computing is driven by anecdote: “we built it, it worked, and we really like it”. This is not because researchers are lazy. Rather, exploring parallel systems with available workloads is difficult and too time consuming for most groups to tolerate. Benchmarks and mini-applications help, but they still require more work to port to multiple parallel systems than most research groups can justify. In response to this situation, we have embarked on a minimalist approach. We created a set of programs that probe the bottlenecks application programmers encounter when writing parallel applications. These programs are small, designed around particular scalability issues, and self-testing (to assess correctness as well as performance). We call these the parallel research kernels (https://github.com/ParRes/Kernels). Using these kernels, we have collected objective data on an unusually large collection of programming systems. In this talk, we discuss results from the parallel research kernels on a number of programming systems with exascale ambitions (including Charm++).	[pptx]	[YouTube]
10:30 am - 11:00 am	Break
Morning	Technical Session: Tools (Chair: Harshitha Menon) - NCSA Auditorium
11:00 am - 11:30 am	Talk	Performance Analysis and Projections Ronak Buch Click here to expand description One of the chief challenges in developing and running extreme scale applications is actually achieving high performance in practice. It is crucial for developers to be able to identify, diagnose, and fix computational imbalance, message bottlenecks, underutilization of processors, and other performance problems. To this end, we have developed Projections, a performance analysis tool for Charm++. In this talk, we discuss recent additions to Projections and how they can be used to analyze various performance properties. Additionally, we will describe how Projections has been used to solve performance woes in production applications.	[pdf]	[YouTube]
11:30 am - 12:00 pm	Talk	Variation Among Processors Under Turbo Boost in HPC Systems Bilge Acun Click here to expand description The design and manufacture of present-day CPUs causes inherent variation in supercomputer architectures such as variation in power and temperature of the chips. The variation also manifests itself as frequency differences among processors under Turbo Boost dynamic overclocking. This variation can lead to unpredictable and suboptimal performance in tightly coupled HPC applications. In this study, we use compute-intensive kernels and applications to analyze the variation among processors in four top supercomputers: Edison, Cab, Stampede, and Blue Waters. We observe that there is an execution time difference of up to 16% among processors on the Turbo Boost-enabled supercomputers: Edison, Cab, Stampede. There is less than 1% variation on Blue Waters, which does not have a dynamic overclocking feature. We analyze measurements from temperature and power instrumentation and find that intrinsic differences in the chips' power efficiency is the culprit behind the frequency variation. Moreover, we analyze potential solutions such as disabling Turbo Boost, leaving idle cores and replacing slow chips to mitigate the variation. We also propose a speed-aware dynamic task redistribution (load balancing) algorithm to reduce the negative effects of performance variation. Our speed-aware load balancing algorithm improves the performance up to 18% compared to no load balancing performance and 6% better than the non-speed aware counterpart.	[pdf]	[YouTube]
12:00 pm - 12:30 pm	Talk	FlipBack: Automatic Targeted Protection Against Silent Data Corruption Xiang Ni Click here to expand description The decreasing size of transistors has been critical to the increase in capacity of supercomputers. It is predicted that transistors will likely be one third of their current size by the time exascale computers are available. The smaller the transistors are, less energy is required to flip a bit, and thus silent data corruptions (SDCs) are likely to occur more frequently. Traditional approaches to protect applications from SDCs come at the cost of either doubling the hardware resources utilized or elongating application execution time to at least twice as before. In this paper, we present FlipBack, a novel automatic software-based approach that protects application from SDCs. FlipBack enables targeted protection for different types of data and calculations based on their characteristics. We evaluate FlipBack with HPC proxy applications that capture the behavior of real scientific applications and show that FlipBack is able to fully protect applications from silent data corruptions with only 6-20% performance degradation.	[key] [pdf]	[YouTube]
12:30 pm - 01:30 pm	Lunch - Provided - 2nd floor atrium outside 2405 Siebel Center
1:30 pm - 3:00 pm	Panel	Higher Level Abstractions for Parallel Programming: Promise or Distraction? Panelists: Dr. Thomas Sterling, Dr. Laxmikant Kale, Dr. William Gropp, Dr. Maria Garzaran Click here to expand description After several decades of experience, parallel programming today is still fairly low level. Explicit decomposition of data and work is prevalent. Charm++ alleviates resource management, but it still requires explicit decomposition. Higher level general purpose languages, viz. Chapel, X10, and Legion, are promising but still not quite there. On the other hand, "low level" languages are doing reasonably well as workhorses of parallel programming. This panel will examine whether the promise of improved productivity via higher-level languages is realizable and if so, what forms it will take.	[pptx]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Applications (Chair: Bilge Acun) - 2405 Siebel Center
3:30 pm - 4:00 pm	Talk	Load Balancing and Asynchrony in PDES Applications Eric Mikida Click here to expand description Parallel Discrete Event Simulation (PDES) differs in many ways from traditional HPC applications from the science and engineering domains. PDES applications often involve a large number of autonomous agents, responding to very fine-grain messages. These messages are often spread non-uniformly throughout time, and the communication pattern among agents can be difficult to know a priori. This results in simulations with complex and irregular behavior which can be difficult to manage effectively. In this talk we'll show how the Charm++ model is well suited for PDES applications, and how certain features of the Charm++ runtime system can help manage some of the complexity present in PDES applications.	[pdf]
4:00 pm - 4:30 pm	Talk	Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions Jim Phillips Click here to expand description The Charm++ partitions feature allows a parallel run to be divided into a number of independent Charm++ contexts (at most one per process) that can exchange Converse messages via CmiInter... send functions. The partitions feature is analogous to MPI_Comm_split, which is how it was prototyped as a patch for the Charm++ 6.4 MPI machine layer in NAMD 2.9 before being integrated into the LRTS layer in Charm++ 6.6 for NAMD 2.10. Inter-partition communication is exposed in the NAMD Tcl scripting interface as MPI-style replicaSend, replicaRecv, and replicaSendRecv commands. These commands provide synchronous data exchange between pairs of paused simulations, as required by basic multi-copy algorithms such as replica exchange. NAMD 2.11 extends this interface, exploiting the message-driven nature of Charm++ to provide asynchronous script evaluation and in-memory checkpointing on remote partitions with currently-running simulations. These new asynchronous multi-copy scripting capabilities enable workflow-style programming (without a dedicated master partition), and hence multiplexing of replica simulations on a smaller number of Charm++ partitions.	[pdf]
4:30 pm - 5:20 pm	Talk	OpenAtom: On the fly ab initio molecular dynamics on the ground state surface with instantaneous GW-BSE level spectra Glenn Martyna, Eric Bohm, and Subhasish Mandal Click here to expand description The goal of the OpenAtom project is to statistically sample complex environments in order to understand physical systems with application across Science and Solutions. These include studying biological function enabled by fluctuations in both the environment and the biomolecules, pollutant detection in complex aqueous systems via spectroscopic signatures, understanding chemical reactions in dense arrays whose complex many-body reaction paths challenge simple models. We describe our progress towards fast ab initio computations to provide the ground energy surface that describe the systems of interest within Density Functional Theory and excited state properties and spectroscopy within the GW / Bethe-Saltpeter approach.	[pptx] [pdf]

Abstracts Due:	March 10, 2016 (extended)
Author Notification:	March 15, 2016
Hotel Reservation:	March 19, 2016
Workshop:	April 19-20, 2016