Program

Time	Type	Description	Slides	Video
Day 1 (Tuesday, April 10th)
Morning	Tutorial Session: Charm++ - Thomas M. Siebel Center for Computer Science 3405 for Charm++ basic features, 4124 for advanced topics
9:00 am - 1:00 pm	Tutorial	Charm++ PPL	[pptx]
Day 2 (Wednesday, April 11th)
8:00 am - 9:00 am	Continental Breakfast / Registration - NCSA 1st Floor Lobby
Morning	Opening Session - NCSA Auditorium 1122 (Chair: Sanjay Kale)
9:00 am - 9:20 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign	[pptx]	[YouTube]
9:20 am - 10:20 am	Keynote	Resource Management Challenges in the Era of Extreme Heterogeneity Ron Brightwell, Sandia National Laboratory Click here to expand description Future HPC systems will be characterized by a large number and variety of complex, interacting components including processing units, accelerators, deep memory hierarchies, multiple interconnects, and alternative storage technologies. In addition to extreme hardware diversity, there is a broadening community of computational scientists using HPC as a tool to address challenging problems. Systems will be expected to efficiently support a wider variety of applications, including not only traditional HPC modeling and simulation codes, but also data analytics and machine learning workloads. This era of extreme heterogeneity creates several challenges that will need to be addressed to enable future systems to be effective tools for enabling scientific discovery. A recent DOE/ASCR workshop discussed these challenges and potential research directions to address them. In this talk, I will give my perspective on the resource management challenges stemming from extreme heterogeneity and offer my views on the most important system software capabilities that will need to be explored to meet these challenges.	[pdf]	[YouTube]
10:20 am - 10:45 am	Break
Morning	Technical Session: Applications I - NCSA Auditorium 1122 (Chair: Ronak Buch)
10:45 am - 11:15 am	Talk	Experiences with Charm++ and NAMD on the Summit POWER9/Volta Supercomputer Dr. James Phillips, University of Illinois at Urbana-Champaign Click here to expand description The highly parallel molecular dynamics code NAMD has been long used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64-million-atom model of the HIV virus capsid. In 2007 NAMD was was one of the first codes to run on a GPU cluster, and it is now one of the first on the new ORNL Summit supercomputer, which features IBM POWER9 CPUs, NVIDIA Volta GPUs, the NVLink CPU-GPU interconnect, and a dual-rail EDR InfiniBand inter-node network. This talk will cover the latest NAMD performance improvements and scaling results on Summit and other leading supercomputers, with an emphasis on recent Charm++ features and optimizations of particular relevance to NAMD.	[pdf]	[YouTube]
11:15 am - 11:45 am	Talk	Improving NAMD Performance on Multi-GPU Platforms Dr. David J. Hardy, University of Illinois at Urbana-Champaign Click here to expand description NAMD was, in 2007, the first full-featured production molecular dynamics software to use CUDA for accelerating its costliest computations. The initial effort was to offload the most demanding computation of NAMD, evaluation of the short-range non-bonded forces between atoms. Multiple Charm++ compute object workloads had to be aggregated to provide sufficient work for a GPU. Although there was extra cost due to managing the workload aggregation, as well as transferring data between host and device, the strategy was successful in harnessing GPU-accelerated clusters. Refinements soon followed that included asynchronous CUDA kernel execution, asynchronous streaming of host-device data transfers, and prioritized kernel scheduling to favor returning results due off-node before those needed locally. This strategy to use CUDA devices as coprocessors made perfect sense at the time, enabling good parallel scalability on the GPU-accelerated supercomputers Blue Waters and Titan, while requiring minimal changes to NAMD’s existing codebase. As GPU technology has evolved, NAMD has benefited from moving greater amounts of work to the GPU. The current NAMD codes include CUDA kernels for performing the PME grid point charge spreading and force interpolation, kernels for calculating forces for bonded interactions and cross-terms and exclusions, and even a set of kernels for calculating the entire PME long-range force contribution, intended only for single-node simulation. Effectively, the entire force calculation can now be offloaded to the GPU. Furthermore, NAMD has completely revised device management code, providing more effective use of multiple GPUs per node. NVIDIA’s release of Volta has now shifted the balance of compute power almost entirely to the GPU, with the small remaining CPU calculations often posing a bottleneck to NAMD’s performance. Profiling results collected on a multi-GPU Volta platform show that effective use of just a single Volta GPU requires NAMD to employ a sufficient number of CPU cores, where the number of cores needed scales with the problem size. Also, the profiling shows that the parallel decomposition produces for each Charm++ user-level thread such a small amount of computational work, that the cost of useful work is rivaled by the cost of the Charm++ scheduling functions. This presentation will share details from the NAMD profiling and the optimization strate- gies being undertaken to improve the CPU vectorization so as to facilitate the offloading of these remaining CPU calculations to the GPU.	[pdf]	[YouTube]
11:45 am - 12:45 pm	Lunch - Provided - NCSA 1st Floor Lobby
Afternoon	Technical Session: Applications I & Runtime System - NCSA Auditorium 1122 (Chair: Karthik Senthil)
12:45 pm - 1:15 pm	Talk	ChaNGa: from cosmology to a flexible, parallel tree-code framework Prof. Tom Quinn, University of Washington	[pdf]	[YouTube]
1:15 pm - 1:45 pm	Talk	Integrated Runtime of Charm++ and OpenMP Seonmyeong Bak, University of Illinois at Urbana-Champaign Click here to expand description The recent trend of increasing numbers of cores per chip has resulted in vast amounts of on-node parallelism. These high core counts result in hardware variability that introduces imbalance. Applications are also becoming more complex, resulting in dynamic load imbalance. Load imbalance of any kind can result in loss of performance and system utilization. We address the challenge of handling both transient and persistent load imbalances while maintaining locality with low overhead. In this work, we propose an integrated runtime system that combines the Charm++ distributed programming model with concurrent tasks to mitigate load imbalances within and across shared memory address spaces. It utilizes a periodic assignment of work to cores based on load measurement, in combination with user created tasks to handle load imbalance. We integrate OpenMP with Charm++ to enable creation of potential tasks via OpenMP’s parallel loop construct. This is also available to MPI applications through the Adaptive MPI implementation. We demonstrate the benefits of our work on three applications. We show improvements of Lassen by 29.6% on Cori and 46.5% on Theta. We also demonstrate the benefits on a Charm++ application, ChaNGa by 25.7% on Theta, as well as an MPI proxy application, Kripke, using Adaptive MPI.	[pptx]	[YouTube]
1:45 pm - 2:15 pm	Talk	Recent Developments in Dynamic Load Balancing Ronak Buch, University of Illinois at Urbana-Champaign Click here to expand description One of the main features of Charm++ is its ability to dynamically balance load to improve resource utilization and communication performance. The evolving hardware landscape, however, with larger machine scales, slower cores, and heterogeneous architectures present challenges for fast and effective load balancing. In this talk, we will discuss several important developments in the Charm++ load balancing framework to overcome these challenges. These include: (a) new efficient load balancing strategies aimed at preserving the mapping of chares to their machine/network topology locations; (b) upcoming Metabalancer features; (c) load balancing performance improvements; (d) redesign of the load balancing framework, which will simplify the task of writing new load balancing strategies, improve performance of load balancing, and introduce a more powerful and flexible hierarchical load balancing framework; (e) recent advances in heterogeneous and vector load balancers (which allow load to be dynamically balanced across accelerator devices) and their integration within the new framework.	[pdf]	[YouTube]
2:15 pm - 2:30 pm	Break
Afternoon	Technical Session: Programming Languages & Interfaces - NCSA Auditorium 1122 (Chair: Eric Mikida)
2:30 pm - 3:00 pm	Talk	A SpECTRE With a New Face Nils Deppe, Cornell University Click here to expand description Advanced LIGO has begun the exciting new era of gravitational wave astronomy with its groundbreaking discoveriesof gravitational waves (GWs) from the merger of two black holes (BHBH). In addition to BHBH mergers, LIGO has observed binary neutron star (NSNS) mergers with telescopes observing the merger event in the electromagnetic spectrum. This means that NSNS mergers are multi-messengers, they can be detected using gravitational wave detectors, and telescopes. The gravitational and electromagnetic waves emitted by NSNS mergers do not just depend on the initial masses of the stars, but also on the micro- physics of them. For example, the way neutrinos interact and are emitted, or the equation of state inside the stars play a crucial role in both the gravitational and electromagnetic waves emitted. Unfortunately, the treatment of neutrinos and the equation of state are both things that are not yet very well understood. The cores of neutron stars are well above nuclear density and so no experiments on Earth have been able to tell us anything about physics at such scales. One of the major goals of studying NSNS mergers is to understand physics at these extremely high densities. Currently, however, state-of-the-art simulations do not have enough accuracy to be able to constrain models very tightly. Specifically, the computational errors are in the range of 1–10% and often not even quantifiable with current algorithmic and hardware limitations. Also, the simulations take too long: several months on present supercomputers even at the current low accuracy. Moreover, the methods do not scale well to upcoming exascale machines. Our code SpECTRE[1] is designed from the ground up to solve problems our current codes struggle to, with focus on petascale and exascale platforms. It features two key new methods that will make breakthrough simulations of neutron merg- ers and core-collapse supernovae possible: Discontinuous Galerkin (DG) discretization and task-based parallelism. Over the past year we have ported and cleaned up the majority of the code we previ- ously had in our private repository and SpECTRE is now available at https://github.com/ sxs-collaboration/spectre. We have implemented a variety of new features focusing on efficiency, code clarity and ease-of-use. One of the major new features is the computational domain decomposition. The domain is divided into blocks, which are deformed hexahedra and make the up coarsest refinement level of the computational domain for our under-development adaptive mesh refinement implementation. Currently spheres, shells, wedges, and a general trilinear transformation are implemented, as well as several other shapes required for per- forming simulations of compact object binaries. In order to successfully simulate black hole binaries a moving grid, dual-frame, or Arbitrary- Lagrangian-Eulerian (ALE) approach must be used. Specifically, the partial differential equa- tions are solved in what we call the “grid” frame, which is connected to the “intertial” frame (the physical frame) by a series of time-dependent coordinate transformations. The net result is that the excision spheres around the black holes rotate, move inwards, and deform with the black holes. A dual-frame approach has been found to be required for high-accuracy binary black hole evolutions using spectral-type methods such as DG. We have implemented some of the basic time-dependent coordinate maps as well as functions of time that are the parameters of the maps, and are starting development on the control system that will dynamically adjust the time-dependent coordinate maps. The time-dependent coordinate maps must track the black holes, which requires us to be able to find them. A basic horizon finder to find black holes is implemented, and work on advanced features such as adaptive REFERENCES 2 refinement of the horizon surfaces is being developed. Another major new feature that has been implemented is a novel high-order conserva- tive local time stepping algorithm. The ability to do local time stepping is expected to be especially useful with adaptive mesh refinement, since astrophysical simulations often range many orders of magnitude in spatiotemporal scales. Design decisions while moving SpECTRE to GitHub have been focusing on making the code easier to use for physicists. To this end we developed a layer on top of Charm++ that eliminates the need for Charm++’s interface files. The motivation for removing interface files is that we have found them to be difficult to write and lead to even more difficult to debug runtime errors, especially when writing generic code. Our interface eliminates many of these pitfalls. Any pitfalls that are not eliminated trigger static assertions to provide compile time error messages to the developer. We perform automatic registration of chares as well as all entry methods invoked, even entry method templates, which have in the past been the source of difficult to find bugs. I will briefly review the motivation behind SpECTRE, outline what we have accomplished, and then present our interface to Charm++ and how it could be used to replace interface files in the next generation of Charm++.	[pdf]	[YouTube]
3:00 pm - 3:30 pm	Talk	Concept-based runtime polymorphism with Charm++ chare arrays using value semantics Dr. Jozsef Bakosi, Los Alamos National Laboratory Click here to expand description We discuss a generic and migratable C++ helper class that enables runtime polymorphism among chare arrays using value semantics. This enables hiding, behind a single type, different Charm++ array proxy types that model a single concept, i.e., define some common functions as Charm++ entry methods. As a result, member functions can be invoked by client code without knowing the underlying type that model the concept. Since the solution uses value semantics, it entirely hides the dispatch and keeps client code easy to read. Full implementation and more details at: [1] https://github.com/quinoacomputing/quinoa/blob/develop/src/Inciter/Scheme.h [2] https://github.com/quinoacomputing/quinoa/blob/develop/src/Inciter/SchemeBase.h [3] https://github.com/quinoacomputing/quinoa/blob/develop/src/Base/Variant.h	[pdf]	[YouTube]
3:30 pm - 4:00 pm	Talk	Parallel Programming with Charm++ in Python Dr. Juan Galvez, University of Illinois at Urbana-Champaign Click here to expand description Charmpy is a parallel/distributed programming framework for Python based on the Charm++ programming model. It is built on top of the Charm++ runtime, and provides asynchronous remote method invocation, automatic load balancing and overlap of computation and communication. Charmpy takes advantage of the dynamic language features of Python to offer a high-level, simple and powerful API that simplifies writing and debugging parallel applications. Applications, written as collections of distributed objects, are expressed in Python in a natural way, and no specialized languages, preprocessing or compilation steps are necessary. Through Python, Charmpy also offers automatic memory management and object serialization. High performance for Charmpy applications is achieved using the various performance options available for Python, for example: the NumPy scientific computing library, compilation of Python to native instructions using Numba, or interfacing with existing C/Fortran/OpenMP code. Using Charmpy, we have written Python mini-apps that achieve similar performance and scaling characteristics as their C++ counterparts.	[pdf]	[YouTube]
4:00 pm - 4:15 pm	Break
Afternoon	Technical Session: Applicaitons II - NCSA Auditorium 1122 (Chair: Michael Robson)
4:15 pm - 4:45 pm	Talk	Projector Augmented Wave based Kohn-Sham Density Functional Theory in OpenAtom with N^2 log (N) scaling Dr. Qi Li, University of Illinois at Urbana-Champaign & IBM TJ Watson Research Click here to expand description In order to treat metals and metal-semiconductor heterostuctures, plane-wave based Kohn-Sham Density Functional Theory (KS-DFT) computations are typically performed using norm conserving non-local pseudo-potentials yielding a highly parallelized fast method. However, for many systems of interest, the Plane Augmented Wave (PAW) approach of Bloch has many important advantages, most notably less demanding plane-wave cutoff energy due to the splitting of the KS states into delocalized and core parts and direct access to the core-states if needed for NMR and other applications. However, traditional PAW based KS-DFT scales as ~N^3 and the parallel performance is poor compared to the plane-wave approach. To solve this problem, we have developed a PAW based KS-DFT with ~N^2 log(N) scaling enabled by Euler Exponential Spline (EES) and FFTs to form a grid multi-resolution method. We have also increased the accuracy of the long-range interactions by applying Ewald summation methods to the Hartree and external components. Our development, as implemented in OpenAtom, inherits all the merits of the PAW method with significantly higher parallel and scalar efficiency.	[pptx]	[YouTube]
4:45 pm - 5:15 pm	Talk	OpenAtom: First Principles GW method for electronic excitations Dr. Minjung Kim, Yale University Click here to expand description We highlight our progress towards developing, implementing, testing, and parallelizing an implementation of the GW method for electronic excitations within the OpenAtom software. We will summarize the standard GW algorithm and its current implementation and parallel scaling as well as our recently completed development of a much faster GW approach.	[pptx] [pdf]	[YouTube]
5:15 pm - 5:30 pm	Discussion	Upcoming Improvements and Features in Charm++ Eric Bohm, CharmWorks, Inc	[pptx]	[YouTube]
6:45pm - 8:45 pm	Workshop Banquet (for registered participants only) Located at the 2nd floor atrium of Siebel Center
Day 3 (Thursday, April 12th)
8:00 am - 9:00 am	Continental Breakfast - NCSA 1st Floor Lobby
Morning	Opening Session (Chair: Sanjay Kale) - NCSA Auditorium 1122
9:00 am - 10:00 am	Keynote	Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Prof. Dhabaleswar K. (DK) Panda, The Ohio State University Click here to expand description With the advancements in modern networking technologies with many features and mechanisms, there is an increasing trend to design runtimes to exploit maximum overlap between computation and communication. This talk will focus on the set of features available in MVAPICH2 library to exploit overlap of computation and communication on modern clusters. Sample features will include: job start-up, point-to-point operations, RMA operations, kernel-based collectives, and non-blocking collectives (with and without core-direct support). For MVAPICH2-GDR (the optimized GPU version) library, we will additionally focus on the use of GPU Direct RDMA, kernel-based reduction and datatype operations. Performance benefits of these features will be presented.	[pdf]	[YouTube]
10:00 am - 10:30 am	Talk	Adaptive MPI: Features and Recent Developments Sam White, University of Illinois at Urbana-Champaign Click here to expand description Adaptive MPI (AMPI) is an implementation of the MPI standard written on top of Charm++. AMPI provides support to existing MPI codes for features such as over-decomposition, dynamic load balancing, and automatic fault tolerance. This talk provides an overview of AMPI's features and highlights recent work on its performance, particularly within shared-memory. We compare AMPI's performance to other MPI implementations and show recent results from applications using AMPI.	[pdf]	[YouTube]
10:30 am - 11:00 am	Break
Morning	Technical Session: Libraries - NCSA Auditorium 1122 (Chair: Jaemin Choi)
11:00 am - 11:30 am	Talk	Charades: An Adaptive Parallel Discrete Event Simulation Framework on Charm++ Eric Mikida, University of Illinois at Urbana-Champaign Click here to expand description This presentation highlights recent advances in the Charades Parallel Discrete Event Simulation system, which rely heavily on the natural asynchrony and adaptive overlap present in the Charm++ runtime system. Furthermore, recent improvements to low overhead, distributed load balancers in Charm++ provide further room for exploration and improvement.	[pdf]	[YouTube]
11:30 am - 12:00 pm	Talk	A Highly Scalable Graph Clustering Library based on Parallel Union-Find Karthik Senthil, University of Illinois at Urbana-Champaign Click here to expand description Connected components detection or clustering is a popular graph algorithm used in various domains of science and engineering. There has been active research into scalable algorithms on parallel and distributed machines, where most implementations are synchronous in nature. To remove the overhead of synchronization and improve performance on large-scale graphs, we introduce a novel parallel connectivity algorithm based on asynchronous Union-Find operations. We have implemented this algorithm as a library in Charm++ and it can be used in any generic Charm++ application. In this talk we present the current status of the library and analyze its performance using various synthetic and real-world graphs. We also discuss various existing strategies and planned optimizations targeting real-world applications like ChaNGa to perform Friends-of-friends based galaxy detection.	[pdf]	[YouTube]
12:00 pm - 01:15 pm	Lunch - Provided - NCSA 1st Floor Lobby
1:15 pm - 2:15 pm	Panel	Architectural Convergence of Big Data and Extreme-Scale Computing: Marriage of Convenience or Conviction Panelists: Prof. Dhabaleswar K.(DK) Panda, Prof. Marc Snir, Mr. Ron Brightwell, Dr. Bill Kramer, Prof. Laxmikant Kale Moderator: Dr. Edgar Solomonik	[pptx] [pdf]	[YouTube]
2:15 pm - 2:30 pm	Break
Afternoon	Technical Session: Applications III - NCSA Auditorium 1122 (Chair: Sam White)
2:30 pm - 3:00 pm	Talk	A more robust SIMD library for ChaNGa and Charm++ Tim Haines, University of Wisconsin-Madison Click here to expand description We present an improved SIMD library for ChaNGa and Charm++ with support for all major variants of the x86 SSE and AVX instruction sets. We also explore the AVX-512 instruction set used on the Intel(R) Xeon Phi(R) coprocessors with particular emphasis on the Knights Landing (KNL) revision. Using C++ best practices, our library allows for users to write a single source that can compile to any supported SIMD instruction set (including none) without any explicit intervention. This allows for automatic performance improvements as SIMD platforms expand in the future as well as drastically simplifying porting of ChaNGa's source to the KNL platform where SIMD utilization is mandatory in order to achieve peak performance. We discuss the process of converting ChaNGa's gravity kernels to use the new library and provide benchmarking results. We conclude with discussions of ongoing work to expand the mathematical functionality to include trigonometric and special functions.	[pdf]	[YouTube]
3:00 pm - 3:30 pm	Talk	Progress towards development of discontinuous Galerkin finite-element methods for compressible flows using Charm++ Aditya K Pandare, North Carolina State University Click here to expand description The discontinuous Galerkin methods (DGM) have become popular for the solution of systems of conserva- tion laws in computational fluid dynamics in the past few decades. The DG methods combine two advantageous features commonly associated with finite element and finite volume methods. As in classical finite element methods, the DGM achieve high order accuracy by means of high-order polynomial approximation within an element rather than by use of wider stencils as in the case of finite volume methods. The physics of wave propagation is, however, accounted for by solving Riemann problems that arise from the discontinuous repre- sentation of the solution at element interfaces, which makes them similar to finite volume methods. High-order solution polynomials allow the resolution of fine-scale physics without the need for mesh refinement. In other words, by increasing the order of the solution polynomial, the size of the problem needed to be solved is reduced. Increasing the order comes at a price of solving more unknowns on each mesh cell. However, the improvement in the fine-scale resolution outweighs this increase in cost. Thus, the DG methods have gained popularity not only in CFD but also in the computational physics community. The discontinuous Galerkin methods have many attractive features: (1) Their mathematical rigor implies useful mathematical properties with respect to conservation, stability and convergence; (2) The methods can be easily extended to compact higher-order (> 2nd) approximations even on unstructured meshes, suitable for complex engineering simula- tions; (3) They can also handle non-conforming elements, where the grids are allowed to have hanging nodes; (4) The methods are highly parallelizable, as they are compact and each element is independent; (5) Since the elements are discontinuous, and the inter-element communications are minimal, domain decomposition can be efficiently employed. The compactness also allows for structured and simplified implementation and coding; (6) They can easily handle adaptive strategies, since refining or coarsening a grid can be achieved without considering the continuity restriction commonly associated with the conforming elements; (7) The methods allow easy implementation of hp-refinement, for example, the order of accuracy, or shape, can vary from element to element. p-refinement can be achieved by simply increasing the order of the approximation polynomial. However, the DGM have their own weaknesses. In particular, compared to the finite element methods and finite volume methods, the DGM require solutions of systems of equations with more unknowns for the same grids. Consequently, these methods have been recognized as expensive in terms of computational costs. However, this cost as mentioned before, is localized to grid cells and thus such increased computational cost (FLOPs/memory) may be advantageous on future hardware. I will give an overview of DG methods and the basic ingredients of their software implementation. Then I will discuss our near-future plan towards a compressible Navier-Stokes solver for 3D unstructured meshes in Charm++ combined with solution-adaptive refinement.	[pdf]	[YouTube]
3:30 pm - 4:00 pm	Talk	On the Suitability of Charm++ for Extreme-Scale Simulations of Mesoscale Particle Flow and Transport Phenomena Dr. Yidong Xia, Idaho National Laboratory Click here to expand description Particle-based mesoscale numerical methods such as smoothed particle hydrodynamics (SPH), discrete element methods (DEM) and dissipative particle dynamics (DPD) are used in a variety of energy material related research areas at Idaho National Laboratory (INL), e.g. 1) recovery of unconventional fossil fuels from nanoporous tight shale, 2) carbon oxidation in nuclear graphite, and 3) biomass feedstock particle flow. Further development of these mesoscale models to enable efficient parallel computing on the supercomputing resources (e.g. hundreds of thousands or even millions of processors) would substantially increase their ranges of scientific applications at the desired computational scales (e.g. 10^10 ~ 10^12 particles). Extreme-scale simulations of these particle-based methods might be achieved only if they were implemented with a highly effective load balancing scheme. General-purpose parallel particle simulators such as LAMMPS provide dynamic load balancing based on adaptive domain decomposition. However, this talk will show that the resulted simulation codes are still far from scalable to the desired spatial and temporal scales in problems of interest, e.g., 1) nanoscale fluidics in porous materials and 2) engineered biomass particle flowability characterization, mainly because of the highly non-uniform transient location distribution of particles in those problems. The core philosophy of Charm++: automatic load balancing based on the adaptive MPI (AMPI) and smart rank scheduling is especially attractive to the aforementioned models. However, a “good” Charm++ implementation of those models might not be trivial, especially when various engineered processes need to be considered in the programming, and when sophisticated boundary conditions need to be implemented by strictly following the Charm++ paradigm. This talk will present an overview of those mesoscale particle models and examples of their current scientific and engineering applications at INL, and assessment of the suitability of Charm++ for the development of production-level application codes based on those models. In addition, simulation results of a few nanofluid dynamics benchmark problems as well as an engineered shale oil recovery process by flow injection in high-resolution realistic nanoporous shale pore networks will be demonstrated as reference.	[pptx]	[YouTube]
4:00 pm - 4:20 pm	Break
Afternoon	Technical Session: PPL talks - NCSA Auditorium 1122 (Chair: Juan Galvez)
4:20 pm - 4:40 pm	Talk	Recent Communication Optimizations in Charm++ Nitin Bhat, CharmWorks, Inc Click here to expand description As we progress towards Exascale, communication optimizations are crucial to give improved performance and avoid communication bottlenecks. In this talk, I'll introduce the existing copy based messaging API in Charm++ and discuss the recent advancements and ongoing work in the area of messaging optimizations in Charm++. These include the new zerocopy API which uses Remote Direct Memory Access (RDMA) and shared memory optimizations for inter-process communication using Cross Memory Attach (CMA). I'll also talk about results that'll highlight the performance improvement achieved with these optimizations.	[pptx]	[YouTube]
4:40 pm - 5:00 pm	Talk	User-facing improvements to Charm++ process launching Evan Ramos, Charmworks, Inc Click here to expand description Work has recently been undertaken at Charmworks to provide additional options and capabilities to users when running jobs with the Charmrun process launcher or as standalone executables. These center around the use of the Portable Hardware Locality library (hwloc) to query processor topology and use the resulting information to spawn processes and threads as requested through user-specified parameters without requiring a manual count of the full total.	[pptx]	[YouTube]
5:00 pm - 5:20 pm	Talk	Using OpenMP offloading in Charm++ Dr. Matthias Diener, University of Illinois at Urbana-Champaign Click here to expand description High Performance Computing relies on accelerators (such as GPGPUs) to achieve fast execution of scientific applications. Traditionally, such accelerators have been programmed with specialized languages, such as CUDA or OpenCL. In recent years, OpenMP emerged as an alternative for supporting accelerators, providing advantages such as maintaining a single code base for the host and different accelerator types and providing a simple way to extend support for accelerators to existing code. This talk gives a brief overview of OpenMP offloading and discusses how Charm++ applications can benefit from it.	[pptx]	[YouTube]
5:20 pm - 5:45 pm	Talk	Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson, University of Illinois at Urbana-Champaign Click here to expand description With the rise of heterogeneous systems in high performance computing, how we utilize accelerators has become a critical factor in achieving the optimal performance. We first review the basics of heterogeneous computing and how to use GPU devices with the current support in Charm++. Then we explore some issues with using accelerators in Charm++, and discuss how the runtime could alleviate performance bottlenecks by enabling concurrent execution of heterogeneous tasks.	[pdf]	[YouTube]
5:45 pm - 6:00 pm		Closing Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign		[YouTube]
6:00 pm - 8:00 pm	Dinner(tentative)
Day 4 (Friday, April 13th)
Morning	Tutorial Session: Adaptive MPI - Thomas M. Siebel Center for Computer Science 4405
10:00 am - 12:00 pm	Tutorial	Adaptive MPI Sam White, University of Illinois at Urbana-Champaign

Abstracts Due:	March 2nd
Author Notification:	March 9th
Hotel Reservation:	March 9th
Workshop:	April 11-12, 2018
Tutorial:	April 10 or 13