Workshop Program

October 20 (Tue)


Time (CDT, UTC-5) Type Title Speaker Affiliation Slides Video
Session: Keynote I (Chair: Laxmikant V. Kale)
09:00 - 09:20 Opening Remarks Laxmikant V. Kale University of Illinois Urbana-Champaign Video
09:20 - 10:20 Keynote Preparing for Extreme Heterogeneity in High Performance Computing Jeffrey S. Vetter Oak Ridge National Laboratory Slides Video
10:20 - 10:40 Break
Session: Asynchronous Applications (Chair: Justin Szaday)
10:40 - 11:10 Talk Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids Aditya Pandare Los Alamos National Laboratory Slides Video
11:10 - 11:40 Talk Asynchronous Distributed-Memory Task-Parallel 3D Unstructured Mesh-to-Mesh Data Transfer Eric Mikida Charmworks, Inc. Video
11:40 - 12:10 Talk Bounded Asynchrony and Nested Parallelism for Scalable Graph Processing Lawrence Rauchwerger University of Illinois Urbana-Champaign Video
12:10 - 13:00 Lunch Break
Session: Astrophysical Simulations (Chair: Sam White)
13:00 - 13:30 Talk Enzo-E/Cello Computational Astrophysics and Cosmology James Bordner University of California, San Diego Slides Video
13:30 - 14:00 Talk Moving-mesh Hydrodynamics in ChaNGa Philip Chang University of Wisconsin-Milwaukee Slides Video
14:00 - 14:30 Talk SpECTRE: Toward Simulations of Binary Black Hole Mergers Using Charm++ Francois Hebert Caltech Slides Video
14:30 - 14:50 Break
Session: Quantum Chemistry (Chair: Eric Bohm)
14:50 - 15:20 Talk Projector Augmented Wave based Kohn-Sham Density Functional Theory Simulations with Reduced Order Scaling Glenn Martyna Pimpernel Science, Software and Information Technology Slides Video
15:20 - 15:50 Talk Scalable GW software for excited electrons using OpenAtom Kavitha Chandrasekar,
Kayahan Saritas
University of Illinois Urbana-Champaign,
Yale University
Slides Video
Session: Adaptive MPI & Parallel Tasks (Chair: Matthias Diener)
15:50 - 16:20 Talk Hurricane Storm Surge Analysis and Prediction Using ADCIRC-CG and AMPI Joannes Westerink,
Eric Bohm
Notre Dame University,
University of Illinois Urbana-Champaign
Slides-1 Slides-2 Video
16:20 - 16:50 Talk Recent Progress on Adaptive MPI Samuel White,
Evan Ramos
University of Illinois Urbana-Champaign,
Charmworks, Inc.
Slides Video
16:50 - 17:20 Talk Flexible Hierarchical Execution of Parallel Task Loops Michael Robson,
Kavitha Chandrasekar
Villanova University,
University of Illinois Urbana-Champaign
Slides Video

Workshop Program

October 21 (Wed)


Time (CDT, UTC-5) Type Title Speaker Affiliation Slides Video
Session: Keynote II (Chair: Laxmikant V. Kale)
09:00 - 10:00 Keynote Asynchronous Programming in Modern C++ Hartmut Kaiser Louisiana State University Slides Video
Session: Adaptive Runtime (Chair: Ronak Buch)
10:00 - 10:30 Talk Design and Implementation Techniques for an MPI-Oriented AMT Runtime Jonathan Lifflander Sandia National Labs Slides Video
10:30 - 11:00 Talk Efforts to Bridge Theory and Practice on Distributed Scheduling Algorithms Laercio L. Pilla Laboratoire de Recherche en Informatique, Univ. Paris-Sud - CNRS Slides Video
11:00 - 11:20 Break
Session: Performance & Languages (Chair: Zane Fink)
11:20 - 11:50 Talk Analyzing Call Graphs using Hatchet Abhinav Bhatele University of Maryland Video
11:50 - 12:20 Talk Achieving Computation-Communication Overlap with Overdecomposition on GPU Systems Jaemin Choi University of Illinois Urbana-Champaign Slides Video
12:20 - 12:50 Talk Advances in Charm++-based Languages Justin Szaday University of Illinois Urbana-Champaign Video
12:50 - 14:00 Lunch Break
Session: Keynote III (Chair: Laxmikant V. Kale)
14:00 - 15:00 Keynote Towards Performance Tools for Emerging GPU-Accelerated Exascale Supercomputers John Mellor-Crummey Rice University Slides Video
Session: Load Balancing & Molecular Dynamics (Chair: Kavitha Chandrasekar)
15:00 - 15:30 Talk Advances in VT's load balancing infrastructure and algorithms Phil Miller Intense Computing Slides Video
15:30 - 16:00 Talk Vector Load Balancing in Charm++ Ronak Buch University of Illinois Urbana-Champaign Slides Video
16:00 - 16:30 Talk Improving NAMD Performance and Scaling on Heterogeneous Architectures David Hardy,
Julio Maia
University of Illinois Urbana-Champaign Slides Video
Session: Charm++ Discussion (Chair: Eric Bohm)
16:30 - 17:00 Discussion Charm++ Release 6.11 Eric Bohm University of Illinois Urbana-Champaign Video
17:00 - 17:20 Closing Remarks Laxmikant V. Kale University of Illinois Urbana-Champaign Video

List of Talks


Keynote

Preparing for Extreme Heterogeneity in High Performance Computing

Jeffrey S. Vetter, Oak Ridge National Laboratory

While computing technologies have remained relatively stable for nearly two decades, new architectural features, such as heterogeneous cores, deep memory hierarchies, non-volatile memory (NVM), and near-memory processing, have emerged as possible solutions to address the concerns of energy-efficiency and cost. However, we expect this 'golden age' of architectural change to lead to extreme heterogeneity and it will have a major impact on software systems and applications. Software will need to be redesigned to exploit these new capabilities and provide some level of performance portability across these diverse architectures. In this talk, I will survey these emerging technologies, discuss their architectural and software implications, and describe several new approaches (e.g., domain specific languages, intelligent runtime systems) to address these challenges.


Keynote

Asynchronous Programming in Modern C++

Hartmut Kaiser, Louisiana State University

With the advent of modern computer architectures characterized by many-core nodes, deep and complex memory hierarchies, heterogeneous subsystems, and power-aware components, it is becoming increasingly difficult to achieve best possible application scalability and satisfactory parallel efficiency. The community is experimenting with new programming models that rely on finer-grain parallelism, flexible and lightweight synchronization, combined with work-queue-based, message-driven computation. The recently growing interest in the C++ programming language increases the demand for libraries implementing those programming models for the language. We present a new asynchronous C++ parallel programming model that is built around lightweight tasks and mechanisms to orchestrate massively parallel and distributed execution. This model uses the concept of Futures to make data dependencies explicit, employs explicit and implicit asynchrony to hide latencies and to improve utilization, and manages finer-grain parallelism with a work-stealing scheduling system enabling automatic load balancing of tasks. We have developed and implemented such a model as a C++ library exposing a higher-level parallelism API that is fully conforming to the existing C++11/14/17 standards and is aligned with the ongoing standardization work. This API and programming model has shown to enable writing highly efficient parallel applications for heterogeneous resources with excellent performance and scaling characteristics.


Keynote

Towards Performance Tools for Emerging GPU-Accelerated Exascale Supercomputers

John Mellor-Crummey, Rice University

To tune applications for emerging exascale supercomputers, application developers need tools that measure the performance of applications on GPU-accelerated platforms and attribute application performance back to program source code. This talk will describe work in progress developing extensions to Rice University's HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications on current supercomputers based on NVIDIA GPUs and forthcoming exascale systems based on AMD and Intel GPUs. At present, HPCToolkit's support for NVIDIA's GPUs is the most mature. To help developers understand the performance of accelerated applications as a whole, HPCToolkit's measurement and analysis tools attribute metrics to calling contexts that span both CPUs and GPUs. HPCToolkit measures both profiles and traces of GPU execution. To measure GPU-accelerated applications efficiently, HPCToolkit employs novel wait-free data structures to coordinate monitoring and attribution of GPU performance metrics. To help developers understand the performance of complex GPU code generated from high-level template-based programming models, HPCToolkit's hpcprof constructs sophisticated approximations of call path profiles for GPU computations. To support fine-grain analysis and tuning, HPCToolkit uses platform-dependent hardware and software measurement capabilities to attribute GPU performance metrics to source lines and loops. We illustrate HPCToolkit's emerging capabilities for analyzing GPU-accelerated applications with several case studies.


Achieving Computation-Communication Overlap with Overdecomposition on GPU Systems

Jaemin Choi, University of Illinois Urbana-Champaign

The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition, which enables a logical decomposition of the problem domain without being constrained by the number of processors, has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems, especially given the perceived overheads associated with smaller kernels that overdecomposition entails. In this work, we address the challenges in applying overdecomposition to GPU-accelerated applications and ensuring asynchronous progress of GPU operations using the Charm++ parallel programming system. Combining prioritization of communication in the application and support for asynchronous progress in the runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.


Vector Load Balancing in Charm++

Ronak Buch, University of Illinois Urbana-Champaign

Load balancing has been a fundamental tenet of Charm++ for a long while, but it has been limited in the amount of data available to it. Recently, we have added support for vector load balancing, a scheme in which the runtime system and/or the user record multiple load metrics (e.g. recording separate, orthogonal phases of execution or CPU and GPU load). This enables the load balancing system to make more precise adjustments of objects and improve the quality of its placement decisions. Doing so has required an overhaul of the load balancing infrastructure, APIs, and strategies in Charm++, which we will discuss.


Moving-mesh Hydrodynamics in ChaNGa

Philip Chang, University of Wisconsin-Milwaukee
Zachariah Etienne, West Virginia University

Moving-mesh hydrodynamics has changed the face of numerical simulation of galaxies and structure formation over the last 10 years. In this talk, I discuss the design and construction of moving-mesh hydrodynamic algorithms as applied to dynamical stellar problems. In particular, I will briefly describe the construction of Voronoi meshes and the application of unstructured moving-meshes to solving flux-conservative equations and the challenges to strong scalability. I will also very briefly describe recent results obtained by my group using the MANGA code, the moving-mesh hydrodynamics solver for ChaNGa. I will discuss results obtained for common-envelope evolution and tidal disruption events. I will close with brief discussion of the extension of these moving-mesh algorithms toward general relativistic hydrodynamics with an eye toward moving-mesh NS-NS and NS-BH simulations.


Projector Augmented Wave based Kohn-Sham Density Functional Theory Simulations with Reduced Order Scaling

Glenn Martyna, Pimpernel Science, Software and Information Technology

Currently, our massively parallel implementation of Kohn-Sham Density Functional Theory (KS-DFT) implemented in the OpenAtom software under Charm++ employs norm-conserving non-local pseudopotentials to describe the electron-ion interactions. While sufficient to examine many interesting systems of technological and scientific interest, it cannot address the electronic properties of heavy elements such as iron and copper efficiently due to the requirement of a large basis set or equivalently a large plane-wave cutoff to describe the electronic states. The Projector Augmented Wave approach (PAW) in contrast has many important features such as: (1) less demanding plane-wave cutoff energy due to the splitting of delocalized and core parts of the electron wavefunctions and (2) direct access to the core-states if needed for NMR and other related responses. However, traditional PAW based KS-DFT scales as ~N3 and the parallel performance is quite poor. To solve this problem, we developed a PAW based KS-DFT in OpenAtom with ~N2logN scaling enabled by Euler Exponential Spline (EES) and FFTs to form a 4 grid multi-resolution method. We increase the accuracy of the long-range interactions by applying Ewald summation methods to the Hartree and external interactions. Our development inherits all the merits of the PAW method with significantly higher efficiency and builds naturally on our previous charm++ implementation.


Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured Grids

Jozsef Bakosi, Los Alamos National Laboratory
Marc Charest, Los Alamos National Laboratory
Aditya Pandare, Los Alamos National Laboratory
Jacob Waltz, Los Alamos National Laboratory

We discuss the implementation of a finite element method, used to numerically solve the Euler equations of compressible flows, using an asynchronous runtime system (RTS). The algorithm is implemented for distributed-memory machines, using unstructured 3D meshes, combining data-, and task-parallelism on top of the Charm++ RTS. Charm++'s execution model is asynchronous by default, allowing arbitrary overlap of computation and communication. Task-parallelism allows scheduling parts of an algorithm independently of, or dependent on, each other. Built-in automatic load balancing enables continuous redistribution of computational load by migration of work units based on real-time CPU load measurement. The RTS also features automatic checkpointing, fault tolerance, resilience against hardware failures, and supports power-, and energy-aware computation. We demonstrate scalability up to O(10^9) cells at O(10^4) compute cores and the benefits of automatic load balancing for irregular workloads. The full source code with documentation is available at https://quinoacomputing.org.


Asynchronous Distributed-Memory Task-Parallel 3D Unstructured Mesh-to-Mesh Data Transfer

Eric Mikida, Charmworks, Inc.
Nitin Bhat, Charmworks, Inc.
Eric Bohm, Charmworks, Inc.
Laxmikant Kale, Charmworks, Inc.
Jozsef Bakosi, Los Alamos National Laboratory

We are developing a distributed-memory-parallel adaptive multiphysics simulation software tool, co-designing physics solvers and advanced load balancing techniques. Our target application is fluid-structure interaction of large engineering problems, O(10^7) mesh cells, for O(10^4) CPUs. We will use overset unstructured mesh methods to compute structural mechanics of solids interacting with compressible fluids within complex 3D geometries. To couple fluid and solid mechanics solvers, we are implementing a two-way data transfer algorithm on top of the Charm++ asynchronous tasking runtime system (https://www.hpccharm.com). We will report on the initial implementation and scalability of a mesh-to-mesh solution data transfer algorithm that will be used as a library, coupling physics solvers in Quinoa (https://quinoacomputing.org), developed at Los Alamos National Laboratory.


Scalable GW software for excited electrons using OpenAtom

Kavitha Chandrasekar, University of Illinois Urbana-Champaign
Kayahan Saritas, Yale University

OpenAtom is an open-source, massively parallel software application that performs ab initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We describe the status of the excited-state GW implementation in OpenAtom: the GW method is an accurate but computationally expensive method for describing dynamics of excited electrons in solid state systems. We briefly describe the current O(N^4) scaling GW method implemented in the public version of OpenAtom (where N is the number of atoms in the simulation cell). We then present our progress in implementing an O(N^3) scaling GW method in OpenAtom. In addition to the formalism and physical principles, our parallelization method and scaling results will be presented.


Advances in Charm++-based Languages

Justin Szaday, University of Illinois Urbana-Champaign

Charj, a general-purpose language for parallel computing based on Charm++, has had many advances since it was last discussed at the Charm++ workshop. Its most recent iteration of the language has a Scala-like syntax, and boasts a variety of features supported by static analysis. Included among these features are embedded domain-specific languages for orchestration (i.e. Charisma) and the Multiphase Shared Arrays (MSA) Charm++ library. This presentation will provide an overview of these topics, and our language development plans going forward.


Enzo-E/Cello Computational Astrophysics and Cosmology

James Bordner, University of California, San Diego
Michael Norman, University of California, San Diego

Enzo-E is an adaptive mesh refinement (AMR) astrophysics and cosmology application, and Cello is the AMR framework on which Enzo-E is built. Enzo-E supports an increasingly rich collection of physics kernels, including hydrodynamics, self-gravity, cosmological expansion, chemistry and cooling, and magnetohydrodynamics. Enzo-E's physics kernels are implemented on top of Cello's fully-distributed array-of-octree AMR hierarchy, which is in turn represented using a Charm++ chare array of blocks. We are currently in the final stages of implementing and testing the last functionality required to run large-scale cosmological structure formation simulations that include both dark matter and baryonic matter. These latest features include scalable gravity, robust ghost-zone refresh, flux-correction, and improved interpolation. After the last of these is completed, Enzo-E/Cello will begin its production phase, running scientifically viable astrophysics and cosmology simulations on some of the largest HPC platforms available today.

Design and Implementation Techniques for an MPI-Oriented AMT Runtime

Jonathan Lifflander, Sandia National Labs
Phil Miller, Intense Computing
Nicole Lemaster Slattengren, Sandia National Labs
Nicolas Morales, Sandia National Labs
Paul Stickney, NexGen Analytics, Inc.
Philippe P. Pebay, NexGen Analytics, Inc.

We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the 'MPI+X' model of hybrid parallelism can smoothly extend to become 'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.


Advances in VT's load balancing infrastructure and algorithms

Jonathan Lifflander, Sandia National Labs
Phil Miller, Intense Computing
Philippe Pebay, NexGen Analytics

We present major changes to our load balancing runtime infrastructure to make load balancing algorithms more adaptable/customizable for applications of interest. We discuss the design and implementation of load models as a mechanism for applications to provide more control over how raw instrumented data is presented to the runtime system and used to redistribute workloads.


Analyzing Call Graphs using Hatchet

Abhinav Bhatele, University of Maryland

Performance analysis is critical for eliminating scalability bottlenecks in parallel codes. There are many profiling tools that can instrument codes and gather performance data. However, analytics and visualization tools that are general, easy to use, and programmable are limited. In this talk, we focus on the analytics of structured profiling data, such as that obtained from calling context trees or nested region timers in code. We present a set of techniques and operations that build on the pandas data analysis library to enable analysis of parallel profiles. We have implemented these techniques in a Python-based library called Hatchet that allows structured data to be filtered, aggregated, and pruned. Using performance datasets obtained from profiling parallel codes, we demonstrate performing common performance analysis tasks reproducibly with a few lines of Hatchet code. Hatchet brings the power of modern data science tools to bear on performance analysis.


SpECTRE: Toward Simulations of Binary Black Hole Mergers Using Charm++

Francois Hebert, Caltech

Recent groundbreaking observations have led to a new era for the study of black holes and neutron stars. The Advanced LIGO and Virgo gravitational-wave observatories have measured gravitational waves from merging pairs of black holes and neutron stars, proving that these systems exist in nature and enabling us to study them in detail. The Event Horizon Telescope has directly imaged the accretion flows around the supermassive black hole at the center of the galaxy M87, giving us a new look into these extreme environments. To interpret these observations, however, we rely on simulations to bridge the gap between the highly non-linear governing physics equations and the signals seen in our instruments. Improving the accuracy of simulation methods, as well as their parallelization on large future supercomputers, will be a step key to improving the science output from future observations.

SpECTRE is a next-generation multi-physics code implemented using Charm++, designed to improve on the accuracy and scaling limitations of current codes in relativistic astrophysics. SpECTRE solves the governing equations --general relativistic magnetohydrodynamics for the matter and/or Einstein's equations for dynamical spacetimes-- using a discontinuous Galerkin method. The DG method provides the accuracy of high-order spectral-type methods where the solution is smooth, with various robust shock-capturing methods available near discontinuities (e.g., shocks in the matter). SpECTRE uses task-based parallelism implemented using Charm++, and gets excellent scaling to hundreds of thousands of processors.

In this talk, I will provide an overview of SpECTRE and the algorithms we've implemented. I will give an update on our progress toward simulating binary black hole mergers, highlighting a simulation of the initial "inspiral" phase of the evolution before the black holes merge. I will present our initial investigations into load-balancing and checkpoint-restarting using Charm++'s built-in features, highlighting successes and challenges we've faced and features we would like to see in Charm++. Finally, I will outline future development plans and science goals.


Efforts to Bridge Theory and Practice on Distributed Scheduling Algorithms

Laercio L. Pilla, Laboratoire de Recherche en Informatique, Univ. Paris-Sud - CNRS
Johanne Cohen, Laboratoire de Recherche en Informatique, Univ. Paris-Sud - CNRS

In this talk, we discuss some of our recent work on the subject of distributed scheduling algorithms for high-performance computing applications. We have detected that distributed scheduling algorithms are not very commonly found in contemporaneous runtime systems, whereas many more scheduling algorithms are found in the literature. These distributed scheduling algorithms are usually based on simple models using little to no information, which leads to scenarios which are far from what are able to achieve in our runtime systems. We present some of our initial steps on the adaptation of a distributed scheduling algorithm to use basic global information (the average resource load) and we detail simulation experiments that point to the next steps. We finish this talk by discussing the importance of reducing the need or requirements for experimental evaluations on real high-performance computing platforms.


Bounded Asynchrony and Nested Parallelism for Scalable Graph Processing

Lawrence Rauchwerger, University of Illinois Urbana-Champaign

In this work we develop two broad techniques for improving the performance of graph traversals and general parallel algorithms. 1. Bounded Asynchrony. Increasing asynchrony in a bounded manner allows one to avoid costly global syn- chronization at scale, while still avoiding the penalty of unbounded asynchrony including redundant work. In addition, asynchronous processing enables a new family of approximate algorithms when applications are tolerant to a fixed amount of error. 2. Nested parallelism. Allowing to express graph algo- rithms in a naturally nested parallel manner enables us to fully exploit all of the available parallelism inherent in graph algorithms.


Hurricane Storm Surge Analysis and Prediction Using ADCIRC-CG and AMPI

Eric Bohm, University of Illinois Urbana-Champaign
Samuel White, University of Illinois Urbana-Champaign
Joannes Westerink, Notre Dame University
Justin Szaday, University of Illinois Urbana-Champaign
Damrongsak Wirasaet, Notre Dame University
Dylan Wood, Notre Dame University

Storm-driven coastal flooding is influenced by many physical processes including riverine discharges, regional rainfall, wind, atmospheric pressure, wave-induced set up, wave runup, tides, and fluctuating baseline ocean water levels. Operational storm surge models such as NOAA’s ESTOFS incorporate a variety of these processes including riverine discharges, atmospheric winds and pressure, waves, and tides. However, coastal surge models do not typically incorporate the impact of rainfall across the coastal flood-plain nor fluctuations in background water levels due to the oceanic density structure. Efficient simulation of these phenomena is challenging due to the dynamic load imbalance associated with fluctuating water levels. This talk will review the science of accurately simulating these scenarios and the initial progress that has been made in improving simulation efficiency by refactoring the ADCIRC-CG code to use Adaptive MPI (AMPI).


Recent Progress on Adaptive MPI

Samuel White, University of Illinois Urbana-Champaign
Evan Ramos, Charmworks, Inc.

Adaptive MPI is a full-fledged MPI implementation on top of Charm++. As such, it provides the application programming interface of MPI with the dynamic runtime features of Charm++. In this talk, we provide a quick overview of AMPI and its features before discussing two directions of recent work on its implementation: 1) collective communication optimizations taking advantage of shared memory and new Charm++ semantics, and 2) automatic global variable privatization methods to enable running legacy applications on AMPI.


Flexible Hierarchical Execution of Parallel Task Loops

Michael Robson, Villanova University
Kavitha Chandrasekar, University of Illinois Urbana-Champaign

We demonstrate the effectiveness of combining the techniques of overdecomposition and work-sharing to improve application performance. Our key insight is that tuning these two parameters in combination optimizes performance by facilitating the trade off of communication overlap and overhead. We explore this new space of potential optimization by varying both the problem size decomposition (grain size) and number of cores assigned to execute a particular task (spreading). Utilizing these two variables in concert, we can shape the execution timeline of applications in order to more smoothly inject messages on the network, improve cache performance, and decrease the overall execution time of an application. As single-node performance continues to outpace network bandwidth, ensuring smooth and continuous injection of messages into the network will continue to be of crucial importance. Our preliminary results demonstrate a greater than two-fold improvement in performance over a naive OpenMP-only baseline and a thirty percent speedup over the previously best performing implementation of the same code.


Improving NAMD Performance and Scaling on Heterogeneous Architectures

David Hardy, University of Illinois Urbana-Champaign
Julio Maia, University of Illinois Urbana-Champaign

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Given the diverse features and methodologies available in NAMD, it is considered in the eld to be a major biomolecular simulation engine, used by around 20,000 users, with many running the code at various supercomputing facilities, including DOE and XSEDE resources. NAMD is one of the premier Charm++ applications, and has been developed in conjunction with Charm++ since the mid-90s. NAMD is also the rst full-featured molecular dynamics code to adopt CUDA acceleration for scaling on NVIDIA GPUs. Development eorts in both NAMD and Charm++ have enabled scaling on heterogeneous architectures, most notably on the ORNL Summit supercomputer. In spite of the eorts taken to improve performance and scaling of NAMD on Summit by ooading the force calculations to GPU, performance is limited by the remaining CPU calculations required for time stepping. The overlap between CPU and GPU calculations, that used to benet NAMD on earlier CUDA-capable GPUs, became imbalanced as subsequent generation GPU performance outpaced that of CPUs, making NAMD performance CPU-bound, an eect that is exacerbated by multi-GPU nodes with fewer CPU cores dedicated to each GPU. Development eorts over the past two years have improved NAMD performance on GPUs by moving the remaining CPU work for each time step to the GPU while reducing host-device memory transfers, changing NAMD from employing a GPU-ooad strategy to a GPU-resident one, in which data is maintained on the GPU across time steps. The result has been released as NAMD version 3.0, featuring a fast single-GPU code path that doubles performance by more fully utilizing the GPU with very little calculation remaining on the CPU. This new simulation modality can be used together with the existing multiple copy framework to scale smaller system sizes (under 1M atoms) across all available GPUs in a parallel computer to achieve very large amounts of aggregate sampling. Recent work, not yet available as production-quality code, has succeeded in extending the single-GPU implementation to scale to multiple GPUs on a single node. There are important challenges to overcome in order to realize a scalable GPU-resident version of NAMD. The PME (particle mesh Ewald) algorithm for approximating long-range electrostatics poses an immediate issue for multi- GPU scaling. The 3D FFTs are too small to parallelize eectively across multiple GPUs, either through the use of direct multi-GPU support in cuFFT or the manual assignment of pencils to GPUs along each dimension. The best alternative for supporting PME appears to be assigning the FFTs together with the reciprocal space calculation to a single GPU, which requires sending all grid data to a designated GPU while taking care to not overload it. A better approach appears to be replacing PME with the MSM (multilevel summation method) an alternative algorithm yielding a similar approximation. The hierarchical nature of the MSM calculation produces a scalable tree-like communication structure with nearest neighbor communication at each level, much like the well known FMM (fast multipole method). New theoretical development reveals that the MSM can achieve the same order of accuracy as PME through the use of periodic B-spline basis functions, while relegating the 3D FFTs and reciprocal space calculation to the coarsest grid level, thus allowing the FFT to be made as small as desired. Another important challenge is how to reintroduce Charm++ eectively into a GPU-resident NAMD for multi-node scaling. The current single-GPU implementation side-steps NAMD's current Charm++ infrastructure, avoiding extra latency by putting unused user-level threads to sleep. It might be possible to employ the new Charm++ GPU direct messaging API to minimize CPU involvement.

Follow Us