Call for Abstracts
Workshop Program
May 1st (Wed)
Time | Type | Title | Speaker | Affiliation | Slides | Video |
---|---|---|---|---|---|---|
08:00 - 09:00 | Continental Breakfast | |||||
Opening Session | Chair: Laxmikant V. Kale | |||||
09:00 - 09:20 | Opening Remarks | Laxmikant V. Kale | University of Illinois Urbana-Champaign | Link | ||
09:20 - 10:20 | Keynote | Molecular Dynamics: Looking Ahead to Exascale | Steve Plimpton | Sandia National Laboratories | Link | Link |
10:20 - 10:40 | Break | |||||
Session: Applications I | Chair: Michael Robson | |||||
10:40 - 11:10 | Talk | Experiences with Charm++ and NAMD on the Summit POWER9/Volta Supercomputer | James Phillips | National Center for Supercomputing Applications | Link | Link |
11:10 - 11:40 | Talk | SpECTRE: Towards Improved Simulations of Relativistic Astrophysical Systems | Nils Deppe | Cornell University | Link | Link |
11:40 - 12:10 | Talk | Enzo-E / Cello: Adaptive Mesh Refinement Astrophysics using Charm++ | James Bordner | San Diego Supercomputer Center | Link | Link |
12:10 - 13:10 | Lunch | |||||
Session: Heterogeneous Computing | Chair: Ronak Buch | |||||
13:10 - 13:40 | Talk | Xilinx's Adaptable FPGA Acceleration Platforms for HPC Applications | Viraj R. Paropkari | Xilinx Inc. | ||
13:40 - 14:00 | Talk | Charm++ and the Future of GPUs | Michael Robson | University of Illinois Urbana-Champaign | Link | Link |
14:00 - 14:20 | Talk | Distributed Deep Learning: Leveraging Heterogeneity and Data-Parallelism | Jaemin Choi | University of Illinois Urbana-Champaign | Link | Link |
14:20 - 14:40 | Break | |||||
Session: Frameworks & Communication | Chair: Sam White | |||||
14:40 - 15:10 | Talk | FleCSI: Compile-time Configurable Framework for Multi-Physics Applications | Li-Ta Lo | Los Alamos National Laboratory | Link | Link |
15:10 - 15:40 | Talk | Accelerating Large Messages by Avoiding Memory Operations in Charm++ | Nitin Bhat | Charmworks, Inc. | Link | Link |
15:40 - 16:10 | Talk | Charm4Py: Parallel Programming with Python and Charm++ | Juan Galvez | University of Illinois Urbana-Champaign | Link | Link |
16:10 - 16:30 | Break | |||||
Session: Adaptive MPI | Chair: Matthias Diener | |||||
16:30 - 17:00 | Talk | AMPI for LAMMPS USER-MESO | Yidong Xia, Jiaoyan Li | Idaho National Laboratory | Link | Link |
17:00 - 17:30 | Talk | Recent Developments in Adaptive MPI | Sam White, Evan Ramos | University of Illinois Urbana-Champaign, Charmworks, Inc. | Link | Link |
17:30 - 18:00 | Discussion | Charm++ Release 6.10.0 | Eric Bohm | Charmworks, Inc. | Link | |
18:30 - 20:45 | Banquet | Siebel Center 2nd floor Atrium |
Workshop Program
May 2nd (Thu)
Time | Type | Title | Speaker | Affiliation | Slides | Video |
---|---|---|---|---|---|---|
08:00 - 09:00 | Continental Breakfast | |||||
Opening Session | Chair: Laxmikant V. Kale | |||||
09:00 - 10:00 | Keynote | Machine Learning and Predictive Simulation: HPC and the U.S. Cancer Moonshot on Sierra | Fred Streitz | Lawrence Livermore National Laboratory | Link | |
10:00 - 10:30 | Talk | Software Sustainability and Software Citation | Daniel S. Katz | National Center for Supercomputing Applications | Link | Link |
10:30 - 10:50 | Break | |||||
Session: Applications II | Chair: Eric Bohm | |||||
10:50 - 11:20 | Talk | Distributed Garbage Collection for General Graphs | Steven R. Brandt | Louisiana State University | Link | Link |
11:20 - 11:50 | Talk | OpenAtom: The GW Method for Electronic Excitations | Minjung Kim, Kavitha Chandrasekar | Yale University, University of Illinois Urbana-Champaign | Link | Link |
11:50 - 12:20 | Talk | Adaptive Discontinuous Galerkin Method for Compressible Flows Using Charm++ | Weizhao Li | North Carolina State University | Link | Link |
12:20 - 13:20 | Lunch | |||||
Session: Applications III | Chair: Jaemin Choi | |||||
13:20 - 13:35 | Talk | Adaptive Techniques for Scalable Optimistic Parallel Discrete Event Simulation | Eric Mikida | University of Illinois Urbana-Champaign | Link | Link |
13:35 - 13:50 | Talk | Dynamically Turning Cores Off for Performance and Energy in Charm++ | Kavitha Chandrasekar | University of Illinois Urbana-Champaign | Link | Link |
13:50 - 14:10 | Talk | From Cosmology to Planets: The ChaNGa N-body/Hydrodynamics Code | Tom Quinn | University of Washington | Link | Link |
14:10 - 14:25 | Talk | ParaTreeT: A Fast, General Framework for Tree Traversal | Joseph Hutter | University of Illinois Urbana-Champaign | Link | Link |
14:25 - 14:40 | Talk | Efficient GPU-only Local Tree Walks in ChaNGa | Milind Kulkarni | Purdue University | Link | Link |
14:40 - 14:50 | Break | |||||
Session: Runtimes & Load Balancing | Chair: Raghavendra Kanakagiri | |||||
14:50 - 15:10 | Talk | Optimizing a New DARMA Runtime for Load Balancing EMPIRE | Jonathan Lifflander | Sandia National Laboratories | Link | |
15:10 - 15:30 | Talk | Distributed Load Balancing Utilizing the Communication Graph | Philippe P. Pebay | NextGen Analytics | Link | |
15:30 - 15:50 | Talk | Recent Developments in Dynamic Load Balancing | Ronak Buch | University of Illinois Urbana-Champaign | Link | Link |
15:50 - 16:10 | Break | |||||
Session: Other Charm++ Features | Chair: Kavitha Chandrasekar | |||||
16:10 - 16:40 | Talk | The Effect of UCX Machine layer on Charm++ Simulations | Yong Qin | Mellanox Technologies | Link | Link |
16:40 - 17:00 | Talk | Improving Throughput of Fine-grained Messages with Aggregation | Venkatasubrahmanian Narayanan | University of Illinois Urbana-Champaign | Link | Link |
17:00 - 17:20 | Talk | Interoperability of Shared Memory Parallel Programming Models with Charm++ | Jaemin Choi | University of Illinois Urbana-Champaign | Link | Link |
17:20 - 17:40 | Closing Remarks | Laxmikant V. Kale | University of Illinois Urbana-Champaign | Link | ||
18:30 - 20:00 | Workshop Dinner | Venue: Jupiter's Pizzeria & Billiards 39 E Main St, Champaign, IL 61820 |
Keynote
Molecular Dynamics: Looking Ahead to Exascale
Steve Plimpton, Sandia National Laboratories (SNL)
Coming exascale machines will provide greatly increased computational resources for all kinds of scientific computing and simulation. They will also be harder to use than current machines, both at the single-node and full-machine level. Which is a challenge not just for applications, but also for creators of programming models and software frameworks, including the Charm++ community.
In the first portion of my talk, I'll discuss algorithmic work we've recently done to enable the hyperdynamics (HD) method to run in parallel in our molecular dynamics (MD) code LAMMPS. HD can greatly extend the timescale accessible to an MD simulation, at least for solids where relatively rare events trigger transitions between potential energy basins. I'll describe how HD works in parallel and also present some results for long timescale HD modeling of diffusion on a Pt(100) surface. Used in tandem with multiple-replica techniques, this could be an effective way to use an exascale machine to run small problems for really long timescales.
In the second portion, I'll broaden the scope and give some numbers and examples that illustrate the challenges exascale platforms pose for all kinds of computational modeling. Hopefully you will see ways to turn challenges into opportunities for new research.
Experiences with Charm++ and NAMD on the Summit POWER9/Volta Supercomputer
James Phillips, National Center for Supercomputing Applications (NCSA)
The highly parallel molecular dynamics code NAMD has been long used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64-million-atom model of the HIV virus capsid. In 2007 NAMD was was one of the first codes to run on a GPU cluster, and it is now one of the first on the world's fastest GPU-accelerated supercomputer, ORNL Summit, which features IBM POWER9 CPUs, NVIDIA Volta GPUs, the NVLink CPU-GPU interconnect, and a dual-rail EDR InfiniBand inter-node network. This talk will cover the latest NAMD performance improvements and scaling results on Summit, with an emphasis on recent Charm++ features and optimizations of relevance to NAMD.
SpECTRE: Towards Improved Simulations of Relativistic Astrophysical Systems
Nils Deppe, Cornell University
We are now at the beginning of the exciting new era of multi-messenger astrophysics. The first binary neutron star merger event was detected on August 17, 2017 both in gravitational waves and in the electromagnetic spectrum, and we expect that the Event Horizon Telescope will allow for studying accretion onto black holes. In order to understand the observations that will be possible, we must develop more accurate models and computer simulations than are currently available. Spectral methods have proven themselves to be an invaluable tool for generating gravitational waveform models from numerical relativity simulations. However, these methods cannot be directly applied to hydrodynamics because spectral methods assume a smooth solution, i.e. no shocks. Discontinuous Galerkin methods promise to remedy this, behaving as high-order spectral-type methods where the solution is smooth, and robust shock-capturing methods at discontinuities. The equations of general relativistic magnetohydrodynamics (GRMHD) alone and coupled with the Einstein equations prove to be especially challenging to solve.
SpECTRE is a next-generation multiphysics code implemented using Charm++ and is designed to overcome the limitations of current codes used in relativistic astrophysics. How to most effectively solve the GRMHD equations is still a very active area of research. SpECTRE is designed to be extremely modular so that new algorithms can easily be added into the code and combined with existing algorithms. I will provide an overview of the simulations we are currently capable of running with SpECTRE, the algorithms we’ve implemented in order to highlight SpECTRE‘s flexibility, and the near- and long-term science goals.
Enzo-E / Cello: Adaptive Mesh Refinement Astrophysics using Charm++
James Bordner, San Diego Supercomputer Center (SDSC)
Cello is a highly-scalable "array-of-octree" adaptive mesh refinement (AMR) software framework, implemented using Charm++. Cello is being developed concurrently with the driving application Enzo-E (formerly Enzo-P), a highly scalable branch of the MPI-parallel astrophysics and cosmology application "ENZO". The Cello AMR framework provides its scientific application with mesh adaptivity, ghost cell refresh, generic distributed field and particle data types, and sequencing of user methods for computing on block data. Further advanced parallel software support, including data-driven execution, fully-distributed data structures, dynamic load-balancing, and checkpoint / restart, is provided by Charm++. We give an overview of the Enzo-E / Cello / Charm++ software layers, discuss the design and implementation of Enzo-E / Cello, and present parallel scaling results on Blue Waters.
Xilinx's Adaptable FPGA Acceleration Platforms for HPC Applications
Viraj R. Paropkari, Xilinx Inc.
Exponential growth of compute requirements in HPC applications is now driving the need for heterogeneous computing architectures, which rely on accelerators to deliver power efficient scaling of compute performance. Further compounding the computing challenges is the dawn of AI and the explosion in the sheer amount of data that needs to be stored and processed. A new class of compute and storage acceleration platforms are needed to enable tomorrow's EXASCALE supercomputers. The compute and storage node requires intelligent network fabric to communicate to each other. These accelerators will need to be easy to deploy and manage, and highly adaptable to the ever changing workloads within HPC centers. This talk will focus on Xilinx FPGA based accelerator platforms in Compute, storage, networking and Software programming stack with case studies in HPC applications.
Distributed Deep Learning: Leveraging Heterogeneity and Data-Parallelism
Jaemin Choi, University of Illinois Urbana-Champaign
With deep learning models becoming more complex and the ever-growing volume of training data, training deep learning models with a single device can take intolerable amounts of time. Distributed deep learning tackles this problem by distributing the training process across multiple compute devices. However, the current landscape of distributed deep learning is dominated by GPUs, due to their superior performance of tensor computations. CPUs have naturally taken a back seat, only used when GPUs are not available. But for areas such as natural language processing (NLP) that utilize sequence-to-sequence models, CPU performance does not fall too far behind, especially with the use of optimized libraries such as Intel's MKL. We aim to take advantage of this fact and accelerate training using both CPUs and GPUs, with the popular data-parallel approach of distributed deep learning. We evaluate two deep learning NLP applications: machine translation using Google's Transformer model, and image captioning with a pre-trained CNN as the image encoder and a LSTM as the decoder.
The Effect of UCX Machine layer on Charm++ Simulations
Yong Qin, Mellanox Technologies
Full abstract
From concept to engineering, and from design to test and manufacturing,
engineers from a wide range of industries face the ever-increasing need for
complex and realistic models to analyze the most challenging industrial
problems; Analysis is performed to secure quality and speed up the
development process. Sophisticated programming model and software have
been developed aiming to tackle the need for computational simulations with
superior robustness, speed, and accuracy. These simulations are designed to
run effectively on large-scale computational High-Performance Computing
(HPC) systems.
The new generation of InfiniBand In-Network Computing technology includes
several elements, such as Scalable Hierarchical Aggregation and Reduction
Protocol (SHARP)™, smart MPI hardware Tag Matching, rendezvous
protocols, and more network offload mechanisms. These offload technologies
are in use at several of the recently deployed large-scale supercomputers
around the world, including the top TOP500 platforms.
Unified Communication X (UCX) is an open-source production grade
communication framework for data centric and high-performance applications.
UCX is a collaboration work between industry, laboratories, government
(DoD, DoE), and academia which enabled the highest performance through codesign of software-hardware interfaces.
Mellanox has implemented the UCX machine layer for Charm++ and
conducted performance investigations, including low-level and application
benchmarks, to evaluate its performance and scaling capabilities with the
InfiniBand interconnect.
In the session we will present the test results and performance benefits of the
UCX machine layer, as well as discuss the potential use cases of this
component.
Accelerating Large Messages by Avoiding Memory Operations in Charm++
Nitin Bhat, Charmworks, Inc.
With memory performance not scaling at the rate of CPU performance, memory bound operations are one of the biggest bottlenecks for application performance and scaling. This cost of memory allocation and copying increases drastically with increasing message sizes. The Zero Copy API in Charm++ allows users to avoid making additional allocations and copies by taking advantage of the RDMA capability of the network. In this talk, I will be presenting the recent additions to the Zero Copy API for point-to-point and collective operations. I will discuss methods, use cases, and present results for the different flavors of the API, which can be used to reduce memory footprint and improve performance of the application.
Charm4Py: Parallel Programming with Python and Charm++
Juan Galvez, University of Illinois Urbana-Champaign
Charm4py is a parallel/distributed programming framework for Python built on top of the Charm++ runtime, thus providing asynchronous remote method invocation, overdecomposition, dynamic load balancing and overlap of computation and communication. Charm4py takes advantage of the dynamic language features of Python to offer a high-level, simple and powerful API that simplifies writing, running and debugging parallel applications. Futures are an optional but core feature of the framework, which allow taking advantage of the capabilities of the Charm++ programming model and runtime using direct style programming, which is the usual style of programming. In this talk we will present new features of Charm4py, as well as discuss how high performance applications can be developed using this framework, and how performance compares to other popular Python frameworks like Dask.
AMPI for LAMMPS USER-MESO
Yidong Xia, Jiaoyan Li, Idaho National Laboratory (INL)
Full abstract
The particle-based nano- to micro-scale fluid flow and transport models and high-performance
computing (HPC) techniques are deployed to study source rocks such as shale at Idaho National
Laboratory (INL). The dynamic processes of fluid flow are usually inhomogeneous in the sense
that the distribution of particles as well as the computation of interparticle potentials can exhibit
highly spatial and temporal variabilities. On HPC clusters, those types of inhomogeneity usually
lead to severe load imbalance issue across processors and consequently poor scalability in
simulations of those processes, and therefore making it extremely difficult for most of the current
codes to reach the required ranges of spatial and temporal scales for simulations.
The core capability of Charm++: automatic load balancing based on over-decomposition and
smart rank scheduling, is especially attractive to the particle flow and transport models in general.
However, a "good" Charm++ implementation of those models is not trivial, especially when
various engineered processes are considered in programming, and when sophisticated boundary
conditions need to be implemented by strictly following the Charm++ paradigm.
In this work, we present the latest progress in the implementation of Adaptive Message Passing
Interface (AMPI) for the LAMMPS based USER-MESO particle flow simulation package. First, the
mesoscale particle model implemented in LAMMPS USER-MESO, dissipative particle dynamics
(DPD), will be briefly introduced and examples of the scientific and engineering applications of
DPD at INL will be overviewed. The existing load imbalance problem in the simulation of an
engineered shale oil recovery process by injecting flow in high-resolution realistic nanoporous
shale pore networks will be highlighted. Then, the computing performance will be carefully
evaluated for the fluid flow simulation for runs with and without AMPI support. Finally, we will
demonstrate the possibility of cooperative work of AMPI with the native load balancer in
LAMMPS, recursive coordinate bi-sectioning (RCB).
Recent Developments in Adaptive MPI
Sam White, University of Illinois Urbana-Champaign / Evan Ramos, Charmworks, Inc.
Adaptive MPI is an MPI library implemented on top of Charm++. As such, it provides the application programming interface of MPI with the dynamic runtime features of Charm++. In this talk, we provide a quick overview of AMPI and its features before discussing two directions of recent work on its implementation: 1) communication optimizations for taking advantage of communication locality and for collective routines, and 2) automatic global variable privatization methods to enable running legacy applications on AMPI.
Keynote
Machine Learning and Predictive Simulation: HPC and the U.S. Cancer Moonshot on Sierra
Fred Streitz, Lawrence Livermore National Laboratory (LLNL)
The marriage of experimental science with simulation has been a fruitful one–the fusion of HPC-based simulation and experimentation moves science forward faster than either discipline alone, rapidly testing hypotheses and identifying promising directions for future research. The emergence of machine learning at scale promises to bring a new type of thinking into the mix, incorporating data analytics techniques alongside traditional HPC to accompany experiment. Such explorations can develop highly complex workflows that benefit greatly from heterogeneous computing. I will discuss one such complex workflow: a multi-scale investigation of Ras biology on a realistic membrane that makes effective use of the Sierra supercomputer at LLNL.
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Software Sustainability and Software Citation
Daniel S. Katz, National Center for Supercomputing Applications (NCSA)
Software sustainability means different things to different groups of people, including the persistence of working software, and the persistence of people, or funding. While we can generally define sustainability as the inflow of resources is sufficient to do the needed work, where those resources both include and are somewhat transferrable into human effort, users, funders, managers, and developers (or maintainers) all mean somewhat different things when they use sustainable in the context of software. This talk will illustrate some of these different views, and their corresponding aims. It will also provide some guidance on quantifying software sustainability from some of these views. In particular, one of the methods that can be used to bring resources to a project is to provide incentives (often non-financial) for contributing to the project. In academia, a key incentives is the potential for one's work being cited, and this talk will also discuss how software is cited today, and how we might use software citations to increase sustainability.
Distributed Garbage Collection for General Graphs
Steven R. Brandt, Louisiana State University
Full paper
We propose a scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing. The algorithm is based on the Brownbridge system but uses four different types of references to label edges. Memory usage is O(logn) bits per node, where n is the number of nodes in the graph. The algorithm assumes an asynchronous network model with a reliable reordering channel. It collects garbage in O(Ea) time, where Ea is the number of edges in the induced subgraph. The algorithm uses termination detection to manage the distributed computation, a unique identifier to break the symmetry among multiple collectors, and a transaction-based approach when multiple collectors conflict. Unlike existing algorithms, ours is not centralized, does not require barriers, does not require migration of nodes, does not require back-pointers on every edge, and is stable against concurrent mutation.
OpenAtom: The GW Method for Electronic Excitations
Minjung Kim, Yale University
Kavitha Chandrasekar, University of Illinois Urbana-Champaign
OpenAtom is an open-source, massively parallel software application that performs ab initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We highlight our recent progress towards developing, implementing, testing, and parallelizing an implementation of the GW method, a computationally expensive but accurate approach for describing the dynamics of excited electrons in materials. We will summarize the standard O(N^4) GW approach, where N is the number of atoms in the simulation cell, its current implementation and parallel scaling in OpenAtom as well as our recently completed development of a much faster O(N^3) GW approach along with its parallel implementation.
Adaptive Discontinuous Galerkin Method for Compressible Flows Using Charm++
Weizhao Li, North Carolina State University
Full abstract
In this presentation, we propose a parallel hp-adaptive discontinuous Galerkin
method for the compressible flows which exercises load balancing strategies of
Charm++. The general algorithm of this adaptive numerical method, utilization of the load balancing strategies as well as a set of numerical results will be
discussed.
Adaptive finite element methods are widely used in computational fluid dynamics because of their reliability, robustness, and efficiency. During the adaptive computation process, portions of the discretized domain are spatially refined
or coarsened (h-refinement) and the solution polynomial order is also varied (prefinement). This enables concentration of computational efforts in regions of
the problem domain where the solution is more varying. In our adaptive discontinuous Galerkin scheme, a posteriori estimates of spatial errors are obtained
from the computed numerical solution. These error estimates are used to update the computation order on local elements during each time step.
By applying the above adaptive computation process, the work load on each
processing element will vary as the computation proceeds. This necessitates a
fast and efficient dynamic load-balancing strategy. This is an important factor
for the efficiency of adaptive numerical methods, especially for large scale parallel computations. All of this makes Charm++ an excellent choice for us to
improve hardware utilization and efficiency in our project.
In the present work, we verify the accuracy and robustness of our adaptive
discontinuous Galerkin method by a wide range of numerical experiments with
analytical solutions. Moreover, performance comparisons and analyses with
and without load balancers as well as different load balancing strategies will be
discussed to demonstrate that the combination of our adaptive discontinuous
Galerkin scheme and the implementation of load balancing strategies significantly reduces the total execution times and also yields a comparable accuracy.
In the future, we will combine the mesh and polynomial-order refinement
to obtain an hp-adaptive discontinuous Galerkin method that would be capable of computing in real world physics processes. A set of performance tests
with thousands of compute cores will be conducted to verify the scalability and
efficiency of our project.
Adaptive Techniques for Scalable Optimistic Parallel Discrete Event Simulation
Eric Mikida, University of Illinois Urbana-Champaign
Discrete Event Simulation (DES) can be an important tool across various domains such as Engineering, Military, Biology, High Performance Computing, and many others. Interacting systems in these domains can be simulated with a high degree of fidelity and accuracy. Charades is a Parallel Discrete Event Simulation (PDES) engine built on top of Charm++, and utilizes the adaptive and asynchronous nature of Charm++ to execute simulations effectively. Charades is primarily an optimistic simulator, which means that it executes events speculatively in order to increase parallelism. In order to manage the speculative execution and deal with event rollbacks and timestamp synchronization we have developed new techniques for synchronization within optimistic PDES simulations. We've also harnessed the power of Charm++'s dynamic load balancing framework in some unique ways to further enhance the power of these techniques.
From Cosmology to Planets: The ChaNGa N-body/Hydrodynamics Code
Tom Quinn, University of Washington
The largest gravitationally bound objects in the Universe, galaxy clusters, are a few megaparsecs across, while the smallest known gravitationally bound objects, asteroids and comets, are a few kilometers across. While the scale changes by 19 orders of magnitude, gravity is scale-free, so the same simulation techniques can be used for modelling both systems. The other similarity is the range of scales that need to be modeled within a system: clusters needed to be modeled within their cosmological context, and asteroids need to be modeled within the context of the Solar System. The massively parallel tree-code, ChaNGa, handles these scales and uses the features of Charm++ to address issues of load balancing and communication latency.
ParaTreeT: A Fast, General Framework for Tree Traversal
Joseph Hutter, University of Illinois Urbana-Champaign
Mathematicians and scientists have been developing parallel programs for tree-based algorithms independently for decades. We seek to aggregate their past developments and offer them high productivity with the Charm++-based Parallel Tree Toolkit (ParaTreeT). We present our framework's abstractions that facilitate wide-ranging applications such as astronomy, granular dynamics, scientific visualization, and graphics. We then discuss the primary optimizations and Charm++ features central to our generic tree traversal. Finally, we showcase promising performance results and minimal user code for a Barnes Hut application of ParaTreeT.
Optimizing a New DARMA Runtime for Load Balancing EMPIRE
Jonathan Lifflander, Sandia National Laboratories (SNL)
In the past two years, DARMA has evolved to include a new runtime, VT (Virtual Transport). VT enables virtualization/overdecomposition with C++ objects, similar to the Charm++ programming model. In FY19-Q1, we overhauled EMPIRE, a full-blown ATDM PIC application utilizing Trilinos, in a proof-of-concept implementation that overdecomposes the mesh and particles to support long-term load balancing needs. We discuss the programming model needs of EMPIRE-PIC and the design decisions it led to in building the VT runtime.
Distributed Load Balancing Utilizing the Communication Graph
Philippe P. Pebay, Sandia National Laboratories (SNL)
We build on previous research by Menon and Kale to enhance fully distributed, gossip-based load balancing algorithms to consider the task-to-task communication graph. We present a new, agile load balancing simulator developed in Python for studying task-DAG statistics and exploring new LB algorithms. We present highly scalable, iterative LB algorithms that use the distributed communication graph to optimize task-to-processor placement.
Interoperability of Shared Memory Parallel Programming Models with Charm++
Jaemin Choi, University of Illinois Urbana-Champaign
Parallel programming models such as Kokkos and Raja aim to achieve performance portability while providing abstractions for parallel execution and data management to the users. However, they are limited to shared memory environments (i.e. single node) and rely on other distributed memory frameworks such as MPI for large-scale runs that involve many physical nodes. In this talk, we will demonstrate basic interoperability of the aforementioned models with Charm++ for distributed memory execution and discuss potential improvements for a more streamlined integration.
About Us
Workshop General Chair: Jaemin Choi