Charm++ Workshop 2019

Workshop Program

May 1st (Wed)

Time	Type	Title	Speaker	Affiliation	Slides	Video
08:00 - 09:00	Continental Breakfast
Opening Session	Chair: Laxmikant V. Kale
09:00 - 09:20	Opening Remarks		Laxmikant V. Kale	University of Illinois Urbana-Champaign		Link
09:20 - 10:20	Keynote	Molecular Dynamics: Looking Ahead to Exascale	Steve Plimpton	Sandia National Laboratories	Link	Link
10:20 - 10:40	Break
Session: Applications I	Chair: Michael Robson
10:40 - 11:10	Talk	Experiences with Charm++ and NAMD on the Summit POWER9/Volta Supercomputer	James Phillips	National Center for Supercomputing Applications	Link	Link
11:10 - 11:40	Talk	SpECTRE: Towards Improved Simulations of Relativistic Astrophysical Systems	Nils Deppe	Cornell University	Link	Link
11:40 - 12:10	Talk	Enzo-E / Cello: Adaptive Mesh Refinement Astrophysics using Charm++	James Bordner	San Diego Supercomputer Center	Link	Link
12:10 - 13:10	Lunch
Session: Heterogeneous Computing	Chair: Ronak Buch
13:10 - 13:40	Talk	Xilinx's Adaptable FPGA Acceleration Platforms for HPC Applications	Viraj R. Paropkari	Xilinx Inc.
13:40 - 14:00	Talk	Charm++ and the Future of GPUs	Michael Robson	University of Illinois Urbana-Champaign	Link	Link
14:00 - 14:20	Talk	Distributed Deep Learning: Leveraging Heterogeneity and Data-Parallelism	Jaemin Choi	University of Illinois Urbana-Champaign	Link	Link
14:20 - 14:40	Break
Session: Frameworks & Communication	Chair: Sam White
14:40 - 15:10	Talk	FleCSI: Compile-time Configurable Framework for Multi-Physics Applications	Li-Ta Lo	Los Alamos National Laboratory	Link	Link
15:10 - 15:40	Talk	Accelerating Large Messages by Avoiding Memory Operations in Charm++	Nitin Bhat	Charmworks, Inc.	Link	Link
15:40 - 16:10	Talk	Charm4Py: Parallel Programming with Python and Charm++	Juan Galvez	University of Illinois Urbana-Champaign	Link	Link
16:10 - 16:30	Break
Session: Adaptive MPI	Chair: Matthias Diener
16:30 - 17:00	Talk	AMPI for LAMMPS USER-MESO	Yidong Xia, Jiaoyan Li	Idaho National Laboratory	Link	Link
17:00 - 17:30	Talk	Recent Developments in Adaptive MPI	Sam White, Evan Ramos	University of Illinois Urbana-Champaign, Charmworks, Inc.	Link	Link
17:30 - 18:00	Discussion	Charm++ Release 6.10.0	Eric Bohm	Charmworks, Inc.		Link
18:30 - 20:45	Banquet	Siebel Center 2nd floor Atrium

Workshop Program

May 2nd (Thu)

Time	Type	Title	Speaker	Affiliation	Slides	Video
08:00 - 09:00	Continental Breakfast
Opening Session	Chair: Laxmikant V. Kale
09:00 - 10:00	Keynote	Machine Learning and Predictive Simulation: HPC and the U.S. Cancer Moonshot on Sierra	Fred Streitz	Lawrence Livermore National Laboratory		Link
10:00 - 10:30	Talk	Software Sustainability and Software Citation	Daniel S. Katz	National Center for Supercomputing Applications	Link	Link
10:30 - 10:50	Break
Session: Applications II	Chair: Eric Bohm
10:50 - 11:20	Talk	Distributed Garbage Collection for General Graphs	Steven R. Brandt	Louisiana State University	Link	Link
11:20 - 11:50	Talk	OpenAtom: The GW Method for Electronic Excitations	Minjung Kim, Kavitha Chandrasekar	Yale University, University of Illinois Urbana-Champaign	Link	Link
11:50 - 12:20	Talk	Adaptive Discontinuous Galerkin Method for Compressible Flows Using Charm++	Weizhao Li	North Carolina State University	Link	Link
12:20 - 13:20	Lunch
Session: Applications III	Chair: Jaemin Choi
13:20 - 13:35	Talk	Adaptive Techniques for Scalable Optimistic Parallel Discrete Event Simulation	Eric Mikida	University of Illinois Urbana-Champaign	Link	Link
13:35 - 13:50	Talk	Dynamically Turning Cores Off for Performance and Energy in Charm++	Kavitha Chandrasekar	University of Illinois Urbana-Champaign	Link	Link
13:50 - 14:10	Talk	From Cosmology to Planets: The ChaNGa N-body/Hydrodynamics Code	Tom Quinn	University of Washington	Link	Link
14:10 - 14:25	Talk	ParaTreeT: A Fast, General Framework for Tree Traversal	Joseph Hutter	University of Illinois Urbana-Champaign	Link	Link
14:25 - 14:40	Talk	Efficient GPU-only Local Tree Walks in ChaNGa	Milind Kulkarni	Purdue University	Link	Link
14:40 - 14:50	Break
Session: Runtimes & Load Balancing	Chair: Raghavendra Kanakagiri
14:50 - 15:10	Talk	Optimizing a New DARMA Runtime for Load Balancing EMPIRE	Jonathan Lifflander	Sandia National Laboratories		Link
15:10 - 15:30	Talk	Distributed Load Balancing Utilizing the Communication Graph	Philippe P. Pebay	NextGen Analytics		Link
15:30 - 15:50	Talk	Recent Developments in Dynamic Load Balancing	Ronak Buch	University of Illinois Urbana-Champaign	Link	Link
15:50 - 16:10	Break
Session: Other Charm++ Features	Chair: Kavitha Chandrasekar
16:10 - 16:40	Talk	The Effect of UCX Machine layer on Charm++ Simulations	Yong Qin	Mellanox Technologies	Link	Link
16:40 - 17:00	Talk	Improving Throughput of Fine-grained Messages with Aggregation	Venkatasubrahmanian Narayanan	University of Illinois Urbana-Champaign	Link	Link
17:00 - 17:20	Talk	Interoperability of Shared Memory Parallel Programming Models with Charm++	Jaemin Choi	University of Illinois Urbana-Champaign	Link	Link
17:20 - 17:40	Closing Remarks		Laxmikant V. Kale	University of Illinois Urbana-Champaign		Link
18:30 - 20:00	Workshop Dinner	Venue: Jupiter's Pizzeria & Billiards 39 E Main St, Champaign, IL 61820

Keynote

Molecular Dynamics: Looking Ahead to Exascale

Steve Plimpton, Sandia National Laboratories (SNL)

Coming exascale machines will provide greatly increased computational resources for all kinds of scientific computing and simulation. They will also be harder to use than current machines, both at the single-node and full-machine level. Which is a challenge not just for applications, but also for creators of programming models and software frameworks, including the Charm++ community.

In the first portion of my talk, I'll discuss algorithmic work we've recently done to enable the hyperdynamics (HD) method to run in parallel in our molecular dynamics (MD) code LAMMPS. HD can greatly extend the timescale accessible to an MD simulation, at least for solids where relatively rare events trigger transitions between potential energy basins. I'll describe how HD works in parallel and also present some results for long timescale HD modeling of diffusion on a Pt(100) surface. Used in tandem with multiple-replica techniques, this could be an effective way to use an exascale machine to run small problems for really long timescales.

In the second portion, I'll broaden the scope and give some numbers and examples that illustrate the challenges exascale platforms pose for all kinds of computational modeling. Hopefully you will see ways to turn challenges into opportunities for new research.

Experiences with Charm++ and NAMD on the Summit POWER9/Volta Supercomputer

James Phillips, National Center for Supercomputing Applications (NCSA)

The highly parallel molecular dynamics code NAMD has been long used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64-million-atom model of the HIV virus capsid. In 2007 NAMD was was one of the first codes to run on a GPU cluster, and it is now one of the first on the world's fastest GPU-accelerated supercomputer, ORNL Summit, which features IBM POWER9 CPUs, NVIDIA Volta GPUs, the NVLink CPU-GPU interconnect, and a dual-rail EDR InfiniBand inter-node network. This talk will cover the latest NAMD performance improvements and scaling results on Summit, with an emphasis on recent Charm++ features and optimizations of relevance to NAMD.

SpECTRE: Towards Improved Simulations of Relativistic Astrophysical Systems

Nils Deppe, Cornell University

We are now at the beginning of the exciting new era of multi-messenger astrophysics. The first binary neutron star merger event was detected on August 17, 2017 both in gravitational waves and in the electromagnetic spectrum, and we expect that the Event Horizon Telescope will allow for studying accretion onto black holes. In order to understand the observations that will be possible, we must develop more accurate models and computer simulations than are currently available. Spectral methods have proven themselves to be an invaluable tool for generating gravitational waveform models from numerical relativity simulations. However, these methods cannot be directly applied to hydrodynamics because spectral methods assume a smooth solution, i.e. no shocks. Discontinuous Galerkin methods promise to remedy this, behaving as high-order spectral-type methods where the solution is smooth, and robust shock-capturing methods at discontinuities. The equations of general relativistic magnetohydrodynamics (GRMHD) alone and coupled with the Einstein equations prove to be especially challenging to solve.

SpECTRE is a next-generation multiphysics code implemented using Charm++ and is designed to overcome the limitations of current codes used in relativistic astrophysics. How to most effectively solve the GRMHD equations is still a very active area of research. SpECTRE is designed to be extremely modular so that new algorithms can easily be added into the code and combined with existing algorithms. I will provide an overview of the simulations we are currently capable of running with SpECTRE, the algorithms we’ve implemented in order to highlight SpECTRE‘s flexibility, and the near- and long-term science goals.

Enzo-E / Cello: Adaptive Mesh Refinement Astrophysics using Charm++

James Bordner, San Diego Supercomputer Center (SDSC)

Cello is a highly-scalable "array-of-octree" adaptive mesh refinement (AMR) software framework, implemented using Charm++. Cello is being developed concurrently with the driving application Enzo-E (formerly Enzo-P), a highly scalable branch of the MPI-parallel astrophysics and cosmology application "ENZO". The Cello AMR framework provides its scientific application with mesh adaptivity, ghost cell refresh, generic distributed field and particle data types, and sequencing of user methods for computing on block data. Further advanced parallel software support, including data-driven execution, fully-distributed data structures, dynamic load-balancing, and checkpoint / restart, is provided by Charm++. We give an overview of the Enzo-E / Cello / Charm++ software layers, discuss the design and implementation of Enzo-E / Cello, and present parallel scaling results on Blue Waters.

Xilinx's Adaptable FPGA Acceleration Platforms for HPC Applications

Viraj R. Paropkari, Xilinx Inc.

Exponential growth of compute requirements in HPC applications is now driving the need for heterogeneous computing architectures, which rely on accelerators to deliver power efficient scaling of compute performance. Further compounding the computing challenges is the dawn of AI and the explosion in the sheer amount of data that needs to be stored and processed. A new class of compute and storage acceleration platforms are needed to enable tomorrow's EXASCALE supercomputers. The compute and storage node requires intelligent network fabric to communicate to each other. These accelerators will need to be easy to deploy and manage, and highly adaptable to the ever changing workloads within HPC centers. This talk will focus on Xilinx FPGA based accelerator platforms in Compute, storage, networking and Software programming stack with case studies in HPC applications.

Distributed Deep Learning: Leveraging Heterogeneity and Data-Parallelism

Jaemin Choi, University of Illinois Urbana-Champaign

With deep learning models becoming more complex and the ever-growing volume of training data, training deep learning models with a single device can take intolerable amounts of time. Distributed deep learning tackles this problem by distributing the training process across multiple compute devices. However, the current landscape of distributed deep learning is dominated by GPUs, due to their superior performance of tensor computations. CPUs have naturally taken a back seat, only used when GPUs are not available. But for areas such as natural language processing (NLP) that utilize sequence-to-sequence models, CPU performance does not fall too far behind, especially with the use of optimized libraries such as Intel's MKL. We aim to take advantage of this fact and accelerate training using both CPUs and GPUs, with the popular data-parallel approach of distributed deep learning. We evaluate two deep learning NLP applications: machine translation using Google's Transformer model, and image captioning with a pre-trained CNN as the image encoder and a LSTM as the decoder.

The Effect of UCX Machine layer on Charm++ Simulations

Yong Qin, Mellanox Technologies

Full abstract

From concept to engineering, and from design to test and manufacturing, engineers from a wide range of industries face the ever-increasing need for complex and realistic models to analyze the most challenging industrial problems; Analysis is performed to secure quality and speed up the development process. Sophisticated programming model and software have been developed aiming to tackle the need for computational simulations with superior robustness, speed, and accuracy. These simulations are designed to run effectively on large-scale computational High-Performance Computing (HPC) systems.

The new generation of InfiniBand In-Network Computing technology includes several elements, such as Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™, smart MPI hardware Tag Matching, rendezvous protocols, and more network offload mechanisms. These offload technologies are in use at several of the recently deployed large-scale supercomputers around the world, including the top TOP500 platforms.

Unified Communication X (UCX) is an open-source production grade communication framework for data centric and high-performance applications. UCX is a collaboration work between industry, laboratories, government (DoD, DoE), and academia which enabled the highest performance through codesign of software-hardware interfaces.

Mellanox has implemented the UCX machine layer for Charm++ and conducted performance investigations, including low-level and application benchmarks, to evaluate its performance and scaling capabilities with the InfiniBand interconnect.

In the session we will present the test results and performance benefits of the UCX machine layer, as well as discuss the potential use cases of this component.

Accelerating Large Messages by Avoiding Memory Operations in Charm++

Nitin Bhat, Charmworks, Inc.

With memory performance not scaling at the rate of CPU performance, memory bound operations are one of the biggest bottlenecks for application performance and scaling. This cost of memory allocation and copying increases drastically with increasing message sizes. The Zero Copy API in Charm++ allows users to avoid making additional allocations and copies by taking advantage of the RDMA capability of the network. In this talk, I will be presenting the recent additions to the Zero Copy API for point-to-point and collective operations. I will discuss methods, use cases, and present results for the different flavors of the API, which can be used to reduce memory footprint and improve performance of the application.

Charm4Py: Parallel Programming with Python and Charm++

Juan Galvez, University of Illinois Urbana-Champaign

Charm4py is a parallel/distributed programming framework for Python built on top of the Charm++ runtime, thus providing asynchronous remote method invocation, overdecomposition, dynamic load balancing and overlap of computation and communication. Charm4py takes advantage of the dynamic language features of Python to offer a high-level, simple and powerful API that simplifies writing, running and debugging parallel applications. Futures are an optional but core feature of the framework, which allow taking advantage of the capabilities of the Charm++ programming model and runtime using direct style programming, which is the usual style of programming. In this talk we will present new features of Charm4py, as well as discuss how high performance applications can be developed using this framework, and how performance compares to other popular Python frameworks like Dask.

AMPI for LAMMPS USER-MESO

Yidong Xia, Jiaoyan Li, Idaho National Laboratory (INL)

Full abstract

The particle-based nano- to micro-scale fluid flow and transport models and high-performance computing (HPC) techniques are deployed to study source rocks such as shale at Idaho National Laboratory (INL). The dynamic processes of fluid flow are usually inhomogeneous in the sense that the distribution of particles as well as the computation of interparticle potentials can exhibit highly spatial and temporal variabilities. On HPC clusters, those types of inhomogeneity usually lead to severe load imbalance issue across processors and consequently poor scalability in simulations of those processes, and therefore making it extremely difficult for most of the current codes to reach the required ranges of spatial and temporal scales for simulations.

The core capability of Charm++: automatic load balancing based on over-decomposition and smart rank scheduling, is especially attractive to the particle flow and transport models in general. However, a "good" Charm++ implementation of those models is not trivial, especially when various engineered processes are considered in programming, and when sophisticated boundary conditions need to be implemented by strictly following the Charm++ paradigm.

In this work, we present the latest progress in the implementation of Adaptive Message Passing Interface (AMPI) for the LAMMPS based USER-MESO particle flow simulation package. First, the mesoscale particle model implemented in LAMMPS USER-MESO, dissipative particle dynamics (DPD), will be briefly introduced and examples of the scientific and engineering applications of DPD at INL will be overviewed. The existing load imbalance problem in the simulation of an engineered shale oil recovery process by injecting flow in high-resolution realistic nanoporous shale pore networks will be highlighted. Then, the computing performance will be carefully evaluated for the fluid flow simulation for runs with and without AMPI support. Finally, we will demonstrate the possibility of cooperative work of AMPI with the native load balancer in LAMMPS, recursive coordinate bi-sectioning (RCB).

Recent Developments in Adaptive MPI

Sam White, University of Illinois Urbana-Champaign / Evan Ramos, Charmworks, Inc.

Adaptive MPI is an MPI library implemented on top of Charm++. As such, it provides the application programming interface of MPI with the dynamic runtime features of Charm++. In this talk, we provide a quick overview of AMPI and its features before discussing two directions of recent work on its implementation: 1) communication optimizations for taking advantage of communication locality and for collective routines, and 2) automatic global variable privatization methods to enable running legacy applications on AMPI.

Keynote

Machine Learning and Predictive Simulation: HPC and the U.S. Cancer Moonshot on Sierra

Fred Streitz, Lawrence Livermore National Laboratory (LLNL)

The marriage of experimental science with simulation has been a fruitful one–the fusion of HPC-based simulation and experimentation moves science forward faster than either discipline alone, rapidly testing hypotheses and identifying promising directions for future research. The emergence of machine learning at scale promises to bring a new type of thinking into the mix, incorporating data analytics techniques alongside traditional HPC to accompany experiment. Such explorations can develop highly complex workflows that benefit greatly from heterogeneous computing. I will discuss one such complex workflow: a multi-scale investigation of Ras biology on a realistic membrane that makes effective use of the Sierra supercomputer at LLNL.

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Software Sustainability and Software Citation

Daniel S. Katz, National Center for Supercomputing Applications (NCSA)

Software sustainability means different things to different groups of people, including the persistence of working software, and the persistence of people, or funding. While we can generally define sustainability as the inflow of resources is sufficient to do the needed work, where those resources both include and are somewhat transferrable into human effort, users, funders, managers, and developers (or maintainers) all mean somewhat different things when they use sustainable in the context of software. This talk will illustrate some of these different views, and their corresponding aims. It will also provide some guidance on quantifying software sustainability from some of these views. In particular, one of the methods that can be used to bring resources to a project is to provide incentives (often non-financial) for contributing to the project. In academia, a key incentives is the potential for one's work being cited, and this talk will also discuss how software is cited today, and how we might use software citations to increase sustainability.

Distributed Garbage Collection for General Graphs

Steven R. Brandt, Louisiana State University

Full paper

We propose a scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing. The algorithm is based on the Brownbridge system but uses four different types of references to label edges. Memory usage is O(logn) bits per node, where n is the number of nodes in the graph. The algorithm assumes an asynchronous network model with a reliable reordering channel. It collects garbage in O(Ea) time, where Ea is the number of edges in the induced subgraph. The algorithm uses termination detection to manage the distributed computation, a unique identifier to break the symmetry among multiple collectors, and a transaction-based approach when multiple collectors conflict. Unlike existing algorithms, ours is not centralized, does not require barriers, does not require migration of nodes, does not require back-pointers on every edge, and is stable against concurrent mutation.

OpenAtom: The GW Method for Electronic Excitations

Minjung Kim, Yale University
Kavitha Chandrasekar, University of Illinois Urbana-Champaign

OpenAtom is an open-source, massively parallel software application that performs ab initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We highlight our recent progress towards developing, implementing, testing, and parallelizing an implementation of the GW method, a computationally expensive but accurate approach for describing the dynamics of excited electrons in materials. We will summarize the standard O(N^4) GW approach, where N is the number of atoms in the simulation cell, its current implementation and parallel scaling in OpenAtom as well as our recently completed development of a much faster O(N^3) GW approach along with its parallel implementation.

Adaptive Discontinuous Galerkin Method for Compressible Flows Using Charm++

Weizhao Li, North Carolina State University

Full abstract

In this presentation, we propose a parallel hp-adaptive discontinuous Galerkin method for the compressible flows which exercises load balancing strategies of Charm++. The general algorithm of this adaptive numerical method, utilization of the load balancing strategies as well as a set of numerical results will be discussed.

Adaptive finite element methods are widely used in computational fluid dynamics because of their reliability, robustness, and efficiency. During the adaptive computation process, portions of the discretized domain are spatially refined or coarsened (h-refinement) and the solution polynomial order is also varied (prefinement). This enables concentration of computational efforts in regions of the problem domain where the solution is more varying. In our adaptive discontinuous Galerkin scheme, a posteriori estimates of spatial errors are obtained from the computed numerical solution. These error estimates are used to update the computation order on local elements during each time step.

By applying the above adaptive computation process, the work load on each processing element will vary as the computation proceeds. This necessitates a fast and efficient dynamic load-balancing strategy. This is an important factor for the efficiency of adaptive numerical methods, especially for large scale parallel computations. All of this makes Charm++ an excellent choice for us to improve hardware utilization and efficiency in our project.

In the present work, we verify the accuracy and robustness of our adaptive discontinuous Galerkin method by a wide range of numerical experiments with analytical solutions. Moreover, performance comparisons and analyses with and without load balancers as well as different load balancing strategies will be discussed to demonstrate that the combination of our adaptive discontinuous Galerkin scheme and the implementation of load balancing strategies significantly reduces the total execution times and also yields a comparable accuracy.

In the future, we will combine the mesh and polynomial-order refinement to obtain an hp-adaptive discontinuous Galerkin method that would be capable of computing in real world physics processes. A set of performance tests with thousands of compute cores will be conducted to verify the scalability and efficiency of our project.

Adaptive Techniques for Scalable Optimistic Parallel Discrete Event Simulation

Eric Mikida, University of Illinois Urbana-Champaign

Discrete Event Simulation (DES) can be an important tool across various domains such as Engineering, Military, Biology, High Performance Computing, and many others. Interacting systems in these domains can be simulated with a high degree of fidelity and accuracy. Charades is a Parallel Discrete Event Simulation (PDES) engine built on top of Charm++, and utilizes the adaptive and asynchronous nature of Charm++ to execute simulations effectively. Charades is primarily an optimistic simulator, which means that it executes events speculatively in order to increase parallelism. In order to manage the speculative execution and deal with event rollbacks and timestamp synchronization we have developed new techniques for synchronization within optimistic PDES simulations. We've also harnessed the power of Charm++'s dynamic load balancing framework in some unique ways to further enhance the power of these techniques.

From Cosmology to Planets: The ChaNGa N-body/Hydrodynamics Code

Tom Quinn, University of Washington

The largest gravitationally bound objects in the Universe, galaxy clusters, are a few megaparsecs across, while the smallest known gravitationally bound objects, asteroids and comets, are a few kilometers across. While the scale changes by 19 orders of magnitude, gravity is scale-free, so the same simulation techniques can be used for modelling both systems. The other similarity is the range of scales that need to be modeled within a system: clusters needed to be modeled within their cosmological context, and asteroids need to be modeled within the context of the Solar System. The massively parallel tree-code, ChaNGa, handles these scales and uses the features of Charm++ to address issues of load balancing and communication latency.

ParaTreeT: A Fast, General Framework for Tree Traversal

Joseph Hutter, University of Illinois Urbana-Champaign

Mathematicians and scientists have been developing parallel programs for tree-based algorithms independently for decades. We seek to aggregate their past developments and offer them high productivity with the Charm++-based Parallel Tree Toolkit (ParaTreeT). We present our framework's abstractions that facilitate wide-ranging applications such as astronomy, granular dynamics, scientific visualization, and graphics. We then discuss the primary optimizations and Charm++ features central to our generic tree traversal. Finally, we showcase promising performance results and minimal user code for a Barnes Hut application of ParaTreeT.

Optimizing a New DARMA Runtime for Load Balancing EMPIRE

Jonathan Lifflander, Sandia National Laboratories (SNL)

In the past two years, DARMA has evolved to include a new runtime, VT (Virtual Transport). VT enables virtualization/overdecomposition with C++ objects, similar to the Charm++ programming model. In FY19-Q1, we overhauled EMPIRE, a full-blown ATDM PIC application utilizing Trilinos, in a proof-of-concept implementation that overdecomposes the mesh and particles to support long-term load balancing needs. We discuss the programming model needs of EMPIRE-PIC and the design decisions it led to in building the VT runtime.

Distributed Load Balancing Utilizing the Communication Graph

Philippe P. Pebay, Sandia National Laboratories (SNL)

We build on previous research by Menon and Kale to enhance fully distributed, gossip-based load balancing algorithms to consider the task-to-task communication graph. We present a new, agile load balancing simulator developed in Python for studying task-DAG statistics and exploring new LB algorithms. We present highly scalable, iterative LB algorithms that use the distributed communication graph to optimize task-to-processor placement.

Interoperability of Shared Memory Parallel Programming Models with Charm++

Jaemin Choi, University of Illinois Urbana-Champaign

Parallel programming models such as Kokkos and Raja aim to achieve performance portability while providing abstractions for parallel execution and data management to the users. However, they are limited to shared memory environments (i.e. single node) and rely on other distributed memory frameworks such as MPI for large-scale runs that involve many physical nodes. In this talk, we will demonstrate basic interoperability of the aforementioned models with Charm++ for distributed memory execution and discuss potential improvements for a more streamlined integration.

About Us

Workshop General Chair: Jaemin Choi
Co-Chair: Kavitha Chandrasekar
Co-sponsored by Intel, Xilinx, NCSA, and Charmworks

Organized by the Parallel Programming Laboratory at the University of Illinois Urbana-Champaign