Authors are invited to submit abstracts or short papers describing research and development
in the broad themes of parallel processing emphasized by Charm++, AMPI, and Charm4Py including
runtime adaptivity, message-driven execution, automated resource management, and application development.
Some specific topics of interest include:
Parallel algorithms and applications, especially those using Charm++, AMPI, and Charm4Py
Parallel libraries and frameworks, especially those supporting runtime adaptivity
Novel parallel programming models and abstractions
Adaptive runtime management, including communication, power, energy, and heat
Dynamic load balancing
Within-node parallelism, tasks, and accelerators
Fault tolerance and resilience
Tools and techniques for performance analysis, tuning and debugging
Extensions and new features in Charm++
Massively parallel processing on future machines
Submissions should include the title of the proposed talk and a short abstract.
An extended abstract ranging 1-3 pages can be provided as an additional source (not required).
The presentations will be posted at the workshop website, but there will be no published proceedings.
Authors are encouraged to publish them in other conferences or journals.
Abstracts should be submitted through EasyChair by the deadline, Oct 8, 2021 AOE.
We ask that authors who submitted their abstract in Spring by email resubmit through EasyChair.
Early submission is welcome and will expedite the reviewing process.
Authors of early submissions will be notified of their acceptance as soon as possible.
Please contact Zane Fink if you have any questions.
Challenges of Programming models for The Supercomputer "Fugaku" and Beyond
Mitsuhisa Sato, RIKEN Center for Computational Science (R-CCS)
The supercomputer "Fugaku" is an exascale system, which is operated since March 2021 in R-CCS, Japan. Fugaku is an ultra-scale "general-purpose" manycore-based system with 158,976 nodes, 7.6M cores in total. We developed a new manycore processor, A64FX, for Fugaku, which supports Arm instruction sets with Scalable Vector Extension (SVE) SIMD instruction sets and is equipped with HBM2 memory. While OpenMP-MPI hybrid programing model is a standard programming model, new programming models such as task-based parallel programming models and parallel object-oriented programming models are expected to be useful to exploit several level of parallelism in such large-scale manycore-based system and wide SIMD parallelism in each core. In this talk, the overview and performance of Fugaku will be presented, followed by challenges of programming models for Fugaku and beyond.
Keynote
A Tale of Two Cultures: Challenges in Developing Tools for Writing Parallel Software
Alex Aiken, Stanford University
The scientific computing community has been developing parallel programming tools for decades, with an emphasis on performance. The software industry has only more recently focused on the development of tools for writing parallel software, with an emphasis on productivity. This talk will compare and contrast the two approaches, with an emphasis on identifying research challenges for the scientific community to achieve productivity comparable to widely-used industry tools.
Adaptive Plasma Physics Simulations: Dealing with Load Imbalance using Charm++
Esteban Meneses, National High Technology Center in Costa Rica
High Performance Computing (HPC) is nearing the exascale era and several challenges have to be addressed in terms of application development. Future parallel programming models should not only help developers take full advantage of the underlying machine but they should also account for highly dynamic runtime conditions, including frequent hardware failures. In this paper, we analyze the porting process of a plasma confinement simulator from a traditional MPI+OpenMP approach to a parallel objects based model like Charm++. The main driver for this effort is the existence of load imbalanced input scenarios that pure OpenMP scheduling can not solve. By using Charm++ adaptive runtime and integrated balancing strategies, we were able to increase total CPU usage from 45.2% to 80.2%, achieving a 1.64x acceleration, after load balancing, over the MPI+OpenMP implementation on a specific input scenario. Checkpointing was added to the simulator thanks to the pack-unpack interface implemented by Charm++, providing scientists with fault tolerance and split execution capabilities.
Dynamically Load-balanced p-adaptive Discontinuous Galerkin Methods using Charm++
Weizhao Li, North Carolina State University Aditya Pandare, Los Alamos National Laboratory Hong Luo, North Carolina State University Jozsef Bakosi, Los Alamos National Laboratory
In this presentation, a p-adaptive discontinuous Galerkin (DG) method for compressible flow on tetrahedral grid will be presented. The discontinuous Galerkin methods have recently become popular for the solution of systems of conservation laws. Nowadays they are widely used in computational fluid dynamics, computational acoustics, and computational magnetohydrodynamics. However, compared to the continuous Galerkin (CG) finite element method and finite volume method, the DG method requires solutions of systems of equations with more unknowns for the same grids. Consequently, it has been recognized as expensive in terms of both computational costs and storage requirements. To reduce computing cost, one of the most straightforward method is the adaptive method. Since the DG method enables discontinuous solutions and treats each element separately, it can be easily extended to an adaptive algorithm. The adaptive DG method can be applied either by refining the mesh (h), or by raising the order of accuracy (p) of the discretization. Both h- and p-adaptions help reduce the total degrees of freedom across the whole computational domain, which may lead to a more efficient simulation procedure. The focus of our presentation will be the p-adaptive DG method alone.
However, simply parallelizing the p-adaptive algorithm based on data partitions will not lead to a significant reduction in computing cost. The computational loads for different parallel partitions are not the same due to different degrees of freedom distributed by the adaptive procedure. In this respect, some of the processors will finish the computation and remain idle while other processors are still dealing with heavily load-distributed tasks. This will lead to a significant adverse effect on the parallel efficiency. Hence, how to maintain a dynamic load balancing on massively parallel supercomputers is one of the most challenging problems for the p-adaptive DG methods.
In the current work, we introduce a novel parallel adaptive DG algorithm which combines the parallel p-adaptive DG method with Charm++ parallel computing library. One of the distinguish feature of this developed method is the application of task parallelism for the DG method. The Charm++ parallel system allows us to build data parallelism as well as task parallelism in one application, which means that instead of just splitting the domain into different partitions, one can also divide the simulation process into different small tasks. The tasks and related data will be encapsulated into migratable objects. The sequence and assignment of these objects will be determined by the Charm++ system automatically in order to be able to maintain maximum efficiency during the runtime. Another attractive feature is the combination of the adaptive Discontinuous Galerkin method with the asynchronous programming model and dynamic load-balancing strategies that are provided by the use of Charm++ library. Since the tasks and data are encapsulated into migratable objects, the task-parallelism structure naturally enables the dynamical load balancing techniques. Numerical results from a variety of benchmark tests show that the developed p-adaptive method is capable of achieving high performance gains with the help of the automatic dynamic load balancing strategies available from the Charm++ library. The detailed information about the developed method and its parallel structure will be shown in the final presentation.
Enzo-E/Cello astrophysics and cosmology: algorithmic improvements for robustness and scalability
James Bordner, University of California, San Diego
Michael L. Norman, University of California, San Diego
Enzo-E is a Charm++ computational astrophysics and cosmology application built on the highly scalable Cello adaptive mesh refinement framework. In this presentation, we describe our new and improved algorithms for I/O and adaptive remeshing, and present preliminary performance results for our new I/O approach run on 4096 nodes of the Frontera petascale computing system.
Improving the Performance of Charm++ Applications on GPU Systems
Jaemin Choi, University of Illinois Urbana-Champaign
Over the years, there have been many updates to GPU support in the Charm++ runtime system. In this talk, we review the newly developed features that allow the Charm++ programmer to reduce unnecessary synchronization, achieve automatic computation-communication overlap, and improve communication performance with direct GPU-GPU data transfers. An effort to implement a message-driven runtime system entirely on the GPU, CharminG, will also be discussed along with other potential research topics for improving the usability and performance of Charm++ applications on modern GPU-centric systems.
Performance Evaluation of Python Parallel Programming Models: Charm4Py and mpi4py
Zane Fink, University of Illinois Urbana-Champaign
Python is rapidly becoming the lingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing is required to scale Python across nodes and make it truly competitive in large-scale high-performance computing. Many frameworks, such as Charm4Py, DaCe, Dask, Legate Numpy, mpi4py, and Ray, scale Python across nodes. However, little is known about these frameworks' relative strengths and weaknesses, leaving practitioners and scientists without enough information about which frameworks are suitable for their requirements. In this paper, we seek to narrow this knowledge gap by studying the relative performance of two such frameworks: Charm4Py and mpi4py.
We perform a comparative performance analysis of Charm4Py and mpi4py using CPU and GPU-based microbenchmarks, including TaskBench and other representative mini-apps for scientific computing.
Communication Optimizations for Adaptive MPI
Sam White, University of Illinois Urbana-Champaign
Adaptive MPI is an MPI implementation on top of Charm++. AMPI virtualizes MPI ranks as user-level threads which are migratable at runtime between address spaces. This enables dynamic load balancing and automatic fault tolerance to be added to legacy MPI codes without major refactoring effort. In this talk, we examine communication overheads caused by the mismatch between AMPI's execution model and MPI's semantics. We find that message matching constraints in point-to-point communication and non-commutative reduction operations both incur unique performance costs in the context of AMPI's migratable overdecomposed ranks. We suggest semantic relaxations that can minimize some of the overheads when coupled with runtime optimizations, and also investigate algorithmic changes to AMPI's collective communication routines. We demonstrate the efficacy of our approach using microbenchmarks and mini-apps.
Simulating neutron stars with discontinuous Galerkin methods and Charm++
Nils Deppe, California Institute of Technology
SpECTRE is a next-generation
multiphysics code implemented using Charm++ and is designed to overcome the
limitations of current codes used in relativistic astrophysics. I will present
recent breakthroughs that enable us to perform the first simulations of
magnetized neutron stars using high-order discontinuous Galerkin methods, and
discuss what we plan to accomplish over the next year as we get closer to being
able to simulate binary neutron star mergers with SpECTRE.
Vector Load Balancing in Charm++
Ronak Buch, University of Illinois Urbana-Champaign
Load balancing has proven to be critical in achieving high performance and scalability for modern parallel applications. However, existing load balancing techniques do not always properly capture the performance characteristics of programs, leading to sub-optimal balancing decisions. To address this, we have constructed a vector load balancing system that measures multiple performance metrics and performs multi-objective optimization to make balancing decisions. This scheme can provide load balancing strategies awareness of multiple computational phases, different hardware resources, and application-specific metrics, among others. We have also developed new load balancing strategies to take advantage of this paradigm, and we demonstrate their utility with various benchmarks and applications.
ParaTreeT: A Fast, General Framework for Spatial Tree Traversal
Joseph Hutter, University of Illinois Urbana-Champaign
Tree-based algorithms for spatial domain applications scale poorly in the distributed setting without extensive experimentation and optimization. Reusability via well-designed parallel abstractions supported by efficient parallel algorithms is therefore desirable. We present ParaTreeT, a parallel tree toolkit for state-of-the-art performance and programmer productivity. ParaTreeT leverages a novel shared-memory software cache to reduce communication volume and idle time throughout traversal. By dividing particles and subtrees across processors independently, it improves decomposition and limits synchronization during tree build. Tree-node states are extracted from the particle set with the Data abstraction, and traversal work and pruning are defined by the Visitor abstraction. ParaTreeT provides built-in trees, decompositions, and traversals that offer application-specific customization. We demonstrate ParaTreeT's improved computational performance over even specialized codes with multiple applications on CPUs. We evaluate how several applications derive benefit from ParaTreeT's models while providing new insights to these workloads through experimentation.
A synthetic tool for analysing adaptable workloads
Iker Martin-Alvarez, Computer Science and Engineering Department, Universitat Jaume I
Jose Ignacio Aliaga, Computer Science and Engineering Department, Universitat Jaume I
Maria Isabel Castillo Computer Science and Engineering Department, Universitat Jaume I
Sergio Iserte, Dept. of Mechanical Engineering and Construction, Universitat Jaume I
Rafael Mayo, Computer Science and Engineering Department, Universitat Jaume I
This paper introduces a configurable synthetic iterative malleable application capable of being resized according to different parameters. The main objective is to create a synthetic application that facilitates the creation of adaptable workloads, in order to analyse how these workloads perform in a system with a resource manager aware of their adaptability. Experiments showcase that this tool generates and tests different workloads and also allows to gather metrics of the performance of each individual application of the workload.
New Charm++ Features
PPL, University of Illinois Urbana-Champaign
Showcase and discuss new features in Charm++.
Towards Performance Portability in NAMD with oneAPI
Tareq Malas, Intel
Jaemin Choi, University of Illinois Urbana-Champaign
NAMD, an award-winning molecular dynamics simulation framework built on Charm++, is continuously adapting to architectural changes in HPC. With the recent focus on GPUs, its CUDA implementation has demonstrated impressive scaling performance on today's largest GPU-accelerated supercomputers as well as tightly-connected multi-GPU systems. In this talk, we discuss our efforts to port CUDA to oneAPI, in preparation for the upcoming exascale supercomputer Aurora and to improve the performance portability of NAMD.
Optimizing Distributed Load Balancing for Time-Varying Imbalance
Jonathan Lifflander, Sandia National Laboratories
Nicole Slattengren, Sandia National Laboratories
Philippe Pébäy, NexGen Analytics, Inc.
Phil Miller, Intense Computing
Francesco Rizzi, NexGen Analytics, Inc.
Matthew Bettencourt, NexGen Analytics, Inc.
This paper explores dynamic load balancing algorithms used by asynchronous many-task (AMT), or 'task- based', programming models to optimize task placement for scientific applications with dynamic workload imbalances. AMT programming models use overdecomposition of the computational domain. Overdecompostion provides a natural mechanism for domain developers to expose concurrency and break their computational domain into pieces that can be remapped to different hardware. This paper explores fully distributed load balancing strategies that have shown great promise for exascale-level computing but are challenging to theoretically reason about and implement effectively. We present a novel theoretical analysis of a gossip-based load balancing protocol and use it to build an efficient implementation with fast convergence rates and high load balancing quality. We demonstrate our algorithm in a next-generation plasma physics application (EMPIRE) that induces time-varying workload imbalance due to spatial non-uniformity in particle density across the domain. Our highly scalable, novel load balancing algorithm, achieves over a 3x speedup (particle work) compared to a bulk-synchronous MPI implementation without load balancing.
What's New with Chapel: Applications, Aggregators, and Accelerators
Brad Chamberlain, Hewlett Packard Enterprise
Chapel is a modern programming language designed to support
general-purpose parallel computing at scale. While Chapel and Charm++
differ in their approaches, particularly with respect to Chapel's
explicit/imperative nature vs. Charm++'s adaptive/migration-based
approach, the two languages have many themes in common, including a
global namespace, object-oriented design, and asynchrony.
In this talk, I will give a very brief introduction to Chapel for
those unfamiliar with the language, before going on to provide a brief
tour of some of Chapel's latest use cases, features, and
accomplishments. I'll start by touching on a pair of the most notable
applications that have been developed in Chapel during the past few
years: the Arkouda workbench for interactive data science at massive
scales from Python; and CHAMPS, a framework for 3D unstructured
Computational Fluid Dynamics.
From there, I will discuss recent work to productively and
automatically aggregate communications in Chapel. These abstractions
and optimizations address computational patterns that have become
increasingly important to our users due to their presence in codes
like Arkouda and Bale; and because of our increased focus on targeting
InfiniBand-based systems as we move beyond the Cray XC's Aries
network.
I will wrap up the talk with a quick preview of our nascent effort to
generate code for GPUs from Chapel source, making good on trailblazing
work done by Albert Sidelnik in his UIUC Ph.D. work a decade ago.
Recent developments in HPX and Octo-Tiger
Patrick Diehl, Louisiana State University
In this talk, we will briefly introduce the astrophysics application Octo-Tiger, which simulates the evolution of star systems based on the fast multipole method on adaptive Octrees. This application is the most advanced HPX application with CUDA and AMD acceleration card support. In the remaining talk, we will talk about the pure CUDA integration and the recently added Kokkos integration to support portability for heterogeneous acceleration cards, especially AMD GPUs. Here, we showcase some scaling results on ORNL's Summit and SCSC's PiZ Daint. Another aspect is performance profiling in asynchronous applications. Here, we recently added CUDA profiling to HPX's performance framework, APEX. Thus, we can collect distributed combined CPU and GPU profiling to analyze the performance of HPX and Octo-Tiger. Here, we show runs on Summit and Piz Daint to compare the performance on different architectures and GPUs. The final aspect is the overhead introduced by the performance framework. Here, we study the overhead introduced by the CPU only profiling and the much more expensive overhead added by the CUDA profiling.
Using concurrent GPU streams to support independent progress of asynchronous objects
Phil Miller, Intense Computing
Jonathan Lifflander, Sandia National Laboratories
Nicole Slattengren, Sandia National Laboratories
Matthew Bettencourt, NexGen Analytics, Inc.
Roger Pawlowski, Sandia National Laboratories
In the EMPIRE plasma physics application, we have observed performance-critical circumstances in which asynchronous objects have their progress slowed by sequential execution of GPU kernels across each process. Microbenchmarks studies of a proxy for the most severely affected code, essentially a particle sorting kernel, showed great potential for speedup through use of independent GPU execution streams. Thus, we have undertaken to implement concurrent execution on GPU streams throughout the EMPIRE code base.
This work uses a broad combination of technologies to achieve the potential for much faster and less sensitive execution with no added conceptual burden for developers. The existing code is written using asynchronous handlers in vt. These handlers are now run in cooperatively scheduled user-level threads, allowing them to be suspended and resumed. Each object is transparently allocated a distinct CUDA stream. Thin wrappers around the Kokkos parallel loop abstraction ensure that the appropriate stream is presented at kernel submission. GPU kernel synchronization operations, in the form of Kokkos 'fence()' calls, are re-implemented as insertion of stream-level events followed by thread suspension. The vt runtime system automatically polls these events and resumes the thread when they complete, allowing other operations to execute in the meanwhile.
The vt runtime's termination detection system has been extended to recognize that suspended threads and pending CUDA kernels represent ongoing work in an epoch, and thus delay reporting termination. EMPIRE's control flow was previously adapted to make progress based on termination of preceding work, allowing its design to remain unchanged through this architecture-level performance effort.
Since starting this effort, the upstream Kokkos library has been extended to abstract the CUDA stream creation and management our code currently performs. Switching to use this abstraction will provide portability to GPUs and other hardware besides that targeted by Nvidia CUDA, such as AMD's HIP, Intel's SYCL, and Kokkos's portable standards-based OpenMPTarget backend.
We additionally describe runtime instrumentation through the Kokkos Tools interface to detect instances in which the above abstractions were not applied or operated incorrectly. This has allowed us to rapidly make a complete transition to this new design without protracted debugging of edge cases, some of which we will describe.
ExaM2M: Scalable and Adaptive Mesh-to-Mesh Transfer
Eric Mikida, Charmworks, Inc.
CharmMPI: From Research Code to Production Workhorse
Evan Ramos, Charmworks, Inc.
Recent development on Charm MPI (formerly Adaptive MPI) has expanded compatibility with existing MPI programs and improved turnkey accessibility of the benefits provided by Charm++. Advancements will be showcased for the PIEglobals and TLSglobals privatization methods, the Isomalloc memory allocator, and application build infrastructure. Successes with production MPI codes will be demonstrated.
Addressing the Challenges of Heterogeneous Computing in NAMD
David Hardy, University of Illinois Urbana-Champaign
NAMD is a parallel molecular dynamics software, made highly scalable through Charm++, and capable of simulating very large systems of biomolecules. As an early adopter of NVIDIA CUDA, NAMD has been GPU-accelerated for over a decade. More recent advances in GPU capabilities have required NAMD to implement a new GPU-resident code path to fully utilize available computing resources. This presentation discusses these new developments in NAMD and the ongoing challenges of supporting new GPUs from AMD and Intel.
Advances in Charm-based Languages II
Justin Szaday, University of Illinois Urbana-Champaign
Compiled Charm-based languages underwent focused development efforts in the past year. We will discuss the mixed-level approach that drove this progress in this talk, introducing some of the runtime and compiler level optimizations that it offers. Along with the enhanced expressiveness that a custom language brings, these optimizations form the backbone for our work, driving us to continue exploring how compilers can benefit the broader Charm++ ecosystem.
Large-scale Parallel Modeling for Storm Surge and Coastal Flooding with AMPI Dynamic Load Balancing
Dylan Wood, University of Notre DameJoannes Westerink, University of Notre DameDamrongsak Wirasaet, University of Notre Dame
Large-scale shallow-water circulation models (e.g., ADCIRC) are routinely used in academic literature and in operational practice by federal agencies to study risks due to storm surge and other flooding sources in coastal areas. These models typically span large spatial domains (e.g., the entire western North Atlantic Ocean and its coastal areas), are run with a high degree of parallelization and are decomposed into many subdomains in both the land and sea regions of the models. However, the models are typically executed using a standard message passing interface (MPI) implementation, wherein each rank (subdomain) is evaluated on a single processor, and the presence of flood-plains in these models (i.e., areas which are initially dry and may become flooded) introduces a load imbalance, because the shallow-water equations need not be solved in non-inundated areas. Furthermore, due to the high degree of spatial resolution in unstructured grids used by the models, especially within the flood-plains, the resulting models can exhibit significant load imbalances as frequently up to half of the computational elements may not actually be inundated. To address these challenges, recent developments on the Advanced Circulation (ADCIRC) model pursue an Adaptive MPI (AMPI) implementation and incorporate algorithms for tracking the time-evolution of the wet or dry state of each computational element, in order to facilitate dynamic load balancing within the model's framework. We present on the results of these developments by showing the currently recorded performance gains achieved, versus the standard MPI implementation, as well as commenting on the expected future gains potentially attainable.
A Universe of ChaNGa Applications
Thomas Quinn, University of Washington
Scalable GW software for excited electrons using OpenAtom
Kayahan Saritas, Yale UniversityKavitha Chandrasekar, University of Illinois at Urbana-Champaign
OpenAtom is an open-source, massively parallel software application that performs ab initio molecular dynamics simulations and ground and excited states calculations utilizing a planewave basis set and relies on the charm++ runtime system. We describe the status of the excited-state GW implementation in OpenAtom: the GW method is an accurate but computationally expensive method for describing dynamics of excited electrons in solid state systems. We briefly describe the current O(N^4) scaling GW method implemented in the public version of OpenAtom (where N is the number of atoms in the simulation cell). We then present our progress in implementing an O(N^3) scaling GW method and diagonalization of dielectric matrix in OpenAtom. In addition to the formalism and physical principles, our parallelization method and scaling results on medium and large size datasets will be presented.
Distributed Load Balancing Strategies with Charm++
Simeng Liu, University of Illinois at Urbana-ChampaignKavitha Chandrasekar, University of Illinois at Urbana-Champaign
With the distributed Load balancing infrastructure, load balancing strategies can scale well with thousands of cores. In this talk, we will compare the DistributedPrefixLB and DistributedOrbLB with PrefixLB OrbLB as well as presenting the distributed diffusion strategy.