Time |
Type |
Description |
Slides |
Webcast |
|
1:00 pm - 5:00 pm |
Charm++ Tutorial
|
Hands-on tutorial in Room 2405
Presenters: Jonathan Lifflander and Eric Mikida
|
|
|
|
8:00 am - 8:30 am |
Continental Breakfast / Registration |
Morning |
|
8:30 am - 8:45 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign
|
|
|
8:45 am - 9:30 am |
Keynote
|
The Romance and Reality of Task-Based Runtime Systems for Heterogenous Systems Experiences with the Uintah Software
Prof. Martin Berzins, University of Utah
Click here to expand description
Adaptive and asynchronous task-based runtime systems are seen as a strong candidate for
making use of large post-petascale architectures. Their ability to adapt the calculation
to the needs of the architectures makes it possible to run challenging engineering applications at very large scales. The Uintah computational framework is one example of such a system. The software uses a high-level graph-based model to provide a task-set
that is executed by the adaptive runtime system. The evolution of the runtime system is described
and the challenges of achieving performance on both homogeneous and heterogenous architectures
described. The challenges of working with communiciations intensive components such as thermal
radiation and with domain specific languages are described particularly in the context of heterogenous GPU architectures are described and results shown.
|
|
|
9:30 am - 10:00 am |
Submitted Paper
|
Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application
Dr. Esteban Meneses-Rojas, University of Pittsburgh
Click here to expand description
Turbulent flow simulations represent one of the most challenging types of computational fluid dynamics (CFD) programs.
Its very nature involve a wide range of length and time scales,
making this type of simulations computationally demanding and with a propensity to develop load imbalances.
We are exploring two major mechanisms to provide a turbulent flow simulation with dynamic load balancing capabilities.
The two methods, Zoltan and Charm++, are compared when extending an already existing MPI application.
The final implementation should be able to alleviate the load imbalance without requiring a restart in the execution.
We present our methodology to find the opportunities and limitation of each mechanism by first studying two benchmarks.
The initial results of these two benchmarks points at a very close competition between the two approaches.
|
|
|
10:00 am - 10:15 am |
|
Morning |
Technical Session: Developing Charm++ Applications (Chair: Dr. Celso Mendes) |
10:15 am - 10:45 am |
Talk
|
ADHydro
Dr. Robert Steinke, University of Wyoming
|
|
|
10:45 am - 11:15 am |
Talk
|
Lessons Learned From Porting the Mini Aero application to Charm++
Dr. David Hollman, Sandia National Laboratory
Click here to expand description
Next generation platform architectures will require us to fundamentally rethink our
programming models due to a combination of factors including extreme parallelism,
data locality issues, and resilience. The asynchronous, many-task programming model is
emerging as a leading new paradigm, with many variants of this model being proposed,
including Charm++, DHARMA, HPX, Legion, OCR, STAPL, and Uintah. Given Sandia's
massive investment in its science and engineering codes, we are conducting a survey
comparing the programmability, performance, and mutability of leading candidate
runtime systems in the context of our representative workloads of interest. This talk
will highlight some initial results from our study, concentrating on lessons learned from
porting Mini Aero to Charm++. Mini Aero is a three-dimensional, unstructured, finite-
volume, computation fluid dynamics code that uses Runge-Kutta fourth-order time
marching. It has options for first or second order spatial discretization, and includes
inviscid Roe and Newtonian viscous fluxes. The baseline application is approximately
3800 lines of C++ code, written with MPI and Kokkos, a Sandia-based performance
portability layer. Our talk will focus on both the programmability and performance
results that have been obtained to date (this is ongoing work), including a review of the
application developers' observations regarding the porting process itself, Charm++
semantics, and a discussion of PUPing using Kokkos data structures.
|
|
|
11:15 am - 11:45 pm |
Talk
|
Introducing Over-decomposition to Existing Applications: A Case Study with PlasComCM and Adaptive MPI
Sam White, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Over-decomposition can potentially alleviate many challenges of today's and tomorrow's large-scale systems by enabling adaptive features such as dynamic load balancing. However, rewriting existing scientific applications with hundreds of lines of code in new programming paradigms is impractical.
In this work, we incorporate over-decomposition in the NNSA's PSAAP2 Center for Exascale Simulation of Plasma-Coupled Combustion (XPACC) code, PlasComCM, using Adaptive MPI (AMPI). AMPI, a portable implementation and extension of the MPI standard that encapsulates MPI ranks into arbitrary numbers of migratable user-level threads per core, is aimed at providing over-decomposition and adaptivity for existing codes. PlasComCM is a large-scale plasma-coupled combustion simulation program written in Fortran90 and MPI. We describe the practical challenges of enabling over-decomposition for such existing codes and our automated approaches to them. Furthermore, we present and analyze the performance improvement we achieve through the features enabled by over-decomposition--namely, message-driven scheduling, dynamic load balancing, and automatic fault tolerance.
|
|
|
11:45 pm - 12:15 pm |
Submitted Paper
|
SC_Tangram: a Parallel Framework based on Charm++ for Cosmological Simulations
Chen Meng, CNIC CAS, China
Click here to expand description
The "XMAPP" features of Charm++ enormously enhances the parallel scalability and runtime adaptivity, while its oriented-object style and message-driven programming model raise the threshold of using it. Performing a high-level framework to Charm++ application building will increase the productivity. However, it faces several great challenges, especially for the conflict between message-driven execution model and procedure-driven user interface. In this work, we developed a framework called SC_Tangram, which combines advantages both Charm++ for runtime and Cactus for its modularity and collaboration. It is organized as no other but components and in the runtime parallel objects interoperated by the baseline charm++ methods for high performance and adaptivity. It hides complex parallel technologies, and provides a platform for composing algorithms and service components together into a complex composite application. So far, SC_Tangram is used for the field of cosmological simulation. We tested its key components aims at structured uniform meshes for solving partial differential equation (PDE) in hydrodynamics. The overhead of the framework is less than 10%, which is acceptable in consideration of its benefits.
|
|
|
12:15 pm - 01:15 pm |
|
Afternoon |
Technical Session: Performance and Power Analysis (Chair: Dr. Todd Gamblin) |
1:15 pm - 1:45 pm |
Talk
|
Leveraging Hardware Address Sampling: Beyond Data Collection and Attribution
Prof. Xu Liu, College of William and Mary
Click here to expand description
Hardware address sampling is widely supported in modern Intel, AMD, and IBM processors. Tools based on address sampling provide deep insights of performance problems in memory subsystems at low cost. However, previous tools mainly focus on collecting and attributing address samples, leaving the analysis to programmers. In this talk, I will describe how we go one step further to analyze address samples based on the existing collection and attribution techniques. Experiments show that our analysis tool identifies performance bottlenecks and their optimization methods that other tools, based on address sampling, do not provide. Guided by our tool, we can significantly speedup several well-known HPC programs by optimizing their memory accesses.
|
|
|
1:45 pm - 2:15 pm |
Talk
|
An Organized View of MPI and Charm++ Traces
Kate Isaacs, University of California, Davis and Lawrence Livermore National Laboratory
Click here to expand description
Traces are a rich source of data in understanding the behavior of parallel applications. Exact sequences of events leading to performance problems can be reconstructed using traces. Visualization is a powerful technique for exploring traces, but standard representation becomes difficult to interpret as the complexity and size of the data increases. In this talk, I will present trace visualizations that evoke the developers' intended organization of events in a parallel program. This depiction provides the context necessary to understand dependencies present in the trace. When paired with appropriate metrics, performance issues are more easily found. I will discuss the successes of this approach for MPI applications and then present initial work applying this methodology to Charm++.
|
|
|
2:15 pm - 2:45 pm |
Talk
|
Variation Aware Scheduling for Manycore Chips
Akhil Langer, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Variation in the CMOS manufacturing process causes the transistors on the chip to differ, which results in many-core chips being inherently heterogeneous. For example, frequency and power consumption profiles of cores can span a wide range. In this work, we study the impact of such variation on HPC applications and programming systems. Scheduling HPC applications under performance and power constraints on such chips requires finding an optimal chip configuration
(that specifies how many and which cores to use) from amongst billions of options, so search by explicit enumeration is intractable. We propose a scheduling framework called VAS (Variation Aware Scheduler), which finds an optimal chip configuration given a power budget for the chip. It uses novel models that predict the performance and power consumption of an application running on any configuration of the chip accurately. VAS finds configurations that provide 27% better performance (18% better energy efficiency) on average for a compute-bound application and 17% better performance (11% better energy efficiency) for a memory-bound application as compared to scheduling algorithms based on heuristics.
|
|
|
2:45 pm - 3:00 pm |
|
Afternoon |
Technical Session: Simulation and Networks (Chair: Phil Miller) |
3:00 pm - 3:30 pm |
Talk
|
Parallel Discrete Event Simulation Empowered by an Adaptive Runtime System
Eric Mikida, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Discrete event simulations (DES) are central to exploration of ”what-if” scenarios in many domains including networks, storage devices, and chip design. Accurate simulations of dy- namically varying behavior of large components in these do- mains requires the DES engine to be scalable and adaptive in order to execute in reasonable time. This paper explores the development of such a simulation engine by porting ROSS, a scalable PDES engine, to Charm++, an adaptive runtime system based on asynchronous migratable objects. First, we show that the overdecomposition based message-driven programming model of Charm++ is highly suitable for im- plementing a PDES engine such as ROSS. Next, we explore the use of asynchrony and adaptivity to the improve the per- formance of ROSS over its current implementation. Finally, we show the performance impacts of dynamic load balancing on models with irregular behavior on large core counts.
|
|
|
3:30 pm - 4:00 pm |
Talk
|
Understanding and Optimizing Communication Performance on HPC Networks
Nikhil Jain, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Communication is a necessary but overhead inducing component of high performance computing. The communication performance of an application is impacted by several related aspects of a parallel job execution: the network topology of the system used for the job execution, the placement of the job within the system, the message injection and reception mechanism, the routing protocol, suitability of the interaction pattern of the application executed to the network, etc. The ever decreasing flops to bandwidth ratio makes network a scarce resource, thus demanding more attention to optimization of communication performance. In this talk, we will describe various streams of work at PPL that focuses on addressing distinct research challenges relevant to communication optimization. The first of those parts is on understanding the anatomy of communication performance and comparing different networks. The second part is focused on studying runtime configurations for optimizing performance. The last part is related to development of software that can improve communication performance.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
TraceR
Bilge Acun, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
This paper presents TraceR, a trace replay tool built upon the ROSS-based CODES simulation framework, for simulating messaging on interconnection networks. TraceR addresses two major shortcomings in state-of-the-art network simulators. First, it enables fast simulations of large-scale supercomputer networks by utilizing ROSS's scalable parallel discrete-event simulation engine. Second, it can simulate production HPC applications using the BigSim emulation framework.In this paper, we use TraceR to study the impact of many PDES related parameters on the simulation performance. We also use TraceR to evaluate conservative and optimistic PDES for simulating HPC codes.Finally, we compare TraceR with other simulators such as SST andBigNetSim, and demonstrate its scalability using various case studies.
|
|
|
4:30 pm - 5:00 pm |
Talk
|
Identifying the Culprits behind Network Congestion
Dr. Abhinav Bhatele, Lawrence Livermore National Laboratory
Click here to expand description
Network congestion is one of the primary causes
of performance degradation, performance variability and poor
scaling in communication-heavy parallel applications. However,
the causes and mechanisms of network congestion on modern
interconnection networks are not well understood. We need new
approaches to analyze, model and predict this critical behavior
in order to improve the performance of large-scale parallel
applications. This paper applies supervised learning algorithms,
such as forests of extremely randomized trees and gradient
boosted regression trees, to perform regression analysis on
communication data and application execution time. Using data
derived from multiple executions, we create models to predict
the execution time of communication-heavy parallel applications.
This analysis also identifies the features and associated hardware
components that have the most impact on network congestion and
in turn, on execution time. The ideas presented in this paper have
wide applicability: predicting the execution time on a different
number of nodes, or different input datasets, or even for an
unknown code, identifying the best configuration parameters for
an application, and finding the root causes of network congestion
on different architectures.
|
|
|
5:00 pm - 7:00 pm |
Fun
|
Blue Waters Facility Tour
NCSA
|
|
|
07:00 pm onwards |
Workshop Banquet (for registered participants only) Located at the 2nd floor Atrium outside 2405 Siebel Center |
|
8:00 am - 8:45 am |
Continental Breakfast / Registration |
8:45 am - 9:30 am |
Keynote
|
The OmpSs programming model and its runtime support
Prof. Jesús Labarta, Director Computational Sciences Department, Barcelona Supercomputing Center
Click here to expand description
The talk will present a vision of how parallel computer architectures are evolving and some of the research being done at the Barcelona Supercomputing Center (BSC) driven by such vision.
We consider that the evolution towards increasing complexity, scale and variability in our systems makes two technologies play a very important role in future parallel computing, which with the advent of multicores means in general computing. On one side, performance analysis tools with very detailed analytics capabilities are key to understanding the actual behavior of our systems. On the other, programming models that hide the actual complexity of the underlying hardware are needed to ensure the programming productivity and performance portability needed to ensure the economic sustainability of the programing efforts.
We will present the OmpSs programming model and development at BSC, a task based model for homogeneous and heterogeneous systems which acts as a forerunner for OpenMP. OmpSs targets in a uniform way multicores, accelerators and clusters. We will describe features of the NANOS++ runtime on which OmpSs is implemented, focusing on the dynamic scheduling capabilities and load balance support features. We will also present the BSC tools environment, including trace visualization capabilities and specific features to understand the actual behavior of the NANOS runtime and OmpSs programs.
|
|
|
09:30 am - 9:50 am |
Talk
|
Parallel Runtime Environments with Cloud Database: A Performance Study for the Heterogeneous Multiscale Method with Adaptive Sampling
Dr. Christoph Junghans, Los Alamos National Laboratory
Click here to expand description
We present an adaptive sampling method for heterogeneous multiscale simulations with stochastic data. Within the Heterogeneous Multiscale Method, a finite-volume scheme integrates the macro-scale differential equations for elastodynamics,which are supplemented by momentum and energy fluxes evaluated at the micro-scale. Therefore, light-weight MD simulations have to be launched for every volume element. Our adaptive sampling scheme replaces costly micro-scale simulations with fast table lookup and prediction. The cloud database Redis serves as plain table lookup and with locality-aware hashing we gather input data for our prediction scheme. For the latter we use ordinary kriging, which estimates an unknown value at a certain location by using weighted averages of the neighboring points. As the independent tasks can be of very different runtime, we used four different parallel computing frameworks, i.e. OpenMP [1], Charm++ [2], Intel CnC [3] and MPI-only using libcircle [4] for the implementation and compared their performance.
|
|
|
9:50 am - 10:00 am |
|
Morning |
Technical Session: Large Applications (Chair: Dr. Abhinav Bhatele) |
10:00 am - 11:00 am |
Panel
|
Developing Sustainable Software in Academia
Panelists: Dr. Gabrielle Allen, Dr. Martin Berzins, Dr. Laxmikant Kale, Dr. Jesús Labarta, Dr. Jim Phillips Moderator: Dr. Abhinav Bhatele
Click here to expand description
Research in the systems area of computer science involves developing software.
One can develop software for the purpose of demonstrating an idea, and then discard it;
An alternative approach is to develop usable software for the community,
while still pursuing research ideas.
This second approach can often lead to better, deeper insights
as the software is used for "real" applications and scenarios.
However, developing sustainable community software in academia poses many challenges.
This panel will discuss these challenges and obstacles, as well as techniques
and ideas that are useful for overcoming them.
|
|
|
11:00 am - 11:30 am |
Talk
|
ChaNGa: a Charm++ N-body Treecode
Prof. Tom Quinn, University of Washington
|
|
|
11:30 am - 12:00 pm |
Talk
|
Towards Process-Level Charm++ Programming in NAMD
Dr. Jim Phillips, University of Illinois Urbana-Champaign
Click here to expand description
The biomolecular simulation program NAMD is based on Charm++ and therefore inherits a view of multithreaded processes as an aggregation of otherwise independent streams of execution. The Charm++ programming model is a flat machine in which chares are fixed to their assigned threads until migrated by the load balancer. The fact that multiple threads share a memory space in SMP and multicore builds is treated as an opportunity for optimization, e.g., avoiding duplicate read-only data structures and exploiting fast intra-process communication. As the number of both cores and hardware threads per node continues to increase it will become increasingly difficult to globally manage individual threads. Meanwhile, the sharing of caches by multiple cores presents new opportunities for optimization. In addition to caches, threads within a process also share interfaces to the network, filesystem, and any accelerator hardware such as a GPU or Xeon Phi coprocessor, all of which require aggregation or coordination to access efficiently. This talk will discuss examples of process-level Charm++ programming in NAMD, including recent experiments in this direction, and suggest Charm++ enhancements to better support recurring idioms of message-driven process-level programming.
|
|
|
12:00 pm - 01:00 pm |
|
Afternoon |
Technical Session: Scalable Applications (Chair: Dr. Esteban Meneses) |
1:00 pm - 1:30 pm |
Talk
|
Scalable Asynchronous Contact Mechanics using Charm++
Xiang Ni, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
In this talk we will discuss a scalable implementation of the Asynchronous Contact Mechanics (ACM) algorithm, a reliable method to simulate flexible material subject to complex collisions and contact geometries. As an example, we apply ACM to cloth simulation for animation. The parallelization of ACM is challenging due to its highly irregular communication pattern, its need for dynamic load balancing, and its extremely fine-grained computations. We utilize CHARM++, an adaptive parallel runtime system, to address these challenges and show good strong scaling of ACM to 384 cores for problems with fewer than 100k vertices. By comparison, the previously published shared memory implementation only scales well to about 30 cores for the same examples. We demonstrate the scalability of our implementation through a number of examples which, to the best of our knowledge, are only feasible with the ACM algorithm. In particular, for a simulation of 3 seconds of a cylindrical rod twisting within a cloth sheet, the simulation time is reduced by 12× from 9 hours on 30 cores to 46 minutes using our implementation on 384 cores of a Cray XC30.
|
|
|
1:30 pm - 2:00 pm |
Talk
|
Spack: A flexible package manager for HPC
Dr. Todd Gamblin, Lawrence Livermore National Laboratory
Click here to expand description
HPC software developers and computing centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single software stack is infeasible. However, managing many configurations is extremely difficult because the software configuration space is exponential in size. We introduce Spack, a package manager used at Lawrence Livermore National Laboratory (LLNL) to manage this complexity. Spack provides a novel, recursive specification syntax to invoke parametric builds of packages and dependencies. It allows any number of builds to coexist on the same system, and it ensures that installed packages can find their dependencies.
|
|
|
2:00 pm - 3:00 pm |
Talk
|
Advances in OpenAtom
Eric Bohm, University of Illinois; Dr. Glenn Martyna, IBM; Dr. Sohrab Ismail-Beigi, Yale University
Click here to expand description
Complex interplay of tightly coupled but disparate computation and communication operations makes simulation of the dynamical motion of dual-natured atoms challenging, especially on multi-petaflops systems. OpenAtom addresses several of these challenges by exploiting overdecomposition and asynchrony in Charm++ and scales to hundreds of thousands of cores. At the same time, it enables simulation of several interesting ab-initio molecular dynamics simulation methods including the Car-Parrinello method, Born Oppenheimer method, parallel tempering, and path integrals. This paper showcases the diverse functionality as well as scalability of OpenAtom via several performance and scientific case studies. In particular, this talk will focus on a large Zinc Chloride system that consists of 426 atoms and 936 electronic states accounting for 22 billion grid points in total. This system is scaled to limits on the multi-petaflop Blue Waters andMira supercomputers via Charm++ executing 10 million parallel entities asynchronously.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Runtime System (Chair: Eric Bohm) |
3:30 pm - 4:00 pm |
Talk
|
[CoolName++]: A Graph Processing Framework in Charm++
Hassan Eslami, Advised by Prof. William Gropp, University of Illinois at Urbana-Champaign
Click here to expand description
The need to gain insights from large-scale graph-structured data has driven the development of new graph-parallel abstractions, such as Pregel and GraphLab. In this project, we evaluate one of these parallel abstractions, vertex-centric graph processing inspired by Pregel, in Charm++. We implemented CoolName++, a scalable vertex-centric graph processing framework on top of Charm++ and demonstrate the simplicity of implementing graph applications within our framework using its intuitive user APIs.
We introduced several optimizations critical to our implementation in Charm++, including message buffering, message combining, and communication-computation overlapping, which take advantage of the asynchronous nature of Charm++. We implemented three diverse graph applications using our proposed framework: Single-Source Shortest Path (SSSP), Approximate Diameter (AD), and Betweenness Centrality (BC). The scalability results shows our framework scales reasonably for communication intensive applications (such as SSSP and AD) and almost perfectly for a more compute intensive application, such as BC. In addition, our results show that CoolName++ is 10-1000X faster than GraphLab, the state-of-the-art graph processing framework widely used in academia and industry implemented in MPI.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Argobots: Lightweight Threading/Tasking Framework
Dr. Cyril Bordage, University of Illinois; Dr. Huiwei Lu, Argonne National Lab
Click here to expand description
This talk presents a holistic low-level threading and tasking model, called
Argobots, to support the massive parallelism required for applications on
exascale systems. Argobots' lightweight operations and controllable executions
enable high-level programming models to easily tune the performance by hiding
long-latency operations with low-cost context switching or by improving locality
with cooperative and deterministic scheduling. Often, complex applications
incorporate different modules or phases, each best described using distinct
programming models or executed using different runtime strategies. Argobots
addresses this challenge by exposing a common runtime with interoperable
capabilities, providing a shared space where programming models become
complementary. We provide an implementation of Argobots as a user-level library
and runtime system. We show that Argobots is indeed capable of bridging the gap
between different programming models, while maintaining or improving the
performance of applications written using a single programming model.
|
|
|
4:30 pm - 5:00 pm |
Talk
|
64-bit ID
Phil Miller, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Chare array elements in Charm++ are named using the user-facing array index. In the implementation, this name has been carried all the way down through the RTS stack, from application code, through array management logic, to the network-layer wire format of the destination field of messages. The use of indices in message wire formats has required that it be a fixed size, while the use by applications requires that it be large enough to represent the abstract indexing scheme appropriate to the problem at hand. This is a suboptimal tradeoff, because it requires more non-payload bits to be transmitted as part of each message, and limits the length of name available to application code. The latter limitation is particularly burdensome to applications with deeply nested structures, such as AMR and tree structures, that wish to encode the structure directly in the name rather than maintaining an intermediate mapping.
The runtime can solve this by providing a scalable generic scheme for mapping arbitrary-length application-facing object names to shorter fixed-length internal object identities. In this talk, I describe such a scheme as implemented in Charm++. It adds no additional communication to the standard message-delivery path, only requiring an extra message when an object is to be constructed away from its home PE. It also fully preserves the existing mapping logic of Charm++ applications, and avoid hotspots in the ID generation mechanism through the same assumption of relative uniformity that existing name-to-PE mappings provide.
|
|
|
5:00 pm - 5:30 pm |
Talk
|
Malleable Jobs: Shrink and Expand
Bilge Acun, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing,at runtime in respond to an external command. Prior research has demonstrated that Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs.To realize these benefits in a real system, three components are critical: an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system(ARTS). In this paper, our focus is on the ARTS and resource manager. We present a novel mechanism which uses task migration and dynamic load balancing, checkpoint/restart,and Linux shared memory to enable fast and efficient resize of parallel programs. Our technique performs true shrink/expand, eliminating the need of any residual processes.We implement our techniques atop Charm++ andAdaptive MPI, and demonstrate malleability using benchmarks and three mini-applications. Performance results and analysis on Stampede supercomputer show the efficacy and scalability of our approach. Next, we adapt the resource manager by establishing a two-way communication channel between the scheduler and the ARTS, and designing a novel split-phase protocol for managing malleable jobs. Finally,we demonstrate the utility of malleable jobs in nontraditional and emerging use cases, specifically power-constraint schedulers and HPC in the cloud.
|
|
|
6:30 PM |
|