Time |
Type |
Description |
Slides |
Webcast |
|
8:00 am - 8:45 am |
Continental Breakfast / Registration |
Morning |
|
8:45 am - 9:00 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign
|
|
|
9:00 am - 9:45 am |
Keynote
|
MADNESS - parallel runtime and application use cases
Prof. Robert J. Harrison, Stony Brook University
Click here to expand description
MADNESS is a general purpose numerical environment that sits upon a
scalable runtime that consciously includes elements "borrowed" from
other projects including Charm++ and the HPCS programming languages
including Chapel, and is designed to be interoperable with "legacy"
code. However, our intent was never to support this long term and in addition to migrating large parts of our functionality to Intel TBB we are casting around for options for the distributed-memory parallel computing environment. I'll give you some flavor of what MADNESS does and how with the objective of starting a conversation and seeding collaborations.
|
|
|
9:45 am - 10:15 am |
Talk
|
Cloth Simulation in Charm++
Rasmus Tamstorf, Disney Research & Xiang Ni, University of Illinois at Urbana-Champaign
Click here to expand description
Accurate simulation of movement of cloth pieces is an important aspect
of a good and realistic animation movie. However, it is also one of
the most challenging problem in terms of complexity and scalability.
In this talk, we present an extension of the Asychronous Contact
Mechanics (ACM) for distributed memory cluster using Charm++. ACM
algorithm is known to produce correct results in finite time for cloth
simulation. However, the dynamic pattern of cloth simulation makes it
an extremely hard problem to parallelize and get good performance. To
achieve scalable parallelism, we have proposed and applied many
techniques - overdecomposition based collision detection, intra-node
load balancing, inter-node load balancing, prioritized processing, and
additional (not required, but good for performance) synchronization.
This talk presents an overview and initial results on these methods in
our implementation of ClothSim in Charm++.
|
|
|
10:15 am - 10:30 am |
|
Morning |
Technical Session: Run Time System (Chair: Dr. Gengbin Zheng) |
10:30 am - 11:00 am |
Talk
|
PICS: A Performance-Analysis-Based Introspective Control System to Steer Parallel Application
Yanhua Sun, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Parallel programming has always been difficult due to the complexity of
hardware and the diversity of applications. Although significant progress has
been achieved with the remarkable efforts of researchers in academia and
industry, attaining high parallel efficiency on large supercomputers with
millions of cores for various applications is still quite challenging.
Therefore, performance tuning becomes even more important and challenging than
ever before. In this paper, we describe the design and implementation of
PICS: a Performance-analysis-based Introspective Control System, used to
tune parallel programs. PICS provides a generic set of abstractions to the
applications to expose the application-specific knowledge to the runtime
system. The abstractions are called control points, which are tunable
parameters with effects on application performance. The application behaviors
are observed, measured and automatically analyzed by the PICS. Based on the
analysis results and expert knowledge rules, program characteristics are
extracted to assist the search of optimal configurations for the control
points. We have implemented the PICS control system in Charm++, an
asynchronous message-driven parallel programming model. We demonstrate the
utility of the PICS system with several benchmarks and a real-world application
and show its effectiveness.
|
|
|
11:00 am - 11:30 am |
Talk
|
Speculative Load Balancing
Hassan Eslami, Advised by Prof. William Gropp, University of Illinois at Urbana-Champaign
Click here to expand description
Continuous dynamic load balancing is a crucial component in many applications that exploit irregular and nested parallelism. Reducing the idle time is one of the biggest challenges any load balancing algorithm tries to address. In this talk we introduce a speculative load balancing algorithm that approaches this challenge through speculative execution of tasks. In our method, each worker process, instead of being idle waiting to get a task, speculatively start working on some tasks with the hope that no one else has started processing those tasks yet. We show that in up to 99% of cases speculation is successful and the idle time is significantly reduced. We use an unbalanced tree search benchmark (UTS) to show the effect of speculation in two load balancing approaches, 1) work sharing using a centralized work queue, and 2) work stealing using explicit polling to service steal requests. We show that speculation execution in work-sharing reduces up to 95% of the idle time, and results in up to 3-5X speed up in the total execution time of UTS compared to the baseline implementation of work sharing. Also, our approach is less sensitive to the load balancing parameters, hence eliminate the time needed for tuning the algorithm for any input type. We also share our experience in applying speculative execution in work stealing.
|
|
|
11:30 am - 12:00 pm |
Submitted Paper
|
Saving Energy by Exploiting Residual Imbalances on Iterative Applications
Laercio L. Pilla, Institute of Informatics, Federal University of Rio Grande do Sul
Click here to expand description
Parallel scientific applications have been influencing the
way science is done in the last decades. These applications
have ever increasing demands in performance and resources
due to their greater complexity and larger datasets. To meet
these demands, the performance of supercomputers has been
growing exponentially, which leads to an exponential growth
in power consumption too. In this context, saving
power has become one of the main concerns of current HPC
platform designs, as future Exascale systems need to consider
power demand and energy consumption constraints.
Whereas some scientific applications have regular designs
that lead to well balanced load distributions, others are more
imbalanced due to the fact that they have tasks with different
processing demands, which makes it difficult to provide an
efficient use of the available resources at the hardware level. In
this case, a challenge lies in reducing the energy consumption
of the application while maintaining a similar performance.
In our work, we focus on reducing the energy consumption
of imbalanced applications through a combination of load balancing and Dynamic Voltage and Frequency Scaling (DVFS).
Our strategy employs an Energy Daemon Tool to gather power
information and a load balancing module that benefits from
the load balancing framework available with the CHARM++
runtime system. Our approach differs from the one proposed
by Sarood et al. as we employ DVFS as a way to
decrease energy consumption after balancing the load, while
the latter uses DVFS to regulate temperature and employs load
balancing to correct subsequent imbalance.
|
|
|
12:00 pm - 12:30 pm |
Talk
|
QMPI: A library for multithreaded MPI applications
Alex B Brooks, Advised by Prof Marc Snir, University of Illinois at Urbana-Champaign
Click here to expand description
The increasing scale and ever-changing design of large supercomputers makes efficient parallel programming difficult. Communication cost and load-balancing continue to be major concerns for performance of parallel applications. Many implementations of the Message Passing Interface (MPI) continue to have issues handling multiple threads communicating from a single process. This results in less-than ideal performance for applications which attempt to exploit intra-node and inter-node parallelism. We introduce QMPI, a parallel programming library on top of MPI to address this problem. QMPI exploits light-weight task parallelism and smart communication progression, showing significant performance improvements over traditional multithreaded MPI techniques. This talk discusses this programming model with a focus on QMPI and its performance.
|
|
|
12:30 pm - 01:30 pm |
|
Afternoon |
Technical Session: Adaptivity at Exascale (Chair: Phil Miller) |
1:30 pm - 2:00 pm |
Invited Talk
|
Moving Software to Exascale and Beyond
Dr. Robert Wisniewski, Intel Corporation
Click here to expand description
Thinking in High Performance Computing software on how to achieve exascale
has tended to focus on either evolutionary or revolutionary approaches.
The path to exascale and beyond is replete with a set of wide-ranged and
intertwined difficulties. This presentation will identify some of the
challenges facing the HPC community from a software perspective with a
focus on runtimes and programming models. The presentation suggests a
plausible path forward that is neither solely evolutionary or solely
revolutionary, but a combination. Runtime and execution models supporting
this notion will be given as a demonstration.
|
|
|
2:00 pm - 2:30 pm |
Talk
|
Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget
Akhil Langer, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Building future generation supercomputers while constraining
their power consumption is one of the biggest challenges faced
by the HPC community. For example, US Department of Energy has set a
goal of 20 MW for an exascale supercomputer.
To realize this goal, a lot of research is being done
to revolutionize hardware design to build power efficient
computers and network interconnects. In this work, we propose
a software-based online resource management system that leverages
hardware facilitated capability to constrain the power consumption of
each node in order to judicially allocate
power and nodes to a job. Our scheme uses this hardware
capability in conjunction with an adaptive runtime system that can dynamically change the
resource configuration of a running job allowing our resource manager to re-optimize allocation
decisions to running jobs as new jobs arrive or a running job terminates.
We also propose a performance modeling scheme that estimates the essential
power characteristics of a job at any scale. The proposed online resource
manager uses these performance characteristics
for making scheduling and resource allocation decisions that maximize the
job throughput of the supercomputer under a given power budget.
We demonstrate the benefits of our approach by using a mix of jobs
with different power-response characteristics.
We show that with a power budget of 4.75 MW,
we can obtain up to 5.2X improvement in job throughput when compared
with the SLURM baseline scheduling policy. We corroborate our
results with real experiments on a relatively small scale in which
we obtain a 1.7X improvement.
|
|
|
2:30 pm - 3:00 pm |
Submitted Paper
|
A Batch System for Malleable Adaptive Parallel Programs
Suraj Prabhakaran, German Research School for Simulation Sciences, RWTH Aachen
Click here to expand description
The performance of supercomputers not only depends on efficient job scheduling but also on the type of jobs that form the workload. Malleable jobs are most friendly to the scheduler, a property they owe to their ability to adapt to changing resource allocation during application execution. The batch system can adjust the allocation of a malleable job according to the system state so as to obtain the best system performance. However, due to the non-adaptive nature of most parallel programming models, todays supercomputers are dominated by rigid jobs that require a fixed resource allocation throughout the job execution. Therefore, malleable jobs and malleable batch job management are not realized in todays production systems. The Charm++ programming paradigm notably supports malleability through its adaptive runtime system and offers a practical possibility to improve system performance. In this paper, we present the ongoing work towards enabling a malleable Torque/Maui batch job management system for Charm++ jobs. In particular, we propose a standard interface for malleability interactions between batch systems and parallel programming paradigms. Through this interface, the batch system allows a seamless expansion or shrinkage of the current resource allocation for a running Charm++ job. We discuss the implementation of the proposed interface and efficient malleable job scheduling strategies. We demonstrate the benefits of such a system, including improved system utilization and throughput, as well as improved response time and turnaround time for the user.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Exploratory Research (Chair: Dr. Celso Mendes) |
3:30 pm - 4:15 pm |
Talk
|
OpenAtom: Fast, fine grained parallel electronic structure software for materials science, chemistry and physics.
Prof. Sohrab Ismail-Beigi, Yale University & Dr. Glenn J. Martyna, IBM Research
Click here to expand description
We will discuss the OpenAtom project with a view to what it is capable of doing right now and what we want it to do in the near future and the next 5 years. The discussion will be framed in terms of modeling the physics and chemistry of useful and interesting materials and what type of real-world problems a well-performing and efficient highly parallel implementation will permit us to address. A brief overview will be given of the project's past, its present status and capabilities for the ground state of electrons and studies of atomistic dynamics at finite temperature on the ground state energy surface, and how we want to incorporate the description of excited electrons into OpenAtom.
|
|
|
4:15 pm - 4:45 pm |
Talk
|
Solvers for O(N) Electronic Structure in the Strong Scaling Limit
Nicolas Bock and Matt Challacombe, Los Alamos National Laboratory and Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
For accurate models, electronic structure theory is an extremely challenging software engineering problem, typically relying on diverse "fast" solvers with mixed programming models. We are developing a new, unified approach to fast electronic structure solvers based on the N-Body programming model. The N-Body model is emerging as an extremely general algorithm class finding increasingly broad application in many disciplines, spanning the information and physical sciences. Historically, the astrophysical N-Body solver has enabled significant computational cost reductions compared to conventional solvers through spatially informed trees and hierarchical approximations based on the range-query, and has been a text-book success for scalable irregular parallelism. In this talk, I will present a generalization of the N-Body programming model for hierarchical multiplication of matrices with decay (SpAMM) [Bock and Challacombe, SIAM Journal on Scientific Computing 35 (1), C72-C98], that uses occlusion based on the metric-query to achieve reduced complexity, and discuss our implementation within Charm++ to access the strong scaling regime. In particular, I will explain how the hierarchical N-Body framework enables simple and concise task-parallel implementations that yield fine-grained task decomposition, without exposing complex message passing interfaces to the programmer, and how N-Body programming models can exploit data locality and persistence based load balancing strategies.
|
|
|
4:45 pm - 5:15 pm |
Talk
|
Cache Hierarchy Reconfiguration in Adaptive HPC Runtime Systems
Ehsan Totoni, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Cache hierarchy consumes a large portion of the processor chip's power and energy, but a sizable fraction of it can be saved for common HPC applications. We propose an adaptive runtime system based reconfiguration approach that turns off ways of set-associative caches. Our simple and practical solution exploits the common patterns and iterative structure of HPC applications, including the prevalent Single Program Multiple Data (SPMD) model of parallel codes, to find the best configuration and save a large fraction of power and energy of the caches. Our experiments using cycle-accurate simulations show that 67% of cache energy can be saved by paying just 2.4% performance penalty (on average). We also show that an adaptive streaming cache strategy can improve performance by up to 30% and save 75% of cache energy in some cases.
|
|
|
5:15 pm - 6:00 pm |
Fun
|
Blue Waters Facility Tour
NCSA
|
|
|
6:00 pm - 7:00 pm |
|
07:00 pm onwards |
Workshop Banquet (for registered participants only) Located at the 2nd floor Atrium outside 2405 Siebel Center |
|
8:00 am - 9:00 am |
Continental Breakfast / Registration |
9:00 am - 9:45 am |
Keynote
|
Hybrid Programming Challenges for Extreme Scale Software
Prof. Vivek Sarkar, E.D. Butcher Chair in Engineering, Rice University
Click here to expand description
It is widely recognized that computer systems in the next decade will
be qualitatively different from current and past computer systems.
Specifically, they will be built using homogeneous and heterogeneous
many-core processors with 100's of cores per chip, their performance
will be driven by parallelism (million-way parallelism just for a
departmental server), and constrained by energy and data movement.
They will also be subject to frequent faults and failures. Unlike
previous generations of hardware evolution, these Extreme Scale
systems will have a profound impact on future software. The software
challenges are further compounded by the need to support new workloads
and application domains that have traditionally not had to worry about
parallel computing in the past.
In general, a holistic redesign of the entire software stack is needed
to address the programmability and performance requirements of Extreme
Scale systems. This redesign will need to span programming models,
languages, compilers, runtime systems, and system software. A major
challenge in this redesign arises from the fact that current
programming systems have their roots in execution models that focused
on homogeneous models of parallelism e.g., OpenMP's roots are in SMP
parallelism, MPI and SHMEM's roots are in cluster parallelism, and
CUDA and OpenCL's roots are in GPU parallelism. This in turn leads to
the "hybrid programming" challenge for application developers, as they
are forced to explore approaches to combine two or all three of these
models in the same application. Despite some early experiences and
attempts by some of the programming systems to broaden their scope
(e.g., addition of accelerator pragmas to OpenMP), hybrid programming
remains an open problem and a major obstacle for application
enablement on future systems.
In this talk, we summarize experiences with hybrid programming in the
Habanero Extreme Scale Software Research project [1] which targets a wide
range of homogeneous and heterogeneous manycore processors in both
single-node and cluster configurations. We focus on key primitives in
the Habanero execution model that simplify hybrid programming, while
also enabling a unified runtime system for heterogeneous hardware.
Some of these primitives are also being adopted by the new Open
Community Runtime (OCR) open source project [2]. These primitives
have been validated in a range of applications, including medical
imaging applications studied in the NSF Expeditions Center for
Domain-Specific Computing (CDSC) [3].
Background material for this talk will be drawn in part from the DARPA
Exascale Software Study report [4] led by the speaker. This talk will
also draw from a recent (March 2013) study led by the speaker on
Synergistic Challenges in Data-Intensive Science and Exascale
Computing [5] for the US Department of Energy's Office of Science. We
would like to acknowledge the contributions of all participants in
both studies, as well as the contributions of all members of the
Habanero, OCR, and CDSC projects.
REFERENCES:
[1] Habanero Extreme Scale Software Research project. http://habanero.rice.edu.
[2] Open Community Runtime (OCR) open source project. https://01.org/projects/open-community-runtime.
[3] Center for Domain-Specific Computing (CDSC). http://cdsc.ucla.edu.
[4] DARPA Exascale Software Study report, September 2009. http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm.
[5] DOE report on Synergistic Challenges in Data-Intensive Science and Exascale Computing, March 2013. http://science.energy.gov/~/media/ascr/ascac/pdf/reports/2013/ASCAC_Data_Intensive_Computing_report_final.pdf.
|
|
|
09:45 am - 10:15 am |
Talk
|
Parallel Programming with Migratable Objects: Charm++ in Practice
Harshitha Menon, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign,
Click here to expand description
The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this talk, we leverage our experience with these concepts to demonstrate their applicability and eficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many mini-applications and real applications executed on modern supercom puters including Blue Gene/Q, Cray XE6, and Stampede.
|
|
|
10:15 am - 10:30 am |
|
Morning |
Technical Session: Large Applications (Chair: Dr. Abhinav Bhatele) |
10:30 am - 11:00 am |
Talk
|
ChaNGa: a Charm++ N-body Treecode
Prof. Tom Quinn, University of Washington
Click here to expand description
Simulations of cosmological structure formation demand significant
computational resources because of the vast range of scales involved:
from the size of star formation regions to, literally, the size of the
Universe. I will describe the cosmology problems we are addressing
with our Blue Waters allocation. ChaNGa, a Charm++ N-body/Smooth
Particle Hydrodynamics tree-code is the application we are running
with this allocation. I will describe the improvements in both
astrophysical modeling and in parallel performance we have made over
the past year in preparation for running these simulations.
|
|
|
11:00 am - 11:30 am |
Submitted Paper
|
Overcoming the Scalability Challenges of Epidemic Simulations on Petascale Platforms
Jae-Seung Yeom, Virginia Tech
Click here to expand description
With an increasingly urbanized and mobile population, the likelihood of a worldwide pandemic is increasing. With rising input sizes and strict deadlines for simulation results, e.g., for real-time planning during the outbreak of an epidemic, we must expand the use of high performance computing (HPC) approaches and, in particular, push the boundaries of scalability for this application area. EpiSimdemics simulates epidemic diffusion in extremely large and realistic social contact networks. It captures dynamics among co-evolving entities. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to irregular communication, load imbalance, and the evolutionary nature of their workload. In this talk, we present an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6 and IBM BlueGene/Q.
|
|
|
11:30 am - 12:00 am |
Talk
|
Petascale Charm++ in Practice: Lessons from Scaling NAMD
Jim Phillips
Click here to expand description
The highly parallel molecular dynamics code NAMD was chosen in 2006 as a
target application for the NSF petascale supercomputer now known as Blue
Waters. NAMD was also one of the first codes to run on a GPU cluster when
CUDA was introduced in 2007. When Blue Waters entered production in 2013,
the first breakthrough it enabled was the complete atomic structure of the
HIV capsid through calculations using NAMD, featured on the cover of
Nature. This talk will cover lessons learned in taking NAMD and Charm++
from a million atoms on a few thousand cores to a hundred million atoms on
500,000 cores, and what changes would aid further progress.
|
|
|
12:30 pm - 01:30 pm |
|
01:30 pm - 01:50 pm |
Panel KickStart
|
Techniques for Effective HPC in the Cloud
Abhishek Gupta, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
The advantages of pay-as-you-go model, elasticity, and the flexibility and customization offered by virtualization make cloud computing an attractive economical option for meeting the needs of some HPC users. However, there is a mismatch between current cloud environments and HPC requirements. HPC is performance-oriented, whereas clouds are cost and resource-utilization oriented. The poor interconnect and I/O performance in cloud, HPC-agnostic cloud schedulers, and the inherent heterogeneity and multi-tenancy in cloud are some bottlenecks for HPC in cloud. This means that the tremendous potential of cloud for both HPC users and providers remains underutilized. In this talk, I will go beyond the common research question: "what is the performance of HPC in cloud?" and present our research on "how can we perform cost-effective and efficient HPC in cloud?" To this end, I will present the complementary approach of making clouds HPC-aware, and HPC runtime system cloud-aware. Through comprehensive HPC performance and cost analysis, HPC-aware VM placement, interference-aware VM consolidation, and cloud-aware HPC load balancing, we demonstrate significant benefits for both: users and cloud providers in terms of cost (up to 60%), performance (up to 45%), and throughput (up to 32%).
|
|
|
01:50 pm - 03:00 pm |
Panel
|
HPC in the cloud: how much water does it hold?
Panelists: Roy Campbell (Professor, University of Illinois at Urbana Champaign), Kate Keahey (Fellow, Computation Institute University of Chicago), Dejan S Milojicic (Senior Research Manager, HP Labs), Landon Curt Noll (Resident Astronomer & HPC Specialist, Cisco), Laxmikant Kale (Professor, University of Illinois at Urbana-Champaign)
Click here to expand description
High performance computing connotes science and engineering applications running on supercomputers. One imagines tightly coupled, latency sensitive, jitter-senstive applications in this space. On the other hand, cloud platforms create the promise of computation on demand, with a flexible infrastructure, and pay-as-you-go cost structure. Can the two really meet? Is it the case that only a subset of CSE applications can run on this platform? Can the increasing role of adaptive schemes in HPC work well with the need for adaptivity in cloud environment? Should national agencies like NSF fund computation time indirectly, and let CSE researchers rent time in the cloud? These are the kind of questions this panel will address.
|
|
|
3:00 pm - 3:30 pm |
|
Afternoon |
Technical Session: Related Research (Chair: Eric Bohm) |
3:30 pm - 4:00 pm |
Invited Talk
|
Compression for Exascale. Always Right Around The Corner
Dr. Brian Van Straalen, Lawrence Berkeley National Laboratory
Click here to expand description
The basic hardware trend moving toward exascale can be summarized in a simple statement: Higher concurrency on more vectorized cores with proportionately less memory and bandwidth. To trade-off more computation for less data traffic, we propose to revisit compression techniques. In particular, we want to perform a systematic and scientific study of the compressibility of floating-point messages and memory data accessed in large-scale DOE HPC applications to determine the potential of data compression in current and future systems.
Real-time, on-line compression has largely failed in the past, especially for floating-point data. This is, in part, because one-size-fits-all compression methods were used. For example, some hardware methods compress all loads and stores in the same way. Moreover, at the hardware level, information about the data structure and dimensionality is typically lost. In other cases, compression is not addressing the most critical problem, like IBM's Active Memory Expansion, which creates a "larger" local DRAM than is physically installed at the cost of extra latency and energy for some off-processor data movement. Even when applied with great care, the energy and latency mismatch between computations and data movement in current-generation processors makes it difficult to successfully exploit compression. We believe this is about to change.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Parallel Branch-and-bound for Two-stage stochastic integer optimization
Akhil Langer, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign
Click here to expand description
Many real-world planning problems require searching for an optimal solution in
the face of uncertain input. One approach to is to express them as a two-stage
stochastic optimization problem where the search for an optimum in one stage is
informed by the evaluation of multiple possible scenarios in the other stage.
Applications of stochastic programming spans a diverse fields ranging
from production, financial modeling, transportation (road as well as air),
supply chain and scheduling to environmental and pollution control,
telecommunications and electricity.
In this talk, we present the parallelization of a two-stage stochastic integer
program solved using branch-and-bound. We present a range of factors that
influence the parallel design for such problems. Unlike typical, iterative
scientific applications, we encounter several interesting characteristics that
make it challenging to realize a scalable design. We present two design variations that
navigate some of these challenges. Our designs seek to increase the
exposed parallelism while delegating sequential linear program
solves to existing libraries. We evaluate the scalability of our designs using
sample aircraft allocation problems for the US airfleet. It is important that
these problems be solved quickly while evaluating large number of scenarios.
Our attempts result in strong scaling to hundreds of cores for
these datasets.
|
|
|
4:30 pm - 5:00 pm |
Talk
|
Task Mapping, Job Placements, and Routing Strategies
Dr. Abhinav Bhatele, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
Click here to expand description
Communication optimizations continue to be important as we scale parallel
applications to the largest parallel machines available. Task mapping has been
shown to be an effective technique for optimizing performance for
communication-bound applications. In this talk, I will present an overview of
our research directions on the Scalable Topology Aware Task Embedding (STATE)
project at LLNL. I will present two task mappings tools, Rubik and Chizu, for
structured and generic communication graphs respectively. I will also discuss
our efforts on modeling of congestion on supercomputer networks. Finally, I
will present some initial work on evaluating the impact of job placements and
routing strategies on application performance. LLNL-ABS-653685.
|
|
|
5:00 pm - 5:30 pm |
Talk
|
Using Charm++ to Improve Extreme Parallel Discrete-Event Simulation (XPDES) Performance and Capability
Prof. Christopher Carothers, Rensselaer Polytechnic Institute
Click here to expand description
ROSS is an extreme parallel discrete-event simulation (XPDES) system
that has demonstrated the ability to scale to millions of cores. In
particular, using 120 racks (1.9M cores) of the Blue Gene/Q "Sequoia"
System, ROSS was able to process over 500 billion events-per-second
for a benchmark parallel discrete-event simulation model comprised of
over 250 million logical processes (LPs) and 16 initial messages per
LP. Thus, ROSS coupled with state-of-the-art supercomputer hardware
is able to model "planetary" scale systems with 100's of millions to
even billions of objects. However, there is still room for improvement
on two fronts. First, ROSS is an all MPI application and does not make
use of threading capabilities of modern supercomputer system. Second,
ROSS currently does not provide a load distribution mechanism that
operates at massively parallel scales. In this talk, I will give an
overview of ROSS and our plans to improve its performance and
capabilities by addressing these limitations using Charm++.
|
|
|
6:30 PM |
Dinner at 301 Mongolia, Address- 301 North Neil Street Champaign, IL 61820.
Meet at 1st floor of Siebel Center for car pooling at 6.10 PM; Call Nikhil@2179790918. |
|
8:30 am - 12:30 pm |
Charm++ Tutorial
|
Hands-on tutorial in Room SC3405
Phil Miller, Harshitha Menon, Eric Mikida
|
|
|