Program

All sessions will be held in Room 2405 of the Thomas M. Siebel Center for Computer Science.

Time	Type	Description	Slides	Webcast
Day 1 (Monday, April 15th)
8:00 am - 8:45 am	Continental Breakfast / Registration
Morning	Opening Session
8:45 am - 9:00 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale	[ppt]	[youtube]
9:00 am - 9:45 am	Invited Talk	X10 at Scale Olivier Tardieu, IBM Click here to expand description X10 is an open-source imperative concurrent object-oriented programming language developed by IBM Research to ease the programming of scalable concurrent and distributed applications. In this talk, I will report and reflect on our experience running HPC application kernels and graph algorithms implemented in X10 on a Petaflop IBM Power 775 supercomputer (with up to 55,000 Power7 cores). I will discuss design and implementation decisions that make it possible to achieve competitive performance at scale while retaining X10's productivity. In particular, I'll describe our implementation of the Unbalanced Tree Search benchmark (UTS), which illustrates X10's handling of irregular parallelism.	[ppt] [pdf]	[youtube]
9:45 am - 10:30 am	Invited Talk	The Chapel Runtime Greg Titus, Cray Click here to expand description The Chapel runtime provides a variety of services during Chapel program execution. Among these services are managing memory, performing remote communication, handling parallelism and synchronization, and many others. In this talk I'll give an overview of the runtime's architecture and implementation, describe its relationship with Chapel's built-in modules, and give several examples of how Chapel language constructs translate into runtime activities. I'll also discuss how we see the responsibilities of the runtime and its relationships with other parts of the Chapel software stack changing in the future.	[pptx]	[youtube]
10:30 am - 11:00 am	Break
Morning	Technical Session: Run Time System (Chair: Gengbin Zheng)
11:00 am - 11:30 am	Talk	A Multi-Paradigm Approach to High-Performance Scientific Programming Pritish Jetley, University of Illinois at Urbana-Champaign Click here to expand description In this talk, we will consider the following key questions: (1) Is it possible to write parallel programs in a modular manner, so that independently developed modules (e.g. a finite element code and a computational fluid dynamics code, or even a chemical kinetics module and a cortical neuron simulator) can be composed in a noninvasive and efficient manner? (2) Is it possible for non-expert programmers to write parallel code in abstract specifications without losing performance? That is to say, can we reconcile the two opposing forces of performance and productivity? These questions are of immediate importance to the developers of complex multi-physics codes intended to scale to hundreds of thousands of processors. Our approach to this problem is characterized by the following salient features: 1. Specialization: programs are written using a set of abstract and specialized, but individually incomplete frameworks/languages (collectively, programming paradigms) to engender productivity of programming. 2. Interoperability: we provide a common runtime substrate for the communication of data between modules written in different frameworks/languages, thereby enabling completeness of expression. 3. Runtime Management: our abstractions, frameworks and languages are based on an adaptive run time system (ARTS) that dynamically optimizes the execution of a running program, namely Charm++. We will discuss these inter-related ideas, and examine a few languages and frameworks (specialized and otherwise) in the Charm++ ecosystem.	[pdf]	[youtube]
11:30 am - 12:00 pm	Talk	LRTS: a Portable High Performance Lower-level Communication Interface Yanhua Sun, University of Illinois at Urbana-Champaign Click here to expand description As the modern interconnection network turns highly complex, it is challenging to obtain good performance of various parallel applications over different supercomputers. Over years, asynchronous message-driven runtime-based parallel programming model has been proven to be scalable and productive, where Charm++ is one major instance. To better exploit the performance and portability of Charm++ on modern supercomputers, we define a set of functions for a universal lower-level communication interface. In this talk, I will present this set of functions and how they support the asynchronous runtime system . Features like persistent communication, intra-node communication are also discussed. The interface has been implemented on MPI, Cray uGNI, and IBM PAMI libraries. I will present some results for small benchmarks and real applications on state-of-art supercomputers.	[pdf]	[youtube]
12:00 pm - 12:30 pm	Talk	Scalable in-memory checkpoint for hard and soft error protection with automatic restart on failure Xiang Ni, University of Illinois at Urbana-Champaign Click here to expand description As the scale of machines increase, the HPC community has seen a steady decrease in reliability of the systems, and hence an increase in the down time. Moreover, soft errors such as bit flips do not prevent execution but generate incorrect results. Checkpoint/restart is by far the most commonly used fault tolerance method for hard errors, and its efficiency and scalability has been improved with recent research. In this talk, we will discuss the asynchronous double in-memorycheckpoint scheme which can significantly hide the checkpoint overhead for applications by overlapping checkpoint with application execution. Soft errors are becoming more important even on terrestrial attitude because of the manufacturing limit on the small architecture. Long time running program with only traditional fault tolerance support like checkpoint have a high chance of ending up with incorrect result due to the soft error corruption. We will also talk about how replication can enhance checkpoint/restart technique to provide soft error protection for applications. In the mean time, replication can increase program efficiency in the environment where fail-stop failure rate is high. We evaluate our approach using multiple benchmarks written in Charm++ and MPI, including stencil codes and molecular dynamics mini-applications. Both benchmarks show minimal overhead when scaled to 32768 cores.	[pdf]	[youtube]
12:30 pm - 01:30 pm	Lunch Provided
Afternoon	Technical Session: Performance at Scale (Chair: Pritish Jetley)
1:30 pm - 2:00 pm	Invited Talk	Intuitive Visualizations for Performance Analysis at Scale Todd Gamblin, Lawrence Livermore National Laboratory Click here to expand description Performance data is highly complex and often difficult to interpret for even a single process. As concurrency levels and on-node complexity rise in modern supercomputers, the difficulty of understanding performance problems also increases. Traditional performance analysis tools can provide some insight into where a code spends its time, but they typically leave to the programmer the task of ascribing meaning to the numbers. At LLNL, the PAVE project is developing techniques to map performance data to more intuitive domains, and to visualize the result. For example, to analyze a data-dependent problem in a physics simulation, we may want to know how performance measurements relate to the physics mesh. Or, to optimize communication within a parallel load balancer, we need to know more about communication patterns to understand which processes are waiting. In this talk, we give an overview of the tools and techniques we have developed for performance measurement and visualization, and we describe the insights we have gained from this approach.	[pdf]	[youtube]
2:00 pm - 2:30 pm	Talk	Distributed Load Balancing Harshitha Menon, University of Illinois at Urbana-Champaign Click here to expand description As we move towards the exascale era, dynamic load balance will become critical for achieving good system utilization. With a large number of cores, centralized load balancing schemes, which collect global information and compute the decisions at a central location, are not scalable. In contrast, fully distributed strategies are scalable but do not produce balanced work distribution because they tend to consider only local information. We will talk about a fully distributed algorithm for persistence based load balancing which performs good load balancing with less overhead and compare it with other load balancing strategies.	[pdf]	[youtube]
2:30 pm - 3:00 pm	Talk	Load Balancing for Cloud Environments Abhishek Gupta, University of Illinois at Urbana-Champaign Click here to expand description Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud. For tightly-coupled iterative applications, one slow processor slows down the entire application, resulting in poor CPU utilisation. In this talk, we present a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. Through experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, we demonstrate performance benefits up to 45%. Finally, we analyse the effect of load balancing frequency, problem size and computational granularity (problem decomposition) on the performance and scalability of our techniques.	[pptx] [pdf]	[youtube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Interoperability (Chair: Ramprasad Venkataraman)
3:30 pm - 4:00 pm	Talk	Charm++ Interoperability Nikhil Jain, University of Illinois at Urbana-Champaign Click here to expand description Charm++ is a unique parallel programming paradigm based on message-driven execution powered by a runtime-system. Over decomposition into migratable work units by application writers and (almost) total control of execution by the run-time system allows Charm++ to improve application performance in conjunction with programmer's productivity. This is in stark contrast with MPI that follows (generally) a bulk-synchronous programming model with all decisions taken by the programmer. Given the wide range of applications and their characteristics, one of these two styles of programming may be suitable for development of those applications. In fact, given the complexity of present day applications, different programming paradigm may suit different parts of an application. This talk explores interoperability of Charm++ with other programming paradigms (with focus on MPI). The topics touched upon include Adaptive MPI, hybrid programming in Charm++ and MPI and interoperability with OpenMP.	[key] [pdf]	[youtube]
4:00 pm - 4:30 pm	Invited Talk	Charm++ Implementation of a Detonation Shock Dynamics Algorithm Brian McCandless, Lawrence Livermore National Laboratory Click here to expand description This work presents a Charm++ implementation for a narrow band algorithm for the Detonation Shock Dynamics (DSD) model to simulate the propagation of detonations. This algorithm is regarded as a fast and accurate method, however there are significant challenges to implement it in a scalable way. In this algorithm, nearly all the computational work is located on the small region near the propagation front. If the algorithm is implemented with a simple domain decomposition, then the problem becomes load imbalanced, where only a small fraction of the processors will have work to do at any given time-step. One approach to solve the load balancing problem using over-decomposition with Charm++ has been investigated. The second focus of this work is MPI interoperability. This work explores the relatively new Charm++/MPI interoperability feature. The DSD algorithm is a small part of a much larger MPI based application. It is not practical to rewrite the entire application in Charm++, but it is possible to incrementally rewrite certain algorithms, provided that it works well with MPI in control. Initial exploration on the interoperability will be discussed.	[pptx]	[youtube]
4:30 pm - 5:00 pm	Submitted Paper	Optimizing Charm++ over MPI Ralf Gunter, Argonne National Laboratory Click here to expand description Charm++ may employ any of a myriad network-specific APIs for handling communication, which are usually promoted as being faster than its catch-all MPI module. Such a performance difference not only causes development effort to be spent on tuning vendor-specific APIs, but also discourages hybrid Charm++/MPI applications. We investigate this disparity across several machines and applications, ranging from small InfiniBand clusters to Blue Gene/Q supercomputers; from synthetic benchmarks to large-scale biochemistry codes. Finally, we demonstrate the use of one feature from the recent MPI-3 standard to bridge this gap where applicable, and what can be done today.	[pdf]	[youtube]
5:00 pm - 6:00 pm	Fun	Blue Waters Facility Tour NCSA
6:00 pm - 7:00 pm	Break
07:00 pm onwards	Workshop Banquet (for registered participants only) Located at the 2nd floor Atrium outside 2405 Siebel Center
Day 2 (Tuesday, April 16th)
8:00 am - 9:00 am	Continental Breakfast / Registration
9:00 am - 10:00 am	Keynote Talk	Towards an Ecosystem for Heterogeneous Parallel Computing Wu-chun Feng, Virginia Tech Click here to expand description With processor core counts doubling every 18-24 months and penetrating all markets from high-end servers in supercomputers to desktops and laptops down to even mobile phones, we sit at the dawn of a world of ubiquitous parallelism, one where extracting performance via parallelism is paramount. That is, the "free lunch" to better performance, where programmers could rely on substantial increases in single-threaded performance to improve software, is over. The burden falls on developers to exploit parallel hardware for performance gains. But how do we lower the cost of extracting such parallelism, particularly in the face of the increasing heterogeneity of processing cores? To address this issue, this talk will present a vision for an ecosystem for delivering accessible and personalized supercomputing to the masses, one with a heterogeneity of (hardware) processing cores on a die or in a package, coupled with enabling software that tunes the parameters of the processing cores with respect to performance, power, and portability via a benchmark suite of computational dwarfs and applications.		[youtube]
10:00 am - 10:30 am	Talk	Dynamic Power Management in Charm++ Osman Sarood, University of Illinois at Urbana-Champaign Click here to expand description As we move to exascale machines, both peak power demand and total energy consumption have become prominent challenges. A significant portion of that power and energy consumption is devoted to cooling, which is overlooked by most HPC researchers. It is possible to save cooling energy consumption given that processor cores do not overheat. Due to the exponential relationship between core temperatures and fault rate, saving cooling at the expense of higher core temperature might imply an increase number of faults. In our work we propose a scheme which leverages DVFS to constrain core temperatures what allows us to reduce cooling energy consumption while reducing the fault rate at the same time. Our approach is particularly designed for parallel applications, which are typically tightly coupled, and tries to minimize the timing penalty associated with temperature control. We formulate a model that captures the expected reduction in execution time that can result due to better reliability resulting from temperature control. We demonstrate the uses of our scheme using 5 different applications on a 32-node cluster with a dedicated Computer Room Air-conditioning Unit (CRAC).	[pptx]	[youtube]
10:30 am - 11:00 am	Break
Morning	Technical Session: Power and Energy (Chair: Osman Sarood)
11:00 am - 11:30 am	Invited Talk	Power-performance modeling, analyses and challenges Kirk W. Cameron, Virginia Tech Click here to expand description The power consumption of supercomputers ultimately limits their performance. The current challenge is not whether we will can build an exaflop system by 2018, but whether we can do it in less than 20 megawatts. The SCAPE Laboratory at Virginia Tech has been studying the tradeoffs between performance and power for over a decade. We've developed an extensive tool chain for monitoring and managing power and performance in supercomputers. We will discuss our power-performance modeling efforts and the implications of our findings for exascale systems as well as some research directions ripe for innovation.	[pdf]	[youtube]
11:30 am - 12:00 am	Talk	Toward Runtime Power Management of Exascale Networks by On/Off Control of Links Ehsan Totoni, University of Illinois at Urbana-Champaign Click here to expand description Higher radix networks, such as high-dimensional tori and multi-level directly connected networks, are being used for supercomputers as they become larger but need lower diameter. These networks have more resources (e.g. links) in order to provide good performance for a range of applications. We observe that a sizeable fraction of the links in the interconnect are never used or underutilized during execution of common parallel applications. Thus, in order to save power, we propose addition of hardware support for on/off control of links in software and their management using adaptive runtime systems. We study the effectiveness of our approach using real applications (NAMD, MILC), and application benchmarks (NAS Parallel Benchmarks, Jacobi). They are simulated on representative networks such as 6-D Torus and IBM PERCS (similar to Dragonfly). For common applications, our approach can save up to 20% of total machine¹s power and energy, without any performance penalty.	[pptx] [pdf]	[youtube]
12:00 am - 12:30 pm	Talk	Energy Profile of Fault Tolerance Methods Esteban Meneses, University of Illinois at Urbana-Champaign Click here to expand description An exascale machine is expected to be delivered in the time frame 2018-2022. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This talk presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.	[pdf]	[youtube]
12:30 pm - 01:30 pm	Lunch Provided
01:30 pm - 03:00 pm	Panel	Temperature, Power, Time and Energy: Can software control it all? Panelists: Kirk W. Cameron (Virginia Tech), Martin Schulz (Lawrence Livermore National Laboratory), Wu-chun Feng (Virginia Tech), Mitsuhisa Sato (University of Tsukuba), Laxmikant V. Kale (University of Illinois at Urbana-Champaign) Moderator: William Gropp (University of Illinois at Urbana-Champaign) Click here to expand description As move from Petascale to Exascale, these factors are becoming increasingly important. To avoid overheating the chips, frequencies have stopped increasing. Instantaneous power needs to be kept within the facility's limits. The energy per FLOP needs be kept within bounds for attaining exaFLOP/s rates in a practical manner. Architecture innovations are clearly needed. But the question is: given a particular machine hardware, can software do something so as to (a) keep instantaneous power within pre-define limits, while (b) keeping chip temperatures within pre-set thresholds, and (c) minimizing execution time for a given job and (d) minimize energy used per job? This panel will seek answers from leading researchers in this area.		[youtube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Applications (Chair: Celso L. Mendes)
3:30 pm - 4:00 pm	Talk	Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines Keith Bisset, Virginia Tech Click here to expand description Applications that model dynamical systems involve large scale, ir- regular graph processing. These applications are difficult to scale due to the evolutionary nature of their workload, irregular commu- nication and load imbalance. EpiSimdemics implements a graph based system that captures dynamics among co-evolving entities, while simulating contagious diffusion in extremely large and real- istic social contact networks. EpiSimdemics relies on individual- based models, thus allowing studies in great detail. This paper presents a novel implementation of EpiSimdemics in Charm++, which enables future research by social, biological and compu- tational scientists at unprecedented data and system scales. We present new methods for application-specific decomposition of graph data and predictive dynamic load migration and demonstrate the effectiveness of these methods on Cray XE6/XK7 and IBM Blue Gene/Q.	[pptx] [pdf]	[youtube]
4:00 pm - 4:30 pm	Talk	ChaNGa: a Charm++ N-body Treecode Tom Quinn, University of Washington Click here to expand description Simulations of cosmological structure formation demand significant computational resources because of the vast range of scales involved: from the size of star formation regions to, literally, the size of the Universe. I will describe the cosmology problems we plan to address with our Blue Waters allocation. ChaNGa, a Charm++ N-body/Smooth Particle Hydrodynamics tree-code is the application we will be running with this allocation. I will describe the improvements in both astrophysical modeling and in parallel performance we have made over the past year in preparation for running these simulations.		[youtube]
4:30 pm - 5:00 pm	Talk	Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, University of Illinois Urbana-Champaign Click here to expand description We present scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a distributed load balancer in contrast to traditional linear numbering schemes, which incur unnecessary synchronization for load balancing. In contrast to the existing approaches which take O(P ) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 16k ranks on IBM Blue Gene/Q.	[pptx]	[youtube]
5:00 pm - 5:30 pm	Talk	TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages Lukasz Wesolowski, University of Illinois at Urbana-Champaign Click here to expand description Fine-grained communication in supercomputing applications often limits performance due to high communication overhead and saturation of network bandwidth. In this talk I describe how to optimize fine-grained communication performance using the Topological Routing and Aggregation Module (TRAM). TRAM collects units of fine-grained communication from application code and combines them into aggregate messages with a common intermediate destination. It routes these messages along a virtual mesh topology mapped onto the physical topology of the network, recombining message fragments at intermediate destinations. TRAM leads to improved network bandwidth utilization and reduced message overhead. It is particularly effective in improving performance of patterns with global communication and a large number of messages, such as all-to-all and many-to-many paradigms. This will be demonstrated with performance results from two scientific applications: EpiSimdemics, a simulator of the spread of contagion, and ChaNGa, an N-Body cosmological simulator.	[pdf]	[youtube]
5:30 pm - 6:00 pm	Fun	Annual PPL Photograph Jonathan Lifflander

Abstracts Due:	March 10th
Author Notification:	March 18th
Workshop:	April 15-16