Program

Location: 2405 Siebel Center
The schedule below is tentative, and subject to change

Time	Type	Description	Slides	Webcast
Day 1 (Monday, April 18th)
8:15 am - 8:30 am	Breakfast / Registration
Morning	Opening Session
8:30 am - 9:00 am
8:45 am - 09:30 am	Keynote	Architecture-aware Algorithms and Software for Peta and Exascale Computing Prof. Jack Dongarra University Distinguished Professor of Electrical Engineering and Computer Science, University of Tennessee Click here to expand description In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder. We will look at five areas of research that will have an importance impact in the development of software and algorithms. We will focus on following themes: Redesign of software to fit multicore and hybrid architectures Automatically tuned application software Exploiting mixed precision for performance The importance of fault tolerance Communication avoiding algorithms	[pptx] [pdf]	[wmv]
9:30 am - 10:00 am	Talk	Charm++ Research Agenda: Recent Developments and Plans Prof. Laxmikant V. Kale Professor of Computer Science, University of Illinois Click here to expand description Charm++ and the rich research agenda engendered by its idea of object-based over-decomposition made significant progress during the past year. I will review the basic concepts that have been the foundation of our approach to parallel programming, and highlight specific achie vements of the past year. These include progress on our production-quality collaboratively-developed science and engineering applications, including NAMD (biophysics), OpenAtom (Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the progress and challenges in our agenda of developing higher level parallel languages.	[key] [pdf]	[wmv]
10:00 am - 10:15 am	Break
Morning	Technical Session: Load Balancing (Chair: Dr. Gengbin Zheng)
10:15 am - 10:45 am	Submitted Paper	Improving Charm++ Performance with a NUMA-aware Load Balancer Laercio Lima Pilla Federal University of Rio Grande do Sul, Brazil Click here to expand description The importance of Non-Uniform Memory Access (NUMA) machines has been increasing as a solution to ease the memory wall problem and to provide better scalability for multi-core machines. On such machines, the shared memory is physically distributed into memory banks connected by a network. Due to this, memory access costs may vary depending on the distance between the desired memory bank and the requesting processing unit. We propose a NUMA-aware load balancer that combines the information about the NUMA topology with the statistics captured by the Charm++ RTS. We present improvements of up to 10% over existing load balancing strategies in benchmark performance. In addition, our algorithm presents up to seven times smaller overheads than the other strategies by avoiding unnecessary migrations.	[pdf]	[wmv]
10:45 am - 11:15 am	Talk	Temperature aware Load Balancing for Parallel Applications Osman Sarood Click here to expand description Increasing number of cores and clock speeds on a smaller chip area implies more heat dissipation and an ever increasing heat density. This increased heat, in turn, leads to higher cooling cost and occurrence of hot spots. Effective use of dynamic voltage and frequency scaling (DVFS) can help us alleviate this problem. But there is an associated execution time penalty which can get amplified in parallel applications. In high performance computing, applications are typically tightly coupled and even a single overloaded core can adversely affect the execution time of the entire application. This makes load balancing of utmost value. In this paper, we outline a temperature aware load balancing scheme, which uses DVFS to keep core temperatures below a user-defined threshold with minimum timing penalty. While doing so, it also reduces the possibility of hot spots. We apply our scheme to three parallel applications with different energy consumption profiles. Results from our technique show that we save up to 14% in execution time and 12% in machine energy consumption as compared to frequency scaling without using load balancing. We are also able to bound the average temperature of all the cores and reduce the temperature deviation amongst the cores by a factor of 3.	[pptx] [pdf]	[wmv]
11:15 am - 11:45 am	Talk	New Developments in the Charm++ Load Balancing Framework and its Applications Dr. Abhinav Bhatele Click here to expand description The theme of this year's workshop is load balancing. Over the past one year, several improvements have been made to the Charm++ load balancing framework and new strategies have been added. We will highlight new developments and also discuss new load balancers and their performance. Some new load balancing strategies are based on recursive bi-partitioning and Scotch (a graph partitioning library).	[key] [pdf]	[wmv]
11:45 am - 12:15 pm	Talk	Impact Of Type Ia Supernova Ejecta On The Binary Companion Kuo-Chuan Pan Click here to expand description Type Ia supernovae are thought to be due to thermonuclear explosions of carbon-oxygen white dwarfs in close binary systems. In the single-degenerate scenario, the companion star is non-degenerate and can be significantly altered by the explosion. We explore this interaction by means of three-dimensional adaptive mesh refinement (AMR) simulations using the FLASH code. Since the simulations are computationally expensive, we optimize the FLASH in two approaches: 1. using an automatic MPI to AMPI tool to get benefit from object migration in Charm++; 2. re-design the AMR framework using Charm++ for dynamic scheduling and processor virtualization. We consider several different companion types, including red giants, main-sequence-like stars, and helium stars, and we include the symmetry-breaking effects of orbital motion, Roche-lobe overflow, and pre-supernova mass loss. Our analysis focuses on mass loss by the companion, contamination of the companion's atmosphere by supernova ejecta, and post-supernova motion of the companion relative to the ejecta. We discuss the implications of our results for variation in Type Ia supernova properties and searches for remnant companion stars.	[pdf]	[wmv]
12:15 pm - 01:15 pm	Lunch
Afternoon	Technical Session: Exascale Efforts (Chair: Ryan Mokos)
01:15 pm - 01:45 pm	Talk	Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses and Xiang Ni Click here to expand description The widespread adoption of multicore chips as the basis to build supercomputers brings new design options for fault tolerance strategies. In this talk, we will describe how we evolved our two major fault tolerance techniques, checkpoint/restart and message logging, to take full advantage of the opportunities offered by this architecture.	[pdf]	[wmv]
01:45 pm - 02:15 pm	Talk	Architectural constraints to attain 1 Exaflop/s for molecular dynamics and cosmology simulations Dr. Abhinav Bhatele and Pritish Jetley Click here to expand description The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to reach Exaflop/s performance. In this talk, we will present a feasibility study of two important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes are classical molecular dynamics and cosmological simulations. We will analyze the problem sizes required for representative applications to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we will also discuss the performance of these applications for much smaller problem sizes.	[key] [pdf]	[wmv]
02:15 pm - 02:45 pm	Talk	Large scale simulations enabled by BigSim Dr. Gengbin Zheng and Ehsan Totoni Click here to expand description Petaflop/s class computers are currently being deployed and even larger exascale computers are being planned. Our BigSim project is aimed at developing tools that allow one to develop, debug and tune/scale/predict the performance of applications before such machines are available so that the applications can be ready when the machine first becomes operational. It also allows easier "offline" experimentation of parallel performance tuning strategies --- without using the full parallel computer. This talk will focus on our recent development in BigSim project , including work on simulating petascale machines using NAMD with improved accuracy, a new faster implementation of a sequential network simulator, and several case studies of using BigSim.	[pptx] [pdf]	[wmv]
02:45 pm - 03:00 pm	Break
Afternoon	Technical Session: Applications (Chair: Eric Bohm)
03:00 pm - 03:30 pm	Submitted Paper	Large-Scale Computational Epidemiology Simulations using Charm++ Keith Bisset Virginia Bioinformatics Institute at Virginia Tech Click here to expand description Contagion (or diffusion) models are pervasive in social and physical sciences, such as potential pandemics caused by avian influenza. Developing computational models to reason about these systems is complicated and scientifically challenging. The size and scale of these systems can be extremely large (e.g., pandemic planning at a global scale requires models with 6 Billion agents). Developing scientific foundations for practical global-scale problems requires one to model systems comprised of multiple interacting behaviors, networks, and contagions. In our example of epidemiology, we are interested in at least two separate contagion processes: spread of disease through the population and spread of fear, influence and information in response the epidemic. We present preliminary results from our work applying the Charm++ framework to the domain of Computational Epidemiology. Computational Epidemiology provides a type of computation that is different that the typical HPC codes. It is characterized by unstructured, dynamically changing communication patterns. Our initial results are promising, showing significant improvements over the original MPI-based implementation. In addition, the task-based decomposition dictated by Charm++ is a natural fit for these types of problems.	[key]	[wmv]
03:30 pm - 04:00 pm	Talk	Scaling Dense LU Factorization in Charm++ Jonathan Lifflander Click here to expand description We describe a new implementation of LU factorization of dense matrices in Charm++. This is a popular supercomputer benchmark, whose best-known implementation, HPL, is used to generate the Top500 rankings. Our implementation focuses on maximising dynamic execution flexibility, based on the arrival of input data. We will describe the tools and techniques this application uses, and present scaling results obtained on Cray XT5 and Bluegene/P.	[pdf]	[wmv]
04:00 pm - 05:00 pm	CS Distinguished Lecture Series (1404 SC)	Constraint-based Synchronization for Strong Scaled Execution Prof. Thomas Sterling Arnaud and Edwards Professor of Computer Science, Louisiana State University Click here to expand description Weak-scaled parallel execution has demonstrated dramatic advances in the last two decades through the combination of MPP and commodity cluster architectures with communicating sequential processes programming methods (e.g., MPI). Increased performance in that time frame for suitable applications (e.g., Linpack) has been observed in excess of four to five orders of magnitude. However, many applications requiring reduction of execution time for fixed sized data sets have exhibited less favorable progress. These strong scaled applications are constrained by limited parallelism, high overheads, delays due to remote access latencies, and contention for shared resources. Conventional practices mitigate such challenges through regular coarse grained tasks avoiding significant effects of these sources of performance degradation. This presentation will discuss recent results of experiments in strong scaling using lightweight synchronization constructs including dataflow and futures objects combined with message-driven computation and dynamic thread scheduling. The ParalleX execution model is an experimental synthesis of these and other methods that expose and exploit parallelism inherent to the meta-data of dynamic graph-based applications. A proof-of-concept reference implementation of ParalleX, HPX-3, is a runtime system that serves on conventional SMPs and commodity clusters with Unix-like operating system interfaces. The goal of this research has been to investigate the potential of constraint-based synchronization for detection and dynamic scheduling of this form of parallelism for strong scaling. An advanced Adaptive Mesh Refinement code used for numerical relativity studies was originally developed in MPI and has been ported to the HPX library. Initial measurements suggest that such methods may exhibit significant improvements of scalability compared to conventional practices. This talk will present these findings and conclude with requirements for future work.
05:00 pm - 06:30 pm	Break
06:30 pm onwards	Workshop Banquet (for registered participants only)
Day 2 (Tuesday, April 19th)
8:15 am - 8:30 am	Breakfast
Morning	Technical Session: Invited Talks (Chair: Prof. Laxmikant Kale)
08:30 am - 09:15 am	Keynote	TSUBAME2.0, or the long road from tiny clusters to petascale, and its possible contributions to high-resolution natural disaster simulations Prof. Satoshi Matsuoka Professor, Tokyo Institute of Technology and National Institute of Informatics, Japan Click here to expand description TSUBAME2.0 is the latest incarnation of the series of clusters that have been built at Tokyo Institute of Technology, and has become the first supercomputer in Japan to reach the Petaflops plateau. TSUBAME 2.0 embodies many unique features derived from years of research into HPC, especially keeping in mind retaining or improving bandwidth scalability, fault tolerance, green, using the latest hardware components such as GPUs and SSDs, as well as employing some of the latest software research results from labs at Tokyo Tech. I will also touch upon its possible use to simulations of natural disasters that have hit Japan recently, demonstrating that despite its relatively small size as well as adoption of hybrid architectures it scales well to the use of thousands of GPUs as well as demonstrates performance topping the largest machines such as the ORNL Jaguar.	[pptx] [pdf]	[wmv]
09:15 am - 09:45 am	Talk	Towards a Usable Programming Model for GPGPUs Prof. Orion Sky Lawlor University of Alaska at Fairbanks Click here to expand description The enormous performance potential of the modern Graphics Processing Unit (GPU) for general purpose programming (GPGPU) is matched by the enormous difficulty of writing correct and maintainable high-performance GPGPU software. In particular, today's mainstream GPGPU languages like CUDA and OpenCL manage to combine many of the worst features of both shared memory programming, such as locking and race conditions, with the worst features of distributed memory parallel programming, such as explicit byte-level memory copies. As an alternative, we present a simple restricted programming model for GPGPU with clean syntax, guaranteed race condition free memory access, and excellent performance. We compare this new model against Charm++'s accelerator interface, Charm++ Arrays, SDAG, and MSA. We conclude by exploring methods by which Charm++'s flagship dynamic migration and automatic load balancing capabilities could be more naturally extended to the GPGPU era.	[ppt] [pdf]	[wmv]
09:45 am - 10:15 am	Talk	Exploring Novel Parallel Implementations of Stochastic Programs Prof. Udatta Palekar Professor of Business Administration, University of Illinois Click here to expand description Real-life stochastic programs typically involve integer programs, which are hard to solve. The problem we are interested in, involves the scheduling of aircraft for transportation of passengers and cargo. Typical real-life applications involve millions of equations and constraints. We present interesting parallelization challenges, both inherent to the algorithms and imposed by the use of popular numeric libraries. We will also discuss a design that should significantly enhance the scalability of a parallel implementation. Decomposing the stochastic program into multi-stage linear programs and adopting a branch-and-bound technique can enhance the scalability.	[ppt] [pdf]	[wmv]
10:15 am - 10:30 am	Break
Morning	Technical Session: Languages (Chair: Dr. Celso Mendes)
10:30 am - 11:00 am	Talk	Adventures in Load-Balancing at Large Scale: Successes, Fizzles, and Next Steps Ewing "Rusty" Lusk Mathematics and Computer Science Division, Argonne National Laboratory Click here to expand description This talk will describe an ongoing project to scale simple load-balancing approaches to hundreds of thousands of processes. The project has developed the Asynchronous, Dynamic Load-Balancing Library (ADLB) interface, and experimented with multiple implementations. The API is small and easy to use, yet flexible enough to serve both as a high-level manager/worker programming model in its own right and as a low-level execution model for higher-level approaches. This talk will describe a sophisticated application that has used ADLB to scale to today's largest machines while simplifying its programming approach, an alternate implementation that has many good properties but scales less well, and plans for future improvements.	[pptx] [pdf]	[wmv]
11:00 am - 11:30 am	Talk	Using distributed shared-array abstractions in a virtualized message-driven execution environment Phil Miller Click here to expand description The strength of the Charm++ programming model is the flexibility it affords to remap computational objects for load balance and locality. However, programs using the Multiphase Shared Arrays (MSA) library have not been able to easily exploit this flexibility due to restrictions of the implementation. This talk will outline the work done to lift those limitations, and present some preliminary results on the performance improvements that can be achieved.	[odp] [pdf]	[wmv]
11:30 am - 12:00 pm	Talk	Charj: A Language for Writing Better Charm Programs More Easily Aaron Becker Click here to expand description Charm is a powerful system, but using it effectively is not always easy or intuitive. Because Charm is implemented as a C++ library and associated translator, it suffers from difficult-to-interpret error messages, duplication of code, and the inability to perform optimizations. This talk describes ongoing work on Charj, a language and compiler targetting the Charm runtime which aims to address these problems.	[key] [pdf]	[wmv]
12:00 pm - 12:30 pm	Talk	Asynchronous message-driven programming in a shared-memory context Pritish Jetley Click here to expand description In the context of shared-memory programs, application performance is typically confouned by the following inter-related issues: dynamic task creation and placement, synchronization and data movement costs, and critical path delays. Charm++ offers a solution to these problems in the form of an object-based, message-driven approach to the programming of irregular, shared-memory programs. The defining features of this approach are: Overdecomposition of problems into medium-grained tasks, which engenders data locality; Implicit synchronization between objects, which is realized through the exchange of messages; Asynchrony of communication, which allows communication latency to be overlapped with useful computation; Automatic scheduling of dynamically generated tasks, which removes imbalance of load; Prioritization of tasks, which prevents the critical path from being delayed. We outline the productivity and performance benefits of this paradigm in the context of two tree-based applications, namely, N-body computations using the Barnes-Hut method, and the construction of SAH-balanced kd-trees for efficient rendering of three-dimensional scenes.	[odp] [pdf]	[wmv]
12:30 pm - 01:30 pm	Lunch
01:30 pm - 03:00 pm	Panel	Message driven execution and migratability: niche or necessity? Panelists: William D. Gropp, Laxmikant V. Kale, Ewing 'Rusty' Lusk, Satoshi Matsuoka, Thomas Sterling Moderator: Orion Lawlor Click here to expand description These two ideas have been at the core of the Charm++ approach to parallel programming. Some or all of the elements of these ideas (esp MDE) have also occured in the actor model, J- machine, macro-data-flow, Earth, ParalleX and active messages. Message-driven execution (MDE) is the idea that a processor should be scheduled based on availability of data, typically arriving from remote processors. It can also be called data-driven execution, except that that phrase has been overloaded in different contexts. . Benefits of MDE include adaptive overlap of communication-and-computation, compositionality, and pre-fetching enabled by the concomitant scheduler's queue. Migratibility is the notion that work-units and data-units of parallel programs should not be tied to processors by the programming model, but rather allowed to migrate across processors under the control of an adaptive runtime system. Migratability is useful for automated resource management, and fault tolerance, for example. Migratability requires overdecomposition (aka virtualization, but that’s another overloaded word), I.e. The number of work and data-units should be larger than the number of processors. The question before the panel is: Will these features remain in a small niche of parallel programming approaches, or are they so essential in parallel programming of the future that they must be part of every approach?		[wmv]
03:00 pm - 03:15 pm	Break
Afternoon	Technical Session: Applications (Chair: Ramprasad Venkataraman)
03:15 pm - 03:45 pm	Talk	ChaNGa Prof. Thomas Quinn Professor, Department of Astronomy, University of Washington at Seattle Click here to expand description Astrophysical N-body simulation is a uniquely challenging application for parallel performance. The large dynamic range in space calls for irregular data structures, while the large dynamic range in time poses challenges for load balancing. The successes and pitfalls of developing the parallel N-body code, ChaNGa, in Charm++ will be presented.	[odp] [pdf]	[wmv]
03:45 pm - 04:15 pm	Talk	Asynchronous Message-driven Runtime Innovations to Enable Biomolecular Simulations of 100 Million Atoms on Petascale Machines Dr. James C. Phillips and Chao Mei Click here to expand description 100 million atom biomolecular simulation with NAMD is one of the three scientific benchmarks for BlueWaters machine, the NSF-funded sustainable petascale machine at University of Illinois. To simulate such a huge molecular system on a machine with hundreds of thousands of cores presents great challenges. Those issues not only include the traditional optimization problem as how to achieve good strong scaling results, but also include new problems due to the problem size, such as loading input data into the simulation and outputting trajectory data to the file system. Unlike other application optimizations, we adopted a holistic approach to optimizing both the application itself (NAMD) and its underlying asynchronous message-driven runtime (Charm++) for this enablement. In this paper, we examine the issues one by one, and explore the techniques employed to overcome them. In particular, we designed and optimized a new mode in the runtime to take advantage of the wide multicore nodes installed on petascale machines, then demonstrated how this new mode improves the performance of this simulation by about 20%. In addition, a new hybrid scheme is introduced in runtime to handle the large memory footprint requirement when doing load balancing for the simulation. Without the techniques described in this paper, the 100M-atom simulation in NAM would not be able to run, not to mention scaling on the petascale machine. By using those techniques, we were able to achieve very good performance for the 100M-atom simulation, scaling up to 100K cores, on the Jaguar Cray XT5 machine at NCCS.	[ppt]	[wmv]
04:15 pm - 04:45 pm	Talk	OpenAtom Dr. Glenn Martyna and Eric Bohm IBM Research Click here to expand description The goal of simulation studies is to provide insight into important systems of scientific and technical interest. Today, appoaching these systems involves treating accurately complex heterogeneous interfaces. The modeling of nanostructures is reviewed with application to problems in engineering, physics, and biochemistry. In particular, computer models of phase change memories and transparent electrodes for solar cells are described along with the novel parallel algorithms underlying the computations. Of particular interest is the discovery chemistry underlying the doping of graphene sheets for use in photovoltaic cells.	[ppt]	[wmv]
04:45 pm - 05:15 pm	Talk	Towards Message-driven Mixed-Quantum/Classical Dynamics in Atomic Simulation Dr. Chris Harrison Click here to expand description Massively parallel simulations of complex atomic and molecular processes using mixed quantum/classical dynamics remains an important goal of computational physics and chemistry. Mixed quantum/classical simulations promise new insight into the coupling between large scale dynamics, treated classically, and key quantum events, such as chemical reactions, excited state transitions and similar electronic phenomena treated quantum mechanically. Example problems to benefit include the conversion of light into energy in photosynthetic and solar-cell systems, or the coupling between chemical reactions and protein dynamics in enzyme performance. Previous quantum/classical simulation codes using leapfrog algorithms offered limited parallelism and concurrency. We present the early evolution of a message-driven mixed-quantum/classical dynamics interface using multiphase shared arrays, and some preliminary results.	[pdf]	[wmv]
5:15 pm - 5:45 pm	Fun	Annual PPL Photograph David Kunzman
Day 3 (Wednesday, April 20th)
8:45 am - 09:00 am	Breakfast
Morning	Technical Session: Tutorial
09:00 am - 10:30 pm	Tutorial	Charm++ Coordinator: Eric Bohm Click here to expand description	[pptx] [pdf]	[wmv]
12:00 pm - 01:00 pm	Lunch
Afternoon	Technical Session: Tutorial
01:00 pm - 02:00 pm	Fun	Tour of the Blue Waters Facility Organized by NCSA
02:00 pm - 05:00 pm	Tutorial	BigSim Simulation Framework Dr. Celso Mendes and Ryan Mokos Click here to expand description BigSim is a simulation system that allows application programmers to develop, debug and tune/scale/predict the performance of applications on large machines. One of its main advantages is to allow the study of application behavior on future machines, so that the those applications can be ready when the machine first becomes operational. It also allows easy "offline" experimentation of parallel performance tuning strategies, without using the full parallel computer. This tutorial will cover the main aspects involved in using BigSim, including presentation of its major components: emulation and simulation. It will also show how to prepare MPI applications for use in BigSim, how to analyze the emulation traces used in the simulation, how to obtain and visualize performance data for the simulated code, and how to produce various statistics about the network of the machine being simulated. To illustrate how different network models can be used in BigSim, some examples will be presented covering both simple models and more advanced models capable of modeling network congestion, such as the model for the Blue Waters interconnect.	[ppt] [pdf]	[wmv]

Abstracts Due:	February 28
Author Notification:	March 7
Workshop:	April 18-20