Program

Location:2405 Siebel Center

Time	Type	Description	Slides	Webcast
Day 1 (Wednesday, April 28th)
8:30 am - 9:00 am	Breakfast / Registration
Morning	Opening Session
9:00 am - 9:15 am	Talk	Opening Remarks Prof. Laxmikant V. Kale
9:15 am - 10:00 am	Talk	Charm++ Research Agenda: Recent Developments and Plans Prof. Laxmikant V. Kale Click here to expand description Charm++ and the rich research agenda engendered by its idea of object-based over-decomposition made significant progress during the past y ear. I will review the basic concepts that have been the foundation of our approach to parallel programming, and highlight specific achie vements of the past year. These include progress on our production-quality collaboratively-developed science and engineering applications, including NAMD (biophysics), OpenAtom (Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the progress and challenges in our agenda of developing higher level parallel languages.	[pptx]	[wmv]
10:00 am - 10:15 am	Break
Morning	Technical Session: Charm++ on Blue Waters (Chair: Eric Bohm)
10:15 am - 10:45 am	Talk	Adaptive MPI Celso Mendes Click here to expand description In this talk, we discuss Adaptive MPI (AMPI), an implementation of the popular MPI standard. AMPI is based on Charm++, and implements the traditional MPI tasks with user-level migratable threads. Thus, AMPI provides advanced features such as dynamic load balancing and automatic overlap between computation and communication to traditional MPI codes. Porting legacy MPI codes to AMPI typically involves no change to the sources. We will review AMPI's basic features, and discuss its current status.	[pptx]	[wmv]
10:45 am - 11:15 am	Talk	The BigSim Parallel Simulation System Gengbin Zheng and Ryan Mokos Click here to expand description PetaFLOPS-class computers are currently being developed and even larger computers are being planned. Our BigSim project is aimed at developing tools that allow one to develop, debug and tune/scale/predict the performance of applications before such machines are available so that the applications can be ready when the machine first becomes operational. It also allows easier "offline" experimentation of parallel performance tuning strategies --- without using the full parallel computer. To the machine architects, BigSim provides a method for modeling the impact of architectural choices (including the communication network) on actual, full-scale applications. In this talk, we will present our simulation framework which consists of an emulator and a simulator; we will focus on the recent progress in integrating instruction level simulation with our framework, and out-of-core emulation support.	[ppt]	[wmv]
11:15 am - 11:45 am	Submitted Paper	Automatic MPI to AMPI Conversion using Photran Stas Negara Click here to expand description Adaptive MPI is an implementation of the Message Passing Interface (MPI) standard. AMPI benefits MPI programs with features such as dynamic load balancing, virtualization, and checkpointing. AMPI runs each MPI process in a user-level thread, therefore causing problems when an MPI program has global variables. Manually removing the global variables in the program is tedious and error-prone. In this talk, we present a tool that automates this task with a source-to-source transformation that supports Fortran. We evaluate our tool on a real-world large-scale FLASH code and present preliminary results of running FLASH on AMPI. Our results demonstrate that the tool makes it easier to use AMPI.	[ppt]	[wmv]
11:45 am - 12:15 pm	Submitted Paper	BigDFT with AMPI: Preliminary Results Jean-Francois Mehaut Click here to expand description In this paper, we show what we have done to adapt BigDFT, an atomistic simulation to AMPI. We compare the performance of both MPI and AMPI version of BigDFT. We also evaluate the impact of GPU on this performance.
12:15 pm - 1:00 pm	Lunch
Afternoon	Technical Session: Large scale MD Simulations (Chair: Celso Mendes)
1:00 pm - 1:45 pm	Talk	NAMD Preparation for Blue Waters Eric Bohm Click here to expand description BlueWaters offers exciting new opportunities for the simulation of large biological molecular systems. This talk will cover a few of the ways we are extending NAMD to shine on the BlueWaters architecture. Such as, support for molecular systems larger than 100 million atoms, parallel startup, parallel I/O, performance tuning for Power 7, conversion of SSE routines to VSX, uses of SMT and SMP, and more.	[odp] [ppt] [pdf]	[wmv]
1:45 pm - 2:30 pm	Talk	Charm++ Hits and Misses - A NAMD Perspective Jim Phillips, Beckman Institute, University of Illinois Click here to expand description NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique used now by most scalable programs for biomolecular simulations, including Blue Matter and Desmond developed by IBM and D. E. Shaw respectively. NAMD is developed using Charm++ and benefits from its adaptive communication-computation overlap and dynamic load balancing.	[ppt]	[wmv]
2:30 pm - 3:00 pm	Talk	Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Click here to expand description Parallel machines with hundreds of thousands of processors are already in use. Ensuring good load balance is critical for scaling certain classes of parallel applications on these machines. Centralized load balancing algorithms face scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this talk, we present an automatic adaptive hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in NAMD.	[pptx]	[wmv]
3:00 pm - 3:15 pm	Break
Afternoon	Technical Session (Chair: Gengbin Zheng)
3:15 pm - 3:45 pm	Talk	Scalable Fault Tolerance with Charm++ Esteban Meneses Click here to expand description In the next few years we will witness the deployment of unprecedented large supercomputers, comprised of hundreds of thousands of cores. Rough estimates predict that the mean time between failures in those machines will be smaller than one day. For an application that runs for long time and uses a large core count, the only way to survive in this environment is by using fault tolerance mechanisms. One possibility is to rely on a runtime system that can recover from failures in an automatic fashion, like Charm++. In this talk, we will present recent developments in Charm++ infrastructure that will make us able to deal with failures at a high core count.	[pdf]	[wmv]
3:45 pm - 4:30 pm	Talk	ChaNGa: Charm N-body GrAvity solver Thomas R. Quinn Professor, Department of Astronomy, University of Washington Click here to expand description Simulations of galaxies forming in their cosmological context poses a number of challenges to performance on large parallel machines. The first is the very non-local nature of gravitational forces. Galaxies are influenced by the gravitational forces originating tens of megaparsecs away, requiring significant communication in the force solver. Second is the enormous spatial dynamic ranges involved, from megaparsecs to sub-parsec scales, requiring dynamic hierarchical data structures. Third is the vast time scales involve, from less than 1 million years to the age of the Universe, posing significant challenges for load balancing. This talk will present how these challenges have been addressed in the design of ChaNGa, the Charm N-body GrAvity solver.	[odp]	[wmv]
4:30 pm - 6:00 pm	Panel Discussion	Exascale by 2018. Really? Panelists: Bill Kramer (NCSA), Franck Cappello (INRIA/Illinois), James Browne (UT-Austin), Laxmikant Kale (Illinois) Moderator: Dr. Celso Mendes Click here to expand description Your desktop can probably do a few billion operations per second. If you multiply that billion fold, you get a petaFLOP/s machine. At Illinois, we will be deploying a multi-petaFLOP/s machine during the next year or so. Now imagine a machine thousand times powerful than that! That is an Exascale machine, and scientists and funding agencies have been discussing development of such machine in eight years time.. by 2018. Can we build such a powerful machine? What can we do with this powerful a machine? Can the society afford it? How much electricity will it consume, and can we reduce that to a practical number? What kinds of software innovations are needed to program it? We will discuss these questions with experts.	Browne [ppt] Cappello [pdf] Kale [pptx] Kramer [pdf] Kramer [pptx]	[wmv]
6:00 pm - 7:00 pm	Break
7:00 pm onwards	Workshop Banquet (for registered participants only)
Day 2 (Thursday, April 29th)
8:30 am - 9:00 am	Breakfast
Morning	Technical Session (Chair: L. V. Kale)
9:00 am - 10:00 am	Keynote	An Off-The-Wall, Possibly CHARMing View of Future Parallel Applications James C. Browne, Regents Chair in Computer Sciences, University of Texas at Austin Click here to expand description Development methods for HPC applications change slowly and will continue to change slowly. It is thus safe to suggest radical changes because the chance they will be adopted quickly is low. This talk will sketch a few possible futures for HPC application development which are considerably different from current practice. The first part of the talk will sketch possible influences of development practices and the second some responses to these influences including, components, self-management, a merger of grid and HPC developments, tools based on expert systems technology.	[ppt] [pdf]	[wmv]
10:00 am - 10:45 am	Talk	Processor Virtualization in Weather Models Eduardo Rodrigues Click here to expand description In this work, we investigate the usefulness of Processor Virtualization in weather models as a tool for load balancing this type of application. That is because this strategy can solve both issues presented by Xue et al : (1) it simplifies the implementation of a load balance scheme and (2) it can hide the overhead of migrating load with computation. In our experiments we used the weather model BRAMS.	[pdf]	[wmv]
10:45 am - 11:00 am	Break
11:00 am - 11:30 am	Submitted Paper	NUMA Support for Charm++ Christiane Pousa Ribeiro Click here to expand description This paper presents the work we have done on charm++ in order to provide a transparent NUMA support. Such support is based on three parts: a command line option to bind application data to memory banks, an interleaved heap and a NUMA-aware memory allocator.	[pdf]	[wmv]
11:30 am - 12:00 pm	Talk	Parallel Sorting Edgar Solomonik Click here to expand description Efficiently scaling parallel sorting on modern supercomputers is inhibited by the communication-intensive problem of migrating large amounts of data between processors. The challenge is to design a highly scalable sorting algorithm that uses minimal communication, maximizes overlap between computation and communication, and uses memory efficiently. This talk presents a scalable extension of the Histogram Sorting method which makes modifications to the original algorithm in order to minimize message contention and exploit overlap. We compare the performance of Histogram Sort, Sample Sort, and Radix Sort, all implemented in Charm++. The choice of algorithm as well as the importance of the optimizations is validated by performance tests on two predominant modern supercomputer architectures: XT4 at ORNL (Jaguar) and Blue Gene/P at ANL (Intrepid).	[pdf]	[wmv]
Noon	Lunch
Afternoon	Technical Session (Chair: Ryan Mokos)
1:30 pm - 2:15 pm	Talk	Charm++ on Heterogeneous Systems: Cell processor and GPGPU David Kunzman and Lukasz Wesolowski Click here to expand description Accelerators such as Graphical Processing Units (GPUs) and specialized cores, such as the Synergistic Processing Elements (SPEs) on the Cell processor, are being used with greater frequency in the realm of parallel computing to speedup computationally heavy portions of code. These systems are comprised of multiple types of processing elements, each with unique characteristics, strengths, weaknesses, and programming paradigms. Developing applications can be challenging since many architectural details must be taken into account. In this talk we will summarize the ongoing efforts to allow the Charm++ Runtime System to utilize accelerators while trying to abstract away as many architectural details as possible. Specifically, we will cover work related to the Cell processor and GPUs.	Part1 [ppt] Part2 [pdf]	[wmv]
2:15 pm - 3:00 pm	Talk	Studying the Scalability of N-body Simulations on GPU Clusters Pritish Jetley Click here to expand description We study the use of clusters of general purpose graphics processors for tree-based N-body simulations. We investigate key performance issues in the context of clusters of GPUs. These include kernel organization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments are carried out to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present scaling performance results from experiments on the NCSA's Lincoln GPU cluster.	[pdf]	[wmv]
3:00 pm - 3:15 pm	Break
3:15 pm - 4:00 pm	Talk	Debugging Large Scale Parallel Applications Filippo Gioachin Click here to expand description In this talk, I will expose recent research in the field of parallel debgging. The main topic discussed will be: how can we debug an application on thousands of processors without burning all our allocation on the machine? The two techniques I will present are Virtualized Debugging and Processor Extraction.	[odp] [pdf]	[wmv]
4:00 pm - 4:45 pm	Talk	Automating Topology Aware Mapping for Supercomputers Abhinav Bhatele Click here to expand description Parallel computing is entering the era of petascale machines. This era brings enormous computing power to us and new challenges to harness this power efficiently. Machines with hundreds of thousands of processors already exist, connected by complex interconnect topologies. Network contention is becoming an increasingly important factor affecting overall performance. The farther different messages travel on the network, greater is the chance of resource sharing between messages and hence, of contention. Recent studies on IBM Blue Gene and Cray XT machines have shown that under contention, message latencies can be severely affected. Mapping of communicating tasks on nearby processors can minimize contention and lead to better application performance. In this talk, I will propose algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. I will first demonstrate the effect of contention on message latencies and use these studies to guide the design of mapping algorithms. I will introduce the hop-bytes metric for the evaluation of mapping algorithms and suggest that it is a better metric than the previously used maximum dilation metric. I will then discuss in some detail, the mapping framework which comprises of topology aware mapping algorithms for parallel applications with regular and irregular communication patterns.	[key] [pdf]	[wmv]
4:45 pm - 5:30 pm	Event	Annual PPL Group Photograph
Day 3 (Friday, April 30th)
9:30 am - 10:00 am	Breakfast
Morning	Technical Session (Chair: Ramprasad Venkataraman)
10:00 am - 10:30 am	Talk	Implementing Dense LU Factorizations in Parallel Isaac Dooley Click here to expand description This talk will give an overview of how dense matrix LU factorizations are performed in parallel. The LU factorization is an important problem because it is used to test the speed of supercomputers for the Top500 list. A new Charm++ implementation will be discussed along with the common MPI implementation named HPL and a UPC implementation.	[pdf]	[wmv]
10:30 am - 11:00 am	Talk	Stochastic Programming : Aircraft Allocation Problem Gagan Gupta Click here to expand description We present our initial results on the parallelization of a classic example of two stage stochastic programming with linear recourse. This problem, inspired by the Air Mobility Command's operational problem involves the assignment of aircrafts at various bases for a period of one month so that the subsequent disruptions due to emergencies and variable demands is minimized. We discuss some of the peculiar aspects of the problem (coarse grain computation, dependency in the execution times, large message sizes) and the approaches we plan to take to achieve good speedup.	[pptx]	[wmv]
11:00 am - 11:30 am	Talk	OpenAtom Glenn Martyna IBM Research Click here to expand description The goal of simulation studies is to provide insight into important systems of scientific and technical interest. Today, appoaching these systems involves treating accurately complex heterogeneous interfaces. The modeling of nanostructures is reviewed with application to problems in engineering, physics, and biochemistry. In particular, computer models of phase change memories and transparent electrodes for solar cells are described along with the novel parallel algorithms underlying the computations. Of particular interest is the discovery chemistry underlying the doping of graphene sheets for use in photovoltaic cells.	[ppt]	[wmv]
11:30 am - 12:00 pm	Break
12:00 pm - 1:00 pm	Tour	NCSA Blue Waters Facility Tour Click here to expand description