Program

All sessions will be held in Room 2405 of the Thomas M. Siebel Center for Computer Science.

Time	Type	Description	Slides	Webcast
Day 1 (Tuesday, April 29th)
8:00 am - 8:45 am	Continental Breakfast / Registration
Morning	Opening Session
8:45 am - 9:00 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign	[pptx]	[youtube]
9:00 am - 9:45 am	Keynote	MADNESS - parallel runtime and application use cases Prof. Robert J. Harrison, Stony Brook University Click here to expand description MADNESS is a general purpose numerical environment that sits upon a scalable runtime that consciously includes elements "borrowed" from other projects including Charm++ and the HPCS programming languages including Chapel, and is designed to be interoperable with "legacy" code. However, our intent was never to support this long term and in addition to migrating large parts of our functionality to Intel TBB we are casting around for options for the distributed-memory parallel computing environment. I'll give you some flavor of what MADNESS does and how with the objective of starting a conversation and seeding collaborations.	[pdf]	[youtube]
9:45 am - 10:15 am	Talk	Cloth Simulation in Charm++ Rasmus Tamstorf, Disney Research & Xiang Ni, University of Illinois at Urbana-Champaign Click here to expand description Accurate simulation of movement of cloth pieces is an important aspect of a good and realistic animation movie. However, it is also one of the most challenging problem in terms of complexity and scalability. In this talk, we present an extension of the Asychronous Contact Mechanics (ACM) for distributed memory cluster using Charm++. ACM algorithm is known to produce correct results in finite time for cloth simulation. However, the dynamic pattern of cloth simulation makes it an extremely hard problem to parallelize and get good performance. To achieve scalable parallelism, we have proposed and applied many techniques - overdecomposition based collision detection, intra-node load balancing, inter-node load balancing, prioritized processing, and additional (not required, but good for performance) synchronization. This talk presents an overview and initial results on these methods in our implementation of ClothSim in Charm++.	[key] [pdf]	[youtube]
10:15 am - 10:30 am	Break
Morning	Technical Session: Run Time System (Chair: Dr. Gengbin Zheng)
10:30 am - 11:00 am	Talk	PICS: A Performance-Analysis-Based Introspective Control System to Steer Parallel Application Yanhua Sun, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications is still quite challenging. Therefore, performance tuning becomes even more important and challenging than ever before. In this paper, we describe the design and implementation of PICS: a Performance-analysis-based Introspective Control System, used to tune parallel programs. PICS provides a generic set of abstractions to the applications to expose the application-specific knowledge to the runtime system. The abstractions are called control points, which are tunable parameters with effects on application performance. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the analysis results and expert knowledge rules, program characteristics are extracted to assist the search of optimal configurations for the control points. We have implemented the PICS control system in Charm++, an asynchronous message-driven parallel programming model. We demonstrate the utility of the PICS system with several benchmarks and a real-world application and show its effectiveness.	[pdf]	[youtube]
11:00 am - 11:30 am	Talk	Speculative Load Balancing Hassan Eslami, Advised by Prof. William Gropp, University of Illinois at Urbana-Champaign Click here to expand description Continuous dynamic load balancing is a crucial component in many applications that exploit irregular and nested parallelism. Reducing the idle time is one of the biggest challenges any load balancing algorithm tries to address. In this talk we introduce a speculative load balancing algorithm that approaches this challenge through speculative execution of tasks. In our method, each worker process, instead of being idle waiting to get a task, speculatively start working on some tasks with the hope that no one else has started processing those tasks yet. We show that in up to 99% of cases speculation is successful and the idle time is significantly reduced. We use an unbalanced tree search benchmark (UTS) to show the effect of speculation in two load balancing approaches, 1) work sharing using a centralized work queue, and 2) work stealing using explicit polling to service steal requests. We show that speculation execution in work-sharing reduces up to 95% of the idle time, and results in up to 3-5X speed up in the total execution time of UTS compared to the baseline implementation of work sharing. Also, our approach is less sensitive to the load balancing parameters, hence eliminate the time needed for tuning the algorithm for any input type. We also share our experience in applying speculative execution in work stealing.	[pdf]	[youtube]
11:30 am - 12:00 pm	Submitted Paper	Saving Energy by Exploiting Residual Imbalances on Iterative Applications Laercio L. Pilla, Institute of Informatics, Federal University of Rio Grande do Sul Click here to expand description Parallel scientific applications have been influencing the way science is done in the last decades. These applications have ever increasing demands in performance and resources due to their greater complexity and larger datasets. To meet these demands, the performance of supercomputers has been growing exponentially, which leads to an exponential growth in power consumption too. In this context, saving power has become one of the main concerns of current HPC platform designs, as future Exascale systems need to consider power demand and energy consumption constraints. Whereas some scientific applications have regular designs that lead to well balanced load distributions, others are more imbalanced due to the fact that they have tasks with different processing demands, which makes it difficult to provide an efficient use of the available resources at the hardware level. In this case, a challenge lies in reducing the energy consumption of the application while maintaining a similar performance. In our work, we focus on reducing the energy consumption of imbalanced applications through a combination of load balancing and Dynamic Voltage and Frequency Scaling (DVFS). Our strategy employs an Energy Daemon Tool to gather power information and a load balancing module that benefits from the load balancing framework available with the CHARM++ runtime system. Our approach differs from the one proposed by Sarood et al. as we employ DVFS as a way to decrease energy consumption after balancing the load, while the latter uses DVFS to regulate temperature and employs load balancing to correct subsequent imbalance.	[pdf]	[youtube]
12:00 pm - 12:30 pm	Talk	QMPI: A library for multithreaded MPI applications Alex B Brooks, Advised by Prof Marc Snir, University of Illinois at Urbana-Champaign Click here to expand description The increasing scale and ever-changing design of large supercomputers makes efficient parallel programming difficult. Communication cost and load-balancing continue to be major concerns for performance of parallel applications. Many implementations of the Message Passing Interface (MPI) continue to have issues handling multiple threads communicating from a single process. This results in less-than ideal performance for applications which attempt to exploit intra-node and inter-node parallelism. We introduce QMPI, a parallel programming library on top of MPI to address this problem. QMPI exploits light-weight task parallelism and smart communication progression, showing significant performance improvements over traditional multithreaded MPI techniques. This talk discusses this programming model with a focus on QMPI and its performance.	[pdf]	[youtube]
12:30 pm - 01:30 pm	Lunch - Provided
Afternoon	Technical Session: Adaptivity at Exascale (Chair: Phil Miller)
1:30 pm - 2:00 pm	Invited Talk	Moving Software to Exascale and Beyond Dr. Robert Wisniewski, Intel Corporation Click here to expand description Thinking in High Performance Computing software on how to achieve exascale has tended to focus on either evolutionary or revolutionary approaches. The path to exascale and beyond is replete with a set of wide-ranged and intertwined difficulties. This presentation will identify some of the challenges facing the HPC community from a software perspective with a focus on runtimes and programming models. The presentation suggests a plausible path forward that is neither solely evolutionary or solely revolutionary, but a combination. Runtime and execution models supporting this notion will be given as a demonstration.	[pdf]	[youtube]
2:00 pm - 2:30 pm	Talk	Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget Akhil Langer, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to judicially allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive or a running job terminates. We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power-response characteristics. We show that with a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM baseline scheduling policy. We corroborate our results with real experiments on a relatively small scale in which we obtain a 1.7X improvement.	[pptx]	[youtube]
2:30 pm - 3:00 pm	Submitted Paper	A Batch System for Malleable Adaptive Parallel Programs Suraj Prabhakaran, German Research School for Simulation Sciences, RWTH Aachen Click here to expand description The performance of supercomputers not only depends on efficient job scheduling but also on the type of jobs that form the workload. Malleable jobs are most friendly to the scheduler, a property they owe to their ability to adapt to changing resource allocation during application execution. The batch system can adjust the allocation of a malleable job according to the system state so as to obtain the best system performance. However, due to the non-adaptive nature of most parallel programming models, todays supercomputers are dominated by rigid jobs that require a fixed resource allocation throughout the job execution. Therefore, malleable jobs and malleable batch job management are not realized in todays production systems. The Charm++ programming paradigm notably supports malleability through its adaptive runtime system and offers a practical possibility to improve system performance. In this paper, we present the ongoing work towards enabling a malleable Torque/Maui batch job management system for Charm++ jobs. In particular, we propose a standard interface for malleability interactions between batch systems and parallel programming paradigms. Through this interface, the batch system allows a seamless expansion or shrinkage of the current resource allocation for a running Charm++ job. We discuss the implementation of the proposed interface and efficient malleable job scheduling strategies. We demonstrate the benefits of such a system, including improved system utilization and throughput, as well as improved response time and turnaround time for the user.	[pdf]	[youtube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Exploratory Research (Chair: Dr. Celso Mendes)
3:30 pm - 4:15 pm	Talk	OpenAtom: Fast, fine grained parallel electronic structure software for materials science, chemistry and physics. Prof. Sohrab Ismail-Beigi, Yale University & Dr. Glenn J. Martyna, IBM Research Click here to expand description We will discuss the OpenAtom project with a view to what it is capable of doing right now and what we want it to do in the near future and the next 5 years. The discussion will be framed in terms of modeling the physics and chemistry of useful and interesting materials and what type of real-world problems a well-performing and efficient highly parallel implementation will permit us to address. A brief overview will be given of the project's past, its present status and capabilities for the ground state of electrons and studies of atomistic dynamics at finite temperature on the ground state energy surface, and how we want to incorporate the description of excited electrons into OpenAtom.	[pptx] [pdf]	[youtube]
4:15 pm - 4:45 pm	Talk	Solvers for O(N) Electronic Structure in the Strong Scaling Limit Nicolas Bock and Matt Challacombe, Los Alamos National Laboratory and Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description For accurate models, electronic structure theory is an extremely challenging software engineering problem, typically relying on diverse "fast" solvers with mixed programming models. We are developing a new, unified approach to fast electronic structure solvers based on the N-Body programming model. The N-Body model is emerging as an extremely general algorithm class finding increasingly broad application in many disciplines, spanning the information and physical sciences. Historically, the astrophysical N-Body solver has enabled significant computational cost reductions compared to conventional solvers through spatially informed trees and hierarchical approximations based on the range-query, and has been a text-book success for scalable irregular parallelism. In this talk, I will present a generalization of the N-Body programming model for hierarchical multiplication of matrices with decay (SpAMM) [Bock and Challacombe, SIAM Journal on Scientific Computing 35 (1), C72-C98], that uses occlusion based on the metric-query to achieve reduced complexity, and discuss our implementation within Charm++ to access the strong scaling regime. In particular, I will explain how the hierarchical N-Body framework enables simple and concise task-parallel implementations that yield fine-grained task decomposition, without exposing complex message passing interfaces to the programmer, and how N-Body programming models can exploit data locality and persistence based load balancing strategies.	[pdf]	[youtube]
4:45 pm - 5:15 pm	Talk	Cache Hierarchy Reconfiguration in Adaptive HPC Runtime Systems Ehsan Totoni, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description Cache hierarchy consumes a large portion of the processor chip's power and energy, but a sizable fraction of it can be saved for common HPC applications. We propose an adaptive runtime system based reconfiguration approach that turns off ways of set-associative caches. Our simple and practical solution exploits the common patterns and iterative structure of HPC applications, including the prevalent Single Program Multiple Data (SPMD) model of parallel codes, to find the best configuration and save a large fraction of power and energy of the caches. Our experiments using cycle-accurate simulations show that 67% of cache energy can be saved by paying just 2.4% performance penalty (on average). We also show that an adaptive streaming cache strategy can improve performance by up to 30% and save 75% of cache energy in some cases.	[pptx] [pdf]	[youtube]
5:15 pm - 6:00 pm	Fun	Blue Waters Facility Tour NCSA
6:00 pm - 7:00 pm	Break
07:00 pm onwards	Workshop Banquet (for registered participants only) Located at the 2nd floor Atrium outside 2405 Siebel Center
Day 2 (Wednesday, April 30th)
8:00 am - 9:00 am	Continental Breakfast / Registration
9:00 am - 9:45 am	Keynote	Hybrid Programming Challenges for Extreme Scale Software Prof. Vivek Sarkar, E.D. Butcher Chair in Engineering, Rice University Click here to expand description It is widely recognized that computer systems in the next decade will be qualitatively different from current and past computer systems. Specifically, they will be built using homogeneous and heterogeneous many-core processors with 100's of cores per chip, their performance will be driven by parallelism (million-way parallelism just for a departmental server), and constrained by energy and data movement. They will also be subject to frequent faults and failures. Unlike previous generations of hardware evolution, these Extreme Scale systems will have a profound impact on future software. The software challenges are further compounded by the need to support new workloads and application domains that have traditionally not had to worry about parallel computing in the past. In general, a holistic redesign of the entire software stack is needed to address the programmability and performance requirements of Extreme Scale systems. This redesign will need to span programming models, languages, compilers, runtime systems, and system software. A major challenge in this redesign arises from the fact that current programming systems have their roots in execution models that focused on homogeneous models of parallelism e.g., OpenMP's roots are in SMP parallelism, MPI and SHMEM's roots are in cluster parallelism, and CUDA and OpenCL's roots are in GPU parallelism. This in turn leads to the "hybrid programming" challenge for application developers, as they are forced to explore approaches to combine two or all three of these models in the same application. Despite some early experiences and attempts by some of the programming systems to broaden their scope (e.g., addition of accelerator pragmas to OpenMP), hybrid programming remains an open problem and a major obstacle for application enablement on future systems. In this talk, we summarize experiences with hybrid programming in the Habanero Extreme Scale Software Research project [1] which targets a wide range of homogeneous and heterogeneous manycore processors in both single-node and cluster configurations. We focus on key primitives in the Habanero execution model that simplify hybrid programming, while also enabling a unified runtime system for heterogeneous hardware. Some of these primitives are also being adopted by the new Open Community Runtime (OCR) open source project [2]. These primitives have been validated in a range of applications, including medical imaging applications studied in the NSF Expeditions Center for Domain-Specific Computing (CDSC) [3]. Background material for this talk will be drawn in part from the DARPA Exascale Software Study report [4] led by the speaker. This talk will also draw from a recent (March 2013) study led by the speaker on Synergistic Challenges in Data-Intensive Science and Exascale Computing [5] for the US Department of Energy's Office of Science. We would like to acknowledge the contributions of all participants in both studies, as well as the contributions of all members of the Habanero, OCR, and CDSC projects. REFERENCES: [1] Habanero Extreme Scale Software Research project. http://habanero.rice.edu. [2] Open Community Runtime (OCR) open source project. https://01.org/projects/open-community-runtime. [3] Center for Domain-Specific Computing (CDSC). http://cdsc.ucla.edu. [4] DARPA Exascale Software Study report, September 2009. http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm. [5] DOE report on Synergistic Challenges in Data-Intensive Science and Exascale Computing, March 2013. http://science.energy.gov/~/media/ascr/ascac/pdf/reports/2013/ASCAC_Data_Intensive_Computing_report_final.pdf.	[pdf]	[youtube]
09:45 am - 10:15 am	Talk	Parallel Programming with Migratable Objects: Charm++ in Practice Harshitha Menon, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign, Click here to expand description The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this talk, we leverage our experience with these concepts to demonstrate their applicability and eficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many mini-applications and real applications executed on modern supercom puters including Blue Gene/Q, Cray XE6, and Stampede.	[pdf]	[youtube]
10:15 am - 10:30 am	Break
Morning	Technical Session: Large Applications (Chair: Dr. Abhinav Bhatele)
10:30 am - 11:00 am	Talk	ChaNGa: a Charm++ N-body Treecode Prof. Tom Quinn, University of Washington Click here to expand description Simulations of cosmological structure formation demand significant computational resources because of the vast range of scales involved: from the size of star formation regions to, literally, the size of the Universe. I will describe the cosmology problems we are addressing with our Blue Waters allocation. ChaNGa, a Charm++ N-body/Smooth Particle Hydrodynamics tree-code is the application we are running with this allocation. I will describe the improvements in both astrophysical modeling and in parallel performance we have made over the past year in preparation for running these simulations.	[pdf]	[youtube]
11:00 am - 11:30 am	Submitted Paper	Overcoming the Scalability Challenges of Epidemic Simulations on Petascale Platforms Jae-Seung Yeom, Virginia Tech Click here to expand description With an increasingly urbanized and mobile population, the likelihood of a worldwide pandemic is increasing. With rising input sizes and strict deadlines for simulation results, e.g., for real-time planning during the outbreak of an epidemic, we must expand the use of high performance computing (HPC) approaches and, in particular, push the boundaries of scalability for this application area. EpiSimdemics simulates epidemic diffusion in extremely large and realistic social contact networks. It captures dynamics among co-evolving entities. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to irregular communication, load imbalance, and the evolutionary nature of their workload. In this talk, we present an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6 and IBM BlueGene/Q.		[youtube]
11:30 am - 12:00 am	Talk	Petascale Charm++ in Practice: Lessons from Scaling NAMD Jim Phillips Click here to expand description The highly parallel molecular dynamics code NAMD was chosen in 2006 as a target application for the NSF petascale supercomputer now known as Blue Waters. NAMD was also one of the first codes to run on a GPU cluster when CUDA was introduced in 2007. When Blue Waters entered production in 2013, the first breakthrough it enabled was the complete atomic structure of the HIV capsid through calculations using NAMD, featured on the cover of Nature. This talk will cover lessons learned in taking NAMD and Charm++ from a million atoms on a few thousand cores to a hundred million atoms on 500,000 cores, and what changes would aid further progress.	[pdf]	[youtube]
12:30 pm - 01:30 pm	Lunch Provided
01:30 pm - 01:50 pm	Panel KickStart	Techniques for Effective HPC in the Cloud Abhishek Gupta, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description The advantages of pay-as-you-go model, elasticity, and the flexibility and customization offered by virtualization make cloud computing an attractive economical option for meeting the needs of some HPC users. However, there is a mismatch between current cloud environments and HPC requirements. HPC is performance-oriented, whereas clouds are cost and resource-utilization oriented. The poor interconnect and I/O performance in cloud, HPC-agnostic cloud schedulers, and the inherent heterogeneity and multi-tenancy in cloud are some bottlenecks for HPC in cloud. This means that the tremendous potential of cloud for both HPC users and providers remains underutilized. In this talk, I will go beyond the common research question: "what is the performance of HPC in cloud?" and present our research on "how can we perform cost-effective and efficient HPC in cloud?" To this end, I will present the complementary approach of making clouds HPC-aware, and HPC runtime system cloud-aware. Through comprehensive HPC performance and cost analysis, HPC-aware VM placement, interference-aware VM consolidation, and cloud-aware HPC load balancing, we demonstrate significant benefits for both: users and cloud providers in terms of cost (up to 60%), performance (up to 45%), and throughput (up to 32%).	[pptx]	[youtube]
01:50 pm - 03:00 pm	Panel	HPC in the cloud: how much water does it hold? Panelists: Roy Campbell (Professor, University of Illinois at Urbana Champaign), Kate Keahey (Fellow, Computation Institute University of Chicago), Dejan S Milojicic (Senior Research Manager, HP Labs), Landon Curt Noll (Resident Astronomer & HPC Specialist, Cisco), Laxmikant Kale (Professor, University of Illinois at Urbana-Champaign) Click here to expand description High performance computing connotes science and engineering applications running on supercomputers. One imagines tightly coupled, latency sensitive, jitter-senstive applications in this space. On the other hand, cloud platforms create the promise of computation on demand, with a flexible infrastructure, and pay-as-you-go cost structure. Can the two really meet? Is it the case that only a subset of CSE applications can run on this platform? Can the increasing role of adaptive schemes in HPC work well with the need for adaptivity in cloud environment? Should national agencies like NSF fund computation time indirectly, and let CSE researchers rent time in the cloud? These are the kind of questions this panel will address.		[youtube]
3:00 pm - 3:30 pm	Break
Afternoon	Technical Session: Related Research (Chair: Eric Bohm)
3:30 pm - 4:00 pm	Invited Talk	Compression for Exascale. Always Right Around The Corner Dr. Brian Van Straalen, Lawrence Berkeley National Laboratory Click here to expand description The basic hardware trend moving toward exascale can be summarized in a simple statement: Higher concurrency on more vectorized cores with proportionately less memory and bandwidth. To trade-off more computation for less data traffic, we propose to revisit compression techniques. In particular, we want to perform a systematic and scientific study of the compressibility of floating-point messages and memory data accessed in large-scale DOE HPC applications to determine the potential of data compression in current and future systems. Real-time, on-line compression has largely failed in the past, especially for floating-point data. This is, in part, because one-size-fits-all compression methods were used. For example, some hardware methods compress all loads and stores in the same way. Moreover, at the hardware level, information about the data structure and dimensionality is typically lost. In other cases, compression is not addressing the most critical problem, like IBM's Active Memory Expansion, which creates a "larger" local DRAM than is physically installed at the cost of extra latency and energy for some off-processor data movement. Even when applied with great care, the energy and latency mismatch between computations and data movement in current-generation processors makes it difficult to successfully exploit compression. We believe this is about to change.		[youtube]
4:00 pm - 4:30 pm	Talk	Parallel Branch-and-bound for Two-stage stochastic integer optimization Akhil Langer, Advised by Prof. Laxmikant Kale, University of Illinois at Urbana-Champaign Click here to expand description Many real-world planning problems require searching for an optimal solution in the face of uncertain input. One approach to is to express them as a two-stage stochastic optimization problem where the search for an optimum in one stage is informed by the evaluation of multiple possible scenarios in the other stage. Applications of stochastic programming spans a diverse fields ranging from production, financial modeling, transportation (road as well as air), supply chain and scheduling to environmental and pollution control, telecommunications and electricity. In this talk, we present the parallelization of a two-stage stochastic integer program solved using branch-and-bound. We present a range of factors that influence the parallel design for such problems. Unlike typical, iterative scientific applications, we encounter several interesting characteristics that make it challenging to realize a scalable design. We present two design variations that navigate some of these challenges. Our designs seek to increase the exposed parallelism while delegating sequential linear program solves to existing libraries. We evaluate the scalability of our designs using sample aircraft allocation problems for the US airfleet. It is important that these problems be solved quickly while evaluating large number of scenarios. Our attempts result in strong scaling to hundreds of cores for these datasets.	[pptx]	[youtube]
4:30 pm - 5:00 pm	Talk	Task Mapping, Job Placements, and Routing Strategies Dr. Abhinav Bhatele, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory Click here to expand description Communication optimizations continue to be important as we scale parallel applications to the largest parallel machines available. Task mapping has been shown to be an effective technique for optimizing performance for communication-bound applications. In this talk, I will present an overview of our research directions on the Scalable Topology Aware Task Embedding (STATE) project at LLNL. I will present two task mappings tools, Rubik and Chizu, for structured and generic communication graphs respectively. I will also discuss our efforts on modeling of congestion on supercomputer networks. Finally, I will present some initial work on evaluating the impact of job placements and routing strategies on application performance. LLNL-ABS-653685.	[pdf]	[youtube]
5:00 pm - 5:30 pm	Talk	Using Charm++ to Improve Extreme Parallel Discrete-Event Simulation (XPDES) Performance and Capability Prof. Christopher Carothers, Rensselaer Polytechnic Institute Click here to expand description ROSS is an extreme parallel discrete-event simulation (XPDES) system that has demonstrated the ability to scale to millions of cores. In particular, using 120 racks (1.9M cores) of the Blue Gene/Q "Sequoia" System, ROSS was able to process over 500 billion events-per-second for a benchmark parallel discrete-event simulation model comprised of over 250 million logical processes (LPs) and 16 initial messages per LP. Thus, ROSS coupled with state-of-the-art supercomputer hardware is able to model "planetary" scale systems with 100's of millions to even billions of objects. However, there is still room for improvement on two fronts. First, ROSS is an all MPI application and does not make use of threading capabilities of modern supercomputer system. Second, ROSS currently does not provide a load distribution mechanism that operates at massively parallel scales. In this talk, I will give an overview of ROSS and our plans to improve its performance and capabilities by addressing these limitations using Charm++.	[pptx]	[youtube]
6:30 PM	Dinner at 301 Mongolia, Address- 301 North Neil Street Champaign, IL 61820. Meet at 1st floor of Siebel Center for car pooling at 6.10 PM; Call Nikhil@2179790918.
Day 3 (Thursday, May 1st)
8:30 am - 12:30 pm	Charm++ Tutorial	Hands-on tutorial in Room SC3405 Phil Miller, Harshitha Menon, Eric Mikida	[pdf]	[youtube]

Abstracts Due:	March 15
Author Notification:	March 20
Workshop:	April 29-30
Tutorial:	May 1