Power and energy efficiency are important challenges for the High Performance Computing (HPC) community. Excessive power consumption is a main limitation for further scaling of HPC systems, and researchers believe that current technology trends will not provide Exascale performance within a reasonable power budget in near future. Hardware innovations such as the proposed Exascale architectures and Near Threshold Computing are expected to improve power efficiency significantly, but more innovations are required in this domain to make Exascale possible.
To help shrink the power efficiency gap, we argue that adaptive runtime systems can be exploited. The runtime system (RTS) can save significant power, since it is aware of both the hardware properties and the application behavior.
Adaptive Runtime Systems. We use application-centric analysis of different architectures to design automatic adaptive RTS techniques that save significant power in different system components, only with minor hardware support. In a nutshell, we analyze different modern architectures and common applications and illustrate that some system components such as caches and network links consume extensive power disproportionately for common HPC applications. We demonstrate how a large fraction of power consumed in caches and networks can be saved using our approach automatically. In these cases, the hardware support the RTS needs is the ability to turn off ways of set-associative caches and network links.
Optimizing Energy Consumption. Furthermore, cooling energy needs to be considered for large-scale systems. As of today, most of the research has focused on saving machine energy consumption leaving behind energy spent on cooling which takes about 40% of the total energy consumption for a datacenter. Our focus is to extend energy optimization work beyond machine energy saving so that we reduce cooling energy. Most datacenters do excessive cooling in order to avoid hotspots (areas in the machine room which are at a much higher temperature than other parts of the room). We are working on a runtime system which uses Dynamic Voltage and Frequency Scaling (DVFS) in order to minimize the occurrence of hotspots by keeping core temperatures in check. While doing so, one of our schemes reduces the timing penalty associated with using just DVFS by doing chare migration in order to load balance the application. Our results show that we can save considerable cooling energy using this temperature aware load balancing. Part of our recent research is exploring the possibility of load balancing chares in a way that we place 'less-frequency-sensitive' chares on hotter cores so that we can further reduce DVFS induced slowdown.
Performance Optimization Under Power Budget. Recent advances in processor and memory hardware designs have made it possible for the user to control the power consumption of the CPU and memory through software, e.g., the power consumption of Intel’s Sandy Bridge family of processors can be user-controlled through the Running Average Power Limit (RAPL) library. It has been shown that increase in the power allowed to the processor (and/or memory) does not yield a proportional increase in the application’s performance. As a result, for a given power budget, it can be better to run an application on larger number of nodes with each node capped at lower power than fewer nodes each running at its TDP. This is also called as overprovisioning. The optimal resource configuration for an application can be determined by profiling an application’s performance for varying number of nodes, CPU power and memory power and then selecting the best performing configuration for the given power budget. In our recent work, we propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. Our online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. With a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. With real experiments on a relatively small scale cluster, we obtained 1.7X improvement. An adaptive runtime system allows further improvement by allowing already running jobs to shrink and expand for optimal resource allocation.
Several of our new online softwares and methods such as Power Aware Resource Manager [14-15] and Variation Aware Scheduler [15-01] use linear/integer programming to come up with superior solutions as compared to solutions obtained from suboptimal heuristics.[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[PhD Thesis]
[Paper]
[Talk]
[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[Paper]
[Talk]
[Paper]
[Paper]
[Paper]
[Talk]
[Paper]
[Talk]
[Paper]
[Paper]
[Paper]
[Paper]
[Talk]
[Paper]
[Paper]