Power-aware and Temperature Restrain Modeling for Maximizing Performance and Reliability

Laxmikant V. Kale, Akhil Langer†, and Osman Sarood
Department of Computer Science
University of Illinois at Urbana-Champaign
†langer@illinois.edu

ABSTRACT
Ability to constrain power consumption in the recent hardware architectures is a powerful capability that can be leveraged for efficient utilization of available power. We propose to develop power-aware performance models that can predict job performance given a resource configuration, that is, the CPU/memory power cap, the number of nodes, etc. In addition to performance optimization under a fixed power budget, our proposed model also alleviates the difference in thermal profiles amongst different processors to achieve a balance in the overall temperature distribution of the data center. Reduced temperature of operation improves the reliability of the system in addition to saving cooling energy of the data center, while minimizing the overall execution time of the jobs. The power-aware performance model can be used to determine the optimal resource configurations for a job or for a set of jobs, with the aim of efficient utilization of power.

1. POWER-AWARE PERFORMANCE MODELING
Power requirements of a data center are computed using the Thermal Design Power (TDP) of its subsystems. However, TDP limit is hardly reached in normal operation for any individual processor. Nonetheless, TDP amount of power has to be allocated to the subsystems, in order to avoid circuit trips on the rare occasions when the power draw reaches TDP. Clearly, this is excessive and wasteful allocation of power. Recent microprocessor architectures such as, Intel SandyBridge [1], IBM Power6 [2], IBM Power7 [3], AMD Bulldozer [4], allow constraining the CPU and memory power consumption to below their TDP limit. This feature can be used to constrain the power consumption of nodes, and using the saved power to add more nodes to the data center. This is also called as overprovisioning [5, 6, 7].

Applications do not yield proportionate improvements in performance as the power allocated to the CPU and/or memory is increased. For a given power budget, it might be beneficial to run an application on larger number of nodes with each node capped at a power level below its TDP than running the fewer nodes each allocated TDP amount of power [5, 6]. In addition to cores/memory, caches constitute a significant portion of the node power consumption. However, the benefits of using different levels of caches on application performance may not be proportional to their power consumption. Ability to dynamically enable/disable caches at various levels through software, is also being supported by the hardware architectures [8, 9]. Similar to the power savings by capping CPU/memory power, judicious turning the caches on/off can save power, which can be used to add more nodes. Different applications respond differently to changes in CPU/memory power and/or availability of caches. In order to allocate the resources (i.e. CPU/memory power caps, number of nodes, etc.) to jobs or to a set of jobs, the Power-aware Strong Scaling (PASS) performance model of the jobs is required [7]. This is where modeling can be used to predict the applications performance for any given resource configuration.

Application performance modeling using DVFS has been extensively studied [10, 11, 12, 13]. Because of the difference in the CPU/memory characteristics of an application, different applications running under the same CPU power cap can have different CPU frequencies. Based on the on-chip activity of the application, user-specified CPU power cap is ensured by using a mixture of DVFS and CPU throttling [1, 14]. Power consumption of the chip can be modeled as a function of the leakage power of the chip, cache and memory access rates of the application, and the fixed idle/base power of the chip [15, 16]. Both leakage power, and cache, memory access rates can be modeled as functions of chip frequency. Chip frequency, in turn, can be used to model the execution time of the application [10, 11, 12, 17]. These models can be combined together with the strong-scaling model (e.g. Downey’s strong scaling model [18]) to get a holistic model that can predict an application’s performance for any resource configuration.

2. MODELING CHIP TEMPERATURES
The temperature of the data center is maintained such that the cooling is sufficient to cool-down the processors with hot-spots, whose temperature can be up to 30°C more than the processor with lowest temperature [19]. This is done due to fear of increased node failures at higher temperatures because Mean Time Between Failure (MTBF) of a processor is directly proportional to the exponential of its temperature [20, 21, 22]. It has also been reported that for every 10°C increase in temperature, fault rate doubles [20, 23, 24, 25]. Therefore, besides reducing the cooling energy of the data center, restraining the temperature of the processors also increases the MTBF of the system. Temperature control is achieved by reducing the frequency of the chip (using Dynamic Voltage and Frequency Scaling, DVFS), when the temperature increases beyond a threshold, and by increasing the frequency when the temperature decreases below a certain limit [26, 27]. Since, the optimum checkpoint frequency
for fault tolerance is computed based on the MTBF of the system, increase in MTBF due to temperature control, reduces the checkpoint frequency. This reduces the overhead of checkpoint/restart in the overall execution time of the job. As can be understood from the context, temperature control brings a trade-off between the checkpoint/restart overhead and job program time which increases due to control of frequency. The optimal temperature, where the overall execution time of the job is the least, varies from job to job [19].

In the contemporary research, focus has been on reducing the energy consumption of the applications which is achieved by using DVFS. However, the focus is shifting towards efficient utilization of available power as power is becoming a limiting factor. We propose modeling maximum temperature of a processor as a function of the the power cap of the CPU and the cooling temperature of the data center. Identical chips exhibit significant variation in their temperatures even when running under identical settings. This is attributed to chip-to-chip fabrication precision during manufacturing. This effect will be even more pronounced as new revolutionary chip technologies will be developed to reach exascale. For example, the recent 2014 DoE report on top ten exascale research challenges ([28]) shows that with the Near Threshold Voltage (NTV) operation, the variability in circuit speeds increases dramatically to 50%. This implies that processor temperatures will have to be individually modeled for each processor. However, the cost of temperature modeling is a one time cost and hence negligible as compared to the overall operations of the data center.

3. USING THE MODELS FOR IMPROVED PERFORMANCE AND RELIABILITY

Figure 1 shows the overall block-diagram of an online resource manager that makes resource allocation decisions while taking into account the power-aware strong-scaling performance of the applications and the temperature response of the processors under a given power cap, and machine room temperature settings. Power-aware performance of the jobs can be used to determine optimal allocation of resources to the set of jobs being scheduled by the data center. Use of online integer linear program optimization for optimal allocation of power to jobs being submitted to a data center has been shown in [7]. In order to improve the reliability of the system, additional temperature constraints are added to the linear program to restrain processor temperature from going beyond a threshold.

We also plan to address the important challenge of handling speed variability across identical nodes. Even in current architectures this variability becomes pronounced when the processors are power capped or temperature restrained. For example, Roundtree, et. al. [1] show variation of 8% in performance across 64 processors when the CPU is power capped at 50W (where the TDP is 85W). We have observed further increase in variation with CPU power caps below 50W. In our earlier work [19], we have shown variation in temperatures across processors can be up to 30°C. Hence, temperature restrain through DVFS leads to different processor speeds. These variations cause load imbalance and hence synchronization issues in HPC applications. We propose the use of over-decomposition and subsequent dynamic load balancing through object migration to achieve load balance. Overdecomposition and object migration also allows for dynamic restriction or extraction of jobs to a different number of processors, during its execution. This gives an additional degree of freedom to the online resource manager to remake optimal resource allocation decisions with the current set of jobs, by changing the configuration of running jobs as new jobs arrive and/or running jobs terminate.

4. EFFORT

Work requires developing performance models and its empirical validation in a data center that supports power capping and temperature control of the room. Charm++ [29, 30] runtime system provides support for writing custom load balancers. Existing load balancers will have to be adapted to take into account the heterogeneity of nodes under the proposed settings. Charm++ features can be easily realized for legacy MPI codes by using the Adaptive MPI framework (AMPI) provided by Charm++. AMPI [31, 32] is built on top of Charm++ framework and uses light-weight user levels threads instead of processes. This allows us to virtualize several MPI ranks on a single physical core, which brings the benefits of over-decomposition. The ranks can then be migrated to realize benefits such as load balancing and fault-tolerance. Some effort will be required towards development of a tool for automated conversion of MPI programs to AMPI.

Empirical validation on a relatively small-scale data center will be followed by large scale projections for exascale through simulation. We estimate that an effort with 1 FTE and 2 graduate research assistants, over a period of two years will be needed to carry out the proposed research program.

5. REFERENCES


[29] Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil

