Thermal Profiling of Cluster Systems (Summer Research 2015)

This research aims to investigate the performance, thermal behavior, and energy consumption of cluster systems. We used benchmark tools, such as Whetstone, Dhrystone, and Postmark, which are widely used in Industry and research communities, to simulate CPU-intensive and I/O-intensive workload on our cluster system. Our experimental results showed that I/O operations had an insignificant effect on the inlet and outlet temperature of the cluster. However, CPU temperature increased around 40 degrees Celsius when all CPU cores were busy. It took around half an hour to an hour for those CPU cores to reach maximum steady temperature, and around 4 to 16 minutes for them to cool down to room temperature. One more observation from the experiments was that tasks could execute faster when evenly spreading them out across the cluster than assigning them to one node. In addition, we developed a real-time monitoring system, which enables users to track the performance and temperature of cluster systems under a variety of workload and scheduling strategies. The monitoring system collects data from each node in cluster systems every second, and visualizes the data for further analysis.

Experiments under CPU-intensive Workload

Whetstone and Dhrystone were used to generate floating point operations and integer operations in our experiments. The experimental data showed that CPU temperature increased around 40 degrees Celsius when all CPU cores were running under heavy workload. It took around half an hour to an hour for those CPU cores to reach maximum steady temperature, and around 4 to 16 minutes to cool down to room temperature.

In particular, we first compared the impact of integer operations and floating point operations on CPU temperature. We set up two groups of experiments executing either Whetstone or Dhrystone, respectively. In each group, the benchmark tool was executed in a single-thread process and a eight-thread process. From the results, we found that running integer computation (Dhrystone) in single thread resulted in higher increment in CPU temperature than executing floating point computation (Whetstone) in single thread. Similar results were observed when running Dhrystone or Whetstone in eight threads.

(These eight figures were created by Wilson Lin.)

Moreover, we also observed that the thread migrated among CPU cores when running Whetstone in single thread. To further study the thread migration problem, we conducted one more group of experiments by running Whetstone in more threads. Each thread drove a CPU core to full utilization. The results of the experiments can be found at Two threads, Three threads, Five threads, Eight threads, Nine threads.

In our cluster, each processor has 8 cores in two core groups. From the experimental results, we found that workload was first assigned to any core in a particular group with relatively lower temperature, and the workload would migrate among these cores at any time. If all the cores in the group were fully utilized, new workload or tasks would be assigned to any cores in other groups.

We also conducted three groups of experiments, in which CPU utilization was kept at low level, anywhere between low and moderate level, or anywhere between moderate and high level. The experimental results are shown in

temperature under low CPU utilization, temperature under low or moderate CPU utilization, temperature under moderate or high CPU utilization.

Experiments under I/O-intensive Workload

To study the thermal behavior of cluster systems under I/O-intensive workload, Postmark was used as a benckmark tool to generate I/O workload in our experiments. Differing from CPU-intensive workload that could contribute to higher CPU temperature, I/O-intensive workload could increase the disk temperature. But the impact of I/O-intensive workload over disk temperature increment is insignificant. Only 2- or 3-degree increment on disk temperature was observed in our experiments. To explore the relationship between disk activities and disk temperature, we collected disk temperature by varying disk utilization. The results are shown below.

(These two figures were created by Tuguldur Baigalmaa.)

Experiments using PBS scheduling

The temperature and utilization patterns of cluster systems may vary greatly under different task scheduling strategies. We compared two scheduling strategies in two groups of settings, in which eight threads were running on either one server or four servers. The results showed that the majority of threads could complete in a shorter time period when they were spread out across multiple servers than assigning them on one server. The followings are the comparison results of the two groups of experiments under the same workload.

Development of Thermal Profiling System

For the convenience of data collection, retrieval, and visualization, we developed a monitoring system, which can be used to collect real-time data (including CPU and disk utilizations, and CPU, disk, inlet and outlet temperature) of each node in cluster systems. These data can be visualized in figures. In addition, we also stored the data in a database system. After the completion of experiments on cluster systems, all the experimental data can be easily retrieved from the database. Users can download the data in XML or CSV files for further analysis or backup. Our system also provides a filtering function, which allows users to visualize or download data in a particular time period.

Future Work

In this research, we have characterized the performance and thermal behavior of cluster systems under a variety of workload. All these information ultimately will be used as basis for the study of energy-efficiency and energy-saving in cluster systems. In the future, we will conduct more experiments to study the energy consumption of cluster systems by applying a variety of task scheduling strategies. The monitoring system will also be extended to collect information about memory, GPU or network adaptors. Monitoring the total energy consumption of cluster systems is another important function in the monitoring system. With the knowledge of energy consumption of cluster systems and the full-fledged monitoring system, we will propose new energy-efficient data management and task scheduling approaches for cluster systems, and compare them with state-of-the-art solutions.