# A Case Study of Communication Optimizations on 3D Mesh Interconnects

Abhinav Bhatelé, Eric Bohm, and Laxmikant V. Kalé

Department of Computer Science University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA {bhatele, ebohm, kale}@illinois.edu

**Abstract.** Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, **no-load** latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of **contention**, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance.

Our methodology builds upon the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction between multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.

# 1 Introduction

A significant number of the largest supercomputers in use today, including IBM's Blue Gene family and Cray's XT family, employ a 3D mesh or torus topology. With tens of thousands of nodes, messages may have to travel many tens of hops before arriving at their destinations. With the advances in communication technology, especially wormhole routing, it was observed that the *latency* of communication was almost unaffected by the number of hops traveled by a message [1, 2]. However, the fraction of *bandwidth* occupied by the message is proportional to the number of hops (links) traversed. Increased contention for bandwidth might result in longer latencies.

Consider a computation in which each processor sends one message to some other processor, in a data permutation pattern (so each processor also receives one message). For a thousand nodes organized in a  $10 \times 10 \times 10$  3D torus, the

**Table 1.** Execution time per step of OPENATOM on Blue Gene/L (CO mode) without topology aware mapping (System: WATER\_32M\_70Ry)

| Cores       | 512   | 1024  | 2048  | 4096  | 8192  |
|-------------|-------|-------|-------|-------|-------|
| Time (secs) | 0.274 | 0.189 | 0.219 | 0.167 | 0.129 |

average number of hops traveled by a random message (i.e. the average internode distance) is 7.5. In a torus, a message needs to travel at most 5 hops in each dimension, leading to an average of 2.5 per dimension. So, if we organize our computation so that each message travels only one hop, we will use 7.5 times smaller global bandwidth than in the case of a random communication pattern. This was not as significant an issue when the number of nodes was relatively small and the processors were slower <sup>1</sup>. But for today's large machines with faster processors, the issue becomes much more significant: on a machine with 64,000 nodes organized in a  $40 \times 40 \times 40$  3D torus, the average inter-node distance is 30, and that is the ratio of bandwidth used by the two communication patterns. Further, with faster processors, the need for delivered bandwidth is higher. As the communication starts to occupy a large fraction of the available bandwidth, the contention in the network increases and message delivery gets delayed [3].

In this context, it is important to map computation to the processors to not just minimize the overall communication volume, but also the average number of hops traveled by the bytes communicated. Even though the utility of doing this may be apparent to programmers, the significance of the impact is probably more than most programmers expect. Our methodology builds upon object-based decomposition used in CHARM++ [4] and related programming models, including Adaptive MPI (AMPI) [5]. This separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects.

In this paper, we first present the abstraction of object-based decomposition in CHARM++ and an API which provides a uniform interface for obtaining topology information at runtime on four different machines – Cray XT3, Cray XT4, IBM Blue Gene/L and IBM Blue Gene/P. This API can be used by user-level codes for task mapping and is independent of the programming model being used (MPI, OpenMP, CHARM++ or something else). We then demonstrate topology aware mapping techniques for a communication intensive application: OPENATOM, a production Car-Parrinello *ab initio* Molecular Dynamics (CPAIMD) code used by scientists to study properties of materials and nano-scale molecular structures for biomimetic engineering, molecular electronics, surface catalysis, and many other areas [6–9].

Initial scaling studies of OPENATOM on Blue Gene/L uncovered inadequate parallel efficiency for a small system with 32 atoms (Table 1). Further analysis

<sup>&</sup>lt;sup>1</sup> This fact had made early work on topology aware mapping obsolete. But with large machines, bandwidth effects have again become important.

isolated the cause to poor communication performance, which naturally led us to consider the network topology. We consider 3D torus topologies in this paper but not irregular networks or flat topologies. For logarithmic topologies (such as fat trees), the need to pay attention to topology may be smaller because the maximum number of hops between nodes tends to be small. Also, there is no support for user level derivation of topology for most fat-tree networks so any implementation would be specific to an individual cluster layout.

# 2 Previous Work

There has been considerable research on the task mapping problem. It has been proven that the problem is NP-complete and computationally equivalent to the general graph embedding problem. Heuristic techniques like pairwise exchange were developed in the 80s by Bokhari [10] and Aggarwal [11]. These schemes, however, are not scalable when mapping millions of objects to thousands of processors. This problem has been handled by others using recursive partitioning [12] and graph contraction [13] by mapping sub-graphs to a subset of processors. Physical optimization techniques like simulated annealing [14] and genetic algorithms [15] are very effective but can take very long to arrive at optimal results. Results in the 80s and 90s were not demonstrated on real machines and even when they were, they were targeted towards small sized machines. They also did not consider real applications. With the emergence of large parallel machines, we need to revisit these techniques and build upon them, on a foundation of real machines, in the context of real applications.

A lot of work until the 90s was focused on hypercube networks [11, 14]. The development of large parallel machines like Blue Gene/L, XT3, XT4 and Blue Gene/P has led to the re-emergence of mapping issues. Both application and system developers have evolved mapping techniques for Blue Gene/L [16–20]. Yu [20] and Smith [19] discuss embedding techniques for graphs onto the 3D torus of BG/L which can be used by the MPI Topology functions. Weisser et al. [21] present an analysis of topology aware job placement techniques for XT3. However, our work is a one of the first for task mapping on XT3. This work builds on our previous work [22] demonstrating the effectiveness of topology aware task mapping for a 3D Stencil kernel. It presents a case study for OPENATOM, a production quantum chemistry code, and demonstrates high returns using topology aware schemes. Our earlier publication on OPENATOM [23] demonstrates the effectiveness of such schemes on BG/L. In this paper, we present results for XT3, BG/L and BG/P for multiple systems including a non-benchmark simulation.

On machines like Blue Gene/L and Blue Gene/P, obtaining topology information is simple and an interface is available to the programmers. The API described in this paper provides a wrapper for these and additional functionality as we shall see in Section 3.1. However, on Cray XT machines, there is no interface for topology information, in accordance with the widespread, albeit mistaken idea, that topology mapping is not important on fast Cray machines. For XT machines, our API uses lower level system calls to obtain information about allocated partitions at runtime. To the best of our knowledge, there is no published work describing such functionality for the Cray machines. We believe that this information will be useful to programmers running on Cray machines. Also, the API provides a uniform interface which works on all these machines which hides architecture specific details from the application programmer. This API can be used as a library for CHARM++, MPI or any other parallel program.

# 3 Charm++ Arrays: A Useful Abstraction for Mapping

Parallelizing an application consists of two tasks: 1. decomposition of the problem into a large number of sub-problems to facilitate efficient parallelization to thousands of processors, 2. mapping of these sub-problems to physical processors to ensure load balance and minimum communication. Object-based decomposition separates the two tasks and gives independent control over both of them. In this paper, we use the CHARM++ runtime which allows the application developer to decompose the problem into objects and the CHARM++ runtime does a default mapping of objects to processors.

The basic unit of computation in CHARM++ is called a *chare* (simply referred to as an "object" in this paper) which can be invoked through remote function calls. The application developer decomposes the problem into chares or objects and the CHARM++ runtime does a default mapping of objects to processors. Each processor can have multiple objects which facilitates overlap of computation and communication. This default mapping does not have any information about the topology of the machine. The user can override the default mapping with more intelligent schemes that take the topology of the machine into account.

#### 3.1 Topology Manager API: Runtime Information

Mapping of communication graphs onto the processor graph requires information about the machine topology at runtime. The application should be able to query the runtime to get information like the dimensions of the allocated processor partition, mapping of ranks to physical nodes etc. However, the mapping interface should be simple and should hide machine-specific details from the application. The Topology Manager API in CHARM++ provides a uniform interface to the application developer and hence the application just knows that the job partition is a 3D torus or mesh topology. Application specific task mapping decisions require no architecture or machine specific knowledge (BG/L or XT3 for example).

The Topology Manager API in CHARM++ provides different functions which can be grouped into the following categories:

1. Size and properties of the allocated partition: At runtime, the application needs to know the dimensions of the allocated partition (getDimNX, getDimNY, getDimNZ), number of cores per node (getDimNT) and whether we have a torus or mesh in each dimension (isTorusX, isTorusY, istorusZ).

- 2. Properties of an individual node: The interface also provides calls to convert from ranks to physical coordinates and vice-versa (rankToCoordinates, coordinatesToRank).
- 3. Additional Functionality: Mapping algorithms often need to calculate number of hops between two ranks or pick the closest rank to a given rank from a list. Hence, the API provides functions like getHopsBetweenRanks, pickClosestRank and sortRanksByHops to facilitate mapping algorithms.

We now discuss the process of extracting this information from the system at runtime and why is it useful to use the Topology Manager API on different machines:

**IBM Blue Gene machines:** On Blue Gene/L and Blue Gene/P [24], topology information is available through system calls to the "BGLPersonality" and "BGPPersonality" data structures, respectively. It is useful to use the Topology Manager API instead of the system calls for two reasons. First, these system calls can be expensive (especially on Blue Gene/L) and so it is advisable to avoid doing too many of them. The API does a few system calls to obtain enough information so that it can construct the topology information itself. It is useful to use the API instead of expensive system calls throughout the execution.

**Cray XT machines:** Cray machines have been designed with a significant overall bandwidth, and possibly for this reason, documentation for topology information was not readily available at the installations we used. We hope that the information provided here will be useful to other application programmers.

Obtaining topology information on XT machines is a two step process: 1. Getting the node ID (nid) corresponding to a given MPI rank (pid) which tells us which physical node a given MPI rank is on. This can be done through different system calls on XT3 and XT4: cnos\_get\_nidpid\_map available through "catamount/cnos\_mpi\_os.h" and PMI\_Portals\_get\_nidpid\_map available from "pmi.h". These calls provide a map for all ranks in the current job and their corresponding node IDs. 2. The second step is obtaining the physical coordinates for a given node ID. This can be done by using the system call rca\_get\_meshcoord from "rca\_lib.h". Once we have information about the physical coordinates for all ranks in the job, the API derives information such as the extent of the allocated partition by itself (this assumes that the machine has been reserved and we have a contiguous partition).

The API provides a uniform interface which works on all the above mentioned machines which hides architecture specific details from the application programmer. This API can be used as a library for CHARM++, MPI or any other parallel program. The next section describes the use of object-based decomposition and the Topology Manager API in a production code.



**Fig. 1.** Decomposition of the physical system into chare arrays (only important ones shown for simplicity) in OPENATOM

# 4 OpenAtom: A Case Study

An accurate understanding of phenomena occurring at the quantum scale can be achieved by considering a model representing the electronic structure of the atoms involved. The CPAIMD method [25] is one such algorithm which has been widely used to study systems containing  $10-10^3$  atoms. To achieve a fine-grained parallelization of CPAIMD, computation in OPENATOM [23] is divided into a large number of objects, enabling scaling to tens of thousands of processors. We will look at the parallel implementation of OPENATOM, explain the communication involved and then analyze the benefit from topology aware mapping of its objects.

In an *ab initio* approach, the system is driven by electrostatic interactions between the nuclei and electrons. Calculating the electrostatic energy involves computing several terms. Hence, CPAIMD computations involve a large number of phases with high inter-processor communication: (1) quantum mechanical kinetic energy of non-interacting electrons, (2) Coulomb interaction between electrons or the Hartree energy, (3) correction of the Hartree energy to account for the quantum nature of the electrons or the exchange-correlation energy, and (4) interaction of electrons with atoms in the system or the external energy. These phases are discretized into a large number of objects which generate a lot of communication, but ensures efficient interleaving of work. The entire computation is divided into ten phases which are parallelized by decomposing the physical system into fifteen *chare arrays*. For a detailed description of this algorithm please refer to [23].

#### 4.1 Communication Dependencies

The ten phases referred to in the previous section are parallelized by decomposing the physical system into fifteen *chare arrays* of different dimensions (ranging between one and four). A simplified description of five of these arrays (those most relevant to the mapping) follows:

1. **GSpace and RealSpace:** These represent the g-space and real-space representations of each of the electronic states [25]. Each electronic state is represented by a 3D array of "complex numbers". OPENATOM decomposes this

data into a 2D chare array of objects. Each object holds a plane of one of the states (see Figure 1). The chare arrays are represented by G(s,p)  $[n_s \times N_g]$  and R(s,p)  $[n_s \times N]$  respectively. GSpace and RealSpace interact through transpose operations (as part of a Fast Fourier Transform) in Phase I and hence all planes of one state of GSpace interact with all planes of the same state of RealSpace.

- 2. **RhoG and RhoR:** They are the g-space and real-space representations of electron density and are decomposed into 1D and 2D *chare arrays* respectively. They are represented as  $G_{\rho}(p)$  and  $R_{\rho}(p, p')$ . RealSpace interacts with RhoR through reductions in Phase II. RhoG is obtained from RhoR in Phase III through two transposes.
- 3. **PairCalculators:** These 3D chare arrays are used in phase IV. They communicate with GSpace through multicasts and reductions. They are represented as  $P_c(s, s', p) [n_s \times n_s \times N_g]$ . All elements of the GSpace array with a given state index interact with all elements of the PairCalculator array with the same state in one of their first two dimensions.

### 4.2 Mapping

OPENATOM provides us with a scenario where the load on each object is static (under the CPAIMD method) and the communication is regular and clearly understood. Hence, it should be possible to intelligently map the arrays in this application to minimize inter-processor communication and maintain load balance. OPENATOM has a default mapping scheme, but it should be noted that the default mapping is far from random. It is the mapping scheme used on standard fat-tree networks, wherein objects which communicate frequently are co-located on processors within the constraints of even distribution. This reduces the total communication volume. It only lacks a model for considering the relative distance between processors in its mapping considerations. We can do better than the default mapping by using the communication and topology information at runtime. We now describe how a complex interplay (of communication dependencies) between five of the *chare arrays* is handled by our mapping scheme.

GSpace and RealSpace are 2D chare arrays with states in one dimension and planes in the other. These arrays interact with each other through transpose operations where all planes of one state in GSpace, G(s, \*) talk to all planes of the same state, R(s, \*) in RealSpace (state-wise communication). The number of planes in GSpace is different from that in RealSpace. GSpace also interacts with the PairCalculator arrays. Each plane of GSpace, G(\*, p) interacts with the corresponding plane, P(\*, \*, p) of the PairCalculators (plane-wise communication) through multicasts and reductions. So, GSpace interacts state-wise with RealSpace and plane-wise with PairCalculators. If all planes of GSpace are placed together, then the transpose operation is favored, but if all states of GSpace are placed together, the multicasts/reductions are favored. To strike a balance between the two extremes, a hybrid map is built, where a subset of planes and states of these three arrays are placed on one processor.



Fig. 2. Mapping of different arrays to the 3D torus of the machine

**Mapping GSpace and RealSpace Arrays:** Initially, the GSpace array is placed on the torus and other objects are mapped relative to GSpace's mapping. The 3D torus is divided into rectangular boxes (which will be referred to as "prisms") such that the number of prisms is equal to the number of the planes in GSpace. The longest dimension of the prism is chosen to be same as one dimension of the torus. Each prism is used for all states of one plane of GSpace. Within each prism for a specific plane, the states in G(\*, p) are laid out in increasing order along the long axis of the prism. Figure 2 shows the GSpace prisms (box at the bottom) being mapped along the long dimension of the torus (box in the center). Once GSpace is mapped, the RealSpace objects are placed. Prisms perpendicular to the GSpace prisms are created which are formed by including processors holding all planes for a particular state of GSpace, G(s,\*). These prisms (box on the left) are perpendicular to the GSpace prisms and the corresponding states of RealSpace, R(s,\*) are mapped on to these prisms.

Mapping of Density Arrays: RhoR objects communicate with RealSpace plane-wise and hence  $R_{\rho}(p, *)$  have to be placed close to R(\*, p). To achieve this, we start with the centroid of the prism used by R(\*, p) and place RhoR objects in proximity to it. RhoG objects,  $G_{\rho}(p)$  are mapped near RhoR objects,  $R_{\rho}(p, *)$  but not on the same processors as RhoR to maximize overlap. The density computation is inherently smaller and hence occupies the center of the torus (box on the top in Figure 2).

**Mapping PairCalculator Arrays:** Since PairCalculator and GSpace objects interact plane-wise, the aim is to place G(\*, p) and P(\*, \*, p) nearby. Chares

with indices P(s1, s2, p) are placed around the centroid of  $G(s1, p), ..., G(s1 + block\_size, p)$  and  $G(s2, p), ..., G(s2 + block\_size, p)$ . This minimizes the hopcount for the multicast and reduction operations. The result of this mapping co-locates each plane of PairCalculators (box on the right in Figure 2) with its corresponding plane of GSpace objects within the GSpace prisms.

The mapping schemes discussed above substantially reduce the hop-count for different phases. They also restrict different communication patterns to specific prisms within the torus, thereby reducing contention and ensuring balanced communication throughout the torus. State-wise and plane-wise communication is confined to different (orthogonal) prisms. This helps avoid scaling bottlenecks as we will see in Section 4.3. These maps perform no better (and generally slightly worse) than the default maps on architectures which have more uniform network performance, such as Ethernet or Infiniband.

**Time Complexity:** Although maps are created only once during application start-up, they must still be efficient in terms of their space and time requirements. The memory cost of these maps grows linearly (3 integers per object) with the number of objects, which is a few megabytes in the largest system studied. The runtime cost of creating the most complex of these maps is  $O(pn^2log(n))$  where n is the number of objects and p the number of processors. Despite this complexity, this time is sufficiently small that generating the maps for even the largest systems requires only a few minutes. As an optimization, once created, maps can be stored and reloaded in subsequent runs to minimize restart time. Offline creation of maps using even more sophisticated techniques and adapting these ideas to other topologies is an area of future work.

### 4.3 Comparative Analysis of OpenAtom

To analyze the effects of topology aware mapping in a production science code we studied the strong scaling (fixed problem size) performance of OPENATOM with and without topology aware mapping. Two benchmarks commonly used in the CPMD community: the minimization of WATER\_32M\_70Ry and WA-TER\_256M\_70Ry were used. The benchmarks simulate the electronic structure of 32 molecules and 256 molecules of water, respectively, with a standard g-space spherical cutoff radius of  $|\mathbf{g}|_{cut}^2 = 70$  Rydberg (Ry) on the states. To illustrate that the performance improvements extend beyond benchmarks to production science systems, we also present results for GST\_BIG, which is a system being studied by our collaborator, Dr Glenn J. Martyna. GST\_BIG consists of 64 molecules of Germanium, 128 molecules of Antimony and 256 molecules of Tellurium at cutoff radius of  $|\mathbf{g}|_{cut}^2 = 20$  Ry on the states.

Blue Gene/L (IBM T. J. Watson) runs are done in co-processor (CO) mode to use a single core per node. Single core per node runs were chosen to highlight interconnect performance and to facilitate fair comparisons between the two machines. Blue Gene/P (Intrepid at ANL) runs were done in VN mode using all

| $WATER_{32}M_{70}Ry WATER_{256}M_{70}Ry$ |         |                  |       |                         | GST_BIG |        |
|------------------------------------------|---------|------------------|-------|-------------------------|---------|--------|
| Cores                                    | Default | Topology Default |       | Topology Default Topolo |         | pology |
| 512                                      | 0.274   | 0.259            | -     | -                       | -       | -      |
| 1024                                     | 0.189   | 0.150            | 19.10 | 16.4                    | 10.12   | 8.83   |
| 2048                                     | 0.219   | 0.112            | 13.88 | 8.14                    | 7.14    | 6.18   |
| 4096                                     | 0.167   | 0.082            | 9.13  | 4.83                    | 5.38    | 3.35   |
| 8192                                     | 0.129   | 0.063            | 4.83  | 2.75                    | 3.13    | 1.89   |
| 16384                                    | -       | -                | 3.40  | 1.71                    | 1.82    | 1.20   |

Table 2. Execution time per step (in secs) of OPENATOM on Blue Gene/L (CO mode)

Table 3. Execution time per step (in secs) of OPENATOM on Blue Gene/P (VN mode)

| WATER_32M_70Ry WATER_256M_70Ry |         |                  |       |          |  |  |
|--------------------------------|---------|------------------|-------|----------|--|--|
| Cores                          | Default | Topology Default |       | Topology |  |  |
| 256                            | 0.395   | 0.324            | -     | -        |  |  |
| 512                            | 0.248   | 0.205            | -     | -        |  |  |
| 1024                           | 0.188   | 0.127            | 10.78 | 6.70     |  |  |
| 2048                           | 0.129   | 0.095            | 6.85  | 3.77     |  |  |
| 4096                           | 0.114   | 0.067            | 4.21  | 2.17     |  |  |
| 8192                           | -       | -                | 3.52  | 1.77     |  |  |

four cores per node. Cray XT3 (BigBen at PSC) runs are done in two modes: single core per node (SN) and two cores per node (VN).

As shown in Table 2, performance improvements from topology aware mapping for Blue Gene/L (BG/L) can be quite significant. As the number of cores and likewise, the diameter of the torus grows, the performance impact increases until it is a factor of two faster for WATER\_32M\_70Ry at 2048 and for WATER\_256M\_70Ry at 16384 cores. There is a maximum improvement of 40% for GST\_BIG. The effect is not as strong in GST\_BIG due to the fact that the time step in this system is dominated by a subset of the orthonormalization process which has not been optimized extensively, but even a 40% improvement represents a substantial improvement in time to solution.

Performance improvements on Blue Gene/P (Table 3) are similar to those observed on BG/L. The improvement for WATER\_32M\_70Ry is not as remarkable as on BG/L but for WATER\_256M\_70Ry, we see a factor of 2 improvement starting at 2048 cores. The absolute numbers on BG/P are much better than on BG/L partially because of the increase in processor speeds but more due to the better interconnect (higher bandwidth and DMA engine). The performance for WATER\_256M\_70Ry at 1024 cores is 2.5 times better on BG/P than on BG/L. This is when comparing the VN mode on BG/P to the CO mode on BG/L. If we use only one core per node on BG/P, the performance difference is even greater,

|       | WATER                | _32M_70Ry V      | VATER_   | $256 M_{-}70 Ry$ | GS        | T_BIG  |  |
|-------|----------------------|------------------|----------|------------------|-----------|--------|--|
| Cores | Default              | Topology Default |          | Topology D       | efault To | pology |  |
|       | Single core per node |                  |          |                  |           |        |  |
| 512   | 0.124                | 0.123            | 5.90     | 5.37             | 4.82      | 3.86   |  |
| 1024  | 0.095                | 0.078            | 4.08     | 3.24             | 2.49      | 2.02   |  |
|       |                      | T                | vo cores | per node         |           |        |  |
| 256   | 0.226                | 0.196            | -        | -                | -         | -      |  |
| 512   | 0.179                | 0.161            | 7.50     | 6.58             | 6.28      | 5.06   |  |
| 1024  | 0.144                | 0.114            | 5.70     | 4.14             | 3.51      | 2.76   |  |
| 2048  | 0.135                | 0.095            | 3.94     | 2.43             | 2.90      | 2.31   |  |

Table 4. Performance (time per step in secs) of OPENATOM on XT3.



Fig. 3. Comparison of benefit by topology aware mapping (for WATER\_256M\_70Ry)

but the higher core per node count, combined with the DMA engine and faster network make single core per node use less interesting on BG/P.

The improvements from topology awareness on Cray XT3, presented in Table 4 are comparable to those on BG/L and BG/P. The improvement of 20% and 18.8% on XT3 for WATER\_256\_70Ry and GST\_BIG at 1024 cores is greater than the improvement of 14% and 12% respectively on BG/L at 1024 cores in spite of a much faster interconnect.

The improvement trends plotted in Figure 3 lead us to project that topology aware mapping should yield improvements proportional to torus size on larger Cray XT installations. The difference in processor speeds is approximately a factor of 4 (XT3 2.6 Ghz, BG/L 700 Mhz), which is reflected in the performance for the larger grained OPENATOM results on XT3 when comparing single core per node performance. The difference in network performance is approximately a factor of 7 (XT3 1.1 GB/sec, BG/L 150 MB/sec), when considering delivered bandwidth as measured by HPC Challenge [26] ping pong. This significant dif-



Fig. 4. Effect of topology aware mapping on latency and bandwidth in OPENATOM

ference in absolute speed and computation/bandwidth ratios does not shield the XT3 from performance penalties from topology ignorant placement schemes.

As discussed in prior sections, OPENATOM is highly communication bound. Although CHARM++ facilitates the exploitation of the available overlap and latency tolerance across phases, the amount of latency tolerance inevitably drops as the computation grain size is decreased by the finer decomposition required for larger parallel runs. It is important to consider the reasons for these performance improvements in more detail. Figure 4 compares idle time as captured by the Projections profiling system in CHARM++ for OPENATOM on BG/L for the default mapping, versus the topology aware mapping. A processor is idle whenever it is waiting for messages to arrive. It is clear from Figure 4 that the factor of two speed increase from topology awareness is reflected directly in relative idle time and that the maximum speed increase which can be obtained from topology aware mapping is a reduction in the existing idle time.

It is illuminating to study the exact cause for this reduction in idle time. To that end, we ported IBM's High Performance Monitor library [27] for Blue Gene/P's Universal Performance Counters to Charm++, and enabled performance counters for a single time step in WATER\_256M\_70Ry in both topology aware and non-topology aware runs. We summed the per node torus counters (BGP\_TORUS\_\*\_32BCHUNKS), to produce the aggregate bandwidth consumed by one step across all nodes to obtain the results in Figure 4. It is clear from the figure, that topology aware mapping results in a significant reduction, by up to a factor of two, in the total bandwidth consumed by the application. This more efficient use of the network is directly responsible for the reduction in latency due to contention and decreased idle time.

### 5 Conclusion and Future Work

In this paper we demonstrated that topology aware mapping can substantially improve performance for communication intensive applications on 3D torus networks. Significant improvements were shown for the OPENATOM code and the effectiveness of topology aware mapping was shown for both IBM Blue Gene and Cray XT architectures. Mapping was facilitated by the object-based virtualization in CHARM++ and the availability of the Topology Manager API.

OPENATOM has complex but relatively regular or structured communication. We think that it may be possible to develop general methodologies that deal with such structured communication. Unstructured static communication patterns, as represented by unstructured-mesh computations might need somewhat different mapping techniques. Work in the future would be on the lines of an automatic mapping framework which can work in tandem with the topology manager interface. This would also require a study of a more diverse set of applications with different communication patterns. Further study will be given to characterizing network resource usage patterns with respect to those which are most affected by topology aware task mapping.

# Acknowledgments

This work was supported in part by a DOE Grant B341494 funded by Center for Simulation of Advanced Rockets, DOE grant DE-FG05-080R23332 through ORNL LCF, and a NSF Grant ITR 0121357 for Quantum Chemistry. This research was supported in part by NSF through TeraGrid [28] resources provided by NCSA and PSC through grants ASC050040N and MCA93S028. We thank Shawn T. Brown and Chad Vizino from PSC for help with system reservations and runs on BigBen. We also thank Fred Mintzer, Glenn Martyna and Sameer Kumar from IBM for access and assistance in running on the Watson Blue Gene/L. We also used running time on the Blue Gene/P at Argonne National Laboratory, which is supported by DOE under contract DE-AC02-06CH11357.

# References

- Greenberg, R.I., Oh, H.C.: Universal wormhole routing. IEEE Transactions on Parallel and Distributed Systems 08(3) (1997) 254–262
- Ni, L.M., McKinley, P.K.: A survey of wormhole routing techniques in direct networks. Computer 26(2) (1993) 62–76
- Bhatele, A., Kale, L.V.: An Evaluation of the Effect of Interconnect Topologies on Message Latencies in Large Supercomputers. In: Proceedings of Workshop on Large-Scale Parallel Processing (IPDPS '09). (May 2009)
- Kalé, L., Krishnan, S.: CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Paepcke, A., ed.: Proceedings of OOPSLA'93, ACM Press (September 1993) 91–108
- Bhandarkar, M., Kale, L.V., de Sturler, E., Hoeflinger, J.: Object-Based Adaptive Load Balancing for MPI Programs. In: Proceedings of the International Conference on Computational Science, San Francisco, CA, LNCS 2074. (May 2001) 108–117
- A, P., MS, H., R, C.: Interface structure between silicon and its oxide by firstprinciples molecular dynamics. Nature **396** (1998) 58
- L, D.S., P, C.: Serine proteases: An ab initio molecular dynamics study. Proteins 37 (1999) 611

- 8. Saitta, A.M., Soper, P.D., Wasserman, E., Klein, M.L.: Influence of a knot on the strength of a polymer strand. Nature **399** (1999) 46
- U, R., P, C., K, D., M, P.: A comparative study of galactose oxidase and active site analogs based on QM/MM Car Parrinello simulations. J. Biol. Inorg. Chem. 5 (2000) 236
- Bokhari, S.H.: On the mapping problem. IEEE Trans. Computers 30(3) (1981) 207–214
- 11. Lee, S.Y., Aggarwal, J.K.: A mapping strategy for parallel processing. IEEE Trans. Computers **36**(4) (1987) 433–442
- Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the 3rd conference on Hypercube concurrent computers and applications, ACM Press (1988) 210–221
- Berman, F., Snyder, L.: On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing 4(5) (1987) 439–458
- 14. Bollinger, S.W., Midkiff, S.F.: Processor and link assignment in multicomputers using simulated annealing. In: ICPP (1). (1988) 1–7
- Arunkumar, S., Chockalingam, T.: Randomized heuristics for the mapping problem. International Journal of High Speed Computing (IJHSC) 4(4) (December 1992) 289–300
- Bhanot, G., Gara, A., Heidelberger, P., Lawless, E., Sexton, J.C., Walkup, R.: Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development 49(2/3) (2005) 489–500
- Gygi, F., Draeger, E.W., Schulz, M., Supinski, B.R.D., Gunnels, J.A., Austel, V., Sexton, J.C., Franchetti, F., Kral, S., Ueberhuber, C., Lorenz, J.: Large-Scale Electronic Structure Calculations of High-Z Metals on the Blue Gene/L Platform. In: Proceedings of the International Conference in Supercomputing, ACM Press (2006)
- Bhatelé, A., Kalé, L.V., Kumar, S.: Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications. In: 23rd ACM International Conference on Supercomputing. (2009)
- Smith, B.E., Bode, B.: Performance Effects of Node Mappings on the IBM Blue Gene/L Machine. In: Euro-Par. (2005) 1005–1013
- Yu, H., Chung, I.H., Moreira, J.: Topology mapping for Blue Gene/L supercomputer. In: SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, New York, NY, USA, ACM (2006) 116
- Deborah Weisser, Nick Nystrom, Chad Vizino, Shawn T. Brown, and John Urbanic: Optimizing Job Placement on the Cray XT3. 48th Cray User Group Proceedings (2006)
- Bhatelé, A., Kalé, L.V.: Benefits of Topology Aware Mapping for Mesh Interconnects. Parallel Processing Letters (Special issue on Large-Scale Parallel Processing) 18(4) (2008) 549–566
- Bohm, E., Bhatele, A., Kale, L.V., Tuckerman, M.E., Kumar, S., Gunnels, J.A., Martyna, G.J.: Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM Journal of Research and Development: Applications of Massively Parallel Systems 52(1/2) (2008) 159–174
- 24. IBM Blue Gene Team: Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development **52**(1/2) (2008)
- Tuckerman, M.E.: Ab initio molecular dynamics: Basic concepts, current trends and novel applications. J. Phys. Condensed Matter 14 (2002) R1297

- Dongarra, J., Luszczek, P.: Introduction to the HPC Challenge Benchmark Suite. Technical Report UT-CS-05-544, University of Tennessee, Dept. of Computer Science (2005)
- Salapura, V., Ganesan, K., Gara, A., Gschwind, M., Sexton, J., Walkup, R.: Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events. In: IEEE International Symposium on Performance Analysis of Systems and Software. (April 2008) 139 – 146
- 28. Catlett, C., et. al.: TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications. In Grandinetti, L., ed.: HPC and Grids in Action, Amsterdam, IOS Press (2007)