9<sup>th</sup> Annual Workshop on CHARM++ and its Applications

## Improving CHARM++ Performance with a NUMA-aware Load Balancer

Laércio Lima Pilla<sup>1,2</sup>, Christiane Pousa<sup>2</sup>, Daniel Cordeiro<sup>2,3</sup>, Abhinav Bhatele<sup>4</sup>, Philippe O. A. Navaux<sup>1</sup>, Jean-François Méhaut<sup>2</sup>, Laxmikant V. Kale<sup>4</sup>

<sup>1</sup>Federal University of Rio Grande do Sul – Porto Alegre, Brazil
 <sup>2</sup>Grenoble University – Grenoble, France
 <sup>3</sup>University of São Paulo – São Paulo, Brazil
 <sup>4</sup>University of Illinois at Urbana-Champaign – Urbana, IL, USA













### Summary

# How we used NUMA architectural information to build a CHARM++ load balancer and obtained improvements on overall performance.



#### <u>NUMA</u>

# Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

## UMA x NUMA

#### **Uniform Memory Access**

- Centralized shared memory
  - Uniform latencies
- Data placement does not matter



#### **Non-Uniform Memory Access**

- Distributed shared memory

   Non-uniform latencies
- Data placement matters



#### **Reduce latencies**



#### Reduce contention/improve bandwidth



#### CHARM++ does not consider these characteristics

#### **Physical organization**



#### CHARM++'s vision (UMA)

No memory hierarchy





#### **Our Load Balancer: NUMALB**

# Experimental Setup Results Concluding Remarks

- Application data CHARM++ LB framework
  - Processor load: execution time
  - Chare load: execution time
  - Communication graph: size and number of messages
- NUMA topology archTopology (our library)
  - Core to NUMA node (socket) hierarchy mapping
  - NUMA factor

#### NUMA factor (i, j) = <u>Read latency from i to j</u> Read latency on i

#### • Heuristic

- Task mapping is NP-Hard
- No initial assumptions about the application
- List scheduling
  - Put tasks on a priority list by load
  - Assign tasks to the processor with the smallest cost on a greedy fashion
- Improve performance
  - by reducing unbalance
  - by reducing remote communication costs
  - while avoiding migrations (data movement costs)

Cost function

```
cost(c,p) = load(p) + a \times (r_{comm}(c,p) \times NUMA factor - l_{comm}(c,p))
```

#### Where

c: chare

p: core

load(p): load (execution time) on core p

 $r_{comm}(c,p)$ : number of messages sent by chare c to chares on other NUMA node  $I_{comm}(c,p)$ : number of messages sent by chare c to chares on the same NUMA node a: communication weight

**Input**: C set of chares, P set of cores, M mapping **Output**: *M'* mapping of chares to cores  $M' \leftarrow M$ 1. while  $c \neq \emptyset$  do 2. for the number of chares 3.  $c \leftarrow v \mid v \in \arg \max_{u \in C} load(u)$ take heaviest chare  $C \leftarrow C \setminus \{c\}$ 4.  $p \leftarrow q, q \in P \land \{(c,q)\} \in M$ 5. get its core  $load(p) \leftarrow load(p) - load(c)$ 6. remove its load from its core 7.  $M' \leftarrow M' \setminus \{(c,p)\}$ remove from mapping  $p' \leftarrow q \mid q \in \arg\min_{r \in P} cost(c, r)$ 8. find core with smallest cost  $load(p') \leftarrow load(p') + load(c)$ 9. add chare load to new core  $\square M' \leftarrow M' \cup \{(c,p')\}$ 10. map to new core



# Our Load Balancer: NUMALB Experimental Setup Results Concluding Remarks

## **Experimental Setup**

- 2 NUMA machines
- 3 CHARM++ benchmarks
- 4 other CHARM++ load balancers
- Statistical confidence of 95%
  - 5% relative error
  - Student's t-distribution
  - Minimum of 25 executions
- Performance
  - Gains: Average iteration time (baseline = no LB)
  - Costs: Load balancing overhead

## **Experimental Setup: Machines**

- NUMA16
  - AMD Opteron
  - 8×2 cores @ 2.2 GHz
  - 1 MB private L2 cache
  - 32 GB main memory
  - Low latency for memory access
  - Crossbar
  - NUMA factor: 1.1–1.5



## **Experimental Setup: Machines**

- NUMA32
  - Intel Xeon X7560
  - 4×8 cores @ 2.27 GHz
  - 256 KB private L2
  - 24 MB shared L3
  - 64 GB main memory
  - QuickPath
  - NUMA factor: 1.36–3.6



## **Experimental Setup: Benchmarks**

- kNeighbor
  - Synthetic iterative benchmark where a chare communicates with other k chares at each step
  - Completely I/O bound
  - -200 chares, 16 KB messages, k = 8
- lb\_test
  - Synthetic unbalanced benchmark with different possible communication patterns
  - 200 chares, random communication graph, load between 50 and 200 ms
- jacobi2D
  - Unbalanced two-dimensional five-point stencil
  - 100 chares, 32<sup>2</sup> data array

4/18/2011

## **Experimental Setup: LBs**

- GREEDYLB
  - Iteratively maps the most loaded chares to the least loaded cores
- RECBIPARTLB
  - Recursive bipartition of the communication graph
  - Breadth-first traversal until groups the required load
- MetisLB
  - Graph partitioning algorithms from METIS
- SCOTCHLB
  - Graph partitioning algorithms from SCOTCH
- Neither consider the current chare mapping

4/18/2011



#### **Our Load Balancer: NUMALB**

#### **Experimental Setup**

#### <u>Results</u>

### **Concluding Remarks**

## Results: kNeighbor



## Results: kNeighbor

Homogeneous distribution



## Results: lb\_test



## Results: jacobi2D



## Results: jacobi2D - Projections

- jacobi2D on NUMA16
  - 2 steps before LB
  - 4 steps after LB
- The smaller the idle parts, the higher the efficency

#### **METISLB: 75% efficiency**



#### NUMALB: 93.5% efficiency Time In Microseconds 9.000.000 11.000.000 13,000,000 -----PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 24

### Results: overheads

#### Average number of chares migrated

| Benchmark | Machine | Load Balancer |          |         |             |          |
|-----------|---------|---------------|----------|---------|-------------|----------|
|           |         | NumalB        | GREEDYLB | MetisLB | RECBIPARTLB | ScotchLB |
| kNeighbor | NUMA16  | 25            | 189      | 188     | 176         | 185      |
|           | NUMA32  | 57            | 194      | 195     | 185         | 194      |
| lb_test   | NUMA16  | 40            | 188      | 187     | 184         | 184      |
|           | NUMA32  | 48            | 194      | 194     | 192         | 192      |
| jacobi2D  | NUMA16  | 26            | 94       | 94      | 91          | 93       |
|           | NUMA32  | 33            | 97       | 96      | 93          | 98       |
|           |         |               |          |         |             |          |

Maximum migrations = 33%

Minimum migrations = 88%

All load balancers took less than 7 ms for their algorithms.

#### **Results: migration times for NUMA16**





#### **Our Load Balancer: NUMALB**

#### **Experimental Setup**

#### **Results**

#### **Concluding Remarks**

## Conclusions

- Multi-core machines with NUMA design introduce new challenges for their efficient use
- CHARM++ does not consider NUMA asymmetries
- With our NUMA-aware LB we obtained
  - An average speedup of 1.51 over the baseline
    - Transparent to the user, no previous knowledge
  - 10% improvement over most LBs
  - Migration overheads up to 7 times smaller
    - Migrating at most 33% of all chares

## Future Work

- Multi-core load balancer
  - UMA and NUMA machines
  - Communication latencies among cores
  - Use HWLOC representation of cache hierarchy
- Distributed multi-core load balancer
  - For clusters of multi-core machines
- Gather and organize communication information
  - Latencies, bandwidth
  - Provide this data to other libraries (like SCOTCH)

9<sup>th</sup> Annual Workshop on CHARM++ and its Applications

## **Improving CHARM++ Performance** with a NUMA-aware Load Balancer

# Thank you.

#### Laércio Lima Pilla

Contact: llpilla@inf.ufrgs.br













