Analyzing network health and congestion in dragonfly-based systems
IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016
Publication Type: Paper
Repository URL:
Download:
[BIB]
Abstract
The dragonfly topology is a popular choice for
building high-radix, low-diameter, hierarchical networks with
high-bandwidth links. On Cray installations of the dragonfly
network, job placement policies and routing inefficiencies can
lead to significant network congestion for a single job and multijob
workloads. In this paper, we explore the effects of job
placement, parallel workloads and network configurations on
network health to develop a better understanding of inter-job
interference. We have developed a functional network simulator,
Damselfly, to model the network behavior of Cray Cascade, and
a visual analytics tool, DragonView, to analyze the simulation
output. We simulate several parallel workloads based on five
representative communication patterns on up to 131,072 cores.
Our simulations and visualizations provide unique insight into the
buildup of network congestion and present a trade-off between
deployment dollar costs and performance of the network.
People
- Abhinav Bhatele
- Nikhil Jain
- Yarden Livnat
- Valerio Pascucci
- Peer-Timo Bremer
Research Areas