A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
Pith reviewed 2026-05-24 22:51 UTC · model grok-4.3
The pith
This paper provides an end-to-end framework for long-term monitoring of network congestion and uses it to study real conditions in two different petascale interconnects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies and applies it to an empirical study of network congestion in petascale systems across Cray Gemini 3-D torus and Cray Aries DragonFly interconnect technologies.
What carries the argument
End-to-end framework for monitoring and analysis of network congestion in high-speed interconnects.
If this is right
- Congestion control at the network level can be informed by real field data.
- Application placement, mapping, and scheduling at the system level can use actual congestion characteristics.
- Long-term studies of congestion become possible with the provided framework.
- Comparisons between different topologies like torus and dragonfly can be made based on production workloads.
Where Pith is reading between the lines
- The framework might enable similar studies on other interconnect technologies beyond the two examined.
- Real congestion data could lead to revised models of performance variation in supercomputing applications.
- Future interconnect designs could incorporate lessons from the observed patterns in these systems.
Load-bearing premise
Proxy applications and benchmarks are not representative of the congestion characteristics observed in actual high-speed interconnects during field use.
What would settle it
If measurements using the framework show that congestion patterns match those from proxy applications and benchmarks, the motivation for the new approach would be undermined.
Figures
read the original abstract
Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents (a) an end-to-end framework for monitoring and analysis to enable long-term field studies of network congestion in high-speed interconnects and (b) an empirical study of congestion characteristics on petascale systems using two Cray interconnects: Gemini (3-D torus topology) and Aries (DragonFly topology). The work is motivated by the claim that prior studies rely on proxy applications and benchmarks that fail to capture real field-congestion behavior.
Significance. If the framework and empirical findings hold, the contribution is significant for systems research in high-performance computing. It supplies actual field data from production petascale machines rather than proxies, directly addressing a stated gap in the literature on congestion control and application mapping. The dual-technology comparison (torus vs. DragonFly) provides concrete topology-specific observations that can inform future scheduling and routing work. The empirical focus and provision of a reusable monitoring framework are explicit strengths.
minor comments (2)
- Abstract: the description of the two interconnect technologies could include the specific machine names or node counts to allow readers to assess scale immediately.
- The framework description would benefit from an explicit statement of measurement overhead and intrusiveness, as this directly affects suitability for long-term field studies.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The provided summary accurately reflects the paper's focus on an end-to-end monitoring framework and the empirical characterization of congestion on production petascale systems using Gemini and Aries interconnects.
Circularity Check
Empirical framework and field study with no derivation chain
full rationale
The paper presents an end-to-end monitoring framework and an empirical characterization of congestion on Gemini and Aries interconnects. No equations, fitted parameters, predictions, or mathematical derivations appear in the abstract or described contributions. The central claim is the delivery of real-world data collection and analysis to address the stated motivation that proxies are unrepresentative; this is a direct empirical contribution rather than a reduction of any result to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The derivation chain is empty, consistent with an observational systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Proxy applications and benchmarks are not representative of field-congestion characteristics of high-speed interconnects
Reference graph
Works this paper leans on
-
[1]
There goes the neighborhood: performance degradation due to nearby jobs,
A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, “There goes the neighborhood: performance degradation due to nearby jobs,” in Proc. International Conference for High Performance Computing, Networking, Storage and Analysis , 2013, pp. 41:1–41:12
work page 2013
-
[2]
Eval- uating HPC networks via simulation of parallel workloads,
N. Jain, A. Bhatele, S. White, T. Gamblin, and L. V . Kale, “Eval- uating HPC networks via simulation of parallel workloads,” in High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for . IEEE, 2016, pp. 154–165
work page 2016
-
[3]
Characterizing the influ- ence of system noise on large-scale applications by simulation,
T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the influ- ence of system noise on large-scale applications by simulation,” in Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2010, pp. 1–11
work page 2010
-
[4]
Topology-aware task mapping for reducing communication contention on large parallel machines,
T. Agarwal, A. Sharma, A. Laxmikant, and L. V . Kal´e, “Topology-aware task mapping for reducing communication contention on large parallel machines,” in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International . IEEE, 2006, pp. 10–pp
work page 2006
-
[5]
M. Mubarak, P. Carns, J. Jenkins, J. K. Li, N. Jain, S. Snyder, R. Ross, C. D. Carothers, A. Bhatele, and K.-L. Ma, “Quantifying I/O and communication traffic interference on dragonfly networks equipped with burst buffers,” in Cluster Computing, 2017 IEEE Int’l Conf. on . IEEE, 2017, pp. 204–215
work page 2017
-
[6]
Watch out for the bully!: job interference study on dragonfly network,
X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, “Watch out for the bully!: job interference study on dragonfly network,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 2016, p. 64
work page 2016
-
[7]
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters,
S. Jha, V . Formicola, C. Di Martino, M. Dalton, W. T. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters,”IEEE Transactions on Dependable and Secure Computing , 2017
work page 2017
-
[8]
Characterizing supercomputer traffic networks through link-level analysis,
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer, “Characterizing supercomputer traffic networks through link-level analysis,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2018, pp. 562–570
work page 2018
-
[9]
A. Agelastos et al., “Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,” in SC14: International Conference for High Performance Computing, Networking, Storage and Analysis , 2014, pp. 154–165
work page 2014
-
[10]
Measuring con- gestion in high-performance datacenter interconnects,
S. Jha, A. Patke, B. Lim, J. Brandt, A. Gentile, G. Bauer, M. Showerman, L. Kaplan, Z. Kalbarczyk, W. T. Kramer, and R. Iyer, “Measuring con- gestion in high-performance datacenter interconnects,” in 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), Feb 2020
work page 2020
- [11]
-
[12]
http://www.nersc.gov/users/computational-systems/edison/
“http://www.nersc.gov/users/computational-systems/edison/.”
-
[13]
Managing System Software for the Cray Linux Environ- ment,
Cray Inc., “Managing System Software for the Cray Linux Environ- ment,” Cray Doc S-2393-5202axx, 2014
work page 2014
-
[14]
Using the Cray Gemini Performance Counters,
K. Pedretti, C. Vaughan, R. Barrett, K. Devine, and S. Hemmert, “Using the Cray Gemini Performance Counters,” in Proc. Cray User’s Group , 2013
work page 2013
-
[15]
Performance variability due to job placement on edison,
D. Wang, A. Bhatele, and D. Ghosal, “Performance variability due to job placement on edison,” Poster presented at SC14, Nov , pp. 16–21, 2014
work page 2014
-
[16]
Topology-aware task mapping for reducing communication contention on large parallel machines,
T. Agarwal, A. Sharma, and L. V . Kal ´e, “Topology-aware task mapping for reducing communication contention on large parallel machines,” in Proc. Int’l IEEE Parallel and Distributed Processing Symposium , 2006. 4
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.