pith. sign in

arxiv: 2604.11432 · v1 · submitted 2026-04-13 · 💻 cs.DC

Characterizing the Impact of Congestion in Modern HPC Interconnects

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.DC
keywords HPC interconnectsnetwork congestionInfiniBandCray Slingshotbursty trafficcollective communicationAI workloadscongestion characterization
0
0 comments X

The pith

Modern HPC interconnect fabrics exhibit distinct scale-dependent responses to both steady and bursty congestion patterns typical of AI workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how congestion arises and propagates in current high-performance computing networks when they carry mixed simulation and AI training traffic. It applies both constant heavy loads and controlled intermittent bursts that vary in length, strength, and gaps between them to five different fabrics. The study tracks how these conditions affect collective communication operations and how the effects change as the number of nodes grows. The resulting observations are meant to set realistic expectations for application performance and to highlight where congestion-control and load-balancing improvements would be most useful.

Core claim

Across EDR, HDR, and NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics, congestion behavior is not uniform: each fabric shows its own sensitivity to burst duration, intensity, and pause intervals, and these sensitivities become more pronounced at larger system scales, directly influencing the completion time of collective operations.

What carries the argument

Controlled injection of steady congestion and parameterized bursty traffic patterns (varying duration, intensity, and pause length) applied at multiple system sizes to measure fabric-specific responses in collective communication performance.

If this is right

  • Collective performance models must incorporate fabric-specific and scale-dependent congestion terms rather than assuming uniform behavior.
  • Congestion-control algorithms should be tuned differently for short intense bursts versus long sustained loads on each fabric type.
  • Load-balancing strategies for mixed workloads can be refined by using the observed pause-length and intensity thresholds that trigger performance drops.
  • Ethernet-based designs aligned with emerging standards display congestion traits that can be compared directly against proprietary fabrics for future procurement decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If production AI workloads are dominated by short, high-intensity bursts separated by pauses, then fabrics that recover quickly from such bursts will deliver measurably shorter training times at scale.
  • The scale-dependent effects suggest that network architects may need to provision more adaptive routing or buffering as system sizes exceed current test configurations.
  • Direct comparison of the five fabrics under identical burst parameters provides a baseline that future congestion-mitigation proposals can be measured against.

Load-bearing premise

The chosen steady and bursty traffic patterns accurately reproduce the congestion that real production AI training and simulation workloads create on large systems.

What would settle it

Running the same collective operations inside an actual large-scale AI training job on one of the tested fabrics and comparing the measured congestion durations, intensities, and resulting slowdowns against the values recorded in the controlled experiments.

Figures

Figures reproduced from arXiv: 2604.11432 by Aldo Artigiani, Dancheng Zhang, Daniele De Sensi, Dirk Pleiter, Francesco Iannone, Karthee Sivalingam, Kexue Zhao, Lorenzo Piarulli, Marco Faltelli, Matteo Turisini.

Figure 1
Figure 1. Figure 1: Comparison of time distribution between AlltoAll and AllReduce operations. to measure and analyze the effects of congestion. Attackers are provided with two types of collectives for noise injection: AlltoAll and Incast. The first is used to send as many messages as possible to all nodes, creating a general state of network noise. The latter, instead, focuses the traffic on a single node, attempting to gene… view at source ↗
Figure 2
Figure 2. Figure 2: Bursty congestion injection visualization, bursty aggressor on the bottom and victim on the top. E. Evaluation Environments The fabrics have been evaluated under many different archi￾tectures, node counts, and topologies. This allowed us to un￾derstand congestion under a variety of scenarios. The systems considered were CINECA’s Leonardo, ENEA’s CRESCO8, LUMI, Huawei AI and Computing at Goethe University (… view at source ↗
Figure 3
Figure 3. Figure 3: 4 nodes HAICGU sawtooth behavior on 128 MiB messages with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Steady NSLB analysis in a AlltoAll congestion with 4 victims and 4 aggressor nodes 7.39% of Leonardo’s Booster partition, 8.60% of LUMI’s GPU partition, and 33.68% of the CRESCO8 CPU partition. These results are shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ratio between uncongested and congested runtimes on CRESCO8, Leonardo and LUMI, from 16 to 256 nodes, and vectors ranging from from 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ratio between uncongested and congested runtime of 512 bytes, 32KiB and 2MiB [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ratio between uncongested and congested runtime. 128 nodes [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ratio between uncongested and congested runtime. 256 nodes [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

High-performance computing (HPC) systems increasingly support both scalable AI training and large-scale simulation workloads. Both typically rely heavily on collective communication operations. On modern supercomputers, however, network congestion has emerged as a major limitation, driven by heterogeneous traffic patterns resulting from diverse workload mixes. As system scale and active users continue to grow, understanding how today's interconnect technologies respond to congestion is essential for establishing realistic performance expectations and informing future system design. This paper presents a comprehensive characterization of congestion behavior across four major HPC fabrics: EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. These fabrics span high-performance proprietary interconnects as well as adaptive Ethernet-based designs aligned with emerging standards such as Ultra Ethernet. We evaluate their responses to both steady congestion and a wide range of bursty patterns that vary in duration, intensity, and pause length, capturing the bursty communication typical of AI workloads. Our study covers multiple scales, examining how congestion manifests differently as system size increases and identifying scale-dependent behaviors that influence collective performance. By analyzing the challenges that arise under these controlled stress conditions, we aim to provide a practical overview of congestion issues and possible optimizations. The insights derived from this evaluation can guide researchers and HPC architects in designing more effective congestion-control mechanisms and network load-balancing strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents a comprehensive empirical characterization of congestion behavior across EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. It evaluates responses to steady congestion as well as bursty patterns that vary in duration, intensity, and pause length, at multiple system scales, with the aim of capturing traffic typical of AI workloads and informing congestion-control mechanisms and load-balancing strategies.

Significance. If the experimental design and results hold, the work would be significant for providing a broad, multi-fabric comparison of how modern HPC interconnects handle both steady and bursty congestion. The multi-scale analysis and focus on patterns relevant to AI training could offer practical guidance for system design and optimization. The empirical approach using real hardware across proprietary and standards-aligned fabrics is a clear strength.

major comments (2)
  1. Abstract and experimental methodology section: The abstract describes a broad campaign but provides no details on measurement methodology, error bars, statistical significance, or the process for selecting burst parameters; without these, it is impossible to verify whether the reported congestion behaviors are reliable or reproducible.
  2. Abstract: The claim that the controlled bursty patterns 'capture the bursty communication typical of AI workloads' rests on an unverified assumption that independent variation of duration, intensity, and pause length reproduces the correlated, phase-locked all-reduce traffic (identical message sizes, simultaneous incast from thousands of ranks) found in production AI training; this mismatch risks altering queue buildup, credit starvation, and adaptive routing in ways not captured by the steady/bursty matrix.
minor comments (3)
  1. Results sections: Specify the exact quantitative metrics (e.g., latency increase, throughput degradation, packet loss) used to characterize congestion impact for each fabric and pattern.
  2. Figure captions: Include details on the specific burst parameter values plotted and any normalization applied to enable direct comparison across scales and fabrics.
  3. Introduction: Expand the discussion of how the selected fabrics align with emerging Ultra Ethernet standards to strengthen the forward-looking claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and rigor of the manuscript. We address each major comment below and have made revisions to incorporate the suggestions where appropriate.

read point-by-point responses
  1. Referee: Abstract and experimental methodology section: The abstract describes a broad campaign but provides no details on measurement methodology, error bars, statistical significance, or the process for selecting burst parameters; without these, it is impossible to verify whether the reported congestion behaviors are reliable or reproducible.

    Authors: We agree that the abstract and methodology section would benefit from greater specificity to support reproducibility. We have revised the Experimental Methodology section to explicitly describe the measurement approach (high-resolution hardware counters and software timers synchronized across nodes), the computation of error bars (standard deviation across 10 independent runs per configuration), the statistical tests applied (two-tailed t-tests with p < 0.05 threshold for significance), and the burst-parameter selection process (derived from analysis of publicly available AI training traces to span realistic ranges of duration, intensity, and inter-burst pause). A single sentence summarizing these elements has been added to the abstract. revision: yes

  2. Referee: Abstract: The claim that the controlled bursty patterns 'capture the bursty communication typical of AI workloads' rests on an unverified assumption that independent variation of duration, intensity, and pause length reproduces the correlated, phase-locked all-reduce traffic (identical message sizes, simultaneous incast from thousands of ranks) found in production AI training; this mismatch risks altering queue buildup, credit starvation, and adaptive routing in ways not captured by the steady/bursty matrix.

    Authors: The referee is correct that independently varying burst parameters does not reproduce the tightly synchronized, phase-locked all-reduce traffic characteristic of production AI training. Our controlled design was chosen to isolate the individual contributions of duration, intensity, and pause length to congestion phenomena, thereby providing interpretable data that can inform both analytical models and more complex workload studies. We have revised the abstract to state that the patterns 'span a range of bursty behaviors relevant to AI workloads' and have added a dedicated paragraph in the Discussion section that acknowledges the limitation, discusses potential differences in queue dynamics and routing behavior, and outlines how the results remain useful for congestion-control design. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical characterization without derivations or fitted predictions

full rationale

The paper conducts direct hardware measurements of congestion responses on multiple interconnect fabrics (EDR/HDR/NDR InfiniBand, Slingshot, Ethernet) under controlled steady and bursty traffic patterns. No equations, models, parameter fits, or first-principles derivations are described; results are reported from experimental runs on external systems. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on observed behavior rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen synthetic traffic patterns are representative of real workloads; no free parameters, mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption The selected burst durations, intensities, and pause lengths capture the essential communication behavior of AI training and simulation collectives.
    Invoked in the abstract when stating that the patterns 'capture the bursty communication typical of AI workloads'.

pith-pipeline@v0.9.0 · 5576 in / 1239 out tokens · 22572 ms · 2026-05-10T15:50:40.356812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Ultra ethernet,

    U. E. Consortium, “Ultra ethernet,” 2024, https://ultraethernet.org/

  2. [2]

    TOP500: Ranking of the World’s 500 Fastest Super- computers,

    TOP500 Project, “TOP500: Ranking of the World’s 500 Fastest Super- computers,” https://top500.org/, 2025, accessed July 22, 2025

  3. [3]

    Ultra ethernet specification version 1.0,

    “Ultra ethernet specification version 1.0,” Ultra Ethernet Consortium, Technical Specification, 2024, available from the Ultra Ethernet Con- sortium

  4. [4]

    Congestion Control for Large- Scale RDMA Deployments,

    Y . Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y . Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion Control for Large- Scale RDMA Deployments,” inProceedings of the ACM SIGCOMM 2015 Conference (SIGCOMM ’15), London, United Kingdom, 2015, pp. 523–536

  5. [5]

    TIMELY: RTT- based Congestion Control for the Datacenter,

    R. Mittal, V . T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y . Wang, D. Wetherall, and D. Zats, “TIMELY: RTT- based Congestion Control for the Datacenter,” inProceedings of the ACM SIGCOMM 2015 Conference (SIGCOMM ’15), London, United Kingdom, 2015, pp. 537–550

  6. [6]

    HPCC: High Precision Congestion Control,

    Y . Li, R. Miao, H. H. Liu, Y . Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, and M. Yu, “HPCC: High Precision Congestion Control,” inProceedings of the ACM SIGCOMM 2019 Conference (SIGCOMM ’19), Beijing, China, 2019, pp. 44–58

  7. [7]

    Under submission

    D. omitted for double-blind reviewing, “Under submission.”

  8. [8]

    Ai ecn threshold of lossless queues,

    Huawei Support, “Ai ecn threshold of lossless queues,” https://support.huawei.com/enterprise/en/doc/EDOC1100420118/7ade444e/ai- ecn-threshold-of-lossless-queues, 2024, accessed: 2025-12-20

  9. [9]

    Analysis of an Equal-Cost Multi-Path Al- gorithm,

    C. Hopps, “Analysis of an Equal-Cost Multi-Path Al- gorithm,” RFC 2992, Nov. 2009. [Online]. Available: https://www.ietf.org/rfc/rfc2992.txt

  10. [10]

    Hedera: dynamic flow scheduling for data center networks,

    M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera: dynamic flow scheduling for data center networks,” inPro- ceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’10. USA: USENIX Association, 2010, p. 19

  11. [11]

    Conga: distributed congestion-aware load balancing for datacenters,

    M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V . T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese, “Conga: distributed congestion-aware load balancing for datacenters,” inProceedings of the 2014 ACM Conference on SIGCOMM, ser. SIGCOMM ’14. New York, NY , USA: Association for Computing Machinery, 2014, p. 503–514. [On...

  12. [12]

    Rao, Bruno Ribeiro, and Mohit Tawar- malani

    A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang, S. Zhang, M. J. Fernandez, S. Gandham, and H. Zeng, “Rdma over ethernet for distributed training at meta scale,” inProceedings of the ACM SIGCOMM 2024 Conference, ser. ACM SIGCOMM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p....

  13. [13]

    Data center ethernet and remote direct memory access: Issues at hyperscale,

    T. Hoefler, D. Roweth, K. Underwood, R. Alverson, M. Griswold, V . Tabatabaee, M. Kalkunte, S. Anubolu, S. Shen, M. McLaren, A. Kab- bani, and S. Scott, “Data center ethernet and remote direct memory access: Issues at hyperscale,”Computer, vol. 56, no. 7, pp. 67–77, 2023

  14. [14]

    Improving datacenter performance and robustness with multipath tcp,

    C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley, “Improving datacenter performance and robustness with multipath tcp,”SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, p. 266–277, aug 2011. [Online]. Available: https://doi.org/10.1145/2043164.2018467

  15. [15]

    Plb: congestion signals are simple and effective for network load balancing,

    M. A. Qureshi, Y . Cheng, Q. Yin, Q. Fu, G. Kumar, M. Moshref, J. Yan, V . Jacobson, D. Wetherall, and A. Kabbani, “Plb: congestion signals are simple and effective for network load balancing,” inProceedings of the ACM SIGCOMM 2022 Conference, ser. SIGCOMM ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 207–218. [Online]. Available:...

  16. [16]

    Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks,

    A. Kabbani, B. Vamanan, J. Hasan, and F. Duchene, “Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks,” inProceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, ser. CoNEXT ’14. New York, NY , USA: Association for Computing Machinery, 2014, p. 149–160. [O...

  17. [17]

    Let it flow: Resilient asymmetric load balancing with flowlet switching,

    E. Vanini, R. Pan, M. Alizadeh, P. Taheri, and T. Edsall, “Let it flow: Resilient asymmetric load balancing with flowlet switching,” in14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, Mar. 2017, pp. 407–420. [On- line]. Available: https://www.usenix.org/conference/nsdi17/technical- sessions/pr...

  18. [18]

    Presto: Edge-based load balancing for fast datacenter networks,

    K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter, and A. Akella, “Presto: Edge-based load balancing for fast datacenter networks,” in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, ser. SIGCOMM ’15. New York, NY , USA: Association for Computing Machinery, 2015, p. 465–478. [Online]. Available: https://doi.org/10....

  19. [19]

    Flowcut Switching: High-Performance Adaptive Routing With In-Order Delivery Guarantees ,

    T. Bonato, D. De Sensi, S. Di Girolamo, A. Bataineh, D. Hewson, D. Roweth, and T. Hoefler, “ Flowcut Switching: High-Performance Adaptive Routing With In-Order Delivery Guarantees ,”IEEE Transac- tions on Networking, no. 01, pp. 1–14, Dec. 2025. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TON.2025.3636209

  20. [20]

    Multi-path transport for rdma in datacenters,

    Y . Lu, G. Chen, B. Li, K. Tan, Y . Xiong, P. Cheng, J. Zhang, E. Chen, and T. Moscibroda, “Multi-path transport for rdma in datacenters,” in Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’18. USA: USENIX Association, 2018, p. 357–371

  21. [21]

    On the impact of packet spraying in data center networks,

    A. Dixit, P. Prakash, Y . C. Hu, and R. R. Kompella, “On the impact of packet spraying in data center networks,” in2013 Proceedings IEEE INFOCOM, 2013, pp. 2130–2138

  22. [22]

    Network load balancing technologies for intelligent computing centers,

    W. Wang, F. Chen, P. Cao, L. Shan, T. Wu, and H. Wen, “Network load balancing technologies for intelligent computing centers,”Communica- tions of Huawei Research, no. Issue 9, pp. 13–22, 2025

  23. [23]

    InfiniBand Congestion Control: Mod- elling and Validation,

    E. G. Gran and S.-A. Reinemo, “InfiniBand Congestion Control: Mod- elling and Validation,” inOMNeT++ 2011 (Workshop at SIMUTools), Barcelona, Spain, 2011

  24. [24]

    A Measure- ment Study of Congestion in an InfiniBand Network,

    F. Alali, F. Mizero, M. Veeraraghavan, and J. M. Dennis, “A Measure- ment Study of Congestion in an InfiniBand Network,” in2017 Network Traffic Measurement and Analysis Conference (TMA), 2017, pp. 1–9

  25. [25]

    Adaptive routing in infiniband hardware,

    J. Rocher-Gonz ´alez, E. G. Gran, S.-A. Reinemo, T. Skeie, J. Escudero- Sahuquillo, P. J. Garc´ıa, and F. J. Q. Flor, “Adaptive routing in infiniband hardware,” in2022 22nd IEEE/ACM International Symposium on Clus- ter, Cloud and Internet Computing (CCGrid), 2022, pp. 463–472

  26. [26]

    An in-depth analysis of the slingshot interconnect,

    D. De Sensi, S. Di Girolamo, K. H. McMahon, D. Roweth, and T. Hoefler, “An in-depth analysis of the slingshot interconnect,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–14

  27. [27]

    Hpe slingshot launched into network space,

    D. Roweth, “Hpe slingshot launched into network space,” Cray User Group (CUG) Proceedings, 2022

  28. [28]

    Gpcnet: Designing a benchmark suite for inducing and measuring contention in hpc networks,

    S. Chunduri, T. Groves, P. Mendygral, B. Austin, J. Balma, K. Kandalla, K. Kumaran, G. Lockwood, S. Parker, S. Warren, N. Wichmann, and N. J. Wright, “Gpcnet: Designing a benchmark suite for inducing and measuring contention in hpc networks,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC...

  29. [29]

    Open MPI: Open Source High Performance Computing,

    “Open MPI: Open Source High Performance Computing,” https://www.open-mpi.org/, Associated with Software in the Public Interest, 2025, accessed: 2025-09-02

  30. [30]

    , title =

    L. Pichetti, D. De Sensi, K. Sivalingam, S. Nassyr, D. Cesarini, M. Turisini, D. Pleiter, A. Artigiani, and F. Vella, “Benchmarking ethernet interconnect for hpc/ai workloads,” inProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC- W ’24. IEEE Press, 2025, p. 869–875. [...

  31. [31]

    Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,

    M. Turisini, G. Amati, and M. Cestari, “Leonardo: A pan-european pre-exascale supercomputer for hpc and ai applications,” 2023. [Online]. Available: https://arxiv.org/abs/2307.16885

  32. [32]

    CRESCO ENEA HPC clusters: a working example of a multifabric GPFS Spectrum Scale layout,

    F. Iannone, F. Ambrosino, G. Bracco, M. De Rosa, A. Funel, G. Guarnieri, S. Migliori, F. Palombi, G. Ponti, G. Santomauro, and P. Procacci, “CRESCO ENEA HPC clusters: a working example of a multifabric GPFS Spectrum Scale layout,” in2019 International Conference on High Performance Computing Simulation (HPCS), 2019, pp. 1051–1052

  33. [33]

    LUMI supercomputer,

    LUMI Consortium, “LUMI supercomputer,” https://lumi- supercomputer.eu/, 2024, accessed: 2025-01

  34. [34]

    Open edge and hpc initiative,

    “Open edge and hpc initiative,” https://www.open-edge-hpc- initiative.org/, 2025, accessed: 2025-08-28

  35. [35]

    Analysis of the increase and decrease algorithms for congestion avoidance in com- puter networks,

    D.-M. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for congestion avoidance in com- puter networks,”Computer Networks and ISDN Systems, vol. 17, no. 1, pp. 1–14, 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0169755289900196

  36. [36]

    Understanding performance variability on the aries dragonfly network,

    T. Groves, Y . Gu, and N. J. Wright, “Understanding performance variability on the aries dragonfly network,” in2017 IEEE International Conference on Cluster Computing (CLUSTER), Sept 2017, pp. 809–813

  37. [37]

    The impact of network noise at large-scale communication performance,

    T. Hoefler, T. Schneider, and A. Lumsdaine, “The impact of network noise at large-scale communication performance,” in2009 IEEE Inter- national Symposium on Parallel Distributed Processing, May 2009, pp. 1–8

  38. [38]

    Unimem: runtime data managementon non-volatile memory-based heterogeneous main memory,

    S. Chunduri, K. Harms, S. Parker, V . Morozov, S. Oshin, N. Cherukuri, and K. Kumaran, “Run-to-run variability on xeon phi based cray xc systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’17. New York, NY , USA: ACM, 2017, pp. 52:1–52:13. [Online]. Available: http://doi.acm.or...

  39. [39]

    Understanding the causes of performance variability in hpc workloads,

    D. Skinner and W. Kramer, “Understanding the causes of performance variability in hpc workloads,” inIEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005., Oct 2005, pp. 137–149

  40. [40]

    Noise in the clouds: Influence of network performance variability on application scalability,

    D. De Sensi, T. De Matteis, K. Taranov, S. Di Girolamo, T. Rahn, and T. Hoefler, “Noise in the clouds: Influence of network performance variability on application scalability,”Proc. ACM Meas. Anal. Comput. Syst., vol. 6, no. 3, Dec. 2022

  41. [41]

    Canary: Congestion-aware in-network allreduce using dynamic trees,

    D. De Sensi, E. Costa Molero, S. Di Girolamo, L. Van- bever, and T. Hoefler, “Canary: Congestion-aware in-network allreduce using dynamic trees,”Future Generation Computer Systems, vol. 152, pp. 70–82, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X23003850

  42. [42]

    Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks,

    B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and T. Hoefler, “Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks,” inProceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC ’14. New York, NY , USA: ACM, 2014, pp. 129–

  43. [43]

    Available: http://doi.acm.org/10.1145/2600212.2600225

    [Online]. Available: http://doi.acm.org/10.1145/2600212.2600225

  44. [44]

    Evaluation of an interference-free node allocation policy on fat-tree clusters,

    S. D. Pollard, N. Jain, S. Herbein, and A. Bhatele, “Evaluation of an interference-free node allocation policy on fat-tree clusters,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA: IEEE Press, 2018, pp. 26:1–26:13. [Online]. Available: http://dl.acm.org/ci...

  45. [45]

    Quantifying network contention on large parallel machines,

    A. Bhatele and L. V . Kal ´e, “Quantifying network contention on large parallel machines,”Parallel Processing Letters, vol. 19, no. 04, pp. 553–572, 2009. [Online]. Available: https://doi.org/10.1142/S0129626409000419

  46. [46]

    Watch out for the bully! job interference study on dragonfly network,

    X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, “Watch out for the bully! job interference study on dragonfly network,” inSC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2016, pp. 750–760

  47. [47]

    Trade-off study of localizing communication and balancing network traffic on a dragonfly system,

    X. Wang, M. Mubarak, X. Yang, R. B. Ross, and Z. Lan, “Trade-off study of localizing communication and balancing network traffic on a dragonfly system,” in2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2018, pp. 1113–1122

  48. [48]

    Mitigating network noise on dragonfly networks through application-aware routing,

    D. De Sensi, S. Di Girolamo, and T. Hoefler, “Mitigating network noise on dragonfly networks through application-aware routing,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19. New York, NY , USA: ACM, 2019, pp. 16:1–16:32. [Online]. Available: http://doi.acm.org/10.1145/3295500.3356196

  49. [49]

    Mitigating inter-job interference using adaptive flow-aware routing,

    S. A. Smith, C. E. Cromey, D. K. Lowenthal, J. Domke, N. Jain, J. J. Thiagarajan, and A. Bhatele, “Mitigating inter-job interference using adaptive flow-aware routing,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18, 2018

  50. [50]

    Network performance counter monitoring and analysis on the cray xc platform

    J. M. Brandt, E. Froese, A. C. Gentile, L. Kaplan, B. A. Allan, and E. J. Walsh, “Network performance counter monitoring and analysis on the cray xc platform.” 5 2016

  51. [51]

    Analyzing network health and congestion in dragonfly-based supercomputers,

    A. Bhatele, N. Jain, Y . Livnat, V . Pascucci, and P. Bremer, “Analyzing network health and congestion in dragonfly-based supercomputers,” in 2016 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS), May 2016, pp. 93–102

  52. [52]

    Overtime: A tool for analyzing performance variation due to network interference,

    R. E. Grant, K. T. Pedretti, and A. Gentile, “Overtime: A tool for analyzing performance variation due to network interference,” in Proceedings of the 3rd Workshop on Exascale MPI, ser. ExaMPI ’15. New York, NY , USA: ACM, 2015, pp. 4:1–4:10. [Online]. Available: http://doi.acm.org/10.1145/2831129.2831133

  53. [53]

    Exploring gpu-to-gpu communication: Insights into supercomputer interconnects,

    D. De Sensi, L. Pichetti, F. Vella, T. De Matteis, Z. Ren, L. Fusco, M. Turisini, D. Cesarini, K. Lust, A. Trivedi, D. Roweth, F. Spiga, S. Di Girolamo, and T. Hoefler, “Exploring gpu-to-gpu communication: Insights into supercomputer interconnects,” inProceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Ana...