pith. sign in

arxiv: 2508.16809 · v2 · submitted 2025-08-22 · 💻 cs.DC · cs.PF

PICO: Performance Insights for Collective Operations

Pith reviewed 2026-05-18 20:42 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords collective operationsperformance benchmarkingMPINCCLHPCdistributed AI trainingreproducible experiments
0
0 comments X

The pith

Default collective algorithms and transport settings can be up to 5× slower than the best available choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PICO, an open-source framework that separates portable experiment setup from platform-specific execution to enable systematic benchmarking of collective operations. It supplies adaptive interfaces for MPI and NCCL, plain-MPI reference implementations, optional instrumentation, and full system configuration recording for reproducibility. Evaluations across three supercomputers establish large performance gaps between defaults and tuned selections, while simulator replays of open-source LLM training traces quantify the downstream impact of those tuned profiles. A sympathetic reader would care because collective operations underpin scaling in both HPC and large-scale AI, so exposing better configurations offers a practical route to faster execution without new hardware.

Core claim

PICO decouples experiment setup from execution, offers backend-adaptive parameter selection across MPI and NCCL, supplies instrumentable plain-MPI references, and records full runtime environments; using this machinery on three major supercomputers shows default collective algorithms and transport settings can be up to 5× slower than the best available choice and, when the resulting optimized profiles are replayed in the ATLAHS simulator on open-source LLM training traces, produces reductions in training times of up to 44%.

What carries the argument

PICO's decoupling of portable experiment setup from platform execution together with its backend-adaptive parameter selection interface for MPI and NCCL.

If this is right

  • Default collective choices in production systems are often far from optimal and can be diagnosed by isolating topology-sensitive algorithmic decisions.
  • Instrumentation inside PICO yields per-algorithm breakdowns that explain where time is lost.
  • Simulator replay of real training traces converts micro-benchmark speedups into concrete application-level time savings.
  • Portable experiment definitions plus configuration capture make performance comparisons reproducible across platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many running HPC and AI jobs are likely leaving substantial performance on the table simply by accepting library defaults.
  • Embedding PICO-style selection logic into runtime systems could automate better collective choices without user intervention.
  • The same measurement approach could be applied to other communication patterns such as point-to-point or one-sided operations.

Load-bearing premise

The ATLAHS simulator accurately converts measured collective performance gains into end-to-end training time predictions for real LLM workloads.

What would settle it

Run the same open-source LLM training traces on actual hardware once with default collectives and once with the PICO-identified optimized profiles, then compare measured wall-clock training times against the simulator predictions.

Figures

Figures reproduced from arXiv: 2508.16809 by Daniele De Sensi, Lorenzo Piarulli, Marco Canini, Saverio Pasqualoni, Tommaso Bonato, Torsten Hoefler.

Figure 2
Figure 2. Figure 2: shows one of the required configuration files, in this case, the one describing the communication libraries. As shown, the environment configuration files contain part of the metadata rele￾vant to a benchmark, which will be stored alongside tests’ results themselves to improve reproducibility [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: High-level architecture and data flow in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: TUI flow across the creation of the test JSON. 3.1.3 Benchmark Orchestration. The execution of benchmarks in PICO is managed by a dedicated orchestration script that auto￾mates all steps from configuration parsing to result collection. The process begins by reading the benchmark configuration from the test.json file and the environment description from the env.json files. Based on these inputs, the orchest… view at source ↗
Figure 4
Figure 4. Figure 4: results/ directory structure. 3.3.1 Directory Structure Overview. Each benchmark execution produces a dedicated output directory under the root results/ folder. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tracer estimation for a 128 nodes Leonardo job. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Median performance gain of the default algorithm [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reduce-scatter benchmark run results on Leonardo comparing distance halving and distance doubling algo￾rithms with a 128 node allocation. 4.2 Algorithmic Differences Cost models are a common tool for reasoning about the expected performance of collective communication algorithms. The most widely used cost model for collective operations is the 𝛼, 𝛽,𝛾 model: 𝐴𝛼 + 𝐵𝛽 + 𝐶𝛾 (1) [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 10
Figure 10. Figure 10: Impact of UCX_MAX_RNDV_RAILS on large-scale 2048 nodes Allreduce benchmark. Rabenseifner algorithm. 4.3 Network Layer Parameters Many factors in network communication libraries are often over￾looked when running collective benchmarks, yet they can have a decisive impact on results and, if uncontrolled, can hinder test repro￾ducibility. One example is UCX_MAX_RNDV_RAILS, a UCX runtime parameter that contro… view at source ↗
Figure 11
Figure 11. Figure 11: Performance impact of data movement and reduc [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to $5\times$ slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to $44\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PICO, an open-source framework that decouples portable experiment setup from platform execution for benchmarking collective operations. It supports MPI and NCCL backends, provides plain-MPI reference implementations, records system configurations for reproducibility, and offers diagnostic profiling. Evaluations on three supercomputers show default collective algorithms and transport settings can be up to 5× slower than optimized choices, with evidence on topology sensitivity. To assess end-to-end impact, the authors replay open-source LLM training traces in the ATLAHS simulator after substituting PICO-identified optimized collective profiles, reporting training time reductions of up to 44%.

Significance. If the simulator-based end-to-end results are validated, the work provides concrete evidence that collective tuning can yield substantial gains in large-scale AI training, complementing the direct benchmarking data from multiple production systems. The framework's emphasis on portability, instrumentation, and reproducibility is a practical contribution to the HPC and distributed systems community. The open-source release and focus on diagnostic breakdowns strengthen the utility for practitioners.

major comments (1)
  1. [Evaluation with simulator replays] In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.
minor comments (2)
  1. [Abstract and Evaluation] The abstract and evaluation sections should report the number of runs averaged, presence of error bars or statistical measures, and how post-hoc algorithm selection was performed to support the 'up to 5×' and 'up to 44%' claims.
  2. [Framework description] Clarify the exact interface and automation level of the 'backend-adaptive parameter selection' for MPI versus NCCL to aid reproducibility by readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the clarity of our evaluation. We address the major comment below.

read point-by-point responses
  1. Referee: In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.

    Authors: We agree that the simulator-based results rely on the fidelity of ATLAHS in propagating isolated collective improvements. The simulator replays are intended to provide indicative estimates of end-to-end impact rather than definitive measurements, using publicly available traces and PICO-derived performance profiles. We did not conduct direct cross-validation against full-scale LLM training runs on the target systems, as such experiments would require prohibitive resources and coordinated access beyond the scope of this benchmarking-focused study. The ATLAHS framework is documented in its reference publication to model communication-computation overlap and trace replay with reasonable accuracy for distributed training workloads. In revision we will add an explicit limitations paragraph in the evaluation section that (1) states the absence of direct hardware cross-validation, (2) summarizes the simulator's modeling assumptions and known limitations with respect to hardware-specific effects not present in micro-benchmarks, and (3) qualifies the reported 44% figure as an upper-bound estimate under the simulator's replay model. These changes will make the scope of the claim transparent without altering the core contribution of the PICO framework itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking and simulator replay

full rationale

The paper presents PICO as an empirical benchmarking framework that measures collective performance on real supercomputers and then replays traces in the external ATLAHS simulator using the measured profiles. No mathematical derivation, closed-form prediction, or first-principles result is claimed; the 5× slowdown and 44% training-time figures are obtained directly from instrumentation and trace replay rather than by fitting parameters to the target quantity and renaming the fit as a prediction. The evaluation chain therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard assumptions about collective operation semantics and simulator fidelity rather than introducing new fitted parameters or invented physical entities.

axioms (1)
  • domain assumption Collective operations produce consistent timing behavior across repeated runs under controlled conditions
    Required for reproducible benchmarking and diagnostic isolation of algorithmic choices

pith-pipeline@v0.9.0 · 5752 in / 1231 out tokens · 29702 ms · 2026-05-18T20:42:15.788360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    Advanced Micro Devices, Inc. 2025. ROCm Communication Collectives Library (RCCL) Documentation, v2.22.3. Online documentation. https://rocm.docs.amd. com/projects/rccl Accessed Jul. 23, 2025

  2. [2]

    Jon Ames and Ron Lowman. 2025. How Ultra Ethernet and UALink Enable High- Performance, Scalable AI Networks. Synopsys Blog. https://www.synopsys. com/articles/ultra-ethernet-ualink-ai-networks.html Accessed Aug. 11, 2025

  3. [3]

    Rogers, Evan Schneider, Jean-Luc Vay, and P

    Scott Atchley, Christopher Zimmer, John Lange, David Bernholdt, Veronica Me- lesse Vergara, Thomas Beck, Michael Brim, Reuben Budiardja, Sunita Chan- drasekaran, Markus Eisenbach, Thomas Evans, Matthew Ezell, Nicholas Frontiere, Antigoni Georgiadou, Joe Glenski, Philipp Grete, Steven Hamilton, John Hol- men, Axel Huebl, Daniel Jacobson, Wayne Joubert, Kim...

  4. [4]

    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23)

    Frontier: Exploring Exascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 52, 16 pages. doi:10.1145/3581784.3607089

  5. [5]

    Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. 2009. MPI on a Million Processors. In Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (Espoo, Finland). Springer-Verlag, Berlin, Heidelber...

  6. [6]

    Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, and Sergi Girona. 2025. Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific work- loads. arXiv:2503.09917 [cs.DC] https://ar...

  7. [7]

    Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, and Joud Khoury. 2024. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies. arXiv:2309.13541 [cs.DC] https://arxiv.org/abs/ 2309.13541

  8. [8]

    Amanda Bienz, Shreeman Gautam, and Amun Kharel. 2022. A Locality-Aware Bruck Allgather. arXiv:2206.03564 [cs.DC] https://arxiv.org/abs/2206.03564

  9. [9]

    Broadcom Inc. 2025. Scale-Up Ethernet (SUE) Framework Specification. Tech- nical specification (PDF). https://docs.broadcom.com/doc/scale-up-ethernet- framework Accessed Aug. 11, 2025

  10. [10]

    2002.An Introduction to the InfiniBand Architecture

    Rajkumar Buyya, Toni Cortes, and Hai Jin. 2002.An Introduction to the InfiniBand Architecture. 616–632. doi:10.1109/9780470544839.ch42

  11. [11]

    Generalized Slow Roll for Tensors

    Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14. doi:10.1109/sc41405.2020.00039

  12. [12]

    Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, Tommaso Bonato, Seydou Ba, Matteo Turisini, Jens Domke, and Torsten Hoefler. 2025. Bine Trees: Enhanc- ing Collective Operations by Optimizing Communication Locality. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25). doi:ToAppear

  13. [13]

    Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Dun- can Roweth, Filippo Spiga, Salvatore Di Girolamo, and Torsten Hoefler. 2024. Exploring GPU-to-GPU Communication: Insights into Supercomputer Intercon- nects. In Proceedings of the International C...

  14. [14]

    Forschungszentrum Jülich, Jülich Supercomputing Centre. 2025. JUPITER Tech- nical Overview. Technical overview page. https://www.fz-juelich.de/en/ias/jsc/ jupiter/tech Last modified Jan 7, 2025; accessed Aug 11, 2025

  15. [15]

    Zhenhao He, Daniele Parravicini, Lucian Petrica, Kenneth O’Brien, Gustavo Alonso, and Michaela Blott. 2021. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High- performance Reconfigurable Computing (H2RC) . 33–43. doi:10.1109/H2RC54759. 2021.00009

  16. [16]

    Mert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christopher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-Mei Hwu, and Alex Aiken

  17. [17]

    In Proceedings of the 38th ACM International Conference on Supercomputing

    CommBench: Micro-Benchmarking Hierarchical Networks with Multi- GPU, Multi-NIC Nodes. In Proceedings of the 38th ACM International Conference on Supercomputing. 426–436

  18. [18]

    Roger W. Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (1994), 389–398. doi:10.1016/S0167- 8191(06)80021-9

  19. [19]

    Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoefler. 2025. Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786 [cs.DC] https://arxiv.org/abs/2507.04786

  20. [20]

    Hwang and Z

    K. Hwang and Z. Xu. 1998. Scalable Parallel Computing: Technology, Archi- tecture, Programming. WCB/McGraw-Hill. https://books.google.it/books?id= OJNQAAAAMAAJ

  21. [21]

    Intel Corporation. 2021. Intel ® MPI Benchmarks User Guide, version 2021.2. On- line. https://www.intel.com/content/www/us/en/docs/mpi-library/user-guide- benchmarks/2021-2/overview.html Accessed Jul. 21, 2025

  22. [22]

    Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop, and Dhabaleswar K. Panda. 2009. Designing multi-leader-based Allgather algorithms for multi-core clusters. In 2009 IEEE International Symposium on Parallel & Dis- tributed Processing. 1–8. doi:10.1109/IPDPS.2009.5160896

  23. [23]

    Xiang ke Liao, Kai Lu, Can qun Yang, Jin wen Li, Yuan Yuan, Ming che Lai, Li bo Huang, Ping jing Lu, Jian bin Fang, Jing Ren, and Jie Shen. 2018. Moving from exascale to zettascale computing: challenges and techniques. Frontiers of Information Technology & Electronic Engineering 19, 10 (2018), 1236–1244. doi:10.1631/FITEE.1800494

  24. [24]

    Jongryoul Kim, William Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. ACM SIGARCH Computer Architec- ture News 36, 77–88. doi:10.1109/ISCA.2008.19

  25. [25]

    Andrii Kovalov, Elisabeth Lobe, Andreas Gerndt, and Daniel Lüdtke. 2017. Task- Node Mapping in an Arbitrary Computer Network Using SMT Solver. In Inte- grated Formal Methods, Nadia Polikarpova and Steve Schneider (Eds.). Springer International Publishing, Cham, 177–191

  26. [26]

    Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open- source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, Ne...

  27. [27]

    Lawrence Livermore National Laboratory. 2025. Lawrence Livermore Na- tional Laboratory’s El Capitan verified as world’s fastest supercomputer. Press release. https://www.llnl.gov/article/52061/lawrence-livermore-national- laboratorys-el-capitan-verified-worlds-fastest-supercomputer

  28. [28]

    Dongkyun Lim and John Kim. 2025. TidalMesh: Topology-Driven AllRe- duce Collective Communication for Mesh Topology. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 1526–1540. doi:10.1109/HPCA61900.2025.00114

  29. [29]

    Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A Survey of High- Performance Interconnection Networks in High-Performance Computer Systems. Electronics 11, 9 (2022). doi:10.3390/electronics11091369

  30. [30]

    Robert Lucas, James Ang, Keren Bergman, Shekhar Borkar, William Carlson, Laura Carrington, George Chiu, Robert Colwell, William Dally, Jack Dongarra, Al Geist, Rud Haring, Jeffrey Hittinger, Adolfy Hoisie, Dean Micron Klein, Peter Kogge, Richard Lethin, Vivek Sarkar, Robert Schreiber, John Shalf, Thomas Ster- ling, Rick Stevens, Jon Bashor, Ron Brightwell...

  31. [31]

    Rick Merritt. 2023. What Is NVLink? NVIDIA Official Blog. https://blogs.nvidia. com/blog/what-is-nvidia-nvlink/ Accessed August 8, 2025

  32. [32]

    NVIDIA Corporation. [n. d.]. NCCL-Tests: Performance and correctness micro- benchmarks for NVIDIA NCCL. GitHub repository. https://github.com/NVIDIA/ nccl-tests Accessed Jul. 21, 2025

  33. [33]

    NVIDIA Corporation. 2025. NVIDIA Collective Communication Library (NCCL) Documentation. Online documentation. https://docs.nvidia.com/deeplearning/ nccl/index.html Accessed July 23, 2025

  34. [34]

    NVIDIA Corporation. 2025. NVIDIA GB200 NVL72 (Grace + Blackwell) – Rack- Scale AI System. Product web page. https://www.nvidia.com/en-us/data-center/ gb200-nvl72/ Accessed Aug. 11, 2025. Conference’17, July 2017, Washington, DC, USA Pasqualoni et al

  35. [35]

    OpenFabrics Interfaces Working Group (OFIWG). 2025. libfabric: Open Fabric Interfaces framework for high-performance networking. GitHub repository. https://github.com/ofiwg/libfabric Accessed Jul. 23, 2025

  36. [36]

    Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

  37. [37]

    Muhammad Shuaib Qureshi, Muhammad Bilal Qureshi, Muhammad Fayaz, Wali Khan Mashwani, Samir Brahim Belhaouari, Saima Hassan, and Asadul- lah Shah. 2020. A comparative analysis of resource allocation schemes for real-time services in high-performance computing systems. Interna- tional Journal of Distributed Sensor Networks 16, 8 (2020), 1550147720932750. ar...

  38. [38]

    Paul Sack and William Gropp. 2015. Collective Algorithms for Multiported Torus Networks. ACM Trans. Parallel Comput. 1, 2, Article 12 (Feb. 2015), 33 pages. doi:10.1145/2686882

  39. [39]

    David Schor. 2018. ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packag- ing. WikiChip Fuse blog. https://fuse.wikichip.org/news/1064/isscc-2018-amds- zeppelin-multi-chip-routing-and-packaging/ Accessed Aug. 11, 2025

  40. [40]

    Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoe- fler. 2024. Swing: Short-cutting Rings for Higher Bandwidth Allreduce. arXiv:2401.09356 [cs.DC] https://arxiv.org/abs/2401.09356

  41. [41]

    Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoefler. 2022. Noise in the Clouds: Influence of Network Performance Variability on Application Scalability. arXiv:2210.15315 [cs.DC] https://arxiv.org/abs/2210.15315

  42. [42]

    Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, and Steve Petruzza. 2024. Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. In Proceedings of the International Confer- ence on High Performance Computing in Asia-Pacific Region (Nagoya, Japan) (HP- CAsia ’24). Association for Computing Machinery, New Yo...

  43. [43]

    Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. doi:10.1109/HiPINEB.2017.11

  44. [44]

    Mohak Shroff and Robert van de Geijn. 2000. CollMark: MPI Collective Commu- nication Benchmark. (01 2000)

  45. [45]

    The Ohio State University. [n. d.]. OSU Micro-Benchmarks (OMB). Online. https://mvapich.cse.ohio-state.edu/benchmarks/ Accessed Jul. 21, 2025

  46. [46]

    The Open MPI Community. 2025. Open MPI 5.0: 11.10. Tuning Collectives (coll- tuned). Online documentation. https://docs.open-mpi.org/en/v5.0.x/tuning- apps/coll-tuned.html Last updated Jul. 31, 2025; accessed Aug. 11, 2025

  47. [47]

    The Unified Communication X Library [n. d.]. The Unified Communication X Library. http://www.openucx.org

  48. [48]

    TOP500 Project. 2025. TOP500 List – June 2025. https://top500.org/lists/top500/ 2025/06/. Accessed: Jul. 18, 2025

  49. [49]

    Jesper Larsson Träff. 2006. Efficient Allgather for Regular SMP-Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface , Bernd Mohr, Jesper Larsson Träff, Joachim Worringen, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 58–65

  50. [50]

    Matteo Turisini, Giorgio Amati, and Mirko Cestari. 2023. LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications. arXiv:2307.16885 [cs.DC] https://arxiv.org/abs/2307.16885

  51. [51]

    Ultra Ethernet Consortium. 2025. Ultra Ethernet ™ Specification v1.0. Technical specification (PDF). https://ultraethernet.org/wp-content/uploads/sites/20/2025/ 06/UE-Specification-6.11.25.pdf Accessed Aug. 11, 2025

  52. [52]

    Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, Devendar Bu- reddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer. 2024. Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives. In IEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2...

  53. [53]

    Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. Journal of Computer Science and Technology 38, 1 (Feb. 2023), 166–195. doi:10.1007/s11390-023-2894-6

  54. [54]

    Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, and Songtao Wang

  55. [55]

    arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

    Revisiting the Time Cost Model of AllReduce. arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

  56. [56]

    Wright, and Ayse K

    Yijia Zhang, Taylor Groves, Brandon Cook, Nicholas J. Wright, and Ayse K. Coskun. 2020. Quantifying the impact of network congestion on application performance and network metrics. In2020 IEEE International Conference on Cluster Computing (CLUSTER). 162–168. doi:10.1109/CLUSTER49012.2020.00026

  57. [57]

    Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, et al. 2025. An Extensible Software Transport Layer for GPU Networking. arXiv preprint arXiv:2504.17307 (2025)

  58. [58]

    Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...

  59. [59]

    Zwinger, J

    T. Zwinger, J. Heikonen, and P. Manninen. 2023. LUMI supercomputer for European researchers. In Galileo Conference: Solid Earth and Geohazards in the Exascale Era (2023-05-23/2023-05-26). Barcelona, Spain, GC11–solidearth–25. doi:10.5194/egusphere-gc11-solidearth-25