PICO: Performance Insights for Collective Operations

Daniele De Sensi; Lorenzo Piarulli; Marco Canini; Saverio Pasqualoni; Tommaso Bonato; Torsten Hoefler

arxiv: 2508.16809 · v2 · submitted 2025-08-22 · 💻 cs.DC · cs.PF

PICO: Performance Insights for Collective Operations

Saverio Pasqualoni , Tommaso Bonato , Lorenzo Piarulli , Torsten Hoefler , Marco Canini , Daniele De Sensi This is my paper

Pith reviewed 2026-05-18 20:42 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords collective operationsperformance benchmarkingMPINCCLHPCdistributed AI trainingreproducible experiments

0 comments

The pith

Default collective algorithms and transport settings can be up to 5× slower than the best available choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PICO, an open-source framework that separates portable experiment setup from platform-specific execution to enable systematic benchmarking of collective operations. It supplies adaptive interfaces for MPI and NCCL, plain-MPI reference implementations, optional instrumentation, and full system configuration recording for reproducibility. Evaluations across three supercomputers establish large performance gaps between defaults and tuned selections, while simulator replays of open-source LLM training traces quantify the downstream impact of those tuned profiles. A sympathetic reader would care because collective operations underpin scaling in both HPC and large-scale AI, so exposing better configurations offers a practical route to faster execution without new hardware.

Core claim

PICO decouples experiment setup from execution, offers backend-adaptive parameter selection across MPI and NCCL, supplies instrumentable plain-MPI references, and records full runtime environments; using this machinery on three major supercomputers shows default collective algorithms and transport settings can be up to 5× slower than the best available choice and, when the resulting optimized profiles are replayed in the ATLAHS simulator on open-source LLM training traces, produces reductions in training times of up to 44%.

What carries the argument

PICO's decoupling of portable experiment setup from platform execution together with its backend-adaptive parameter selection interface for MPI and NCCL.

If this is right

Default collective choices in production systems are often far from optimal and can be diagnosed by isolating topology-sensitive algorithmic decisions.
Instrumentation inside PICO yields per-algorithm breakdowns that explain where time is lost.
Simulator replay of real training traces converts micro-benchmark speedups into concrete application-level time savings.
Portable experiment definitions plus configuration capture make performance comparisons reproducible across platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many running HPC and AI jobs are likely leaving substantial performance on the table simply by accepting library defaults.
Embedding PICO-style selection logic into runtime systems could automate better collective choices without user intervention.
The same measurement approach could be applied to other communication patterns such as point-to-point or one-sided operations.

Load-bearing premise

The ATLAHS simulator accurately converts measured collective performance gains into end-to-end training time predictions for real LLM workloads.

What would settle it

Run the same open-source LLM training traces on actual hardware once with default collectives and once with the PICO-identified optimized profiles, then compare measured wall-clock training times against the simulator predictions.

Figures

Figures reproduced from arXiv: 2508.16809 by Daniele De Sensi, Lorenzo Piarulli, Marco Canini, Saverio Pasqualoni, Tommaso Bonato, Torsten Hoefler.

**Figure 2.** Figure 2: shows one of the required configuration files, in this case, the one describing the communication libraries. As shown, the environment configuration files contain part of the metadata relevant to a benchmark, which will be stored alongside tests’ results themselves to improve reproducibility [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 1.** Figure 1: High-level architecture and data flow in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: TUI flow across the creation of the test JSON. 3.1.3 Benchmark Orchestration. The execution of benchmarks in PICO is managed by a dedicated orchestration script that automates all steps from configuration parsing to result collection. The process begins by reading the benchmark configuration from the test.json file and the environment description from the env.json files. Based on these inputs, the orchest… view at source ↗

**Figure 4.** Figure 4: results/ directory structure. 3.3.1 Directory Structure Overview. Each benchmark execution produces a dedicated output directory under the root results/ folder. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Tracer estimation for a 128 nodes Leonardo job. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Median performance gain of the default algorithm [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Reduce-scatter benchmark run results on Leonardo comparing distance halving and distance doubling algorithms with a 128 node allocation. 4.2 Algorithmic Differences Cost models are a common tool for reasoning about the expected performance of collective communication algorithms. The most widely used cost model for collective operations is the 𝛼, 𝛽,𝛾 model: 𝐴𝛼 + 𝐵𝛽 + 𝐶𝛾 (1) [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 10.** Figure 10: Impact of UCX_MAX_RNDV_RAILS on large-scale 2048 nodes Allreduce benchmark. Rabenseifner algorithm. 4.3 Network Layer Parameters Many factors in network communication libraries are often overlooked when running collective benchmarks, yet they can have a decisive impact on results and, if uncontrolled, can hinder test reproducibility. One example is UCX_MAX_RNDV_RAILS, a UCX runtime parameter that contro… view at source ↗

**Figure 11.** Figure 11: Performance impact of data movement and reduc [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to $5\times$ slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to $44\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PICO provides a practical open-source benchmarking framework for collectives with concrete hardware evidence of suboptimal defaults, but its end-to-end training time claims rely on unvalidated simulator replays.

read the letter

Hi colleague, The punchline on this one is that PICO gives you a better way to benchmark and select collective algorithms and configs than what most people do by default. It shows real speedups on actual machines, but the big training time savings number comes from a simulator. What the paper does well is build a framework that separates the portable experiment setup from the platform-specific execution. This should make it easier to run the same tests across different systems. They support both MPI and NCCL backends with adaptive parameter selection, include plain MPI reference implementations that can be instrumented, and capture the full system configuration for reproducibility. That's more than just another timing script. On the evaluation side, running on three major supercomputers and finding that defaults can be up to 5 times slower is solid evidence. The diagnostic part, isolating topology-sensitive choices and using instrumentation to break down the algorithms, adds real insight into why some configs perform better. The soft spot is in the application-level impact section. They take open-source LLM training traces and replay them in the ATLAHS simulator after swapping in the better collective profiles found by PICO. This leads to the claim of up to 44% reduction in training times. The issue is that there's no direct check of whether the simulator's predictions match what would happen if you actually ran the training with those tuned collectives on the hardware. Things like compute-communication overlap and hardware-specific behaviors might not translate perfectly from the microbenchmarks. For readers, this is targeted at folks in the distributed systems and HPC performance community who care about optimizing communication in large-scale training or scientific computing. If you're building tools for collective tuning or just want data on current systems, the framework and results are worth a look. I think it deserves to go through peer review. The core contribution is practical and the hardware data is there to evaluate. Recommendation: send it for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces PICO, an open-source framework that decouples portable experiment setup from platform execution for benchmarking collective operations. It supports MPI and NCCL backends, provides plain-MPI reference implementations, records system configurations for reproducibility, and offers diagnostic profiling. Evaluations on three supercomputers show default collective algorithms and transport settings can be up to 5× slower than optimized choices, with evidence on topology sensitivity. To assess end-to-end impact, the authors replay open-source LLM training traces in the ATLAHS simulator after substituting PICO-identified optimized collective profiles, reporting training time reductions of up to 44%.

Significance. If the simulator-based end-to-end results are validated, the work provides concrete evidence that collective tuning can yield substantial gains in large-scale AI training, complementing the direct benchmarking data from multiple production systems. The framework's emphasis on portability, instrumentation, and reproducibility is a practical contribution to the HPC and distributed systems community. The open-source release and focus on diagnostic breakdowns strengthen the utility for practitioners.

major comments (1)

[Evaluation with simulator replays] In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.

minor comments (2)

[Abstract and Evaluation] The abstract and evaluation sections should report the number of runs averaged, presence of error bars or statistical measures, and how post-hoc algorithm selection was performed to support the 'up to 5×' and 'up to 44%' claims.
[Framework description] Clarify the exact interface and automation level of the 'backend-adaptive parameter selection' for MPI versus NCCL to aid reproducibility by readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the clarity of our evaluation. We address the major comment below.

read point-by-point responses

Referee: In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.

Authors: We agree that the simulator-based results rely on the fidelity of ATLAHS in propagating isolated collective improvements. The simulator replays are intended to provide indicative estimates of end-to-end impact rather than definitive measurements, using publicly available traces and PICO-derived performance profiles. We did not conduct direct cross-validation against full-scale LLM training runs on the target systems, as such experiments would require prohibitive resources and coordinated access beyond the scope of this benchmarking-focused study. The ATLAHS framework is documented in its reference publication to model communication-computation overlap and trace replay with reasonable accuracy for distributed training workloads. In revision we will add an explicit limitations paragraph in the evaluation section that (1) states the absence of direct hardware cross-validation, (2) summarizes the simulator's modeling assumptions and known limitations with respect to hardware-specific effects not present in micro-benchmarks, and (3) qualifies the reported 44% figure as an upper-bound estimate under the simulator's replay model. These changes will make the scope of the claim transparent without altering the core contribution of the PICO framework itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking and simulator replay

full rationale

The paper presents PICO as an empirical benchmarking framework that measures collective performance on real supercomputers and then replays traces in the external ATLAHS simulator using the measured profiles. No mathematical derivation, closed-form prediction, or first-principles result is claimed; the 5× slowdown and 44% training-time figures are obtained directly from instrumentation and trace replay rather than by fitting parameters to the target quantity and renaming the fit as a prediction. The evaluation chain therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard assumptions about collective operation semantics and simulator fidelity rather than introducing new fitted parameters or invented physical entities.

axioms (1)

domain assumption Collective operations produce consistent timing behavior across repeated runs under controlled conditions
Required for reproducible benchmarking and diagnostic isolation of algorithmic choices

pith-pipeline@v0.9.0 · 5752 in / 1231 out tokens · 29702 ms · 2026-05-18T20:42:15.788360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

Advanced Micro Devices, Inc. 2025. ROCm Communication Collectives Library (RCCL) Documentation, v2.22.3. Online documentation. https://rocm.docs.amd. com/projects/rccl Accessed Jul. 23, 2025

work page 2025
[2]

Jon Ames and Ron Lowman. 2025. How Ultra Ethernet and UALink Enable High- Performance, Scalable AI Networks. Synopsys Blog. https://www.synopsys. com/articles/ultra-ethernet-ualink-ai-networks.html Accessed Aug. 11, 2025

work page 2025
[3]

Rogers, Evan Schneider, Jean-Luc Vay, and P

Scott Atchley, Christopher Zimmer, John Lange, David Bernholdt, Veronica Me- lesse Vergara, Thomas Beck, Michael Brim, Reuben Budiardja, Sunita Chan- drasekaran, Markus Eisenbach, Thomas Evans, Matthew Ezell, Nicholas Frontiere, Antigoni Georgiadou, Joe Glenski, Philipp Grete, Steven Hamilton, John Hol- men, Axel Huebl, Daniel Jacobson, Wayne Joubert, Kim...

work page
[4]

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23)

Frontier: Exploring Exascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 52, 16 pages. doi:10.1145/3581784.3607089

work page doi:10.1145/3581784.3607089
[5]

Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. 2009. MPI on a Million Processors. In Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (Espoo, Finland). Springer-Verlag, Berlin, Heidelber...

work page doi:10.1007/978-3- 2009
[6]

Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, and Sergi Girona. 2025. Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific work- loads. arXiv:2503.09917 [cs.DC] https://ar...

work page arXiv 2025
[7]

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, and Joud Khoury. 2024. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies. arXiv:2309.13541 [cs.DC] https://arxiv.org/abs/ 2309.13541

work page arXiv 2024
[8]

Amanda Bienz, Shreeman Gautam, and Amun Kharel. 2022. A Locality-Aware Bruck Allgather. arXiv:2206.03564 [cs.DC] https://arxiv.org/abs/2206.03564

work page arXiv 2022
[9]

Broadcom Inc. 2025. Scale-Up Ethernet (SUE) Framework Specification. Tech- nical specification (PDF). https://docs.broadcom.com/doc/scale-up-ethernet- framework Accessed Aug. 11, 2025

work page 2025
[10]

2002.An Introduction to the InfiniBand Architecture

Rajkumar Buyya, Toni Cortes, and Hai Jin. 2002.An Introduction to the InfiniBand Architecture. 616–632. doi:10.1109/9780470544839.ch42

work page doi:10.1109/9780470544839.ch42 2002
[11]

Generalized Slow Roll for Tensors

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14. doi:10.1109/sc41405.2020.00039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00039 2020
[12]

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, Tommaso Bonato, Seydou Ba, Matteo Turisini, Jens Domke, and Torsten Hoefler. 2025. Bine Trees: Enhanc- ing Collective Operations by Optimizing Communication Locality. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25). doi:ToAppear

work page 2025
[13]

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Dun- can Roweth, Filippo Spiga, Salvatore Di Girolamo, and Torsten Hoefler. 2024. Exploring GPU-to-GPU Communication: Insights into Supercomputer Intercon- nects. In Proceedings of the International C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00039 2024
[14]

Forschungszentrum Jülich, Jülich Supercomputing Centre. 2025. JUPITER Tech- nical Overview. Technical overview page. https://www.fz-juelich.de/en/ias/jsc/ jupiter/tech Last modified Jan 7, 2025; accessed Aug 11, 2025

work page 2025
[15]

Zhenhao He, Daniele Parravicini, Lucian Petrica, Kenneth O’Brien, Gustavo Alonso, and Michaela Blott. 2021. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High- performance Reconfigurable Computing (H2RC) . 33–43. doi:10.1109/H2RC54759. 2021.00009

work page doi:10.1109/h2rc54759 2021
[16]

Mert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christopher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-Mei Hwu, and Alex Aiken

work page
[17]

In Proceedings of the 38th ACM International Conference on Supercomputing

CommBench: Micro-Benchmarking Hierarchical Networks with Multi- GPU, Multi-NIC Nodes. In Proceedings of the 38th ACM International Conference on Supercomputing. 426–436

work page
[18]

Roger W. Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (1994), 389–398. doi:10.1016/S0167- 8191(06)80021-9

work page doi:10.1016/s0167- 1994
[19]

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoefler. 2025. Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786 [cs.DC] https://arxiv.org/abs/2507.04786

work page arXiv 2025
[20]

Hwang and Z

K. Hwang and Z. Xu. 1998. Scalable Parallel Computing: Technology, Archi- tecture, Programming. WCB/McGraw-Hill. https://books.google.it/books?id= OJNQAAAAMAAJ

work page 1998
[21]

Intel Corporation. 2021. Intel ® MPI Benchmarks User Guide, version 2021.2. On- line. https://www.intel.com/content/www/us/en/docs/mpi-library/user-guide- benchmarks/2021-2/overview.html Accessed Jul. 21, 2025

work page 2021
[22]

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop, and Dhabaleswar K. Panda. 2009. Designing multi-leader-based Allgather algorithms for multi-core clusters. In 2009 IEEE International Symposium on Parallel & Dis- tributed Processing. 1–8. doi:10.1109/IPDPS.2009.5160896

work page doi:10.1109/ipdps.2009.5160896 2009
[23]

Xiang ke Liao, Kai Lu, Can qun Yang, Jin wen Li, Yuan Yuan, Ming che Lai, Li bo Huang, Ping jing Lu, Jian bin Fang, Jing Ren, and Jie Shen. 2018. Moving from exascale to zettascale computing: challenges and techniques. Frontiers of Information Technology & Electronic Engineering 19, 10 (2018), 1236–1244. doi:10.1631/FITEE.1800494

work page doi:10.1631/fitee.1800494 2018
[24]

Jongryoul Kim, William Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. ACM SIGARCH Computer Architec- ture News 36, 77–88. doi:10.1109/ISCA.2008.19

work page doi:10.1109/isca.2008.19 2008
[25]

Andrii Kovalov, Elisabeth Lobe, Andreas Gerndt, and Daniel Lüdtke. 2017. Task- Node Mapping in an Arbitrary Computer Network Using SMT Solver. In Inte- grated Formal Methods, Nadia Polikarpova and Steve Schneider (Eds.). Springer International Publishing, Cham, 177–191

work page 2017
[26]

Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open- source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, Ne...

work page doi:10.1145/3295500.3356176 2019
[27]

Lawrence Livermore National Laboratory. 2025. Lawrence Livermore Na- tional Laboratory’s El Capitan verified as world’s fastest supercomputer. Press release. https://www.llnl.gov/article/52061/lawrence-livermore-national- laboratorys-el-capitan-verified-worlds-fastest-supercomputer

work page 2025
[28]

Dongkyun Lim and John Kim. 2025. TidalMesh: Topology-Driven AllRe- duce Collective Communication for Mesh Topology. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 1526–1540. doi:10.1109/HPCA61900.2025.00114

work page doi:10.1109/hpca61900.2025.00114 2025
[29]

Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A Survey of High- Performance Interconnection Networks in High-Performance Computer Systems. Electronics 11, 9 (2022). doi:10.3390/electronics11091369

work page doi:10.3390/electronics11091369 2022
[30]

Robert Lucas, James Ang, Keren Bergman, Shekhar Borkar, William Carlson, Laura Carrington, George Chiu, Robert Colwell, William Dally, Jack Dongarra, Al Geist, Rud Haring, Jeffrey Hittinger, Adolfy Hoisie, Dean Micron Klein, Peter Kogge, Richard Lethin, Vivek Sarkar, Robert Schreiber, John Shalf, Thomas Ster- ling, Rick Stevens, Jon Bashor, Ron Brightwell...

work page doi:10.2172/1222713 2014
[31]

Rick Merritt. 2023. What Is NVLink? NVIDIA Official Blog. https://blogs.nvidia. com/blog/what-is-nvidia-nvlink/ Accessed August 8, 2025

work page 2023
[32]

NVIDIA Corporation. [n. d.]. NCCL-Tests: Performance and correctness micro- benchmarks for NVIDIA NCCL. GitHub repository. https://github.com/NVIDIA/ nccl-tests Accessed Jul. 21, 2025

work page 2025
[33]

NVIDIA Corporation. 2025. NVIDIA Collective Communication Library (NCCL) Documentation. Online documentation. https://docs.nvidia.com/deeplearning/ nccl/index.html Accessed July 23, 2025

work page 2025
[34]

NVIDIA Corporation. 2025. NVIDIA GB200 NVL72 (Grace + Blackwell) – Rack- Scale AI System. Product web page. https://www.nvidia.com/en-us/data-center/ gb200-nvl72/ Accessed Aug. 11, 2025. Conference’17, July 2017, Washington, DC, USA Pasqualoni et al

work page 2025
[35]

OpenFabrics Interfaces Working Group (OFIWG). 2025. libfabric: Open Fabric Interfaces framework for high-performance networking. GitHub repository. https://github.com/ofiwg/libfabric Accessed Jul. 23, 2025

work page 2025
[36]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

work page doi:10.1016/j.jpdc.2008.09.002 2009
[37]

Muhammad Shuaib Qureshi, Muhammad Bilal Qureshi, Muhammad Fayaz, Wali Khan Mashwani, Samir Brahim Belhaouari, Saima Hassan, and Asadul- lah Shah. 2020. A comparative analysis of resource allocation schemes for real-time services in high-performance computing systems. Interna- tional Journal of Distributed Sensor Networks 16, 8 (2020), 1550147720932750. ar...

work page doi:10.1177/1550147720932750 2020
[38]

Paul Sack and William Gropp. 2015. Collective Algorithms for Multiported Torus Networks. ACM Trans. Parallel Comput. 1, 2, Article 12 (Feb. 2015), 33 pages. doi:10.1145/2686882

work page doi:10.1145/2686882 2015
[39]

David Schor. 2018. ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packag- ing. WikiChip Fuse blog. https://fuse.wikichip.org/news/1064/isscc-2018-amds- zeppelin-multi-chip-routing-and-packaging/ Accessed Aug. 11, 2025

work page 2018
[40]

Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoe- fler. 2024. Swing: Short-cutting Rings for Higher Bandwidth Allreduce. arXiv:2401.09356 [cs.DC] https://arxiv.org/abs/2401.09356

work page arXiv 2024
[41]

Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoefler. 2022. Noise in the Clouds: Influence of Network Performance Variability on Application Scalability. arXiv:2210.15315 [cs.DC] https://arxiv.org/abs/2210.15315

work page arXiv 2022
[42]

Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, and Steve Petruzza. 2024. Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. In Proceedings of the International Confer- ence on High Performance Computing in Asia-Pacific Region (Nagoya, Japan) (HP- CAsia ’24). Association for Computing Machinery, New Yo...

work page doi:10.1145/3635035.3635047 2024
[43]

Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. doi:10.1109/HiPINEB.2017.11

work page doi:10.1109/hipineb.2017.11 2017
[44]

Mohak Shroff and Robert van de Geijn. 2000. CollMark: MPI Collective Commu- nication Benchmark. (01 2000)

work page 2000
[45]

The Ohio State University. [n. d.]. OSU Micro-Benchmarks (OMB). Online. https://mvapich.cse.ohio-state.edu/benchmarks/ Accessed Jul. 21, 2025

work page 2025
[46]

The Open MPI Community. 2025. Open MPI 5.0: 11.10. Tuning Collectives (coll- tuned). Online documentation. https://docs.open-mpi.org/en/v5.0.x/tuning- apps/coll-tuned.html Last updated Jul. 31, 2025; accessed Aug. 11, 2025

work page 2025
[47]

The Unified Communication X Library [n. d.]. The Unified Communication X Library. http://www.openucx.org

work page
[48]

TOP500 Project. 2025. TOP500 List – June 2025. https://top500.org/lists/top500/ 2025/06/. Accessed: Jul. 18, 2025

work page 2025
[49]

Jesper Larsson Träff. 2006. Efficient Allgather for Regular SMP-Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface , Bernd Mohr, Jesper Larsson Träff, Joachim Worringen, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 58–65

work page 2006
[50]

Matteo Turisini, Giorgio Amati, and Mirko Cestari. 2023. LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications. arXiv:2307.16885 [cs.DC] https://arxiv.org/abs/2307.16885

work page arXiv 2023
[51]

Ultra Ethernet Consortium. 2025. Ultra Ethernet ™ Specification v1.0. Technical specification (PDF). https://ultraethernet.org/wp-content/uploads/sites/20/2025/ 06/UE-Specification-6.11.25.pdf Accessed Aug. 11, 2025

work page 2025
[52]

Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, Devendar Bu- reddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer. 2024. Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives. In IEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2...

work page doi:10.1109/hoti63208.2024.00018 2024
[53]

Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. Journal of Computer Science and Technology 38, 1 (Feb. 2023), 166–195. doi:10.1007/s11390-023-2894-6

work page doi:10.1007/s11390-023-2894-6 2023
[54]

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, and Songtao Wang

work page
[55]

arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

Revisiting the Time Cost Model of AllReduce. arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

work page arXiv
[56]

Wright, and Ayse K

Yijia Zhang, Taylor Groves, Brandon Cook, Nicholas J. Wright, and Ayse K. Coskun. 2020. Quantifying the impact of network congestion on application performance and network metrics. In2020 IEEE International Conference on Cluster Computing (CLUSTER). 162–168. doi:10.1109/CLUSTER49012.2020.00026

work page doi:10.1109/cluster49012.2020.00026 2020
[57]

Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, et al. 2025. An Extensible Software Transport Layer for GPU Networking. arXiv preprint arXiv:2504.17307 (2025)

work page arXiv 2025
[58]

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...

work page arXiv 2025
[59]

Zwinger, J

T. Zwinger, J. Heikonen, and P. Manninen. 2023. LUMI supercomputer for European researchers. In Galileo Conference: Solid Earth and Geohazards in the Exascale Era (2023-05-23/2023-05-26). Barcelona, Spain, GC11–solidearth–25. doi:10.5194/egusphere-gc11-solidearth-25

work page doi:10.5194/egusphere-gc11-solidearth-25 2023

[1] [1]

Advanced Micro Devices, Inc. 2025. ROCm Communication Collectives Library (RCCL) Documentation, v2.22.3. Online documentation. https://rocm.docs.amd. com/projects/rccl Accessed Jul. 23, 2025

work page 2025

[2] [2]

Jon Ames and Ron Lowman. 2025. How Ultra Ethernet and UALink Enable High- Performance, Scalable AI Networks. Synopsys Blog. https://www.synopsys. com/articles/ultra-ethernet-ualink-ai-networks.html Accessed Aug. 11, 2025

work page 2025

[3] [3]

Rogers, Evan Schneider, Jean-Luc Vay, and P

Scott Atchley, Christopher Zimmer, John Lange, David Bernholdt, Veronica Me- lesse Vergara, Thomas Beck, Michael Brim, Reuben Budiardja, Sunita Chan- drasekaran, Markus Eisenbach, Thomas Evans, Matthew Ezell, Nicholas Frontiere, Antigoni Georgiadou, Joe Glenski, Philipp Grete, Steven Hamilton, John Hol- men, Axel Huebl, Daniel Jacobson, Wayne Joubert, Kim...

work page

[4] [4]

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23)

Frontier: Exploring Exascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 52, 16 pages. doi:10.1145/3581784.3607089

work page doi:10.1145/3581784.3607089

[5] [5]

Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. 2009. MPI on a Million Processors. In Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (Espoo, Finland). Springer-Verlag, Berlin, Heidelber...

work page doi:10.1007/978-3- 2009

[6] [6]

Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, and Sergi Girona. 2025. Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific work- loads. arXiv:2503.09917 [cs.DC] https://ar...

work page arXiv 2025

[7] [7]

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, and Joud Khoury. 2024. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies. arXiv:2309.13541 [cs.DC] https://arxiv.org/abs/ 2309.13541

work page arXiv 2024

[8] [8]

Amanda Bienz, Shreeman Gautam, and Amun Kharel. 2022. A Locality-Aware Bruck Allgather. arXiv:2206.03564 [cs.DC] https://arxiv.org/abs/2206.03564

work page arXiv 2022

[9] [9]

Broadcom Inc. 2025. Scale-Up Ethernet (SUE) Framework Specification. Tech- nical specification (PDF). https://docs.broadcom.com/doc/scale-up-ethernet- framework Accessed Aug. 11, 2025

work page 2025

[10] [10]

2002.An Introduction to the InfiniBand Architecture

Rajkumar Buyya, Toni Cortes, and Hai Jin. 2002.An Introduction to the InfiniBand Architecture. 616–632. doi:10.1109/9780470544839.ch42

work page doi:10.1109/9780470544839.ch42 2002

[11] [11]

Generalized Slow Roll for Tensors

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14. doi:10.1109/sc41405.2020.00039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00039 2020

[12] [12]

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, Tommaso Bonato, Seydou Ba, Matteo Turisini, Jens Domke, and Torsten Hoefler. 2025. Bine Trees: Enhanc- ing Collective Operations by Optimizing Communication Locality. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25). doi:ToAppear

work page 2025

[13] [13]

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Dun- can Roweth, Filippo Spiga, Salvatore Di Girolamo, and Torsten Hoefler. 2024. Exploring GPU-to-GPU Communication: Insights into Supercomputer Intercon- nects. In Proceedings of the International C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00039 2024

[14] [14]

Forschungszentrum Jülich, Jülich Supercomputing Centre. 2025. JUPITER Tech- nical Overview. Technical overview page. https://www.fz-juelich.de/en/ias/jsc/ jupiter/tech Last modified Jan 7, 2025; accessed Aug 11, 2025

work page 2025

[15] [15]

Zhenhao He, Daniele Parravicini, Lucian Petrica, Kenneth O’Brien, Gustavo Alonso, and Michaela Blott. 2021. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High- performance Reconfigurable Computing (H2RC) . 33–43. doi:10.1109/H2RC54759. 2021.00009

work page doi:10.1109/h2rc54759 2021

[16] [16]

Mert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christopher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-Mei Hwu, and Alex Aiken

work page

[17] [17]

In Proceedings of the 38th ACM International Conference on Supercomputing

CommBench: Micro-Benchmarking Hierarchical Networks with Multi- GPU, Multi-NIC Nodes. In Proceedings of the 38th ACM International Conference on Supercomputing. 426–436

work page

[18] [18]

Roger W. Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (1994), 389–398. doi:10.1016/S0167- 8191(06)80021-9

work page doi:10.1016/s0167- 1994

[19] [19]

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoefler. 2025. Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786 [cs.DC] https://arxiv.org/abs/2507.04786

work page arXiv 2025

[20] [20]

Hwang and Z

K. Hwang and Z. Xu. 1998. Scalable Parallel Computing: Technology, Archi- tecture, Programming. WCB/McGraw-Hill. https://books.google.it/books?id= OJNQAAAAMAAJ

work page 1998

[21] [21]

Intel Corporation. 2021. Intel ® MPI Benchmarks User Guide, version 2021.2. On- line. https://www.intel.com/content/www/us/en/docs/mpi-library/user-guide- benchmarks/2021-2/overview.html Accessed Jul. 21, 2025

work page 2021

[22] [22]

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop, and Dhabaleswar K. Panda. 2009. Designing multi-leader-based Allgather algorithms for multi-core clusters. In 2009 IEEE International Symposium on Parallel & Dis- tributed Processing. 1–8. doi:10.1109/IPDPS.2009.5160896

work page doi:10.1109/ipdps.2009.5160896 2009

[23] [23]

Xiang ke Liao, Kai Lu, Can qun Yang, Jin wen Li, Yuan Yuan, Ming che Lai, Li bo Huang, Ping jing Lu, Jian bin Fang, Jing Ren, and Jie Shen. 2018. Moving from exascale to zettascale computing: challenges and techniques. Frontiers of Information Technology & Electronic Engineering 19, 10 (2018), 1236–1244. doi:10.1631/FITEE.1800494

work page doi:10.1631/fitee.1800494 2018

[24] [24]

Jongryoul Kim, William Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. ACM SIGARCH Computer Architec- ture News 36, 77–88. doi:10.1109/ISCA.2008.19

work page doi:10.1109/isca.2008.19 2008

[25] [25]

Andrii Kovalov, Elisabeth Lobe, Andreas Gerndt, and Daniel Lüdtke. 2017. Task- Node Mapping in an Arbitrary Computer Network Using SMT Solver. In Inte- grated Formal Methods, Nadia Polikarpova and Steve Schneider (Eds.). Springer International Publishing, Cham, 177–191

work page 2017

[26] [26]

Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open- source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, Ne...

work page doi:10.1145/3295500.3356176 2019

[27] [27]

Lawrence Livermore National Laboratory. 2025. Lawrence Livermore Na- tional Laboratory’s El Capitan verified as world’s fastest supercomputer. Press release. https://www.llnl.gov/article/52061/lawrence-livermore-national- laboratorys-el-capitan-verified-worlds-fastest-supercomputer

work page 2025

[28] [28]

Dongkyun Lim and John Kim. 2025. TidalMesh: Topology-Driven AllRe- duce Collective Communication for Mesh Topology. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 1526–1540. doi:10.1109/HPCA61900.2025.00114

work page doi:10.1109/hpca61900.2025.00114 2025

[29] [29]

Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A Survey of High- Performance Interconnection Networks in High-Performance Computer Systems. Electronics 11, 9 (2022). doi:10.3390/electronics11091369

work page doi:10.3390/electronics11091369 2022

[30] [30]

Robert Lucas, James Ang, Keren Bergman, Shekhar Borkar, William Carlson, Laura Carrington, George Chiu, Robert Colwell, William Dally, Jack Dongarra, Al Geist, Rud Haring, Jeffrey Hittinger, Adolfy Hoisie, Dean Micron Klein, Peter Kogge, Richard Lethin, Vivek Sarkar, Robert Schreiber, John Shalf, Thomas Ster- ling, Rick Stevens, Jon Bashor, Ron Brightwell...

work page doi:10.2172/1222713 2014

[31] [31]

Rick Merritt. 2023. What Is NVLink? NVIDIA Official Blog. https://blogs.nvidia. com/blog/what-is-nvidia-nvlink/ Accessed August 8, 2025

work page 2023

[32] [32]

NVIDIA Corporation. [n. d.]. NCCL-Tests: Performance and correctness micro- benchmarks for NVIDIA NCCL. GitHub repository. https://github.com/NVIDIA/ nccl-tests Accessed Jul. 21, 2025

work page 2025

[33] [33]

NVIDIA Corporation. 2025. NVIDIA Collective Communication Library (NCCL) Documentation. Online documentation. https://docs.nvidia.com/deeplearning/ nccl/index.html Accessed July 23, 2025

work page 2025

[34] [34]

NVIDIA Corporation. 2025. NVIDIA GB200 NVL72 (Grace + Blackwell) – Rack- Scale AI System. Product web page. https://www.nvidia.com/en-us/data-center/ gb200-nvl72/ Accessed Aug. 11, 2025. Conference’17, July 2017, Washington, DC, USA Pasqualoni et al

work page 2025

[35] [35]

OpenFabrics Interfaces Working Group (OFIWG). 2025. libfabric: Open Fabric Interfaces framework for high-performance networking. GitHub repository. https://github.com/ofiwg/libfabric Accessed Jul. 23, 2025

work page 2025

[36] [36]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

work page doi:10.1016/j.jpdc.2008.09.002 2009

[37] [37]

Muhammad Shuaib Qureshi, Muhammad Bilal Qureshi, Muhammad Fayaz, Wali Khan Mashwani, Samir Brahim Belhaouari, Saima Hassan, and Asadul- lah Shah. 2020. A comparative analysis of resource allocation schemes for real-time services in high-performance computing systems. Interna- tional Journal of Distributed Sensor Networks 16, 8 (2020), 1550147720932750. ar...

work page doi:10.1177/1550147720932750 2020

[38] [38]

Paul Sack and William Gropp. 2015. Collective Algorithms for Multiported Torus Networks. ACM Trans. Parallel Comput. 1, 2, Article 12 (Feb. 2015), 33 pages. doi:10.1145/2686882

work page doi:10.1145/2686882 2015

[39] [39]

David Schor. 2018. ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packag- ing. WikiChip Fuse blog. https://fuse.wikichip.org/news/1064/isscc-2018-amds- zeppelin-multi-chip-routing-and-packaging/ Accessed Aug. 11, 2025

work page 2018

[40] [40]

Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoe- fler. 2024. Swing: Short-cutting Rings for Higher Bandwidth Allreduce. arXiv:2401.09356 [cs.DC] https://arxiv.org/abs/2401.09356

work page arXiv 2024

[41] [41]

Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoefler. 2022. Noise in the Clouds: Influence of Network Performance Variability on Application Scalability. arXiv:2210.15315 [cs.DC] https://arxiv.org/abs/2210.15315

work page arXiv 2022

[42] [42]

Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, and Steve Petruzza. 2024. Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. In Proceedings of the International Confer- ence on High Performance Computing in Asia-Pacific Region (Nagoya, Japan) (HP- CAsia ’24). Association for Computing Machinery, New Yo...

work page doi:10.1145/3635035.3635047 2024

[43] [43]

Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. doi:10.1109/HiPINEB.2017.11

work page doi:10.1109/hipineb.2017.11 2017

[44] [44]

Mohak Shroff and Robert van de Geijn. 2000. CollMark: MPI Collective Commu- nication Benchmark. (01 2000)

work page 2000

[45] [45]

The Ohio State University. [n. d.]. OSU Micro-Benchmarks (OMB). Online. https://mvapich.cse.ohio-state.edu/benchmarks/ Accessed Jul. 21, 2025

work page 2025

[46] [46]

The Open MPI Community. 2025. Open MPI 5.0: 11.10. Tuning Collectives (coll- tuned). Online documentation. https://docs.open-mpi.org/en/v5.0.x/tuning- apps/coll-tuned.html Last updated Jul. 31, 2025; accessed Aug. 11, 2025

work page 2025

[47] [47]

The Unified Communication X Library [n. d.]. The Unified Communication X Library. http://www.openucx.org

work page

[48] [48]

TOP500 Project. 2025. TOP500 List – June 2025. https://top500.org/lists/top500/ 2025/06/. Accessed: Jul. 18, 2025

work page 2025

[49] [49]

Jesper Larsson Träff. 2006. Efficient Allgather for Regular SMP-Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface , Bernd Mohr, Jesper Larsson Träff, Joachim Worringen, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 58–65

work page 2006

[50] [50]

Matteo Turisini, Giorgio Amati, and Mirko Cestari. 2023. LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications. arXiv:2307.16885 [cs.DC] https://arxiv.org/abs/2307.16885

work page arXiv 2023

[51] [51]

Ultra Ethernet Consortium. 2025. Ultra Ethernet ™ Specification v1.0. Technical specification (PDF). https://ultraethernet.org/wp-content/uploads/sites/20/2025/ 06/UE-Specification-6.11.25.pdf Accessed Aug. 11, 2025

work page 2025

[52] [52]

Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, Devendar Bu- reddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer. 2024. Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives. In IEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2...

work page doi:10.1109/hoti63208.2024.00018 2024

[53] [53]

Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. Journal of Computer Science and Technology 38, 1 (Feb. 2023), 166–195. doi:10.1007/s11390-023-2894-6

work page doi:10.1007/s11390-023-2894-6 2023

[54] [54]

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, and Songtao Wang

work page

[55] [55]

arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

Revisiting the Time Cost Model of AllReduce. arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202

work page arXiv

[56] [56]

Wright, and Ayse K

Yijia Zhang, Taylor Groves, Brandon Cook, Nicholas J. Wright, and Ayse K. Coskun. 2020. Quantifying the impact of network congestion on application performance and network metrics. In2020 IEEE International Conference on Cluster Computing (CLUSTER). 162–168. doi:10.1109/CLUSTER49012.2020.00026

work page doi:10.1109/cluster49012.2020.00026 2020

[57] [57]

Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, et al. 2025. An Extensible Software Transport Layer for GPU Networking. arXiv preprint arXiv:2504.17307 (2025)

work page arXiv 2025

[58] [58]

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...

work page arXiv 2025

[59] [59]

Zwinger, J

T. Zwinger, J. Heikonen, and P. Manninen. 2023. LUMI supercomputer for European researchers. In Galileo Conference: Solid Earth and Geohazards in the Exascale Era (2023-05-23/2023-05-26). Barcelona, Spain, GC11–solidearth–25. doi:10.5194/egusphere-gc11-solidearth-25

work page doi:10.5194/egusphere-gc11-solidearth-25 2023