PICO: Performance Insights for Collective Operations
Pith reviewed 2026-05-18 20:42 UTC · model grok-4.3
The pith
Default collective algorithms and transport settings can be up to 5× slower than the best available choice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PICO decouples experiment setup from execution, offers backend-adaptive parameter selection across MPI and NCCL, supplies instrumentable plain-MPI references, and records full runtime environments; using this machinery on three major supercomputers shows default collective algorithms and transport settings can be up to 5× slower than the best available choice and, when the resulting optimized profiles are replayed in the ATLAHS simulator on open-source LLM training traces, produces reductions in training times of up to 44%.
What carries the argument
PICO's decoupling of portable experiment setup from platform execution together with its backend-adaptive parameter selection interface for MPI and NCCL.
If this is right
- Default collective choices in production systems are often far from optimal and can be diagnosed by isolating topology-sensitive algorithmic decisions.
- Instrumentation inside PICO yields per-algorithm breakdowns that explain where time is lost.
- Simulator replay of real training traces converts micro-benchmark speedups into concrete application-level time savings.
- Portable experiment definitions plus configuration capture make performance comparisons reproducible across platforms.
Where Pith is reading between the lines
- Many running HPC and AI jobs are likely leaving substantial performance on the table simply by accepting library defaults.
- Embedding PICO-style selection logic into runtime systems could automate better collective choices without user intervention.
- The same measurement approach could be applied to other communication patterns such as point-to-point or one-sided operations.
Load-bearing premise
The ATLAHS simulator accurately converts measured collective performance gains into end-to-end training time predictions for real LLM workloads.
What would settle it
Run the same open-source LLM training traces on actual hardware once with default collectives and once with the PICO-identified optimized profiles, then compare measured wall-clock training times against the simulator predictions.
Figures
read the original abstract
Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to $5\times$ slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to $44\%$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PICO, an open-source framework that decouples portable experiment setup from platform execution for benchmarking collective operations. It supports MPI and NCCL backends, provides plain-MPI reference implementations, records system configurations for reproducibility, and offers diagnostic profiling. Evaluations on three supercomputers show default collective algorithms and transport settings can be up to 5× slower than optimized choices, with evidence on topology sensitivity. To assess end-to-end impact, the authors replay open-source LLM training traces in the ATLAHS simulator after substituting PICO-identified optimized collective profiles, reporting training time reductions of up to 44%.
Significance. If the simulator-based end-to-end results are validated, the work provides concrete evidence that collective tuning can yield substantial gains in large-scale AI training, complementing the direct benchmarking data from multiple production systems. The framework's emphasis on portability, instrumentation, and reproducibility is a practical contribution to the HPC and distributed systems community. The open-source release and focus on diagnostic breakdowns strengthen the utility for practitioners.
major comments (1)
- [Evaluation with simulator replays] In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.
minor comments (2)
- [Abstract and Evaluation] The abstract and evaluation sections should report the number of runs averaged, presence of error bars or statistical measures, and how post-hoc algorithm selection was performed to support the 'up to 5×' and 'up to 44%' claims.
- [Framework description] Clarify the exact interface and automation level of the 'backend-adaptive parameter selection' for MPI versus NCCL to aid reproducibility by readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to improve the clarity of our evaluation. We address the major comment below.
read point-by-point responses
-
Referee: In the evaluation section describing ATLAHS simulator replays of LLM training traces, the 44% training-time reduction claim is obtained by substituting PICO-optimized collective profiles into the simulator without reported cross-validation against actual end-to-end LLM training runs on the target hardware. This leaves unaddressed whether the simulator correctly propagates isolated collective speedups while accounting for compute-communication overlap, trace fidelity, and hardware-specific effects not captured in the micro-benchmarks; this assumption is load-bearing for the headline application-level result in the abstract.
Authors: We agree that the simulator-based results rely on the fidelity of ATLAHS in propagating isolated collective improvements. The simulator replays are intended to provide indicative estimates of end-to-end impact rather than definitive measurements, using publicly available traces and PICO-derived performance profiles. We did not conduct direct cross-validation against full-scale LLM training runs on the target systems, as such experiments would require prohibitive resources and coordinated access beyond the scope of this benchmarking-focused study. The ATLAHS framework is documented in its reference publication to model communication-computation overlap and trace replay with reasonable accuracy for distributed training workloads. In revision we will add an explicit limitations paragraph in the evaluation section that (1) states the absence of direct hardware cross-validation, (2) summarizes the simulator's modeling assumptions and known limitations with respect to hardware-specific effects not present in micro-benchmarks, and (3) qualifies the reported 44% figure as an upper-bound estimate under the simulator's replay model. These changes will make the scope of the claim transparent without altering the core contribution of the PICO framework itself. revision: yes
Circularity Check
No significant circularity in empirical benchmarking and simulator replay
full rationale
The paper presents PICO as an empirical benchmarking framework that measures collective performance on real supercomputers and then replays traces in the external ATLAHS simulator using the measured profiles. No mathematical derivation, closed-form prediction, or first-principles result is claimed; the 5× slowdown and 44% training-time figures are obtained directly from instrumentation and trace replay rather than by fitting parameters to the target quantity and renaming the fit as a prediction. The evaluation chain therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Collective operations produce consistent timing behavior across repeated runs under controlled conditions
Reference graph
Works this paper leans on
-
[1]
Advanced Micro Devices, Inc. 2025. ROCm Communication Collectives Library (RCCL) Documentation, v2.22.3. Online documentation. https://rocm.docs.amd. com/projects/rccl Accessed Jul. 23, 2025
work page 2025
-
[2]
Jon Ames and Ron Lowman. 2025. How Ultra Ethernet and UALink Enable High- Performance, Scalable AI Networks. Synopsys Blog. https://www.synopsys. com/articles/ultra-ethernet-ualink-ai-networks.html Accessed Aug. 11, 2025
work page 2025
-
[3]
Rogers, Evan Schneider, Jean-Luc Vay, and P
Scott Atchley, Christopher Zimmer, John Lange, David Bernholdt, Veronica Me- lesse Vergara, Thomas Beck, Michael Brim, Reuben Budiardja, Sunita Chan- drasekaran, Markus Eisenbach, Thomas Evans, Matthew Ezell, Nicholas Frontiere, Antigoni Georgiadou, Joe Glenski, Philipp Grete, Steven Hamilton, John Hol- men, Axel Huebl, Daniel Jacobson, Wayne Joubert, Kim...
-
[4]
Frontier: Exploring Exascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 52, 16 pages. doi:10.1145/3581784.3607089
-
[5]
Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Sameer Kumar, Ewing Lusk, Rajeev Thakur, and Jesper Larsson Träff. 2009. MPI on a Million Processors. In Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (Espoo, Finland). Springer-Verlag, Berlin, Heidelber...
-
[6]
Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani, Joan Vinyals, Josep Pocurull, David Vicente, Beatriz Eguzkitza, Flavio C. C. Galeazzo, Mario C. Acosta, and Sergi Girona. 2025. Introducing MareNostrum5: A European pre-exascale energy-efficient system designed to serve a broad spectrum of scientific work- loads. arXiv:2503.09917 [cs.DC] https://ar...
- [7]
- [8]
-
[9]
Broadcom Inc. 2025. Scale-Up Ethernet (SUE) Framework Specification. Tech- nical specification (PDF). https://docs.broadcom.com/doc/scale-up-ethernet- framework Accessed Aug. 11, 2025
work page 2025
-
[10]
2002.An Introduction to the InfiniBand Architecture
Rajkumar Buyya, Toni Cortes, and Hai Jin. 2002.An Introduction to the InfiniBand Architecture. 616–632. doi:10.1109/9780470544839.ch42
-
[11]
Generalized Slow Roll for Tensors
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14. doi:10.1109/sc41405.2020.00039
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00039 2020
-
[12]
Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, Tommaso Bonato, Seydou Ba, Matteo Turisini, Jens Domke, and Torsten Hoefler. 2025. Bine Trees: Enhanc- ing Collective Operations by Optimizing Communication Locality. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25). doi:ToAppear
work page 2025
-
[13]
Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Dun- can Roweth, Filippo Spiga, Salvatore Di Girolamo, and Torsten Hoefler. 2024. Exploring GPU-to-GPU Communication: Insights into Supercomputer Intercon- nects. In Proceedings of the International C...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00039 2024
-
[14]
Forschungszentrum Jülich, Jülich Supercomputing Centre. 2025. JUPITER Tech- nical Overview. Technical overview page. https://www.fz-juelich.de/en/ias/jsc/ jupiter/tech Last modified Jan 7, 2025; accessed Aug 11, 2025
work page 2025
-
[15]
Zhenhao He, Daniele Parravicini, Lucian Petrica, Kenneth O’Brien, Gustavo Alonso, and Michaela Blott. 2021. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High- performance Reconfigurable Computing (H2RC) . 33–43. doi:10.1109/H2RC54759. 2021.00009
-
[16]
Mert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christopher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-Mei Hwu, and Alex Aiken
-
[17]
In Proceedings of the 38th ACM International Conference on Supercomputing
CommBench: Micro-Benchmarking Hierarchical Networks with Multi- GPU, Multi-NIC Nodes. In Proceedings of the 38th ACM International Conference on Supercomputing. 426–436
-
[18]
Roger W. Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (1994), 389–398. doi:10.1016/S0167- 8191(06)80021-9
-
[19]
Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, and Torsten Hoefler. 2025. Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786 [cs.DC] https://arxiv.org/abs/2507.04786
-
[20]
K. Hwang and Z. Xu. 1998. Scalable Parallel Computing: Technology, Archi- tecture, Programming. WCB/McGraw-Hill. https://books.google.it/books?id= OJNQAAAAMAAJ
work page 1998
-
[21]
Intel Corporation. 2021. Intel ® MPI Benchmarks User Guide, version 2021.2. On- line. https://www.intel.com/content/www/us/en/docs/mpi-library/user-guide- benchmarks/2021-2/overview.html Accessed Jul. 21, 2025
work page 2021
-
[22]
Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop, and Dhabaleswar K. Panda. 2009. Designing multi-leader-based Allgather algorithms for multi-core clusters. In 2009 IEEE International Symposium on Parallel & Dis- tributed Processing. 1–8. doi:10.1109/IPDPS.2009.5160896
-
[23]
Xiang ke Liao, Kai Lu, Can qun Yang, Jin wen Li, Yuan Yuan, Ming che Lai, Li bo Huang, Ping jing Lu, Jian bin Fang, Jing Ren, and Jie Shen. 2018. Moving from exascale to zettascale computing: challenges and techniques. Frontiers of Information Technology & Electronic Engineering 19, 10 (2018), 1236–1244. doi:10.1631/FITEE.1800494
-
[24]
Jongryoul Kim, William Dally, Steve Scott, and Dennis Abts. 2008. Technology- Driven, Highly-Scalable Dragonfly Topology. ACM SIGARCH Computer Architec- ture News 36, 77–88. doi:10.1109/ISCA.2008.19
-
[25]
Andrii Kovalov, Elisabeth Lobe, Andreas Gerndt, and Daniel Lüdtke. 2017. Task- Node Mapping in an Arbitrary Computer Network Using SMT Solver. In Inte- grated Formal Methods, Nadia Polikarpova and Steve Schneider (Eds.). Springer International Publishing, Cham, 177–191
work page 2017
-
[26]
Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open- source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, Ne...
-
[27]
Lawrence Livermore National Laboratory. 2025. Lawrence Livermore Na- tional Laboratory’s El Capitan verified as world’s fastest supercomputer. Press release. https://www.llnl.gov/article/52061/lawrence-livermore-national- laboratorys-el-capitan-verified-worlds-fastest-supercomputer
work page 2025
-
[28]
Dongkyun Lim and John Kim. 2025. TidalMesh: Topology-Driven AllRe- duce Collective Communication for Mesh Topology. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 1526–1540. doi:10.1109/HPCA61900.2025.00114
-
[29]
Ping-Jing Lu, Ming-Che Lai, and Jun-Sheng Chang. 2022. A Survey of High- Performance Interconnection Networks in High-Performance Computer Systems. Electronics 11, 9 (2022). doi:10.3390/electronics11091369
-
[30]
Robert Lucas, James Ang, Keren Bergman, Shekhar Borkar, William Carlson, Laura Carrington, George Chiu, Robert Colwell, William Dally, Jack Dongarra, Al Geist, Rud Haring, Jeffrey Hittinger, Adolfy Hoisie, Dean Micron Klein, Peter Kogge, Richard Lethin, Vivek Sarkar, Robert Schreiber, John Shalf, Thomas Ster- ling, Rick Stevens, Jon Bashor, Ron Brightwell...
-
[31]
Rick Merritt. 2023. What Is NVLink? NVIDIA Official Blog. https://blogs.nvidia. com/blog/what-is-nvidia-nvlink/ Accessed August 8, 2025
work page 2023
-
[32]
NVIDIA Corporation. [n. d.]. NCCL-Tests: Performance and correctness micro- benchmarks for NVIDIA NCCL. GitHub repository. https://github.com/NVIDIA/ nccl-tests Accessed Jul. 21, 2025
work page 2025
-
[33]
NVIDIA Corporation. 2025. NVIDIA Collective Communication Library (NCCL) Documentation. Online documentation. https://docs.nvidia.com/deeplearning/ nccl/index.html Accessed July 23, 2025
work page 2025
-
[34]
NVIDIA Corporation. 2025. NVIDIA GB200 NVL72 (Grace + Blackwell) – Rack- Scale AI System. Product web page. https://www.nvidia.com/en-us/data-center/ gb200-nvl72/ Accessed Aug. 11, 2025. Conference’17, July 2017, Washington, DC, USA Pasqualoni et al
work page 2025
-
[35]
OpenFabrics Interfaces Working Group (OFIWG). 2025. libfabric: Open Fabric Interfaces framework for high-performance networking. GitHub repository. https://github.com/ofiwg/libfabric Accessed Jul. 23, 2025
work page 2025
-
[36]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002
-
[37]
Muhammad Shuaib Qureshi, Muhammad Bilal Qureshi, Muhammad Fayaz, Wali Khan Mashwani, Samir Brahim Belhaouari, Saima Hassan, and Asadul- lah Shah. 2020. A comparative analysis of resource allocation schemes for real-time services in high-performance computing systems. Interna- tional Journal of Distributed Sensor Networks 16, 8 (2020), 1550147720932750. ar...
-
[38]
Paul Sack and William Gropp. 2015. Collective Algorithms for Multiported Torus Networks. ACM Trans. Parallel Comput. 1, 2, Article 12 (Feb. 2015), 33 pages. doi:10.1145/2686882
-
[39]
David Schor. 2018. ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packag- ing. WikiChip Fuse blog. https://fuse.wikichip.org/news/1064/isscc-2018-amds- zeppelin-multi-chip-routing-and-packaging/ Accessed Aug. 11, 2025
work page 2018
- [40]
- [41]
-
[42]
Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, and Steve Petruzza. 2024. Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. In Proceedings of the International Confer- ence on High Performance Computing in Asia-Pacific Region (Nagoya, Japan) (HP- CAsia ’24). Association for Computing Machinery, New Yo...
-
[43]
Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low Cost Topology for Scaling Datacenters. doi:10.1109/HiPINEB.2017.11
-
[44]
Mohak Shroff and Robert van de Geijn. 2000. CollMark: MPI Collective Commu- nication Benchmark. (01 2000)
work page 2000
-
[45]
The Ohio State University. [n. d.]. OSU Micro-Benchmarks (OMB). Online. https://mvapich.cse.ohio-state.edu/benchmarks/ Accessed Jul. 21, 2025
work page 2025
-
[46]
The Open MPI Community. 2025. Open MPI 5.0: 11.10. Tuning Collectives (coll- tuned). Online documentation. https://docs.open-mpi.org/en/v5.0.x/tuning- apps/coll-tuned.html Last updated Jul. 31, 2025; accessed Aug. 11, 2025
work page 2025
-
[47]
The Unified Communication X Library [n. d.]. The Unified Communication X Library. http://www.openucx.org
-
[48]
TOP500 Project. 2025. TOP500 List – June 2025. https://top500.org/lists/top500/ 2025/06/. Accessed: Jul. 18, 2025
work page 2025
-
[49]
Jesper Larsson Träff. 2006. Efficient Allgather for Regular SMP-Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface , Bernd Mohr, Jesper Larsson Träff, Joachim Worringen, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 58–65
work page 2006
- [50]
-
[51]
Ultra Ethernet Consortium. 2025. Ultra Ethernet ™ Specification v1.0. Technical specification (PDF). https://ultraethernet.org/wp-content/uploads/sites/20/2025/ 06/UE-Specification-6.11.25.pdf Accessed Aug. 11, 2025
work page 2025
-
[52]
Manjunath Gorentla Venkata, Valentine Petrov, Sergey Lebedev, Devendar Bu- reddy, Ferrol Aderholdt, Joshua Ladd, Gil Bloch, Mike Dubman, and Gilad Shainer. 2024. Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives. In IEEE Symposium on High-Performance Interconnects, HOTI 2024, Albuquerque, NM, USA, August 21-23, 2...
-
[53]
Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. Journal of Computer Science and Technology 38, 1 (Feb. 2023), 166–195. doi:10.1007/s11390-023-2894-6
-
[54]
Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, and Songtao Wang
-
[55]
arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202
Revisiting the Time Cost Model of AllReduce. arXiv:2409.04202 [cs.DC] https://arxiv.org/abs/2409.04202
-
[56]
Yijia Zhang, Taylor Groves, Brandon Cook, Nicholas J. Wright, and Ayse K. Coskun. 2020. Quantifying the impact of network congestion on application performance and network metrics. In2020 IEEE International Conference on Cluster Computing (CLUSTER). 162–168. doi:10.1109/CLUSTER49012.2020.00026
- [57]
-
[58]
Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao...
-
[59]
T. Zwinger, J. Heikonen, and P. Manninen. 2023. LUMI supercomputer for European researchers. In Galileo Conference: Solid Earth and Geohazards in the Exascale Era (2023-05-23/2023-05-26). Barcelona, Spain, GC11–solidearth–25. doi:10.5194/egusphere-gc11-solidearth-25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.