Recognition: unknown
Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora
Pith reviewed 2026-05-10 16:45 UTC · model grok-4.3
The pith
System-level choices enable scaling FP64 HPL to 1.01 EF/s while delivering an 11.5x speedup in mixed precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision orchestration, and a hybrid P2P/collective resilience strategy introduced after synchronization stalls at scale, the system sustained exascale performance in production. This produced scaling from 0.585 EF/s to 1.01 EF/s in FP64 HPL and 11.64 EF/s in HPL-MxP, an 11.5x gain over full precision enabled by mixed-precision arithmetic and acceleration.
What carries the argument
The classification by role at production scale of system-level choices, including deterministic locality-aware resource mapping, explicit processor pipelining, mixed-precision orchestration, and hybrid resilience strategies.
If this is right
- Scaling to thousands of nodes requires these coordinated practices to avoid stalls that appear only at production scale.
- Mixed-precision orchestration can deliver more than tenfold performance gains over full precision for suitable workloads.
- Hybrid peer-to-peer and collective methods reduce the impact of failures and synchronization issues as system size grows.
- Explicit pipelining between processors improves overall efficiency in tightly coupled heterogeneous environments.
Where Pith is reading between the lines
- The same mapping and pipelining practices could be tested on other scientific workloads to check if they improve sustained efficiency.
- The role-based classification offers a template for prioritizing optimizations when designing software stacks for future large systems.
- Resilience strategies of this form may become more critical as hardware fault rates rise with increasing component counts.
Load-bearing premise
The observed choices generalize beyond this specific deployment to other tightly coupled heterogeneous systems at extreme scale.
What would settle it
Measure whether removing locality-aware mapping or the hybrid resilience strategy on a comparable large heterogeneous system causes performance to fall below linear scaling expectations or to encounter more frequent synchronization stalls.
Figures
read the original abstract
Sustaining exascale performance in production requires engineering choices and operational practices that emerge only under real deployment constraints and demand coordination across system layers. This paper reports experience from three successive campaigns running HPL and HPL-MxP on Aurora, an Intel-based exascale system featuring the first large-scale deployment of Intel discrete GPUs, CPU-attached network interfaces, and the largest production Slingshot-11 interconnect. Aurora progressed from 0.585EF/s on 5,439 nodes to 1.01EF/s on 9,234 nodes in FP64 HPL, while HPL-MxP reached 11.64EF/s, an 11.5x speedup over FP64 enabled by mixed-precision arithmetic and Intel AMX acceleration. We identify and classify by role at production scale the system-level choices that sustained these results, including deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision orchestration, and a hybrid P2P/collective resilience strategy introduced after synchronization stalls at scale. While some observations are Aurora-specific, the broader lessons are likely to apply to tightly coupled heterogeneous systems at extreme scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports empirical results from three campaigns running HPL and HPL-MxP on the Aurora exascale system (first large-scale Intel discrete GPU deployment with Slingshot-11 interconnect). It documents FP64 HPL scaling from 0.585 EF/s on 5,439 nodes to 1.01 EF/s on 9,234 nodes and an 11.64 EF/s HPL-MxP result (11.5x over FP64 via mixed-precision and AMX). The central contribution is a post-hoc classification of production-scale choices—deterministic locality-aware mapping, CPU-GPU pipelining, mixed-precision orchestration, and hybrid P2P/collective resilience—by their role in sustaining these results, with a claim that broader lessons apply to tightly coupled heterogeneous systems.
Significance. If the attributions are substantiated, the work supplies concrete, production-derived guidance on cross-layer coordination for exascale heterogeneous computing, including quantified speedups and scaling behavior on a first-of-kind platform. The explicit node counts, achieved rates, and role-based classification of practices constitute a useful reference point for operators of similar systems, even if some observations remain Aurora-specific.
major comments (2)
- [Abstract / results classification] Abstract and results sections: the claim that the enumerated choices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) 'sustained' the reported scaling and 11.5x speedup is not supported by isolation experiments or before/after metrics on identical node counts. The progression from 5,439 to 9,234 nodes and the mixed-precision gain could be explained by hardware scaling, interconnect properties, or unlisted factors; without controlled comparisons the causal link remains correlational and undermines both internal attribution and the transferability assertion.
- [Campaigns and choices sections] Methodology description (campaigns section): insufficient detail is provided on how each choice was implemented, measured, or varied across the three campaigns, including absence of error bars, run-to-run variability, or explicit controls for confounding variables such as job placement policies or network contention. This limits independent verification that the listed practices, rather than Aurora-specific hardware (discrete GPUs, CPU-attached NICs), produced the gains.
minor comments (2)
- [Abstract] The abstract states 'three successive campaigns' but does not tabulate the exact node counts, software versions, or configuration differences between campaigns, which would improve traceability of the scaling steps.
- [Results presentation] Notation for performance units (EF/s) and speedups is clear, but the paper would benefit from an explicit table summarizing all reported rates, node counts, and the precise definition of the 11.5x factor (HPL-MxP vs. FP64 HPL on the final configuration).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments have prompted us to refine the language around attribution and expand methodological details. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / results classification] Abstract and results sections: the claim that the enumerated choices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) 'sustained' the reported scaling and 11.5x speedup is not supported by isolation experiments or before/after metrics on identical node counts. The progression from 5,439 to 9,234 nodes and the mixed-precision gain could be explained by hardware scaling, interconnect properties, or unlisted factors; without controlled comparisons the causal link remains correlational and undermines both internal attribution and the transferability assertion.
Authors: We agree that the evidence presented is observational and correlational rather than derived from controlled isolation experiments or fixed-node before/after comparisons. Such experiments were not feasible given the production nature of the campaigns, the high cost of exascale allocations, and the incremental way improvements were deployed as bottlenecks were identified through profiling. The attributions rest on in-situ measurements (e.g., synchronization stall detection leading to the hybrid resilience strategy) and the documented progression across campaigns. We have revised the abstract and results sections to use more precise phrasing such as 'contributed to' and 'enabled the observed' performance, while adding a brief discussion of the limitations of causal inference in production settings. The practical lessons remain grounded in the actual deployment experience. revision: yes
-
Referee: [Campaigns and choices sections] Methodology description (campaigns section): insufficient detail is provided on how each choice was implemented, measured, or varied across the three campaigns, including absence of error bars, run-to-run variability, or explicit controls for confounding variables such as job placement policies or network contention. This limits independent verification that the listed practices, rather than Aurora-specific hardware (discrete GPUs, CPU-attached NICs), produced the gains.
Authors: We appreciate the call for greater transparency. The revised manuscript expands the Campaigns and Choices sections with concrete implementation details: locality-aware mapping was realized through topology-aware job scheduling that aligned processes to Slingshot-11 fabric links and CPU-attached NICs; CPU-GPU pipelining used explicit asynchronous transfers and kernel overlap measured via oneAPI profiling tools; mixed-precision orchestration coordinated FP64, TF32, and AMX paths with per-kernel timing; and the hybrid resilience strategy combined P2P and collective checkpoints after observing stalls at scale. Where multiple runs were possible, we now report observed variability. Full error bars and exhaustive controls for every confounder (e.g., dynamic network contention or scheduler policies) were not attainable in the production environment; we have added an explicit limitations paragraph acknowledging this constraint while providing the best available data from the three campaigns. revision: partial
Circularity Check
No circularity: pure empirical benchmark report with no derivations or fitted predictions
full rationale
The manuscript is an empirical report of measured HPL and HPL-MxP benchmark results on the Aurora system across three campaigns. It states concrete achieved rates (0.585 EF/s on 5439 nodes scaling to 1.01 EF/s on 9234 nodes for FP64 HPL; 11.64 EF/s for HPL-MxP) and classifies observed practices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) from those runs. No equations, first-principles derivations, parameter fits, or predictions appear in the provided text or abstract. The central claims are direct measurements and post-hoc classification of practices that coincided with the results; they do not reduce to inputs by construction, self-citation chains, or renaming. This matches the default expectation of a non-circular empirical paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,
B. S. Allen, J. Anchell, V . Anisimov, T. Applencourt, A. Bagusetty, R. Balakrishnan, R. Balin, S. Bekele, C. Bertoni, C. Blackworth, R. Bustamante, K. Canada, J. Carrier, C. Chan-nui, L. C. Cheney, T. Childers, P. Coffman, S. Coghlan, M. D’Mello, M. Emani, K. G. Felker, S. Foreman, O. Franza, L. Gao, M. Garc ´ıa, M. Garzar ´an, B. Gerofi, Y . Ghadar, N. ...
-
[2]
[Online]
Aurora. [Online]. Available: https://www.alcf.anl.gov/aurora
-
[3]
In: 2022 IEEE International Solid-State Circuits Conference (ISSCC)
N. Nassif, A. O. Munch, C. L. Molnar, G. Pasdast, S. V . Lyer, Z. Yang, O. Mendoza, M. Huddart, S. Venkataraman, S. Kandula, R. Marom, A. M. Kern, B. Bowhill, D. R. Mulvihill, S. Nimmagadda, V . Kalidindi, J. Krause, M. M. Haq, R. Sharma, and K. Duda, “Sapphire rapids: The next-generation intel xeon scalable processor,” in2022 IEEE International Solid-Sta...
-
[4]
In: 2022 IEEE International Solid-State Circuits Conference (ISSCC)
W. Gomes, A. Koker, P. Stover, D. Ingerly, S. Siers, S. Venkataraman, C. Pelto, T. Shah, A. Rao, F. O’Mahony, E. Karl, L. Cheney, I. Rajwani, H. Jain, R. Cortez, A. Chandrasekhar, B. Kanthi, and R. Koduri, “Ponte vecchio: A multi-tile 3d stacked processor for exascale computing,” in2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, ...
-
[5]
D. Blythe, “ XeHPC Ponte Vecchio ,” in2021 IEEE Hot Chips 33 Symposium (HCS). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2021, pp. 1–34. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/HCS52781.2021.9567038
-
[6]
Hpc slingshot launched into network space,
“Hpc slingshot launched into network space,” inCray User Group 2022 (CUG2022) Proceedings, May 2022. [Online]. Available: https://cug. org/proceedings/cug2022 proceedings/includes/files/pap121s2-file1.pdf
2022
-
[7]
Technology-driven, highly-scalable dragonfly topology,
J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” in2008 International Symposium on Computer Architecture, 2008, pp. 77–88. [Online]. Available: https://doi.com/10.1109/ISCA.2008.19
-
[8]
Scaling mpi applications on aurora,
H. Ibeid, A.-T. Nguyen, A. Nishtala, P. Sakarda, L. Kaplan, N. Mahadevan, M. Woodacre, V . Anisimov, K. Kumaran, J. Kwack, V . Morozov, S. Muralidharan, and S. Parker, “Scaling mpi applications on aurora,” 2025. [Online]. Available: https://arxiv.org/abs/2512.04291
-
[9]
[Online]
Intel oneAPI. [Online]. Available: https://www.intel.com/content/www/ us/en/developer/tools/oneapi/overview.html
-
[10]
[Online]
TOP500 list. [Online]. Available: https://www.top500.org/lists/top500/
-
[11]
[Online]
HPL-MxP. [Online]. Available: https://hpl-mxp.org/results.md
-
[12]
Rogers, Evan Schneider, Jean-Luc Vay, and P
S. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V . Melesse Vergara, T. Beck, M. Brim, R. Budiardja, S. Chandrasekaran, M. Eisenbach, T. Evans, M. Ezell, N. Frontiere, A. Georgiadou, J. Glenski, P. Grete, S. Hamilton, J. Holmen, A. Huebl, D. Jacobson, W. Joubert, K. Mcmahon, E. Merzari, S. Moore, A. Myers, S. Nichols, S. Oral, T. Papatheodore, D. Perez, D....
-
[13]
N. Chalmers, J. Kurzak, D. Mcdougall, and P. Bauman, “Optimizing high-performance linpack for exascale accelerated architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...
-
[14]
Performance analysis of hpc applications on the aurora supercomputer: Exploring the impact of hbm-enabled intel xeon max cpus,
H. Ibeid, V . Narayana, J. Kim, A. Nguyen, V . Morozov, and Y . Luo, “Performance analysis of hpc applications on the aurora supercomputer: Exploring the impact of hbm-enabled intel xeon max cpus,” pp. 1–11,
-
[15]
Available: https://doi.org/10.23919/ISC.2025.11018301
[Online]. Available: https://doi.org/10.23919/ISC.2025.11018301
-
[16]
[Online]
Optimizing machine learning (ml) models with intel® advanced matrix extensions (intel® amx). [Online]. Avail- able: https://www.intel.com/content/dam/www/central-libraries/us/en/ documents/2022-12/optimizing-ml-models-with-amx-brief.pdf
2022
-
[17]
H. Lu, M. Matheson, N. Chalmers, A. Kashi, N. Malaya, and F. Wang, “Insights from optimizing hpl performance on exascale systems: A comparative analysis of panel factorization,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machiner...
-
[18]
Hpl-mxp benchmark: Mixed- precision algorithms, iterative refinement, and scalable data generation,
J. Dongarra and P. Luszczek, “Hpl-mxp benchmark: Mixed- precision algorithms, iterative refinement, and scalable data generation,”The International Journal of High Performance Computing Applications, Sep. 2025. [Online]. Available: http://dx.doi.org/10.1177/10943420251382476
-
[19]
A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” inSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 603–613. [Online]. Available: https://doi.com/10.1109/SC.2018.00050
-
[20]
Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak
H. Lu, M. Matheson, V . Oles, A. Ellis, W. Joubert, and F. Wang, “Climbing the summit and pushing the frontier of mixed precision benchmarks at extreme scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15. [Online]. Available: https://doi.com/10.1109/SC41404.2022. 00083
-
[21]
Prompt Report on Exa- Scale HPL-AI Benchmark ,
S. Kudo, K. Nitadori, T. Ina, and T. Imamura, “ Prompt Report on Exa- Scale HPL-AI Benchmark ,” in2020 IEEE International Conference on Cluster Computing (CLUSTER). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2020, pp. 418–419. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/CLUSTER49012.2020.00058
-
[22]
The linpack benchmark: past, present and future,
J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,”Concurrency and Computation: practice and experience, vol. 15, no. 9, pp. 803–820, 2003. [Online]. Available: https://doi.org/10.1002/cpe.728
-
[23]
HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,
A. Petitet, R. C. Whaley, J. Dongarra, and J. Cleary, “HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,” University of Tennessee, Tech. Rep. UT-CS-01-448, 2001. [Online]. Available: http://www.netlib.org/ benchmark/hpl/
2001
-
[24]
Y . Guo, K. Raffenetti, H. Zhou, P. Balaji, M. Si, A. Amer, S. Iwasaki, S. Seo, G. Congiu, R. Latham, L. Oden, T. Gillis, R. Zambre, K. Ouyang, C. Archer, W. Bland, J. Jose, S. Sur, H. Fujita, D. Durnov, M. Chuvelev, G. Zheng, A. Brooks, S. Thapaliya, T. Doodi, M. Garzaran, S. Oyanagi, M. Snir, and R. Thakur, “Preparing MPICH for exascale,”The Internation...
-
[25]
[Online]
Intel(R) oneAPI Math Kernel Library (oneMKL). [Online]. Avail- able: https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onemkl.html
-
[26]
Why is mpi so slow? analyzing the fundamental limits in implementing mpi-3.1,
K. Raffenetti, A. Amer, L. Oden, C. Archer, W. Bland, H. Fujita, Y . Guo, T. Janjusic, D. Durnov, M. Blocksome, M. Si, S. Seo, A. Langer, G. Zheng, M. Takagi, P. Coffman, J. Jose, S. Sur, A. Sannikov, S. Oblomov, M. Chuvelev, M. Hatanaka, X. Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, and P. Balaji, “Why is mpi so slow? analyzing the fundamental li...
-
[27]
Available: https://doi.org/10.1145/3126908.3126963
[Online]. Available: https://doi.org/10.1145/3126908.3126963
-
[28]
High radix collective algorithms,
T. Doodi, N. Islam, G. Zheng, R. Kalidas, A. Langer, and G. Maria, “High radix collective algorithms,” inProceedings of EuroMPI, 2021. [Online]. Available: https://doi.org/10.1007/978-3-031-29927-8 31
-
[29]
How I learned to stop worrying about user-visible endpoints and love MPI,
R. Zambre, A. Chandramowliswharan, and P. Balaji, “How I learned to stop worrying about user-visible endpoints and love MPI,” inProceedings of the 34th ACM International Conference on Supercomputing, ser. ICS ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10. 1145/3392717.3392773
-
[30]
[Online]
High-Performance Data Type Engine. [Online]. Available: https: //www.yaksa.org
-
[31]
[Online]
Application programming interface for exascale systems. [Online]. Available: https://pmix.github.io/
-
[32]
[Online]
HPE Parallel Application Launch Service (PALS). [Online]. Available: https://support.hpe.com/hpesc/public/docDisplay?docId= a00117940en us&page=Parallel Application Launch Service PALS. html&docLocale=en US
-
[33]
Y . Levitt, R. Barella, S. Zeltner, T. Musta, L. Cheney, G. Espinosa, O. Franza, and B. Gerofi, “Fine-grained automated failure management for extreme-scale gpu accelerated systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Mac...
-
[34]
Gpcnet: designing a benchmark suite for inducing and measuring contention in hpc networks,
S. Chunduri, T. Groves, P. Mendygral, B. Austin, J. Balma, K. Kandalla, K. Kumaran, G. Lockwood, S. Parker, S. Warren, N. Wichmann, and N. Wright, “Gpcnet: designing a benchmark suite for inducing and measuring contention in hpc networks,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.