pith. machine review for the scientific record. sign in

arxiv: 2604.09517 · v1 · submitted 2026-04-10 · 💻 cs.DC

Recognition: unknown

Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:45 UTC · model grok-4.3

classification 💻 cs.DC
keywords exascale performancehigh performance linpackmixed precisionresource mappingprocessor pipeliningresilience strategiesheterogeneous systemsproduction scale
0
0 comments X

The pith

System-level choices enable scaling FP64 HPL to 1.01 EF/s while delivering an 11.5x speedup in mixed precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports experience from running HPL and HPL-MxP benchmarks on a large-scale heterogeneous computing system. Performance in double precision advanced from 0.585 EF/s on 5,439 nodes to 1.01 EF/s on 9,234 nodes. Mixed-precision runs reached 11.64 EF/s through arithmetic adjustments and hardware acceleration. The authors classify by role the production-scale practices such as locality-aware resource mapping, explicit processor pipelining, precision orchestration, and hybrid resilience that supported these results. These findings matter because they show the coordination across layers that becomes necessary only under real deployment constraints at extreme scale.

Core claim

By applying deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision orchestration, and a hybrid P2P/collective resilience strategy introduced after synchronization stalls at scale, the system sustained exascale performance in production. This produced scaling from 0.585 EF/s to 1.01 EF/s in FP64 HPL and 11.64 EF/s in HPL-MxP, an 11.5x gain over full precision enabled by mixed-precision arithmetic and acceleration.

What carries the argument

The classification by role at production scale of system-level choices, including deterministic locality-aware resource mapping, explicit processor pipelining, mixed-precision orchestration, and hybrid resilience strategies.

If this is right

  • Scaling to thousands of nodes requires these coordinated practices to avoid stalls that appear only at production scale.
  • Mixed-precision orchestration can deliver more than tenfold performance gains over full precision for suitable workloads.
  • Hybrid peer-to-peer and collective methods reduce the impact of failures and synchronization issues as system size grows.
  • Explicit pipelining between processors improves overall efficiency in tightly coupled heterogeneous environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mapping and pipelining practices could be tested on other scientific workloads to check if they improve sustained efficiency.
  • The role-based classification offers a template for prioritizing optimizations when designing software stacks for future large systems.
  • Resilience strategies of this form may become more critical as hardware fault rates rise with increasing component counts.

Load-bearing premise

The observed choices generalize beyond this specific deployment to other tightly coupled heterogeneous systems at extreme scale.

What would settle it

Measure whether removing locality-aware mapping or the hybrid resilience strategy on a comparable large heterogeneous system causes performance to fall below linear scaling expectations or to encounter more frequent synchronization stalls.

Figures

Figures reproduced from arXiv: 2604.09517 by Aditya Nishtala, Anthony-Trung Nguyen, Huda Ibeid, Kalyan Kumaran, Kazushige Goto, Servesh Muralidharan.

Figure 1
Figure 1. Figure 1: HPL cost model. Early iterations are dominated by compute-bound [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aurora Exascale Compute Blade (ECB). 2) HPL-MxP: HPL-MxP is a mixed-precision variant of HPL designed to reflect emerging AI and scientific applica￾tions [11]. It accelerates the solution of dense linear systems by executing most arithmetic operations in reduced precision, with iterative refinement restoring double-precision accuracy. While the algorithm retains the structure of HPL, factorizations and tra… view at source ↗
Figure 3
Figure 3. Figure 3: Aurora’s 1-D Dragonfly interconnect topology. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Phase breakdown of the ∼1 EF/s HPL run on 9,234 nodes. GEMM (purple) dominates in the early stages, while SWAP (green) becomes dominant as the trailing submatrix shrinks. The system transitions from compute-bound to communication-bound execution when GEMM time falls below SWAP, which aligns with the decline in the estimated performance line (dashed orange). tuning improved sustained performance to 11.64 EF… view at source ↗
Figure 6
Figure 6. Figure 6: HPL-MxP performance on Aurora at 9,500 nodes. The curve shows [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Sustaining exascale performance in production requires engineering choices and operational practices that emerge only under real deployment constraints and demand coordination across system layers. This paper reports experience from three successive campaigns running HPL and HPL-MxP on Aurora, an Intel-based exascale system featuring the first large-scale deployment of Intel discrete GPUs, CPU-attached network interfaces, and the largest production Slingshot-11 interconnect. Aurora progressed from 0.585EF/s on 5,439 nodes to 1.01EF/s on 9,234 nodes in FP64 HPL, while HPL-MxP reached 11.64EF/s, an 11.5x speedup over FP64 enabled by mixed-precision arithmetic and Intel AMX acceleration. We identify and classify by role at production scale the system-level choices that sustained these results, including deterministic locality-aware resource mapping, explicit CPU-GPU pipelining, mixed-precision orchestration, and a hybrid P2P/collective resilience strategy introduced after synchronization stalls at scale. While some observations are Aurora-specific, the broader lessons are likely to apply to tightly coupled heterogeneous systems at extreme scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports empirical results from three campaigns running HPL and HPL-MxP on the Aurora exascale system (first large-scale Intel discrete GPU deployment with Slingshot-11 interconnect). It documents FP64 HPL scaling from 0.585 EF/s on 5,439 nodes to 1.01 EF/s on 9,234 nodes and an 11.64 EF/s HPL-MxP result (11.5x over FP64 via mixed-precision and AMX). The central contribution is a post-hoc classification of production-scale choices—deterministic locality-aware mapping, CPU-GPU pipelining, mixed-precision orchestration, and hybrid P2P/collective resilience—by their role in sustaining these results, with a claim that broader lessons apply to tightly coupled heterogeneous systems.

Significance. If the attributions are substantiated, the work supplies concrete, production-derived guidance on cross-layer coordination for exascale heterogeneous computing, including quantified speedups and scaling behavior on a first-of-kind platform. The explicit node counts, achieved rates, and role-based classification of practices constitute a useful reference point for operators of similar systems, even if some observations remain Aurora-specific.

major comments (2)
  1. [Abstract / results classification] Abstract and results sections: the claim that the enumerated choices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) 'sustained' the reported scaling and 11.5x speedup is not supported by isolation experiments or before/after metrics on identical node counts. The progression from 5,439 to 9,234 nodes and the mixed-precision gain could be explained by hardware scaling, interconnect properties, or unlisted factors; without controlled comparisons the causal link remains correlational and undermines both internal attribution and the transferability assertion.
  2. [Campaigns and choices sections] Methodology description (campaigns section): insufficient detail is provided on how each choice was implemented, measured, or varied across the three campaigns, including absence of error bars, run-to-run variability, or explicit controls for confounding variables such as job placement policies or network contention. This limits independent verification that the listed practices, rather than Aurora-specific hardware (discrete GPUs, CPU-attached NICs), produced the gains.
minor comments (2)
  1. [Abstract] The abstract states 'three successive campaigns' but does not tabulate the exact node counts, software versions, or configuration differences between campaigns, which would improve traceability of the scaling steps.
  2. [Results presentation] Notation for performance units (EF/s) and speedups is clear, but the paper would benefit from an explicit table summarizing all reported rates, node counts, and the precise definition of the 11.5x factor (HPL-MxP vs. FP64 HPL on the final configuration).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments have prompted us to refine the language around attribution and expand methodological details. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / results classification] Abstract and results sections: the claim that the enumerated choices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) 'sustained' the reported scaling and 11.5x speedup is not supported by isolation experiments or before/after metrics on identical node counts. The progression from 5,439 to 9,234 nodes and the mixed-precision gain could be explained by hardware scaling, interconnect properties, or unlisted factors; without controlled comparisons the causal link remains correlational and undermines both internal attribution and the transferability assertion.

    Authors: We agree that the evidence presented is observational and correlational rather than derived from controlled isolation experiments or fixed-node before/after comparisons. Such experiments were not feasible given the production nature of the campaigns, the high cost of exascale allocations, and the incremental way improvements were deployed as bottlenecks were identified through profiling. The attributions rest on in-situ measurements (e.g., synchronization stall detection leading to the hybrid resilience strategy) and the documented progression across campaigns. We have revised the abstract and results sections to use more precise phrasing such as 'contributed to' and 'enabled the observed' performance, while adding a brief discussion of the limitations of causal inference in production settings. The practical lessons remain grounded in the actual deployment experience. revision: yes

  2. Referee: [Campaigns and choices sections] Methodology description (campaigns section): insufficient detail is provided on how each choice was implemented, measured, or varied across the three campaigns, including absence of error bars, run-to-run variability, or explicit controls for confounding variables such as job placement policies or network contention. This limits independent verification that the listed practices, rather than Aurora-specific hardware (discrete GPUs, CPU-attached NICs), produced the gains.

    Authors: We appreciate the call for greater transparency. The revised manuscript expands the Campaigns and Choices sections with concrete implementation details: locality-aware mapping was realized through topology-aware job scheduling that aligned processes to Slingshot-11 fabric links and CPU-attached NICs; CPU-GPU pipelining used explicit asynchronous transfers and kernel overlap measured via oneAPI profiling tools; mixed-precision orchestration coordinated FP64, TF32, and AMX paths with per-kernel timing; and the hybrid resilience strategy combined P2P and collective checkpoints after observing stalls at scale. Where multiple runs were possible, we now report observed variability. Full error bars and exhaustive controls for every confounder (e.g., dynamic network contention or scheduler policies) were not attainable in the production environment; we have added an explicit limitations paragraph acknowledging this constraint while providing the best available data from the three campaigns. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark report with no derivations or fitted predictions

full rationale

The manuscript is an empirical report of measured HPL and HPL-MxP benchmark results on the Aurora system across three campaigns. It states concrete achieved rates (0.585 EF/s on 5439 nodes scaling to 1.01 EF/s on 9234 nodes for FP64 HPL; 11.64 EF/s for HPL-MxP) and classifies observed practices (locality-aware mapping, pipelining, mixed-precision orchestration, hybrid resilience) from those runs. No equations, first-principles derivations, parameter fits, or predictions appear in the provided text or abstract. The central claims are direct measurements and post-hoc classification of practices that coincided with the results; they do not reduce to inputs by construction, self-citation chains, or renaming. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical experience report based on benchmark runs. No mathematical derivations, free parameters, axioms, or invented entities are present; all claims rest on observed execution times and throughput on the deployed system.

pith-pipeline@v0.9.0 · 5531 in / 1163 out tokens · 39675 ms · 2026-05-10T16:45:14.497149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 21 canonical work pages

  1. [1]

    Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,

    B. S. Allen, J. Anchell, V . Anisimov, T. Applencourt, A. Bagusetty, R. Balakrishnan, R. Balin, S. Bekele, C. Bertoni, C. Blackworth, R. Bustamante, K. Canada, J. Carrier, C. Chan-nui, L. C. Cheney, T. Childers, P. Coffman, S. Coghlan, M. D’Mello, M. Emani, K. G. Felker, S. Foreman, O. Franza, L. Gao, M. Garc ´ıa, M. Garzar ´an, B. Gerofi, Y . Ghadar, N. ...

  2. [2]

    [Online]

    Aurora. [Online]. Available: https://www.alcf.anl.gov/aurora

  3. [3]

    In: 2022 IEEE International Solid-State Circuits Conference (ISSCC)

    N. Nassif, A. O. Munch, C. L. Molnar, G. Pasdast, S. V . Lyer, Z. Yang, O. Mendoza, M. Huddart, S. Venkataraman, S. Kandula, R. Marom, A. M. Kern, B. Bowhill, D. R. Mulvihill, S. Nimmagadda, V . Kalidindi, J. Krause, M. M. Haq, R. Sharma, and K. Duda, “Sapphire rapids: The next-generation intel xeon scalable processor,” in2022 IEEE International Solid-Sta...

  4. [4]

    In: 2022 IEEE International Solid-State Circuits Conference (ISSCC)

    W. Gomes, A. Koker, P. Stover, D. Ingerly, S. Siers, S. Venkataraman, C. Pelto, T. Shah, A. Rao, F. O’Mahony, E. Karl, L. Cheney, I. Rajwani, H. Jain, R. Cortez, A. Chandrasekhar, B. Kanthi, and R. Koduri, “Ponte vecchio: A multi-tile 3d stacked processor for exascale computing,” in2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, ...

  5. [5]

    XeHPC Ponte Vecchio ,

    D. Blythe, “ XeHPC Ponte Vecchio ,” in2021 IEEE Hot Chips 33 Symposium (HCS). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2021, pp. 1–34. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/HCS52781.2021.9567038

  6. [6]

    Hpc slingshot launched into network space,

    “Hpc slingshot launched into network space,” inCray User Group 2022 (CUG2022) Proceedings, May 2022. [Online]. Available: https://cug. org/proceedings/cug2022 proceedings/includes/files/pap121s2-file1.pdf

  7. [7]

    Technology-driven, highly-scalable dragonfly topology,

    J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven, highly-scalable dragonfly topology,” in2008 International Symposium on Computer Architecture, 2008, pp. 77–88. [Online]. Available: https://doi.com/10.1109/ISCA.2008.19

  8. [8]

    Scaling mpi applications on aurora,

    H. Ibeid, A.-T. Nguyen, A. Nishtala, P. Sakarda, L. Kaplan, N. Mahadevan, M. Woodacre, V . Anisimov, K. Kumaran, J. Kwack, V . Morozov, S. Muralidharan, and S. Parker, “Scaling mpi applications on aurora,” 2025. [Online]. Available: https://arxiv.org/abs/2512.04291

  9. [9]

    [Online]

    Intel oneAPI. [Online]. Available: https://www.intel.com/content/www/ us/en/developer/tools/oneapi/overview.html

  10. [10]

    [Online]

    TOP500 list. [Online]. Available: https://www.top500.org/lists/top500/

  11. [11]

    [Online]

    HPL-MxP. [Online]. Available: https://hpl-mxp.org/results.md

  12. [12]

    Rogers, Evan Schneider, Jean-Luc Vay, and P

    S. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V . Melesse Vergara, T. Beck, M. Brim, R. Budiardja, S. Chandrasekaran, M. Eisenbach, T. Evans, M. Ezell, N. Frontiere, A. Georgiadou, J. Glenski, P. Grete, S. Hamilton, J. Holmen, A. Huebl, D. Jacobson, W. Joubert, K. Mcmahon, E. Merzari, S. Moore, A. Myers, S. Nichols, S. Oral, T. Papatheodore, D. Perez, D....

  13. [13]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    N. Chalmers, J. Kurzak, D. Mcdougall, and P. Bauman, “Optimizing high-performance linpack for exascale accelerated architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...

  14. [14]

    Performance analysis of hpc applications on the aurora supercomputer: Exploring the impact of hbm-enabled intel xeon max cpus,

    H. Ibeid, V . Narayana, J. Kim, A. Nguyen, V . Morozov, and Y . Luo, “Performance analysis of hpc applications on the aurora supercomputer: Exploring the impact of hbm-enabled intel xeon max cpus,” pp. 1–11,

  15. [15]

    Available: https://doi.org/10.23919/ISC.2025.11018301

    [Online]. Available: https://doi.org/10.23919/ISC.2025.11018301

  16. [16]

    [Online]

    Optimizing machine learning (ml) models with intel® advanced matrix extensions (intel® amx). [Online]. Avail- able: https://www.intel.com/content/dam/www/central-libraries/us/en/ documents/2022-12/optimizing-ml-models-with-amx-brief.pdf

  17. [17]

    Insights from optimizing hpl performance on exascale systems: A comparative analysis of panel factorization,

    H. Lu, M. Matheson, N. Chalmers, A. Kashi, N. Malaya, and F. Wang, “Insights from optimizing hpl performance on exascale systems: A comparative analysis of panel factorization,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machiner...

  18. [18]

    Hpl-mxp benchmark: Mixed- precision algorithms, iterative refinement, and scalable data generation,

    J. Dongarra and P. Luszczek, “Hpl-mxp benchmark: Mixed- precision algorithms, iterative refinement, and scalable data generation,”The International Journal of High Performance Computing Applications, Sep. 2025. [Online]. Available: http://dx.doi.org/10.1177/10943420251382476

  19. [19]

    Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed- precision iterative refinement solvers,

    A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” inSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 603–613. [Online]. Available: https://doi.com/10.1109/SC.2018.00050

  20. [20]

    Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

    H. Lu, M. Matheson, V . Oles, A. Ellis, W. Joubert, and F. Wang, “Climbing the summit and pushing the frontier of mixed precision benchmarks at extreme scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15. [Online]. Available: https://doi.com/10.1109/SC41404.2022. 00083

  21. [21]

    Prompt Report on Exa- Scale HPL-AI Benchmark ,

    S. Kudo, K. Nitadori, T. Ina, and T. Imamura, “ Prompt Report on Exa- Scale HPL-AI Benchmark ,” in2020 IEEE International Conference on Cluster Computing (CLUSTER). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2020, pp. 418–419. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/CLUSTER49012.2020.00058

  22. [22]

    The linpack benchmark: past, present and future,

    J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark: past, present and future,”Concurrency and Computation: practice and experience, vol. 15, no. 9, pp. 803–820, 2003. [Online]. Available: https://doi.org/10.1002/cpe.728

  23. [23]

    HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,

    A. Petitet, R. C. Whaley, J. Dongarra, and J. Cleary, “HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,” University of Tennessee, Tech. Rep. UT-CS-01-448, 2001. [Online]. Available: http://www.netlib.org/ benchmark/hpl/

  24. [24]

    Preparing MPICH for exascale,

    Y . Guo, K. Raffenetti, H. Zhou, P. Balaji, M. Si, A. Amer, S. Iwasaki, S. Seo, G. Congiu, R. Latham, L. Oden, T. Gillis, R. Zambre, K. Ouyang, C. Archer, W. Bland, J. Jose, S. Sur, H. Fujita, D. Durnov, M. Chuvelev, G. Zheng, A. Brooks, S. Thapaliya, T. Doodi, M. Garzaran, S. Oyanagi, M. Snir, and R. Thakur, “Preparing MPICH for exascale,”The Internation...

  25. [25]

    [Online]

    Intel(R) oneAPI Math Kernel Library (oneMKL). [Online]. Avail- able: https://www.intel.com/content/www/us/en/developer/tools/oneapi/ onemkl.html

  26. [26]

    Why is mpi so slow? analyzing the fundamental limits in implementing mpi-3.1,

    K. Raffenetti, A. Amer, L. Oden, C. Archer, W. Bland, H. Fujita, Y . Guo, T. Janjusic, D. Durnov, M. Blocksome, M. Si, S. Seo, A. Langer, G. Zheng, M. Takagi, P. Coffman, J. Jose, S. Sur, A. Sannikov, S. Oblomov, M. Chuvelev, M. Hatanaka, X. Zhao, P. Fischer, T. Rathnayake, M. Otten, M. Min, and P. Balaji, “Why is mpi so slow? analyzing the fundamental li...

  27. [27]

    Available: https://doi.org/10.1145/3126908.3126963

    [Online]. Available: https://doi.org/10.1145/3126908.3126963

  28. [28]

    High radix collective algorithms,

    T. Doodi, N. Islam, G. Zheng, R. Kalidas, A. Langer, and G. Maria, “High radix collective algorithms,” inProceedings of EuroMPI, 2021. [Online]. Available: https://doi.org/10.1007/978-3-031-29927-8 31

  29. [29]

    How I learned to stop worrying about user-visible endpoints and love MPI,

    R. Zambre, A. Chandramowliswharan, and P. Balaji, “How I learned to stop worrying about user-visible endpoints and love MPI,” inProceedings of the 34th ACM International Conference on Supercomputing, ser. ICS ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10. 1145/3392717.3392773

  30. [30]

    [Online]

    High-Performance Data Type Engine. [Online]. Available: https: //www.yaksa.org

  31. [31]

    [Online]

    Application programming interface for exascale systems. [Online]. Available: https://pmix.github.io/

  32. [32]

    [Online]

    HPE Parallel Application Launch Service (PALS). [Online]. Available: https://support.hpe.com/hpesc/public/docDisplay?docId= a00117940en us&page=Parallel Application Launch Service PALS. html&docLocale=en US

  33. [33]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =

    Y . Levitt, R. Barella, S. Zeltner, T. Musta, L. Cheney, G. Espinosa, O. Franza, and B. Gerofi, “Fine-grained automated failure management for extreme-scale gpu accelerated systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Mac...

  34. [34]

    Gpcnet: designing a benchmark suite for inducing and measuring contention in hpc networks,

    S. Chunduri, T. Groves, P. Mendygral, B. Austin, J. Balma, K. Kandalla, K. Kumaran, G. Lockwood, S. Parker, S. Warren, N. Wichmann, and N. Wright, “Gpcnet: designing a benchmark suite for inducing and measuring contention in hpc networks,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser....