pith. sign in

arxiv: 2604.06056 · v2 · submitted 2026-04-07 · 💻 cs.DC · cs.AR

Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.DC cs.AR
keywords power attributionenergy measurementexascale computingAMD GPUssensor characterizationmixed precisionperformance profilingCray EX systems
0
0 comments X

The pith

A methodology using square-wave workloads reconstructs power from energy counters for accurate attribution on AMD exascale GPU nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a methodology for fine-grained power and energy attribution on AMD GPU and APU based exascale systems. Sensors on these machines have varying update rates, delays, and filtering that obscure short activity periods. Using square wave test workloads, the authors measure and correct for these issues by reconstructing power from energy counters. This corrected data is then used in a profiling tool to analyze benchmarks. The results show large energy reductions from mixed precision, with the method distinguishing whether savings come from faster execution or lower power use.

Core claim

Modern exascale systems with AMD Instinct GPUs and APUs provide multiple power sensors whose differences in scope, update rate, timing, and filtering complicate attribution of short-lived activity. A methodology using controlled square-wave workloads quantifies these effects across hundreds of devices, enables reconstruction of power from cumulative energy counters for faster response, validates against on-chip and off-chip sensors, and integrates the streams into a Score-P/PAPI tool for phase-level attribution. When applied to linear algebra benchmarks, this separates energy savings due to reduced runtime from changes in power draw, demonstrating that mixed precision reduces node energy by

What carries the argument

Square-wave workload characterization of sensor timing and reconstruction of power from cumulative energy counters

Load-bearing premise

Controlled square-wave workloads sufficiently capture the timing, aliasing, and variability effects that occur in real, irregular HPC application workloads.

What would settle it

If the reconstructed power from energy counters fails to match high-resolution measurements during the execution of irregular, non-square-wave HPC applications, the method's accuracy for general use would be refuted.

Figures

Figures reproduced from arXiv: 2604.06056 by Adam McDaniel, Aditya Kashi, Ashesh Sharma, Brandon Neth, Bruno Villasenor Alvarez, Michael Jantz, Oscar Hernandez, Shreyas Khandekar, Steve Abbott, Steven Martin, Wael Elwasif.

Figure 1
Figure 1. Figure 1: Three asynchronous stages for sensor production, driver [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GPU power sensor behavior during a kernel execution. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of sampling effects: (a) signal captured [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Update-interval and timestamp-delta behavior measured across 128 nodes on Frontier (left column, 512 AMD Instinct [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Delay, response, and recovery behavior for AMD Instinct™ MI250X (left) and AMD Instinct™ MI300A (right) under [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Median impact of aliasing on power state transition [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Frontier (AMD Instinct MI250x) rocHPL and rocHPL-MxP: side-by-side comparison of stacked instantaneous power. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Portage (AMD Instinct MI300A) HPG-MxP full-precision and mixed-precision stacked instantaneous power traces [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Different GPU/APUs power and energy sensor on Frontier and Portage nodes, highlighting on-chip, off-chip, and [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Time and frequency domain visualizations for two square-wave workloads measured using derived power from [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Modern exascale GPU- and APU-based systems provide multiple power and energy sensors, but differences in scope, update rate, timing, and filtering complicate the attribution of short-lived accelerator activity. This paper presents a methodology to characterize and correct these effects on Cray EX systems with AMD Instinct MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, we quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs with on-chip (rocm-smi/amd-smi) and off-chip Cray Power Management sensors. We reconstruct power from cumulative energy counters to achieve faster response times, validate it against on-chip, off-chip, and node-level sensors, and integrate the resulting streams into a Score-P/PAPI-based tool for time-aligned, phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method separates energy savings due to reduced runtime from changes in power. Mixed precision reduces node energy on Frontier by 79% for rocHPL-MxP and 31% for HPG-MxP, with similar trends on Portage. These results provide portable guidance for sensor validation and power-aware optimization on current and future exascale systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a methodology to characterize and correct differences in scope, update rate, timing, and filtering among on-chip (rocm-smi/amd-smi) and off-chip Cray power sensors on AMD MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, the authors quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs, reconstruct instantaneous power from cumulative energy counters, validate the streams against independent sensors, and integrate them into a Score-P/PAPI tool for phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method reports that mixed precision reduces node energy by 79% for rocHPL-MxP and 31% for HPG-MxP on Frontier (with similar trends on Portage) while separating savings due to shorter runtime from changes in average power.

Significance. If the central claims hold, the work supplies portable, sensor-validated guidance for power-aware optimization on current and future exascale GPU/APU systems. The large-scale controlled experiments (hundreds of devices) and cross-validation against on-chip, off-chip, and node-level sensors constitute a concrete strength; the integration into an existing performance tool (Score-P/PAPI) increases practical utility for HPC practitioners.

major comments (2)
  1. [§5] §5 (Application to rocHPL/HPG-MxP workloads): The separation of energy savings into runtime reduction versus power reduction for the reported 79% (rocHPL-MxP) and 31% (HPG-MxP) node-energy figures rests on the assumption that the square-wave-derived reconstruction accurately captures aliasing, delay, and variability during the irregular, bursty matrix-multiplication phases of real HPL. No direct comparison of reconstructed versus measured power traces is shown for the actual application phases, leaving open the possibility that residual filtering mismatch distorts the attributed split.
  2. [§4.2] §4.2 (Validation of reconstructed power): While the paper validates the reconstructed power against independent sensors on square-wave workloads, the error bounds and phase-dependent residuals are not quantified for the non-periodic, high-variability patterns present in HPL. This weakens the claim that the method cleanly separates runtime and power contributions in the mixed-precision results.
minor comments (2)
  1. [Figures 3,5] Figure 3 and Figure 5: axis labels and legend entries are too small to read at print size; consider increasing font size or splitting into multiple panels.
  2. [§4] The description of the Score-P/PAPI integration (end of §4) does not specify the exact PAPI event names or the time-alignment window size used, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the scale of the experiments and the practical value of the Score-P/PAPI integration. We address each major comment below and will incorporate additional validation material in the revised manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Application to rocHPL/HPG-MxP workloads): The separation of energy savings into runtime reduction versus power reduction for the reported 79% (rocHPL-MxP) and 31% (HPG-MxP) node-energy figures rests on the assumption that the square-wave-derived reconstruction accurately captures aliasing, delay, and variability during the irregular, bursty matrix-multiplication phases of real HPL. No direct comparison of reconstructed versus measured power traces is shown for the actual application phases, leaving open the possibility that residual filtering mismatch distorts the attributed split.

    Authors: We agree that the current manuscript does not include direct side-by-side comparisons of reconstructed versus measured power traces for the actual rocHPL and HPG-MxP execution phases. The square-wave workloads were designed to reproduce the high-variability, bursty behavior characteristic of matrix-multiplication kernels, and the reconstruction itself operates on cumulative energy counters, which yields instantaneous power independently of workload periodicity. To strengthen the separation claim, we will add representative power traces from HPL phases together with quantitative error metrics (e.g., RMS and phase-wise residuals) in the revised §5 and a new appendix. revision: yes

  2. Referee: [§4.2] §4.2 (Validation of reconstructed power): While the paper validates the reconstructed power against independent sensors on square-wave workloads, the error bounds and phase-dependent residuals are not quantified for the non-periodic, high-variability patterns present in HPL. This weakens the claim that the method cleanly separates runtime and power contributions in the mixed-precision results.

    Authors: The validation in §4.2 quantifies sensor characteristics and reconstruction fidelity under controlled, repeatable conditions across up to 512 GPUs and 480 APUs. The application results rely on time-aligned phase attribution via Score-P. We concur that explicit error bounds for non-periodic HPL patterns are not reported. In the revision we will augment §4.2 (or add an appendix) with reconstruction-error analysis on sampled non-periodic segments extracted from the HPL runs, including mean-absolute and phase-specific residual statistics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's central claims rest on empirical sensor characterization via square-wave workloads, followed by reconstruction of power streams from cumulative counters and cross-validation against independent on-chip, off-chip, and node-level sensors. These steps produce time-aligned attribution for rocHPL variants without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. The separation of runtime versus power contributions follows directly from the validated measurements rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that square-wave workloads faithfully expose sensor timing artifacts without introducing new biases; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Cumulative energy counters can be differentiated to reconstruct instantaneous power with improved temporal resolution.
    Invoked to achieve faster response times than native power sensors.

pith-pipeline@v0.9.0 · 5578 in / 1189 out tokens · 30224 ms · 2026-05-10T18:37:01.169869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Preliminary study on fine-grained power and energy measurements on grace hopper gh200 with open- source performance tools,

    O. Hernandez, T. Wang, W. Elwasif, F. Spiga, F. Tartaglione, M. Eisenbach, and R. Miller, “Preliminary study on fine-grained power and energy measurements on grace hopper gh200 with open- source performance tools,” inProceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops, ser. HPC Asia ’25 Workshops...

  2. [2]

    Power-capping metric evaluation for improving energy efficiency in hpc applications

    M. Patrou, T. Wang, W. Elwasif, M. Eisenbach, R. Miller, W. Godoy, and O. Hernandez, “Power-capping metric evaluation for improving energy efficiency in hpc applications.” Berlin, Heidelberg: Springer-Verlag, 2025, p. 231–244. [Online]. Available: https://doi.org/10.1007/978-3-032-07612-0 18

  3. [3]

    Cray xc30 power monitoring and man- agement,

    S. J. Martin and M. Kappel, “Cray xc30 power monitoring and man- agement,” inCray User Group Conference Proceedings, 2014

  4. [4]

    Martin,HPE Cray EX Power Monitoring Counters, 2024

    S. Martin,HPE Cray EX Power Monitoring Counters, 2024. [Online]. Available: https://cug.org/proceedings/cug2024 proceedings/ includes/files/pres127s2.pdf

  5. [5]

    Advancements of papi for the exascale generation,

    H. Jagode, A. Danalis, G. Congiu, D. Barry, A. Castaldo, and J. Don- garra, “Advancements of papi for the exascale generation,”The In- ternational Journal of High Performance Computing Applications, p. 10943420241303884, 2024

  6. [6]

    Score-p: A joint performance measurement run-time infrastructure for periscope,scalasca, tau, and vampir,

    A. Kn ¨upfer, C. R ¨ossel, D. a. Mey, S. Biersdorff, K. Diethelm, D. Es- chweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. E. Nagel, Y . Oleynik, P. Philippen, P. Saviankou, D. Schmidl, S. Shende, R. Tsch ¨uter, M. Wagner, B. Wesarg, and F. Wolf, “Score-p: A joint performance measurement run-time infrastructure for periscope,scalasca, tau, and vamp...

  7. [7]

    Parallel programma- bility and the chapel language,

    B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel programma- bility and the chapel language,”The International Journal of High Performance Computing Applications, vol. 21, no. 3, pp. 291–312, 2007

  8. [8]

    fastotf2: A high-performance chapel-based library for reading and processing otf2 trace files,

    S. Khandekar, “fastotf2: A high-performance chapel-based library for reading and processing otf2 trace files,” 2025. [Online]. Available: https://github.com/hpc-ai-adv-dev/fastotf2

  9. [9]

    A. V . Oppenheim and G. C. Verghese,Signals, systems & inference. Pearson London, UK:, 2017

  10. [10]

    N. S. Nise,Control systems engineering. John Wiley & Sons, 2019

  11. [11]

    Accurate and convenient energy measurements for gpus: A detailed study of nvidia gpu’s built-in power sensor,

    Z. Yang, K. Adamek, and W. Armour, “Accurate and convenient energy measurements for gpus: A detailed study of nvidia gpu’s built-in power sensor,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’24). IEEE Computer Society, 2024, pp. 307–323

  12. [12]

    Measuring gpu power with the k20 built-in sensor,

    M. Burtscher, I. Zecena, and Z. Zong, “Measuring gpu power with the k20 built-in sensor,” inProceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU), 2014

  13. [14]
  14. [15]

    Fingrav: Methodology for fine-grain gpu power visibility and insights,

    V . Singhania, S. Aga, and M. A. Ibrahim, “Fingrav: Methodology for fine-grain gpu power visibility and insights,” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2025, pp. 96–107

  15. [16]

    Frontier user guide,

    Oak Ridge Leadership Computing Facility, “Frontier user guide,” 2025, accessed: 2025-04-30. [Online]. Available: https://docs.olcf.ornl.gov/ systems/frontier user guide.html

  16. [17]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    N. Chalmers, J. Kurzak, D. Mcdougall, and P. Bauman, “Optimizing high-performance linpack for exascale accelerated architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...

  17. [18]

    rocHPL-MxP: Rocm implementation of the HPL-MxP mixed- precision linpack benchmark,

    ROCm, “rocHPL-MxP: Rocm implementation of the HPL-MxP mixed- precision linpack benchmark,” https://github.com/ROCm/rocHPL-MxP, 2024, gitHub repository. Initial public commit May 3, 2024. Accessed December 19, 2025. [Online]. Available: https://github.com/ROCm/ rocHPL-MxP

  18. [19]

    High-performance GMRES multi-precision benchmark: Design, performance, and challenges,

    I. Yamazaki, C. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, and J. Dongarra, “High-performance GMRES multi-precision benchmark: Design, performance, and challenges,” in2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2022, pp. 112–122

  19. [20]

    Scaling the memory wall using mixed-precision - hpg-mxp on an exascale machine,

    A. Kashi, N. Koukpaizan, H. Lu, M. Matheson, S. Oral, and F. Wang, “Scaling the memory wall using mixed-precision - hpg-mxp on an exascale machine,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 281–297

  20. [21]

    Rapl in action: Experiences in using rapl for power measurements,

    K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou, “Rapl in action: Experiences in using rapl for power measurements,”ACM Trans. Model. Perform. Eval. Comput. Syst., vol. 3, no. 2, Mar. 2018. [Online]. Available: https://doi.org/10.1145/3177754

  21. [22]

    Assessing power monitoring approaches for energy and power analysis of computers,

    M. E. M. Diouri, M. F. Dolz, O. Gl ¨uck, L. Lef `evre, P. Alonso, S. Catal ´an, R. Mayo, and E. S. Quintana-Ort ´ı, “Assessing power monitoring approaches for energy and power analysis of computers,” Sustainable Computing: Informatics and Systems, vol. 4, no. 2, pp. 68–82, 2014, special Issue on Selected papers from EE-LSDS2013 Conference. [Online]. Avail...

  22. [23]

    Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures ,

    J. Dongarra, H. Ltaief, P. Luszczek, and V . M. Weaver, “ Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures ,” in2012 International Conference on Cloud and Green Computing (CGC). Los Alamitos, CA, USA: IEEE Computer Society, Nov. 2012, pp. 274–281. [Online]. Available: https://doi.ieeecomputersoci...

  23. [24]

    NVIDIA Management Library (NVML) API Documentation: Device Queries,

    NVIDIA Corporation, “NVIDIA Management Library (NVML) API Documentation: Device Queries,” 2025, accessed: 2025-12-19. [Online]. Available: https: //docs.nvidia.com/deploy/nvml-api/group nvmlDeviceQueries.html# group nvmlDeviceQueries 1g7ef7dff0ff14238d08a19ad7fb23fc87

  24. [25]

    Bridging the gap: User- centric energy monitoring for policy-driven application optimization in hpc data centers,

    W. Shin, K. W. Schulz, A. F. Lorenzon, M. Maiterth, B. Villasenor Alvarez, J. Polo, A. Kashi, H. Lu, N. Koukpaizan, A. Georgiadou, M. Norman, W. Elwasif, M. Matheson, F. Wang, N. Frontiere, S. Oral, T. Beck, and B. Messer, “Bridging the gap: User- centric energy monitoring for policy-driven application optimization in hpc data centers,” ser. SC Workshops ...

  25. [26]

    Fine-grained application energy and power measurements on the frontier exascale system,

    O. Hernandez and W. Elwasif, “Fine-grained application energy and power measurements on the frontier exascale system,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 135–146. [Online]. Available: https://doi.org/10.1145/3757348.3757363 APPENDIX The Cray Power Management (PM) subsystem ...

  26. [27]

    APU-only power usingrocm-smienergy counters,

  27. [28]

    per-APU PM counters for all four APUs, and

  28. [29]

    the node-level PM total. Across the 50 sampled EX255a nodes, the PM measure- ments for APU 0 and APU 2 were consistently higher than the corresponding APU-onlyrocm-smivalues by approximately 30±2W per sawtooth card. APU 1 and APU 3 did not exhibit this offset. To correct for this static NIC power, we adjust the PM-based accelerator readings as follows: ac...