Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes
Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3
The pith
A methodology using square-wave workloads reconstructs power from energy counters for accurate attribution on AMD exascale GPU nodes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern exascale systems with AMD Instinct GPUs and APUs provide multiple power sensors whose differences in scope, update rate, timing, and filtering complicate attribution of short-lived activity. A methodology using controlled square-wave workloads quantifies these effects across hundreds of devices, enables reconstruction of power from cumulative energy counters for faster response, validates against on-chip and off-chip sensors, and integrates the streams into a Score-P/PAPI tool for phase-level attribution. When applied to linear algebra benchmarks, this separates energy savings due to reduced runtime from changes in power draw, demonstrating that mixed precision reduces node energy by
What carries the argument
Square-wave workload characterization of sensor timing and reconstruction of power from cumulative energy counters
Load-bearing premise
Controlled square-wave workloads sufficiently capture the timing, aliasing, and variability effects that occur in real, irregular HPC application workloads.
What would settle it
If the reconstructed power from energy counters fails to match high-resolution measurements during the execution of irregular, non-square-wave HPC applications, the method's accuracy for general use would be refuted.
Figures
read the original abstract
Modern exascale GPU- and APU-based systems provide multiple power and energy sensors, but differences in scope, update rate, timing, and filtering complicate the attribution of short-lived accelerator activity. This paper presents a methodology to characterize and correct these effects on Cray EX systems with AMD Instinct MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, we quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs with on-chip (rocm-smi/amd-smi) and off-chip Cray Power Management sensors. We reconstruct power from cumulative energy counters to achieve faster response times, validate it against on-chip, off-chip, and node-level sensors, and integrate the resulting streams into a Score-P/PAPI-based tool for time-aligned, phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method separates energy savings due to reduced runtime from changes in power. Mixed precision reduces node energy on Frontier by 79% for rocHPL-MxP and 31% for HPG-MxP, with similar trends on Portage. These results provide portable guidance for sensor validation and power-aware optimization on current and future exascale systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a methodology to characterize and correct differences in scope, update rate, timing, and filtering among on-chip (rocm-smi/amd-smi) and off-chip Cray power sensors on AMD MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, the authors quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs, reconstruct instantaneous power from cumulative energy counters, validate the streams against independent sensors, and integrate them into a Score-P/PAPI tool for phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method reports that mixed precision reduces node energy by 79% for rocHPL-MxP and 31% for HPG-MxP on Frontier (with similar trends on Portage) while separating savings due to shorter runtime from changes in average power.
Significance. If the central claims hold, the work supplies portable, sensor-validated guidance for power-aware optimization on current and future exascale GPU/APU systems. The large-scale controlled experiments (hundreds of devices) and cross-validation against on-chip, off-chip, and node-level sensors constitute a concrete strength; the integration into an existing performance tool (Score-P/PAPI) increases practical utility for HPC practitioners.
major comments (2)
- [§5] §5 (Application to rocHPL/HPG-MxP workloads): The separation of energy savings into runtime reduction versus power reduction for the reported 79% (rocHPL-MxP) and 31% (HPG-MxP) node-energy figures rests on the assumption that the square-wave-derived reconstruction accurately captures aliasing, delay, and variability during the irregular, bursty matrix-multiplication phases of real HPL. No direct comparison of reconstructed versus measured power traces is shown for the actual application phases, leaving open the possibility that residual filtering mismatch distorts the attributed split.
- [§4.2] §4.2 (Validation of reconstructed power): While the paper validates the reconstructed power against independent sensors on square-wave workloads, the error bounds and phase-dependent residuals are not quantified for the non-periodic, high-variability patterns present in HPL. This weakens the claim that the method cleanly separates runtime and power contributions in the mixed-precision results.
minor comments (2)
- [Figures 3,5] Figure 3 and Figure 5: axis labels and legend entries are too small to read at print size; consider increasing font size or splitting into multiple panels.
- [§4] The description of the Score-P/PAPI integration (end of §4) does not specify the exact PAPI event names or the time-alignment window size used, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the scale of the experiments and the practical value of the Score-P/PAPI integration. We address each major comment below and will incorporate additional validation material in the revised manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Application to rocHPL/HPG-MxP workloads): The separation of energy savings into runtime reduction versus power reduction for the reported 79% (rocHPL-MxP) and 31% (HPG-MxP) node-energy figures rests on the assumption that the square-wave-derived reconstruction accurately captures aliasing, delay, and variability during the irregular, bursty matrix-multiplication phases of real HPL. No direct comparison of reconstructed versus measured power traces is shown for the actual application phases, leaving open the possibility that residual filtering mismatch distorts the attributed split.
Authors: We agree that the current manuscript does not include direct side-by-side comparisons of reconstructed versus measured power traces for the actual rocHPL and HPG-MxP execution phases. The square-wave workloads were designed to reproduce the high-variability, bursty behavior characteristic of matrix-multiplication kernels, and the reconstruction itself operates on cumulative energy counters, which yields instantaneous power independently of workload periodicity. To strengthen the separation claim, we will add representative power traces from HPL phases together with quantitative error metrics (e.g., RMS and phase-wise residuals) in the revised §5 and a new appendix. revision: yes
-
Referee: [§4.2] §4.2 (Validation of reconstructed power): While the paper validates the reconstructed power against independent sensors on square-wave workloads, the error bounds and phase-dependent residuals are not quantified for the non-periodic, high-variability patterns present in HPL. This weakens the claim that the method cleanly separates runtime and power contributions in the mixed-precision results.
Authors: The validation in §4.2 quantifies sensor characteristics and reconstruction fidelity under controlled, repeatable conditions across up to 512 GPUs and 480 APUs. The application results rely on time-aligned phase attribution via Score-P. We concur that explicit error bounds for non-periodic HPL patterns are not reported. In the revision we will augment §4.2 (or add an appendix) with reconstruction-error analysis on sampled non-periodic segments extracted from the HPL runs, including mean-absolute and phase-specific residual statistics. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper's central claims rest on empirical sensor characterization via square-wave workloads, followed by reconstruction of power streams from cumulative counters and cross-validation against independent on-chip, off-chip, and node-level sensors. These steps produce time-aligned attribution for rocHPL variants without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. The separation of runtime versus power contributions follows directly from the validated measurements rather than reducing to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cumulative energy counters can be differentiated to reconstruct instantaneous power with improved temporal resolution.
Reference graph
Works this paper leans on
-
[1]
O. Hernandez, T. Wang, W. Elwasif, F. Spiga, F. Tartaglione, M. Eisenbach, and R. Miller, “Preliminary study on fine-grained power and energy measurements on grace hopper gh200 with open- source performance tools,” inProceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops, ser. HPC Asia ’25 Workshops...
-
[2]
Power-capping metric evaluation for improving energy efficiency in hpc applications
M. Patrou, T. Wang, W. Elwasif, M. Eisenbach, R. Miller, W. Godoy, and O. Hernandez, “Power-capping metric evaluation for improving energy efficiency in hpc applications.” Berlin, Heidelberg: Springer-Verlag, 2025, p. 231–244. [Online]. Available: https://doi.org/10.1007/978-3-032-07612-0 18
-
[3]
Cray xc30 power monitoring and man- agement,
S. J. Martin and M. Kappel, “Cray xc30 power monitoring and man- agement,” inCray User Group Conference Proceedings, 2014
work page 2014
-
[4]
Martin,HPE Cray EX Power Monitoring Counters, 2024
S. Martin,HPE Cray EX Power Monitoring Counters, 2024. [Online]. Available: https://cug.org/proceedings/cug2024 proceedings/ includes/files/pres127s2.pdf
work page 2024
-
[5]
Advancements of papi for the exascale generation,
H. Jagode, A. Danalis, G. Congiu, D. Barry, A. Castaldo, and J. Don- garra, “Advancements of papi for the exascale generation,”The In- ternational Journal of High Performance Computing Applications, p. 10943420241303884, 2024
work page 2024
-
[6]
A. Kn ¨upfer, C. R ¨ossel, D. a. Mey, S. Biersdorff, K. Diethelm, D. Es- chweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. E. Nagel, Y . Oleynik, P. Philippen, P. Saviankou, D. Schmidl, S. Shende, R. Tsch ¨uter, M. Wagner, B. Wesarg, and F. Wolf, “Score-p: A joint performance measurement run-time infrastructure for periscope,scalasca, tau, and vamp...
work page 2011
-
[7]
Parallel programma- bility and the chapel language,
B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel programma- bility and the chapel language,”The International Journal of High Performance Computing Applications, vol. 21, no. 3, pp. 291–312, 2007
work page 2007
-
[8]
fastotf2: A high-performance chapel-based library for reading and processing otf2 trace files,
S. Khandekar, “fastotf2: A high-performance chapel-based library for reading and processing otf2 trace files,” 2025. [Online]. Available: https://github.com/hpc-ai-adv-dev/fastotf2
work page 2025
-
[9]
A. V . Oppenheim and G. C. Verghese,Signals, systems & inference. Pearson London, UK:, 2017
work page 2017
-
[10]
N. S. Nise,Control systems engineering. John Wiley & Sons, 2019
work page 2019
-
[11]
Z. Yang, K. Adamek, and W. Armour, “Accurate and convenient energy measurements for gpus: A detailed study of nvidia gpu’s built-in power sensor,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’24). IEEE Computer Society, 2024, pp. 307–323
work page 2024
-
[12]
Measuring gpu power with the k20 built-in sensor,
M. Burtscher, I. Zecena, and Z. Zong, “Measuring gpu power with the k20 built-in sensor,” inProceedings of the Workshop on General Purpose Processing Using GPUs (GPGPU), 2014
work page 2014
-
[14]
[Online]. Available: https://arxiv.org/abs/2604.03591
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Fingrav: Methodology for fine-grain gpu power visibility and insights,
V . Singhania, S. Aga, and M. A. Ibrahim, “Fingrav: Methodology for fine-grain gpu power visibility and insights,” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2025, pp. 96–107
work page 2025
-
[16]
Oak Ridge Leadership Computing Facility, “Frontier user guide,” 2025, accessed: 2025-04-30. [Online]. Available: https://docs.olcf.ornl.gov/ systems/frontier user guide.html
work page 2025
-
[17]
N. Chalmers, J. Kurzak, D. Mcdougall, and P. Bauman, “Optimizing high-performance linpack for exascale accelerated architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/...
-
[18]
rocHPL-MxP: Rocm implementation of the HPL-MxP mixed- precision linpack benchmark,
ROCm, “rocHPL-MxP: Rocm implementation of the HPL-MxP mixed- precision linpack benchmark,” https://github.com/ROCm/rocHPL-MxP, 2024, gitHub repository. Initial public commit May 3, 2024. Accessed December 19, 2025. [Online]. Available: https://github.com/ROCm/ rocHPL-MxP
work page 2024
-
[19]
High-performance GMRES multi-precision benchmark: Design, performance, and challenges,
I. Yamazaki, C. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, and J. Dongarra, “High-performance GMRES multi-precision benchmark: Design, performance, and challenges,” in2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2022, pp. 112–122
work page 2022
-
[20]
Scaling the memory wall using mixed-precision - hpg-mxp on an exascale machine,
A. Kashi, N. Koukpaizan, H. Lu, M. Matheson, S. Oral, and F. Wang, “Scaling the memory wall using mixed-precision - hpg-mxp on an exascale machine,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 281–297
work page 2025
-
[21]
Rapl in action: Experiences in using rapl for power measurements,
K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou, “Rapl in action: Experiences in using rapl for power measurements,”ACM Trans. Model. Perform. Eval. Comput. Syst., vol. 3, no. 2, Mar. 2018. [Online]. Available: https://doi.org/10.1145/3177754
-
[22]
Assessing power monitoring approaches for energy and power analysis of computers,
M. E. M. Diouri, M. F. Dolz, O. Gl ¨uck, L. Lef `evre, P. Alonso, S. Catal ´an, R. Mayo, and E. S. Quintana-Ort ´ı, “Assessing power monitoring approaches for energy and power analysis of computers,” Sustainable Computing: Informatics and Systems, vol. 4, no. 2, pp. 68–82, 2014, special Issue on Selected papers from EE-LSDS2013 Conference. [Online]. Avail...
work page 2014
-
[23]
J. Dongarra, H. Ltaief, P. Luszczek, and V . M. Weaver, “ Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures ,” in2012 International Conference on Cloud and Green Computing (CGC). Los Alamitos, CA, USA: IEEE Computer Society, Nov. 2012, pp. 274–281. [Online]. Available: https://doi.ieeecomputersoci...
-
[24]
NVIDIA Management Library (NVML) API Documentation: Device Queries,
NVIDIA Corporation, “NVIDIA Management Library (NVML) API Documentation: Device Queries,” 2025, accessed: 2025-12-19. [Online]. Available: https: //docs.nvidia.com/deploy/nvml-api/group nvmlDeviceQueries.html# group nvmlDeviceQueries 1g7ef7dff0ff14238d08a19ad7fb23fc87
work page 2025
-
[25]
W. Shin, K. W. Schulz, A. F. Lorenzon, M. Maiterth, B. Villasenor Alvarez, J. Polo, A. Kashi, H. Lu, N. Koukpaizan, A. Georgiadou, M. Norman, W. Elwasif, M. Matheson, F. Wang, N. Frontiere, S. Oral, T. Beck, and B. Messer, “Bridging the gap: User- centric energy monitoring for policy-driven application optimization in hpc data centers,” ser. SC Workshops ...
-
[26]
Fine-grained application energy and power measurements on the frontier exascale system,
O. Hernandez and W. Elwasif, “Fine-grained application energy and power measurements on the frontier exascale system,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 135–146. [Online]. Available: https://doi.org/10.1145/3757348.3757363 APPENDIX The Cray Power Management (PM) subsystem ...
-
[27]
APU-only power usingrocm-smienergy counters,
-
[28]
per-APU PM counters for all four APUs, and
-
[29]
the node-level PM total. Across the 50 sampled EX255a nodes, the PM measure- ments for APU 0 and APU 2 were consistently higher than the corresponding APU-onlyrocm-smivalues by approximately 30±2W per sawtooth card. APU 1 and APU 3 did not exhibit this offset. To correct for this static NIC power, we adjust the PM-based accelerator readings as follows: ac...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.