CompPow: A Case for Component-level GPU Power Management
Pith reviewed 2026-05-22 03:18 UTC · model grok-4.3
The pith
Component-level power management inside GPUs can improve energy efficiency by 10% and performance by 5% for ML tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that component-awareness, termed CompPow, for power management in modern GPUs leads to higher energy efficiency of 10% and improved performance of 5% across various ML operations and execution patterns. They make a case for looking inside the GPU at its integrated components for better power optimization, as opposed to datacenter-level approaches. The work ends with recommendations on software-hardware co-design to extract more efficiency.
What carries the argument
CompPow, the component-aware power management strategy that treats different integrated parts of the GPU separately for power decisions.
If this is right
- ML workloads can run with lower energy consumption on GPUs.
- Some operations may see performance gains from better power allocation.
- Datacenter power budgets can support more GPU tasks without added hardware.
- Software can schedule work to keep only active components powered on.
Where Pith is reading between the lines
- If component power control works, hardware vendors might prioritize exposing such controls in future GPUs.
- This approach could extend to non-ML workloads like graphics rendering or scientific simulations.
- Operating systems and compilers would need updates to track and request component power states.
Load-bearing premise
Modern GPUs can expose or be modified to allow independent power control of their integrated components without significant overhead or software incompatibility.
What would settle it
Measuring actual energy use and performance on a GPU prototype with component-level power management enabled versus disabled for representative ML kernels would confirm or refute the 10% and 5% gains.
Figures
read the original abstract
The ever increasing demand for ML-driven intelligence in a wide spectrum of domains has led to ubiquity of GPUs. At the same time, GPUs are notorious for their power consumption needs and often dominate power allocation in a typical ML datacenter. While datacenter-level power optimizations which focus on collection of GPUs are promising, in this work, we take a different tack -- namely, we take a closer look at power consumption inside a GPU. Specifically, as modern GPUs are comprised of integrated components, we make a case for component-awareness, termed CompPow in this work, for improved power management in modern GPUs. We demonstrate for a variety of ML operations and execution patterns, CompPow has the potential to deliver higher energy efficiency (10%) and even improved performance (5%). We conclude with recommendations on how component-aware software-hardware co-design can extract additional energy efficiency from modern GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CompPow, a component-aware power management strategy for modern GPUs that treats integrated components (SMs, caches, memory controllers, interconnect) as distinct power domains. It argues that this intra-GPU granularity can improve energy efficiency for ML workloads beyond datacenter-level techniques and claims empirical demonstrations across a variety of ML operations and execution patterns that yield 10% higher energy efficiency and 5% better performance. The manuscript concludes with recommendations for software-hardware co-design to realize these gains.
Significance. If the claimed gains are reproducible on production hardware with low overhead, the work would shift GPU power management from coarse-grained to component-level control, potentially reducing the power footprint of ML datacenters where GPUs dominate allocation. The emphasis on co-design rather than pure software or hardware changes is a constructive framing.
major comments (2)
- Abstract: the manuscript states that 'demonstrations were performed' yielding 10% energy-efficiency and 5% performance gains, yet supplies no methodology, benchmark suite, measurement setup, power instrumentation details, or error analysis. Without these, the quantitative claims cannot be evaluated and the central empirical argument remains unsupported.
- The feasibility argument (implicit in the recommendations section): the central claim requires that modern GPUs (or near-term modifications) permit independent power gating or DVFS of components with negligible overhead and without breaking existing software stacks. No modeling, API experiments, or prototype results are referenced to substantiate that this assumption holds at scale; if overhead exceeds a few percent the net benefit disappears.
minor comments (1)
- The abstract and conclusion use the term 'component-awareness' without a precise definition or diagram showing which GPU blocks are treated as independent domains.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the manuscript states that 'demonstrations were performed' yielding 10% energy-efficiency and 5% performance gains, yet supplies no methodology, benchmark suite, measurement setup, power instrumentation details, or error analysis. Without these, the quantitative claims cannot be evaluated and the central empirical argument remains unsupported.
Authors: We agree that the abstract, in its current concise form, does not provide enough detail on methodology to allow full evaluation of the quantitative claims. The full manuscript includes an evaluation section describing the use of cycle-accurate simulation with component-level power models, a benchmark suite consisting of MLPerf workloads and representative ML kernels, and power instrumentation via published GPU power models with reported variance across runs. To address the referee's concern directly, we will revise the abstract to include a brief summary of the evaluation methodology, benchmarks, and error analysis approach, with pointers to the relevant sections. revision: yes
-
Referee: The feasibility argument (implicit in the recommendations section): the central claim requires that modern GPUs (or near-term modifications) permit independent power gating or DVFS of components with negligible overhead and without breaking existing software stacks. No modeling, API experiments, or prototype results are referenced to substantiate that this assumption holds at scale; if overhead exceeds a few percent the net benefit disappears.
Authors: We acknowledge that the feasibility of low-overhead component-level control requires more explicit support. The recommendations section focuses on co-design directions but does not include dedicated overhead modeling or API-level experiments. In the revised manuscript we will add a dedicated feasibility subsection that references existing literature on fine-grained GPU power gating showing overheads below 3% in comparable architectures, along with a simple analytical model demonstrating that the reported 10% efficiency gains remain net positive even under moderate overhead assumptions. This will make the scalability argument more rigorous. revision: yes
Circularity Check
No circularity: CompPow rests on empirical demonstration rather than self-referential derivation
full rationale
The paper advances a case for component-level GPU power management based on observed or modeled gains across ML operations and execution patterns. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The 10% efficiency / 5% performance claims are presented as demonstration outcomes, not as quantities derived by construction from the paper's own inputs or prior self-citations. The hardware feasibility assumption is acknowledged as external but does not create a circular reduction within the paper's own logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern GPUs are comprised of integrated components that can be power-managed independently.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate for a variety of ML operations and execution patterns, CompPow has the potential to deliver higher energy efficiency (10%) and even improved performance (5%).
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
component-level power breakdown for GEMMs and all-gather
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://www.iea.org/reports/energy-and-ai (2025)
Energy and AI. https://www.iea.org/reports/energy-and-ai (2025)
work page 2025
-
[2]
Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era. https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip- powering-the-ai-factory-era/ (2025)
work page 2025
-
[3]
Abe, Y., Sasaki, H., Kato, S., Inoue, K., Edahiro, M., Peres, M.: Power and Perfor- mance Characterization and Modeling of GPU-Accelerated Systems. In: Proceed- ings of the IEEE 28th International Parallel and Distributed Processing Sympo- sium (IPDPS) (2014)
work page 2014
-
[4]
In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2016)
Adhinarayanan, V., Paul, I., Greathouse, J.L., Huang, W., Pattnaik, A., Feng, W.c.: Measuring and modeling on-chip interconnect power on real hardware. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2016)
work page 2016
-
[5]
Agrawal, A., Aga, S., Pati, S., Islam, M.: ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2025)
work page 2025
-
[6]
https://rocm.docs.amd.com/projects/amdsmi /en/latest/ (2024)
AMD: AMD SMI documentation. https://rocm.docs.amd.com/projects/amdsmi /en/latest/ (2024)
work page 2024
-
[7]
Arunkumar, A., Bolotin, E., Nellans, D., Wu, C.J.: Understanding the Future of Energy Efficiency in Multi-Module GPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) (2019)
work page 2019
-
[8]
Bai, Z., Zhang, Z., Zhu, Y., Jin, X.: PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020) CompPow: A Case for Component-level GPU Power Management 13
work page 2020
-
[9]
Chen, Q., Yang, H., Guo, M., Kannan, R.S., Mars, J., Tang, L.: Prophet: Pre- cise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS) (2017)
work page 2017
-
[10]
In: Proceedings of the USENIX Annual Technical Conference (USENIX ATC) (2023)
Choi, S., Koo, I., Ahn, J., Jeon, M., Kwon, Y.: EnvPipe: Performance-preserving DNN training framework for saving energy. In: Proceedings of the USENIX Annual Technical Conference (USENIX ATC) (2023)
work page 2023
-
[11]
In: Proceedings of the ACM SIGOPS Sym- posium on Operating Systems Principles (SOSP)
Chung, J.W., Gu, Y., Jang, I., Meng, L., Bansal, N., Chowdhury, M.: Reducing Energy Bloat in Large Model Training. In: Proceedings of the ACM SIGOPS Sym- posium on Operating Systems Principles (SOSP). ACM (2024)
work page 2024
-
[12]
Grant, R.E., Levenhagen, M., Olivier, S.L., DeBonis, D., Pedretti, K.T., Laros III, J.H.: Standardizing Power Monitoring and Control at Exascale. Computer (2016)
work page 2016
- [13]
- [14]
-
[15]
Kakolyris, A.K., Masouros, D., Vavaroutsos, P., Xydis, S., Soudris, D.: throt- tLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA) (2025)
work page 2025
-
[16]
In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) (2021)
Kandiah, V., Peverelle, S., Khairy, M., Pan, J., Manjunath, A., Rogers, T.G., Aamodt, T.M., Hardavellas, N.: AccelWattch: A Power Modeling Framework for Modern GPUs. In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) (2021)
work page 2021
-
[17]
Kurzynski, M., Aga, S., Wu, D.: Lit Silicon: A Case Where Ther- mal Imbalance Couples Concurrent Execution in Multiple GPUs (2025), https://arxiv.org/abs/2511.09861
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Majumdar, A., Piga, L., Paul, I., Greathouse, J.L., Huang, W., Albonesi, D.H.: Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. In: Proceedings of the IEEE International Symposium on High Performance Com- puter Architecture (HPCA) (2017)
work page 2017
-
[19]
https://developer.nvidia.com/system-management-interface (2024)
NVIDIA: System Management Interface SMIn. https://developer.nvidia.com/system-management-interface (2024)
work page 2024
- [20]
- [21]
-
[22]
Communication Scaling for Future Transformers on Future Hardware
Pati, S., Aga, S., Islam, M., Jayasena, N., Sinclair, M.D.: Tale of Two Cs: Compu- tation vs. Communication Scaling for Future Transformers on Future Hardware. In: 2023 IEEE International Symposium on Workload Characterization (IISWC) (2023). https://doi.org/10.1109/IISWC59245.2023.00026
-
[23]
Pati, S., Aga, S., Islam, M., Quach, R., Kudchadker, S., Ibrahim, M.A.: Dma-latte: Expanding the reach of dma offloads to latency-bound ml communication (2026), https://arxiv.org/abs/2511.06605
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
In: Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) (2019) 14 S
Patki,T.,Frye,Z.,Bhatia,H.,DiNatale,F.,Glosli,J.,Ingolfsson,H.,Rountree,B.: Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow. In: Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) (2019) 14 S. Aga, M. Ibrahim
work page 2019
-
[25]
Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon Emissions and Large Neural Network Training (2021), https://arxiv.org/abs/2104.10350
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Computer Science - Research and Development (2015)
Price, D.C., Clark, M.A., Barsdell, B.R., Babich, R., Greenhill, L.J.: Optimizing performance-per-watt on GPUs in high performance computing: Temperature, fre- quency and voltage effects. Computer Science - Research and Development (2015)
work page 2015
-
[27]
Romein, J.W., Veenboer, B.: PowerSensor 2: A Fast Power Measurement Tool. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018)
work page 2018
-
[28]
Singhania, V., Aga, S., Ibrahim, M.A.: FinGraV: Methodology for Fine-Grain GPU Power Visibility and Insights. In: Proceedings of the IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS) (2025)
work page 2025
-
[29]
In: Proceedings of the International Conference on AI-ML-Systems (AIMLSystems) (2025)
Vamja, T., Ray, K., George, F., Devi, U.: Data-Driven Partitioning of Aggregate GPU Power Among GPU (MIG) Partitions. In: Proceedings of the International Conference on AI-ML-Systems (AIMLSystems) (2025)
work page 2025
-
[30]
https://variorum.readthedocs.io/en/latest/index.h tml (2023)
Variorum: Variorum. https://variorum.readthedocs.io/en/latest/index.h tml (2023)
work page 2023
-
[31]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need (2023), https://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Proceed- ings of the IEEE Transactions on Sustainable Computing (2024)
Wang, Y., Hao, M., He, H., Zhang, W., Tang, Q., Sun, X., Wang, Z.: DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement Learning. Proceed- ings of the IEEE Transactions on Sustainable Computing (2024)
work page 2024
-
[33]
Wang, Y., Wang, Q., Shi, S., He, X., Tang, Z., Zhao, K., Chu, X.: Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (2020)
work page 2020
-
[34]
Wu, G., Greathouse, J.L., Lyashevsky, A., Jayasena, N., Chiou, D.: GPGPU per- formance and power estimation using machine learning. In: Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015)
work page 2015
-
[35]
In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020)
Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., Feng, Y., Lin, W., Jia, Y.: AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020)
work page 2020
-
[36]
Yang, Z., Adamek, K., Armour, W.: Accurate and Convenient Energy Measure- ments for GPUs: A Detailed Study of NVIDIA GPU’s Built-In Power Sensor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2024)
work page 2024
-
[37]
You, J., Chung, J.W., Chowdhury, M.: Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In: Proceedings of the USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI) (2023)
work page 2023
- [38]
-
[39]
Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., Li, S.: PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel (2023), https://arxiv.org/abs/2304.11277
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.