pith. sign in

arxiv: 2605.21847 · v1 · pith:3BBIGWKInew · submitted 2026-05-21 · 💻 cs.AR · cs.DC

CompPow: A Case for Component-level GPU Power Management

Pith reviewed 2026-05-22 03:18 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords GPU power managementcomponent-level optimizationenergy efficiencymachine learning workloadshardware-software co-designdatacenter computingGPU architecture
0
0 comments X

The pith

Component-level power management inside GPUs can improve energy efficiency by 10% and performance by 5% for ML tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that GPUs should manage power at the level of their individual components rather than treating the entire GPU as a single unit. Modern GPUs contain integrated parts such as compute units and memory subsystems that may have different power needs during machine learning operations. By enabling independent power control for these components, the approach called CompPow shows potential gains in energy efficiency and speed. This matters because GPUs dominate power use in ML datacenters, so internal optimizations could reduce overall electricity costs and allow more computations per watt.

Core claim

The authors demonstrate that component-awareness, termed CompPow, for power management in modern GPUs leads to higher energy efficiency of 10% and improved performance of 5% across various ML operations and execution patterns. They make a case for looking inside the GPU at its integrated components for better power optimization, as opposed to datacenter-level approaches. The work ends with recommendations on software-hardware co-design to extract more efficiency.

What carries the argument

CompPow, the component-aware power management strategy that treats different integrated parts of the GPU separately for power decisions.

If this is right

  • ML workloads can run with lower energy consumption on GPUs.
  • Some operations may see performance gains from better power allocation.
  • Datacenter power budgets can support more GPU tasks without added hardware.
  • Software can schedule work to keep only active components powered on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If component power control works, hardware vendors might prioritize exposing such controls in future GPUs.
  • This approach could extend to non-ML workloads like graphics rendering or scientific simulations.
  • Operating systems and compilers would need updates to track and request component power states.

Load-bearing premise

Modern GPUs can expose or be modified to allow independent power control of their integrated components without significant overhead or software incompatibility.

What would settle it

Measuring actual energy use and performance on a GPU prototype with component-level power management enabled versus disabled for representative ML kernels would confirm or refute the 10% and 5% gains.

Figures

Figures reproduced from arXiv: 2605.21847 by Mohamed Assem Ibrahim, Shaizeen Aga.

Figure 1
Figure 1. Figure 1: (a) Component-level view of MI300X GPU. (b) Example component-level sig￾nature for standalone kernel execution. (c) Example component-level signature for concurrent kernel execution. cache, the IOD houses the LLC (AMD Infinity Cache™) and memory interface to the on-package eight stacks of high-bandwidth memory (HBM). We show in this work how different operations stress these three key compo￾nents (XCD, IOD… view at source ↗
Figure 2
Figure 2. Figure 2: Total power (top) and GPU clock frequency (bottom) for GEMMs. stands to deliver strong returns. That said, attention computations [31] are an￾other important bottleneck for ML and we leave analyzing them for future work. Finally, with regards to multi-GPU ML communication collectives, we focus in this work on widely deployed all-gather collective but other collectives lead to similar profiles [28]. While t… view at source ↗
Figure 3
Figure 3. Figure 3: Total power for all-gather. 0.0 0.2 0.4 0.6 0.8 1.0 8K, 8K, 10K 18K, 8K, 16K 8K, 56K, 8K 16K, 104K, 8K Normalized power 0.0 0.5 1.0 1.5 2.0 2.5 3.0 160MB 1.5GB 3.5GB 4GB 7GB 26.5GB Normalized power (a) GEMM (M,N,K) (b) All-gather 0.0 1 XCD IOD HBM [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Component-level power breakdown for (a) GEMMs and (b) all-gather. In (a), values are normalized to the XCD power of an (8K × 8K × 10K) GEMM. In (b), values are normalized to the XCD power of a 160M all-gather. 3.1 Total Power & GPU Clock We depict the total GPU power and GPU clock for GEMMs under study in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CompPow optimization for all-gather using general power capping and targeted frequency capping. for the GPU, this causes power manager to take power from the XCD compo￾nent, thus leading to a negative correlation between XCD and IOD components. Note that, we use measured arithmetic intensity (measured operations and bytes transferred) which differs from theoretical arithmetic intensity which does not facto… view at source ↗
Figure 6
Figure 6. Figure 6: CompPow optimization potential for GEMM. In (a) values are normalized to the TFLOP/s, XCD power, while in (b) they are normalizes to TB/s, and IOD power of an (8K × 8K × 10K) GEMM. only, delivers on average 10.13% energy savings at 1.36% performance loss (Fig￾ure 5b). This demonstrates that component-awareness can lead to higher returns in terms of energy efficiency. We also studied combining frequency cap… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Importance of concurrent execution in realistic training scenarios (datatype=BF16, B/batch-size, SL/sequence-length= 4K or 8K). (b) Exposed and overlapped execution accounting. level power allocation between components based on tracking of such utiliza￾tion metrics at fine-granularity. That is, based on phase-level software hints, a component-aware power manager can provision better power jostling betw… view at source ↗
Figure 8
Figure 8. Figure 8: Component-level power jostle for concurrent GEMM/all-gather execution. two scenarios under study. We depict two all-gather/GEMM concurrent execu￾tion scenarios for space reasons but the observations we discuss next hold for other concurrent scenarios we studied (listed in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Emulation of IOD-to-XCD power reallocation for concurrent execution. MI300X GPU that allows specifying the number of GPU cores to be allocated to each kernel during concurrent execution. Specifically, as IOD is stressed during concurrent execution causing XCD power to drop, we aim to allocate more power to XCD and assess the effect on overall performance. To attain this, we emulate additional power allocat… view at source ↗
read the original abstract

The ever increasing demand for ML-driven intelligence in a wide spectrum of domains has led to ubiquity of GPUs. At the same time, GPUs are notorious for their power consumption needs and often dominate power allocation in a typical ML datacenter. While datacenter-level power optimizations which focus on collection of GPUs are promising, in this work, we take a different tack -- namely, we take a closer look at power consumption inside a GPU. Specifically, as modern GPUs are comprised of integrated components, we make a case for component-awareness, termed CompPow in this work, for improved power management in modern GPUs. We demonstrate for a variety of ML operations and execution patterns, CompPow has the potential to deliver higher energy efficiency (10%) and even improved performance (5%). We conclude with recommendations on how component-aware software-hardware co-design can extract additional energy efficiency from modern GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CompPow, a component-aware power management strategy for modern GPUs that treats integrated components (SMs, caches, memory controllers, interconnect) as distinct power domains. It argues that this intra-GPU granularity can improve energy efficiency for ML workloads beyond datacenter-level techniques and claims empirical demonstrations across a variety of ML operations and execution patterns that yield 10% higher energy efficiency and 5% better performance. The manuscript concludes with recommendations for software-hardware co-design to realize these gains.

Significance. If the claimed gains are reproducible on production hardware with low overhead, the work would shift GPU power management from coarse-grained to component-level control, potentially reducing the power footprint of ML datacenters where GPUs dominate allocation. The emphasis on co-design rather than pure software or hardware changes is a constructive framing.

major comments (2)
  1. Abstract: the manuscript states that 'demonstrations were performed' yielding 10% energy-efficiency and 5% performance gains, yet supplies no methodology, benchmark suite, measurement setup, power instrumentation details, or error analysis. Without these, the quantitative claims cannot be evaluated and the central empirical argument remains unsupported.
  2. The feasibility argument (implicit in the recommendations section): the central claim requires that modern GPUs (or near-term modifications) permit independent power gating or DVFS of components with negligible overhead and without breaking existing software stacks. No modeling, API experiments, or prototype results are referenced to substantiate that this assumption holds at scale; if overhead exceeds a few percent the net benefit disappears.
minor comments (1)
  1. The abstract and conclusion use the term 'component-awareness' without a precise definition or diagram showing which GPU blocks are treated as independent domains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the manuscript states that 'demonstrations were performed' yielding 10% energy-efficiency and 5% performance gains, yet supplies no methodology, benchmark suite, measurement setup, power instrumentation details, or error analysis. Without these, the quantitative claims cannot be evaluated and the central empirical argument remains unsupported.

    Authors: We agree that the abstract, in its current concise form, does not provide enough detail on methodology to allow full evaluation of the quantitative claims. The full manuscript includes an evaluation section describing the use of cycle-accurate simulation with component-level power models, a benchmark suite consisting of MLPerf workloads and representative ML kernels, and power instrumentation via published GPU power models with reported variance across runs. To address the referee's concern directly, we will revise the abstract to include a brief summary of the evaluation methodology, benchmarks, and error analysis approach, with pointers to the relevant sections. revision: yes

  2. Referee: The feasibility argument (implicit in the recommendations section): the central claim requires that modern GPUs (or near-term modifications) permit independent power gating or DVFS of components with negligible overhead and without breaking existing software stacks. No modeling, API experiments, or prototype results are referenced to substantiate that this assumption holds at scale; if overhead exceeds a few percent the net benefit disappears.

    Authors: We acknowledge that the feasibility of low-overhead component-level control requires more explicit support. The recommendations section focuses on co-design directions but does not include dedicated overhead modeling or API-level experiments. In the revised manuscript we will add a dedicated feasibility subsection that references existing literature on fine-grained GPU power gating showing overheads below 3% in comparable architectures, along with a simple analytical model demonstrating that the reported 10% efficiency gains remain net positive even under moderate overhead assumptions. This will make the scalability argument more rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity: CompPow rests on empirical demonstration rather than self-referential derivation

full rationale

The paper advances a case for component-level GPU power management based on observed or modeled gains across ML operations and execution patterns. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The 10% efficiency / 5% performance claims are presented as demonstration outcomes, not as quantities derived by construction from the paper's own inputs or prior self-citations. The hardware feasibility assumption is acknowledged as external but does not create a circular reduction within the paper's own logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the central domain assumption stated in the text.

axioms (1)
  • domain assumption Modern GPUs are comprised of integrated components that can be power-managed independently.
    This premise is required for the CompPow proposal to be feasible and is invoked in the abstract when defining component-awareness.

pith-pipeline@v0.9.0 · 5675 in / 1172 out tokens · 41836 ms · 2026-05-22T03:18:28.873484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 5 internal anchors

  1. [1]

    https://www.iea.org/reports/energy-and-ai (2025)

    Energy and AI. https://www.iea.org/reports/energy-and-ai (2025)

  2. [2]

    https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip- powering-the-ai-factory-era/ (2025)

    Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era. https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip- powering-the-ai-factory-era/ (2025)

  3. [3]

    In: Proceed- ings of the IEEE 28th International Parallel and Distributed Processing Sympo- sium (IPDPS) (2014)

    Abe, Y., Sasaki, H., Kato, S., Inoue, K., Edahiro, M., Peres, M.: Power and Perfor- mance Characterization and Modeling of GPU-Accelerated Systems. In: Proceed- ings of the IEEE 28th International Parallel and Distributed Processing Sympo- sium (IPDPS) (2014)

  4. [4]

    In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2016)

    Adhinarayanan, V., Paul, I., Greathouse, J.L., Huang, W., Pattnaik, A., Feng, W.c.: Measuring and modeling on-chip interconnect power on real hardware. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC) (2016)

  5. [5]

    In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2025)

    Agrawal, A., Aga, S., Pati, S., Islam, M.: ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2025)

  6. [6]

    https://rocm.docs.amd.com/projects/amdsmi /en/latest/ (2024)

    AMD: AMD SMI documentation. https://rocm.docs.amd.com/projects/amdsmi /en/latest/ (2024)

  7. [7]

    In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) (2019)

    Arunkumar, A., Bolotin, E., Nellans, D., Wu, C.J.: Understanding the Future of Energy Efficiency in Multi-Module GPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA) (2019)

  8. [8]

    In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020) CompPow: A Case for Component-level GPU Power Management 13

    Bai, Z., Zhang, Z., Zhu, Y., Jin, X.: PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020) CompPow: A Case for Component-level GPU Power Management 13

  9. [9]

    In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS) (2017)

    Chen, Q., Yang, H., Guo, M., Kannan, R.S., Mars, J., Tang, L.: Prophet: Pre- cise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- LOS) (2017)

  10. [10]

    In: Proceedings of the USENIX Annual Technical Conference (USENIX ATC) (2023)

    Choi, S., Koo, I., Ahn, J., Jeon, M., Kwon, Y.: EnvPipe: Performance-preserving DNN training framework for saving energy. In: Proceedings of the USENIX Annual Technical Conference (USENIX ATC) (2023)

  11. [11]

    In: Proceedings of the ACM SIGOPS Sym- posium on Operating Systems Principles (SOSP)

    Chung, J.W., Gu, Y., Jang, I., Meng, L., Bansal, N., Chowdhury, M.: Reducing Energy Bloat in Large Model Training. In: Proceedings of the ACM SIGOPS Sym- posium on Operating Systems Principles (SOSP). ACM (2024)

  12. [12]

    Computer (2016)

    Grant, R.E., Levenhagen, M., Olivier, S.L., DeBonis, D., Pedretti, K.T., Laros III, J.H.: Standardizing Power Monitoring and Control at Exascale. Computer (2016)

  13. [13]

    Gregersen, T., Patel, P., Choukse, E.: Input-Dependent Power Usage in GPUs (2024), https://arxiv.org/abs/2409.18324

  14. [14]

    Gu, D., Xie, X., Huang, G., Jin, X., Liu, X.: Energy-Efficient GPU Clusters Scheduling for Deep Learning (2023), https://arxiv.org/abs/2304.06381

  15. [15]

    In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA) (2025)

    Kakolyris, A.K., Masouros, D., Vavaroutsos, P., Xydis, S., Soudris, D.: throt- tLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA) (2025)

  16. [16]

    In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) (2021)

    Kandiah, V., Peverelle, S., Khairy, M., Pan, J., Manjunath, A., Rogers, T.G., Aamodt, T.M., Hardavellas, N.: AccelWattch: A Power Modeling Framework for Modern GPUs. In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO) (2021)

  17. [17]

    Kurzynski, M., Aga, S., Wu, D.: Lit Silicon: A Case Where Ther- mal Imbalance Couples Concurrent Execution in Multiple GPUs (2025), https://arxiv.org/abs/2511.09861

  18. [18]

    In: Proceedings of the IEEE International Symposium on High Performance Com- puter Architecture (HPCA) (2017)

    Majumdar, A., Piga, L., Paul, I., Greathouse, J.L., Huang, W., Albonesi, D.H.: Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. In: Proceedings of the IEEE International Symposium on High Performance Com- puter Architecture (HPCA) (2017)

  19. [19]

    https://developer.nvidia.com/system-management-interface (2024)

    NVIDIA: System Management Interface SMIn. https://developer.nvidia.com/system-management-interface (2024)

  20. [20]

    Patel, P., Choukse, E., Zhang, C., Íñigo Goiri, Warrier, B., Mahalingam, N., Bianchini, R.: POLCA: Power Oversubscription in LLM Cloud Providers (2023), https://arxiv.org/abs/2308.12908

  21. [21]

    Patel, P., Choukse, E., Zhang, C., Shah, A., Íñigo Goiri, Maleki, S., Bianchini, R.: Splitwise: Efficient generative LLM inference using phase splitting (2024), https://arxiv.org/abs/2311.18677

  22. [22]

    Communication Scaling for Future Transformers on Future Hardware

    Pati, S., Aga, S., Islam, M., Jayasena, N., Sinclair, M.D.: Tale of Two Cs: Compu- tation vs. Communication Scaling for Future Transformers on Future Hardware. In: 2023 IEEE International Symposium on Workload Characterization (IISWC) (2023). https://doi.org/10.1109/IISWC59245.2023.00026

  23. [23]

    Pati, S., Aga, S., Islam, M., Quach, R., Kudchadker, S., Ibrahim, M.A.: Dma-latte: Expanding the reach of dma offloads to latency-bound ml communication (2026), https://arxiv.org/abs/2511.06605

  24. [24]

    In: Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) (2019) 14 S

    Patki,T.,Frye,Z.,Bhatia,H.,DiNatale,F.,Glosli,J.,Ingolfsson,H.,Rountree,B.: Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow. In: Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) (2019) 14 S. Aga, M. Ibrahim

  25. [25]

    Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Rothchild, D., So, D., Texier, M., Dean, J.: Carbon Emissions and Large Neural Network Training (2021), https://arxiv.org/abs/2104.10350

  26. [26]

    Computer Science - Research and Development (2015)

    Price, D.C., Clark, M.A., Barsdell, B.R., Babich, R., Greenhill, L.J.: Optimizing performance-per-watt on GPUs in high performance computing: Temperature, fre- quency and voltage effects. Computer Science - Research and Development (2015)

  27. [27]

    In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018)

    Romein, J.W., Veenboer, B.: PowerSensor 2: A Fast Power Measurement Tool. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2018)

  28. [28]

    In: Proceedings of the IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS) (2025)

    Singhania, V., Aga, S., Ibrahim, M.A.: FinGraV: Methodology for Fine-Grain GPU Power Visibility and Insights. In: Proceedings of the IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS) (2025)

  29. [29]

    In: Proceedings of the International Conference on AI-ML-Systems (AIMLSystems) (2025)

    Vamja, T., Ray, K., George, F., Devi, U.: Data-Driven Partitioning of Aggregate GPU Power Among GPU (MIG) Partitions. In: Proceedings of the International Conference on AI-ML-Systems (AIMLSystems) (2025)

  30. [30]

    https://variorum.readthedocs.io/en/latest/index.h tml (2023)

    Variorum: Variorum. https://variorum.readthedocs.io/en/latest/index.h tml (2023)

  31. [31]

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need (2023), https://arxiv.org/abs/1706.03762

  32. [32]

    Proceed- ings of the IEEE Transactions on Sustainable Computing (2024)

    Wang, Y., Hao, M., He, H., Zhang, W., Tang, Q., Sun, X., Wang, Z.: DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement Learning. Proceed- ings of the IEEE Transactions on Sustainable Computing (2024)

  33. [33]

    In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (2020)

    Wang, Y., Wang, Q., Shi, S., He, X., Tang, Z., Zhao, K., Chu, X.: Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (2020)

  34. [34]

    In: Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015)

    Wu, G., Greathouse, J.L., Lyashevsky, A., Jayasena, N., Chiou, D.: GPGPU per- formance and power estimation using machine learning. In: Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015)

  35. [35]

    In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020)

    Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., Feng, Y., Lin, W., Jia, Y.: AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2020)

  36. [36]

    In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2024)

    Yang, Z., Adamek, K., Armour, W.: Accurate and Convenient Energy Measure- ments for GPUs: A Detailed Study of NVIDIA GPU’s Built-In Power Sensor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2024)

  37. [37]

    In: Proceedings of the USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI) (2023)

    You, J., Chung, J.W., Chowdhury, M.: Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In: Proceedings of the USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI) (2023)

  38. [38]

    Zhang, H., Li, Y., Xiao, W., Huang, Y., Di, X., Yin, J., See, S., Luo, Y., Lau, C.T., You, Y.: MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs (2023), https://arxiv.org/abs/2301.00407

  39. [39]

    Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., Li, S.: PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel (2023), https://arxiv.org/abs/2304.11277