pith. machine review for the scientific record.
sign in

arxiv: 2511.09861 · v3 · pith:FCIVDTSEnew · submitted 2025-11-13 · 💻 cs.DC · cs.AR

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Pith reviewed 2026-05-17 23:07 UTC · model grok-4.3

classification 💻 cs.DC cs.AR
keywords Lit Siliconthermal imbalancemulti-GPU stragglersconcurrent computation and communicationperformance variationLLM trainingpower managementGPU systems
0
0 comments X

The pith

Thermal imbalance across GPUs introduces stragglers that slow down the system when using concurrent computation and communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that thermal imbalances in multi-GPU nodes cause performance variation by creating hotter straggler GPUs that slow down cooler faster ones during overlapped computation and communication. This matters because such variation reduces efficiency in large-scale AI and HPC workloads like LLM training. The authors provide models to analyze the effect and test mitigation strategies that improve both speed and power use. Experiments confirm gains of up to 6 percent in performance and 4 percent in power savings.

Core claim

Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). This coupling with concurrent computation and communication leads to node-level performance variation and inefficiency. Analytical performance and power models quantify potential gains, while detection and mitigation techniques, including power optimization under thermal design power, node-level GPU power capping, and CPU power sloshing, address the issue.

What carries the argument

The Lit Silicon effect, a mechanism where thermal imbalance couples with concurrent computation and communication to produce stragglers that limit overall node performance.

If this is right

  • Thermal imbalance leads to node-level performance variation in multi-GPU systems.
  • Models predict gains from balancing thermal effects.
  • Detection and mitigation techniques can reduce straggling.
  • Power management solutions yield up to 6% performance and 4% power improvements.
  • Savings in electricity costs for datacenters running LLM training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Implementing temperature-aware scheduling could extend these benefits to larger clusters.
  • The effect may compound with other sources of variation in distributed training.
  • Broader adoption of node-level power sloshing might optimize energy use across facilities.
  • Similar thermal coupling could appear in other parallel hardware architectures.

Load-bearing premise

That kernel-level performance variation is primarily driven by thermal imbalance interacting with concurrent computation and communication rather than workload imbalance or other hardware factors.

What would settle it

Running the same C3 workloads on a multi-GPU node with enforced uniform GPU temperatures and observing whether performance variation is eliminated.

Figures

Figures reproduced from arXiv: 2511.09861 by Di Wu, Marco Kurzynski, Shaizeen Aga.

Figure 1
Figure 1. Figure 1: Overview of this paper. We start from the performance [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Concurrent computation and communication in FSDP. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between the overlap ratio and the kernel [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between overlap ratio and kernel duration of kernels across GPUs (numbered). f [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Temperature and frequency over three training itera [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dynamic coupling towards Lit Silicon. 1 - 4 represent four phases of Lit Silicon in one training iteration. The bold black lines, which connect the start time of identical kernels running on different GPUs, are called straggler waves. The difference in a kernel’s start time on a leader and a straggler is defined as the lead value. a - d denote the lead values for four different kernels. running frequency i… view at source ↗
Figure 7
Figure 7. Figure 7: Lead values from two test nodes, with node 1 in the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Our framework to solve Lit Silicon with three use cases. It only needs about 200 lines of PyTorch codes. C. Framework and Use Cases We show the framework of our solution in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the convergence process for all use [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Measured frequency and power for different con [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Different warm-up periods swept. Baseline is the [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Final power caps set for different scenarios and initial [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity study of knobs in Table II. A higher value is better (e.g., less variation has a larger bar value). The rolling [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Power and throughput metrics are the same as [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Metrics are the same as Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
read the original abstract

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems can exhibit performance variation at the node and cluster levels. Such performance variation can significantly impact both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). In this work, we analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation and communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupled with C3 impacts performance variation, which we coin the Lit Silicon effect. More specifically, Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). Lit Silicon can lead to node-level performance variation and inefficiency, potentially impacting the entire datacenter. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including (1) power optimization under GPU thermal design power, (2) performance optimization under node-level GPU power capping, and (3) performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving several tens of millions of dollars in electricity costs in datacenters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that thermal imbalance across GPUs in a multi-GPU node induces straggler GPUs (hotter/slower) that, when combined with concurrent computation and communication (C3), slow down leader GPUs (cooler/faster), producing node-level performance variation termed the 'Lit Silicon' effect. It supports this with observations from LLM training on AMD MI300X systems, proposes analytical performance and power models, and evaluates three power-management mitigations that yield up to 6% performance and 4% power gains.

Significance. If the proposed causal mechanism can be isolated from confounders, the work identifies a practically relevant source of inefficiency in large-scale GPU clusters that could inform both scheduling and power-management policies. The evaluation on two real MI300X nodes, two workloads, and two training frameworks provides concrete empirical grounding, and the explicit mitigation techniques constitute a useful engineering contribution.

major comments (2)
  1. [Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.
  2. [Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.
minor comments (2)
  1. [Introduction and Lit Silicon definition] Clarify the precise definition of 'node-level straggler' versus 'leader' GPUs and how these roles are identified in the C3 overlap measurements.
  2. [Results and evaluation] The abstract states 'up to 6% performance and 4% power improvements'; the corresponding tables or figures should report confidence intervals or statistical significance for these gains across the two frameworks and workloads.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the concerns on causal isolation in experiments and model derivation below, with revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.

    Authors: We acknowledge the value of controlled interventions for stronger causal isolation. Our evaluation relies on natural thermal variations observed across repeated runs of real LLM training workloads on two distinct MI300X nodes and two frameworks, where temperature differentials consistently correlate with C3-induced straggling while workload balance is enforced by the frameworks and interconnect patterns do not align with the observed performance variation. In the revised manuscript we have added a dedicated subsection discussing potential confounders with supporting measurements, arguing that thermal effects are the most parsimonious explanation. Full active control of per-GPU temperature would require hardware not available in our production testbed. revision: partial

  2. Referee: [Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.

    Authors: The models are derived from first principles using standard thermal throttling equations (frequency scaling with temperature from vendor datasheets) and power models from the literature. We have revised the manuscript to include an explicit derivation section with the base equations, references to MI300X specifications, and validation on held-out data points to demonstrate predictive use beyond the original observations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper reports empirical correlations between kernel-level performance variation, C3 overlap, and thermal imbalance on MI300X nodes, coins the Lit Silicon effect from these observations, and proposes analytical performance and power models to quantify potential gains from mitigations. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the central claims to tautological inputs by construction. The models are presented as explanatory tools for system-level understanding and are evaluated via separate mitigation experiments yielding measured improvements, keeping the chain independent of the initial observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that thermal imbalance is the dominant cause of the observed straggling and that simple analytical models can capture the resulting performance and power behavior; no explicit free parameters are listed in the abstract but models are implied to exist.

axioms (1)
  • domain assumption Kernel-level performance variation is highly correlated with concurrent computation and communication (C3)
    Stated directly as an observation in the abstract that underpins the thermal attribution.
invented entities (1)
  • Lit Silicon effect no independent evidence
    purpose: To name and explain the thermal imbalance causing straggler GPUs to slow leader GPUs during C3
    Newly coined term whose independent evidence is limited to the experiments described in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1471 out tokens · 48511 ms · 2026-05-17T23:07:45.952214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 7 internal anchors

  1. [1]

    A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication ,

    R. C. Agarwal, F. G. Gustavson, and M. Zubair, “A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication ,”IBM Journal of Re- search and Development, vol. 38, no. 6, pp. 673–681, 1994

  2. [2]

    ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,

    A. Agrawal, S. Aga, S. Pati, and M. Islam, “ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,” inIEEE International Symposium on Performance Analysis of Systems and Software, 2025

  3. [3]

    Accelerating SQL database operations on a GPU with CUDA,

    P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,” inWorkshop on General-Purpose Computation on Graphics Processing Units, 2010

  4. [4]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  5. [5]

    GPU Database Systems Characterization and Optimization,

    J. Cao, R. Sen, M. Interlandi, J. Arulraj, and H. Kim, “GPU Database Systems Characterization and Optimization,”VLDB Endowment, vol. 17, no. 3, p. 441–454, Nov. 2023

  6. [6]

    Cen- tauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning,

    C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Cen- tauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

  7. [7]

    Reducing Energy Bloat in Large Model Training,

    J.-W. Chung, Y . Gu, I. Jang, L. Meng, N. Bansal, and M. Chowdhury, “Reducing Energy Bloat in Large Model Training,” inSymposium on Operating Systems Principles, 2024

  8. [8]

    LogP: Towards a Realistic Model of Parallel Computation,

    D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. V on Eicken, “LogP: Towards a Realistic Model of Parallel Computation,” inACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993

  9. [9]

    Efficient AllReduce with Stragglers,

    A. Devraj, E. Ding, A. V . Kumar, R. Kleinberg, and R. Singh, “Efficient AllReduce with Stragglers,”arXiv preprint arXiv:2505.23523, 2025

  10. [10]

    Temperature Management in Data Centers: Why Some (Might) Like It Hot,

    N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, “Temperature Management in Data Centers: Why Some (Might) Like It Hot,” inACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, 2012

  11. [11]

    Nvidia Hopper GPU and Grace CPU Highlights,

    A. C. Elster and T. A. Haugdahl, “Nvidia Hopper GPU and Grace CPU Highlights,”Computing in Science & Engineering, vol. 24, no. 2, pp. 95–100, 2022

  12. [12]

    Power Provisioning for a Warehouse-sized Computer,

    X. Fan, W.-D. Weber, and L. A. Barroso, “Power Provisioning for a Warehouse-sized Computer,” inInternational Symposium on Computer Architecture, 2007

  13. [13]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  14. [14]

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,

    P. Garraghan, X. Ouyang, R. Yang, D. McKee, and J. Xu, “Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,”IEEE Transactions on Services Computing, vol. 12, no. 1, pp. 91–104, 2019

  15. [15]

    TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

    R. Gond, N. Kwatra, and R. Ramjee, “TokenWeave: Efficient Compute- Communication Overlap for Distributed LLM Inference,”arXiv preprint arXiv:2505.11329, 2025

  16. [16]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...

  17. [17]

    SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters,

    C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2018

  18. [18]

    Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,

    Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,” arXiv preprint arXiv:2507.04786, 2025

  19. [19]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,”Advances in neural information processing systems, vol. 32, 2019

  20. [20]

    Tutel: Adaptive Mixture-of-Experts at Scale,

    C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive Mixture-of-Experts at Scale,” inMachine Learning and Systems, 2023

  21. [21]

    DMA-Assisted, Intranode Communication in GPU Accelerated Systems,

    F. Ji, A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, R. Thakur, W.- c. Feng, and X. Ma, “DMA-Assisted, Intranode Communication in GPU Accelerated Systems,” inIEEE International Conference on High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, 2012

  22. [22]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of Experts,”arXiv preprint arXiv:2401.04088, 2024

  23. [23]

    Prediction-Based Power Oversubscription in Cloud Platforms,

    A. G. Kumbhare, R. Azimi, I. Manousakis, A. Bonde, F. Frujeri, N. Mahalingam, P. A. Misra, S. A. Javadi, B. Schroeder, M. Fontoura, and R. Bianchini, “Prediction-Based Power Oversubscription in Cloud Platforms,” inUSENIX Annual Technical Conference, 2021

  24. [24]

    Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,” inSymposium on Operating Systems Principles, 2023

  25. [25]

    Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learn- ing: Performance and Power Implications,

    S. Lee, J. Oh, S. Go, and D. Mahajan, “Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learn- ing: Performance and Power Implications,” inIEEE International Sym- posium on Performance Analysis of Systems and Software, 2025

  26. [26]

    Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,

    A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker, “Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,”IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 94–110, 2019

  27. [27]

    PyTorch Distributed: Experiences on Accelerating Data Parallel Training

    S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “PyTorch Dis- tributed: Experiences on Accelerating Data Parallel Training,”arXiv preprint arXiv:2006.15704, 2020

  28. [28]

    Understanding Stragglers in Large Model Training Using What- if Analysis,

    J. Lin, Z. Jiang, Z. Song, S. Zhao, M. Yu, Z. Wang, C. Wang, Z. Shi, X. Shi, W. Jia, Z. Liu, S. Wang, H. Lin, X. Liu, A. Panda, and J. Li, “Understanding Stragglers in Large Model Training Using What- if Analysis,” inUSENIX Conference on Operating Systems Design and Implementation, 2025

  29. [29]

    RingAttention with Blockwise Transformers for Near-Infinite Context ,

    H. Liu, M. Zaharia, and P. Abbeel, “RingAttention with Blockwise Transformers for Near-Infinite Context ,” inInternational Conference on Learning Representations, 2024

  30. [30]

    Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,

    V . Marjanovi ´c, J. Labarta, E. Ayguad ´e, and M. Valero, “Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,” inInternational Conference on Supercomputing, 2010

  31. [31]

    A Measurement Study of GPU DVFS on Energy Conservation,

    X. Mei, L. S. Yung, K. Zhao, and X. Chu, “A Measurement Study of GPU DVFS on Energy Conservation,” inWorkshop on Power-Aware Computing and Systems, 2013

  32. [32]

    PipeDream: Generalized Pipeline Parallelism for DNN Training,

    D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized Pipeline Parallelism for DNN Training,” inSymposium on Operating Systems Principles, 2019

  33. [33]

    AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,

    OpenAI, “AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,” https://openai.com/index/openai-amd- strategic-partnership/, Oct 6 2025

  34. [34]

    Characterizing Power Management Opportunities for LLMs in the Cloud,

    P. Patel, E. Choukse, C. Zhang, . I. n. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “Characterizing Power Management Opportunities for LLMs in the Cloud,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024

  35. [35]

    Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,

    S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,” inIEEE International Symposium on Workload Characterization, 2023

  36. [36]

    T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,

    ——, “T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024

  37. [37]

    Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,

    T. Patki, Z. Frye, H. Bhatia, F. Di Natale, J. Glosli, H. Ingolfsson, and B. Rountree, “Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,” inIEEE/ACM Workflows in Support of Large-Scale Science, 2019

  38. [38]

    Power-aware Deep Learning Model Serving with µ-serve,

    H. Qiu, W. Mao, A. Patke, S. Cui, S. Jha, C. Wang, H. Franke, Z. T. Kalbarczyk, T. Bas ¸ar, and R. K. Iyer, “Power-aware Deep Learning Model Serving with µ-serve,” inUSENIX Annual Technical Conference, 2024

  39. [39]

    ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020

  40. [40]

    Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,

    S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,” inInternational Symposium on Computer Architecture, 2021

  41. [41]

    Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications,

    J. C. Sancho, K. J. Barker, D. J. Kerbyson, and K. Davis, “Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications,” inACM/IEEE Conference on Supercomputing, 2006

  42. [42]

    Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,

    G. Schieffer, R. Shi, S. Markidis, A. Herten, J. Faj, and I. Peng, “Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,” inWorkshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024

  43. [43]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-LM: Training Multi-Billion Parameter Language Mod- els Using Model Parallelism,”arXiv preprint arXiv:1909.08053, 2019

  44. [44]

    The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,

    Z. Tang, Y . Wang, Q. Wang, and X. Chu, “The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,” inACM International Conference on Future Energy Systems, 2019

  45. [45]

    Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,

    U.S. Energy Information Administration (EIA), “Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,” https://www.eia.gov/electricity/monthly/epm table grapher.php?t= table es1a, 2025

  46. [46]

    Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,

    J. S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. Mc- Nally, J. Meredith, J. Rogers, P. Roth, K. Spaffordet al., “Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,”Computing in Science & Engineering, vol. 13, no. 05, pp. 90–95, 2011

  47. [47]

    Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,

    L. Wang, G. von Laszewski, J. Dayal, and F. Wang, “Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,” inIEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010

  48. [48]

    Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,

    S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, S. Kumar, T. Guo, Y . Xu, and Z. Zhou, “Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2022

  49. [49]

    Using Analytical Perfor- mance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency,

    Z. Wang, Y . Zhang, F. Wei, B. Wang, Y . Liu, Z. Hu, J. Zhang, X. Xu, J. He, X. Wang, W. Dou, G. Chen, and C. Tian, “Using Analytical Perfor- mance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2025

  50. [50]

    A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,

    Y . Wei, M. Langer, F. Yu, M. Lee, J. Liu, J. Shi, and Z. Wang, “A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,” inACM Conference on Recommender Systems, 2022

  51. [51]

    OCP Accel- erator Module Design Specification,

    T. J. Whitney Zhao, C. Chen, S. Taveallaei, and Z. Wu, “OCP Accel- erator Module Design Specification,”Open Compute Project. Retrieved February, vol. 13, p. 2021, 2019

  52. [52]

    Dynamo: Facebook’s Data Center-wide Power Management System,

    Q. Wu, Q. Deng, L. Ganesh, C.-H. Hsu, Y . Jin, S. Kumar, B. Li, J. Meza, and Y . J. Song, “Dynamo: Facebook’s Data Center-wide Power Management System,” inInternational Symposium on Computer Architecture, 2016

  53. [53]

    G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,

    Y . Xiao, S. Zhao, Z. Zhou, Z. Huan, L. Ju, X. Zhang, L. Wang, and J. Zhou, “G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,” inInternational Conference on Information and Knowledge Management, 2023

  54. [54]

    AutoCCL: Au- tomated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training,

    G. Xu, Z. Le, Y . Chen, Z. Lin, Z. Jin, Y . Miao, and C. Li, “AutoCCL: Au- tomated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training,” inUSENIX Symposium on Networked Systems Design and Implementation, 2025

  55. [55]

    Maximizing Performance Under a Power Cap: A Comparison of Hardware , Software, and Hybrid Techniques,

    H. Zhang and H. Hoffmann, “Maximizing Performance Under a Power Cap: A Comparison of Hardware , Software, and Hybrid Techniques,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2016

  56. [56]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleiferet al., “Pytorch FSDP: Ex- periences on Scaling Fully Sharded Data Parallel,”arXiv preprint arXiv:2304.11277, 2023. 14