arxiv: 2511.09861 · v3 · pith:FCIVDTSEnew · submitted 2025-11-13 · 💻 cs.DC · cs.AR

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Marco Kurzynski , Shaizeen Aga , Di Wu This is my paper

Pith reviewed 2026-05-17 23:07 UTC · model grok-4.3

classification 💻 cs.DC cs.AR

keywords Lit Siliconthermal imbalancemulti-GPU stragglersconcurrent computation and communicationperformance variationLLM trainingpower managementGPU systems

0 comments

The pith

Thermal imbalance across GPUs introduces stragglers that slow down the system when using concurrent computation and communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that thermal imbalances in multi-GPU nodes cause performance variation by creating hotter straggler GPUs that slow down cooler faster ones during overlapped computation and communication. This matters because such variation reduces efficiency in large-scale AI and HPC workloads like LLM training. The authors provide models to analyze the effect and test mitigation strategies that improve both speed and power use. Experiments confirm gains of up to 6 percent in performance and 4 percent in power savings.

Core claim

Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). This coupling with concurrent computation and communication leads to node-level performance variation and inefficiency. Analytical performance and power models quantify potential gains, while detection and mitigation techniques, including power optimization under thermal design power, node-level GPU power capping, and CPU power sloshing, address the issue.

What carries the argument

The Lit Silicon effect, a mechanism where thermal imbalance couples with concurrent computation and communication to produce stragglers that limit overall node performance.

If this is right

Thermal imbalance leads to node-level performance variation in multi-GPU systems.
Models predict gains from balancing thermal effects.
Detection and mitigation techniques can reduce straggling.
Power management solutions yield up to 6% performance and 4% power improvements.
Savings in electricity costs for datacenters running LLM training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Implementing temperature-aware scheduling could extend these benefits to larger clusters.
The effect may compound with other sources of variation in distributed training.
Broader adoption of node-level power sloshing might optimize energy use across facilities.
Similar thermal coupling could appear in other parallel hardware architectures.

Load-bearing premise

That kernel-level performance variation is primarily driven by thermal imbalance interacting with concurrent computation and communication rather than workload imbalance or other hardware factors.

What would settle it

Running the same C3 workloads on a multi-GPU node with enforced uniform GPU temperatures and observing whether performance variation is eliminated.

Figures

Figures reproduced from arXiv: 2511.09861 by Di Wu, Marco Kurzynski, Shaizeen Aga.

**Figure 2.** Figure 2: Concurrent computation and communication in FSDP. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between the overlap ratio and the kernel [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation between overlap ratio and kernel duration of kernels across GPUs (numbered). f [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Temperature and frequency over three training itera [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Dynamic coupling towards Lit Silicon. 1 - 4 represent four phases of Lit Silicon in one training iteration. The bold black lines, which connect the start time of identical kernels running on different GPUs, are called straggler waves. The difference in a kernel’s start time on a leader and a straggler is defined as the lead value. a - d denote the lead values for four different kernels. running frequency i… view at source ↗

**Figure 7.** Figure 7: Lead values from two test nodes, with node 1 in the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Our framework to solve Lit Silicon with three use cases. It only needs about 200 lines of PyTorch codes. C. Framework and Use Cases We show the framework of our solution in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of the convergence process for all use [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Measured frequency and power for different con [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Different warm-up periods swept. Baseline is the [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Final power caps set for different scenarios and initial [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Sensitivity study of knobs in Table II. A higher value is better (e.g., less variation has a larger bar value). The rolling [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Power and throughput metrics are the same as [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Metrics are the same as Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

read the original abstract

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems can exhibit performance variation at the node and cluster levels. Such performance variation can significantly impact both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). In this work, we analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation and communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupled with C3 impacts performance variation, which we coin the Lit Silicon effect. More specifically, Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). Lit Silicon can lead to node-level performance variation and inefficiency, potentially impacting the entire datacenter. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including (1) power optimization under GPU thermal design power, (2) performance optimization under node-level GPU power capping, and (3) performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving several tens of millions of dollars in electricity costs in datacenters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Thermal imbalance across GPUs creates node-level stragglers under C3 overlap, with some practical power tweaks showing gains, but the causal isolation is thin.

read the letter

The paper's core observation is that in multi-GPU nodes running LLM training, hotter GPUs act as stragglers and drag down cooler ones specifically when computation and communication overlap. They name this Lit Silicon and link it to thermal imbalance on MI300X hardware. They also report up to 6% performance and 4% power improvements from three simple power management approaches, including node-level capping and CPU sloshing. Those measured gains on real systems with two workloads and two frameworks are the most useful part here. The experiments give a concrete sense of the scale of the inefficiency in datacenter settings. What is new is the specific framing of the thermal-C3 interaction as a distinct effect plus the detection and mitigation ideas. Prior work already covers GPU thermal throttling and performance variation, so the addition is mostly the named mechanism and the applied fixes rather than a fundamental discovery. The soft spot is the jump from correlation to causation. The abstract and described results tie temperature differences to kernel variation and then to C3 slowdown, but they do not appear to include controlled interventions that vary temperature independently while holding workload balance, interconnect traffic, and scheduling fixed. Without that isolation, other factors such as minor workload skew or hardware differences remain plausible alternatives. The analytical performance and power models are mentioned but their derivation and validation against the data are not detailed enough to judge how much they add beyond fitting the observations. This work is aimed at people who tune large-scale multi-GPU training or manage power in GPU clusters. A reader running similar workloads on AMD hardware would find the mitigation results directly relevant and worth testing. The measurements are grounded enough that it deserves a serious referee rather than a desk reject, mainly to press on the causal claims and model details. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that thermal imbalance across GPUs in a multi-GPU node induces straggler GPUs (hotter/slower) that, when combined with concurrent computation and communication (C3), slow down leader GPUs (cooler/faster), producing node-level performance variation termed the 'Lit Silicon' effect. It supports this with observations from LLM training on AMD MI300X systems, proposes analytical performance and power models, and evaluates three power-management mitigations that yield up to 6% performance and 4% power gains.

Significance. If the proposed causal mechanism can be isolated from confounders, the work identifies a practically relevant source of inefficiency in large-scale GPU clusters that could inform both scheduling and power-management policies. The evaluation on two real MI300X nodes, two workloads, and two training frameworks provides concrete empirical grounding, and the explicit mitigation techniques constitute a useful engineering contribution.

major comments (2)

[Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.
[Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.

minor comments (2)

[Introduction and Lit Silicon definition] Clarify the precise definition of 'node-level straggler' versus 'leader' GPUs and how these roles are identified in the C3 overlap measurements.
[Results and evaluation] The abstract states 'up to 6% performance and 4% power improvements'; the corresponding tables or figures should report confidence intervals or statistical significance for these gains across the two frameworks and workloads.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the concerns on causal isolation in experiments and model derivation below, with revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.

Authors: We acknowledge the value of controlled interventions for stronger causal isolation. Our evaluation relies on natural thermal variations observed across repeated runs of real LLM training workloads on two distinct MI300X nodes and two frameworks, where temperature differentials consistently correlate with C3-induced straggling while workload balance is enforced by the frameworks and interconnect patterns do not align with the observed performance variation. In the revised manuscript we have added a dedicated subsection discussing potential confounders with supporting measurements, arguing that thermal effects are the most parsimonious explanation. Full active control of per-GPU temperature would require hardware not available in our production testbed. revision: partial
Referee: [Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.

Authors: The models are derived from first principles using standard thermal throttling equations (frequency scaling with temperature from vendor datasheets) and power models from the literature. We have revised the manuscript to include an explicit derivation section with the base equations, references to MI300X specifications, and validation on held-out data points to demonstrate predictive use beyond the original observations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper reports empirical correlations between kernel-level performance variation, C3 overlap, and thermal imbalance on MI300X nodes, coins the Lit Silicon effect from these observations, and proposes analytical performance and power models to quantify potential gains from mitigations. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the central claims to tautological inputs by construction. The models are presented as explanatory tools for system-level understanding and are evaluated via separate mitigation experiments yielding measured improvements, keeping the chain independent of the initial observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that thermal imbalance is the dominant cause of the observed straggling and that simple analytical models can capture the resulting performance and power behavior; no explicit free parameters are listed in the abstract but models are implied to exist.

axioms (1)

domain assumption Kernel-level performance variation is highly correlated with concurrent computation and communication (C3)
Stated directly as an observation in the abstract that underpins the thermal attribution.

invented entities (1)

Lit Silicon effect no independent evidence
purpose: To name and explain the thermal imbalance causing straggler GPUs to slow leader GPUs during C3
Newly coined term whose independent evidence is limited to the experiments described in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1471 out tokens · 48511 ms · 2026-05-17T23:07:45.952214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 7 internal anchors

[1]

A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication ,

R. C. Agarwal, F. G. Gustavson, and M. Zubair, “A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication ,”IBM Journal of Re- search and Development, vol. 38, no. 6, pp. 673–681, 1994

work page 1994
[2]

ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,

A. Agrawal, S. Aga, S. Pati, and M. Islam, “ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,” inIEEE International Symposium on Performance Analysis of Systems and Software, 2025

work page 2025
[3]

Accelerating SQL database operations on a GPU with CUDA,

P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,” inWorkshop on General-Purpose Computation on Graphics Processing Units, 2010

work page 2010
[4]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

GPU Database Systems Characterization and Optimization,

J. Cao, R. Sen, M. Interlandi, J. Arulraj, and H. Kim, “GPU Database Systems Characterization and Optimization,”VLDB Endowment, vol. 17, no. 3, p. 441–454, Nov. 2023

work page 2023
[6]

Cen- tauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Cen- tauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

work page 2024
[7]

Reducing Energy Bloat in Large Model Training,

J.-W. Chung, Y . Gu, I. Jang, L. Meng, N. Bansal, and M. Chowdhury, “Reducing Energy Bloat in Large Model Training,” inSymposium on Operating Systems Principles, 2024

work page 2024
[8]

LogP: Towards a Realistic Model of Parallel Computation,

D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. V on Eicken, “LogP: Towards a Realistic Model of Parallel Computation,” inACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993

work page 1993
[9]

Efficient AllReduce with Stragglers,

A. Devraj, E. Ding, A. V . Kumar, R. Kleinberg, and R. Singh, “Efficient AllReduce with Stragglers,”arXiv preprint arXiv:2505.23523, 2025

work page arXiv 2025
[10]

Temperature Management in Data Centers: Why Some (Might) Like It Hot,

N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, “Temperature Management in Data Centers: Why Some (Might) Like It Hot,” inACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, 2012

work page 2012
[11]

Nvidia Hopper GPU and Grace CPU Highlights,

A. C. Elster and T. A. Haugdahl, “Nvidia Hopper GPU and Grace CPU Highlights,”Computing in Science & Engineering, vol. 24, no. 2, pp. 95–100, 2022

work page 2022
[12]

Power Provisioning for a Warehouse-sized Computer,

X. Fan, W.-D. Weber, and L. A. Barroso, “Power Provisioning for a Warehouse-sized Computer,” inInternational Symposium on Computer Architecture, 2007

work page 2007
[13]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022
[14]

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,

P. Garraghan, X. Ouyang, R. Yang, D. McKee, and J. Xu, “Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,”IEEE Transactions on Services Computing, vol. 12, no. 1, pp. 91–104, 2019

work page 2019
[15]

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

R. Gond, N. Kwatra, and R. Ramjee, “TokenWeave: Efficient Compute- Communication Overlap for Distributed LLM Inference,”arXiv preprint arXiv:2505.11329, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters,

C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2018

work page 2018
[18]

Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,

Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,” arXiv preprint arXiv:2507.04786, 2025

work page arXiv 2025
[19]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[20]

Tutel: Adaptive Mixture-of-Experts at Scale,

C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive Mixture-of-Experts at Scale,” inMachine Learning and Systems, 2023

work page 2023
[21]

DMA-Assisted, Intranode Communication in GPU Accelerated Systems,

F. Ji, A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, R. Thakur, W.- c. Feng, and X. Ma, “DMA-Assisted, Intranode Communication in GPU Accelerated Systems,” inIEEE International Conference on High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, 2012

work page 2012
[22]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of Experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Prediction-Based Power Oversubscription in Cloud Platforms,

A. G. Kumbhare, R. Azimi, I. Manousakis, A. Bonde, F. Frujeri, N. Mahalingam, P. A. Misra, S. A. Javadi, B. Schroeder, M. Fontoura, and R. Bianchini, “Prediction-Based Power Oversubscription in Cloud Platforms,” inUSENIX Annual Technical Conference, 2021

work page 2021
[24]

Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,” inSymposium on Operating Systems Principles, 2023

work page 2023
[25]

Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learn- ing: Performance and Power Implications,

S. Lee, J. Oh, S. Go, and D. Mahajan, “Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learn- ing: Performance and Power Implications,” inIEEE International Sym- posium on Performance Analysis of Systems and Software, 2025

work page 2025
[26]

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,

A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker, “Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,”IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 94–110, 2019

work page 2019
[27]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “PyTorch Dis- tributed: Experiences on Accelerating Data Parallel Training,”arXiv preprint arXiv:2006.15704, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[28]

Understanding Stragglers in Large Model Training Using What- if Analysis,

J. Lin, Z. Jiang, Z. Song, S. Zhao, M. Yu, Z. Wang, C. Wang, Z. Shi, X. Shi, W. Jia, Z. Liu, S. Wang, H. Lin, X. Liu, A. Panda, and J. Li, “Understanding Stragglers in Large Model Training Using What- if Analysis,” inUSENIX Conference on Operating Systems Design and Implementation, 2025

work page 2025
[29]

RingAttention with Blockwise Transformers for Near-Infinite Context ,

H. Liu, M. Zaharia, and P. Abbeel, “RingAttention with Blockwise Transformers for Near-Infinite Context ,” inInternational Conference on Learning Representations, 2024

work page 2024
[30]

Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,

V . Marjanovi ´c, J. Labarta, E. Ayguad ´e, and M. Valero, “Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,” inInternational Conference on Supercomputing, 2010

work page 2010
[31]

A Measurement Study of GPU DVFS on Energy Conservation,

X. Mei, L. S. Yung, K. Zhao, and X. Chu, “A Measurement Study of GPU DVFS on Energy Conservation,” inWorkshop on Power-Aware Computing and Systems, 2013

work page 2013
[32]

PipeDream: Generalized Pipeline Parallelism for DNN Training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized Pipeline Parallelism for DNN Training,” inSymposium on Operating Systems Principles, 2019

work page 2019
[33]

AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,

OpenAI, “AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,” https://openai.com/index/openai-amd- strategic-partnership/, Oct 6 2025

work page 2025
[34]

Characterizing Power Management Opportunities for LLMs in the Cloud,

P. Patel, E. Choukse, C. Zhang, . I. n. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “Characterizing Power Management Opportunities for LLMs in the Cloud,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024

work page 2024
[35]

Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,

S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,” inIEEE International Symposium on Workload Characterization, 2023

work page 2023
[36]

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,

——, “T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024

work page 2024
[37]

Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,

T. Patki, Z. Frye, H. Bhatia, F. Di Natale, J. Glosli, H. Ingolfsson, and B. Rountree, “Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,” inIEEE/ACM Workflows in Support of Large-Scale Science, 2019

work page 2019
[38]

Power-aware Deep Learning Model Serving with µ-serve,

H. Qiu, W. Mao, A. Patke, S. Cui, S. Jha, C. Wang, H. Franke, Z. T. Kalbarczyk, T. Bas ¸ar, and R. K. Iyer, “Power-aware Deep Learning Model Serving with µ-serve,” inUSENIX Annual Technical Conference, 2024

work page 2024
[39]

ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020

work page 2020
[40]

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,

S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,” inInternational Symposium on Computer Architecture, 2021

work page 2021
[41]

Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications,

J. C. Sancho, K. J. Barker, D. J. Kerbyson, and K. Davis, “Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications,” inACM/IEEE Conference on Supercomputing, 2006

work page 2006
[42]

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,

G. Schieffer, R. Shi, S. Markidis, A. Herten, J. Faj, and I. Peng, “Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,” inWorkshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024

work page 2024
[43]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-LM: Training Multi-Billion Parameter Language Mod- els Using Model Parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[44]

The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,

Z. Tang, Y . Wang, Q. Wang, and X. Chu, “The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,” inACM International Conference on Future Energy Systems, 2019

work page 2019
[45]

Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,

U.S. Energy Information Administration (EIA), “Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,” https://www.eia.gov/electricity/monthly/epm table grapher.php?t= table es1a, 2025

work page 2025
[46]

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,

J. S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. Mc- Nally, J. Meredith, J. Rogers, P. Roth, K. Spaffordet al., “Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,”Computing in Science & Engineering, vol. 13, no. 05, pp. 90–95, 2011

work page 2011
[47]

Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,

L. Wang, G. von Laszewski, J. Dayal, and F. Wang, “Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,” inIEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010

work page 2010
[48]

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,

S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, S. Kumar, T. Guo, Y . Xu, and Z. Zhou, “Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2022

work page 2022
[49]

Using Analytical Perfor- mance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency,

Z. Wang, Y . Zhang, F. Wei, B. Wang, Y . Liu, Z. Hu, J. Zhang, X. Xu, J. He, X. Wang, W. Dou, G. Chen, and C. Tian, “Using Analytical Perfor- mance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2025

work page 2025
[50]

A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,

Y . Wei, M. Langer, F. Yu, M. Lee, J. Liu, J. Shi, and Z. Wang, “A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,” inACM Conference on Recommender Systems, 2022

work page 2022
[51]

OCP Accel- erator Module Design Specification,

T. J. Whitney Zhao, C. Chen, S. Taveallaei, and Z. Wu, “OCP Accel- erator Module Design Specification,”Open Compute Project. Retrieved February, vol. 13, p. 2021, 2019

work page 2021
[52]

Dynamo: Facebook’s Data Center-wide Power Management System,

Q. Wu, Q. Deng, L. Ganesh, C.-H. Hsu, Y . Jin, S. Kumar, B. Li, J. Meza, and Y . J. Song, “Dynamo: Facebook’s Data Center-wide Power Management System,” inInternational Symposium on Computer Architecture, 2016

work page 2016
[53]

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,

Y . Xiao, S. Zhao, Z. Zhou, Z. Huan, L. Ju, X. Zhang, L. Wang, and J. Zhou, “G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,” inInternational Conference on Information and Knowledge Management, 2023

work page 2023
[54]

AutoCCL: Au- tomated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training,

G. Xu, Z. Le, Y . Chen, Z. Lin, Z. Jin, Y . Miao, and C. Li, “AutoCCL: Au- tomated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training,” inUSENIX Symposium on Networked Systems Design and Implementation, 2025

work page 2025
[55]

Maximizing Performance Under a Power Cap: A Comparison of Hardware , Software, and Hybrid Techniques,

H. Zhang and H. Hoffmann, “Maximizing Performance Under a Power Cap: A Comparison of Hardware , Software, and Hybrid Techniques,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2016

work page 2016
[56]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleiferet al., “Pytorch FSDP: Ex- periences on Scaling Fully Sharded Data Parallel,”arXiv preprint arXiv:2304.11277, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023