Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
Pith reviewed 2026-05-17 23:07 UTC · model grok-4.3
The pith
Thermal imbalance across GPUs introduces stragglers that slow down the system when using concurrent computation and communication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). This coupling with concurrent computation and communication leads to node-level performance variation and inefficiency. Analytical performance and power models quantify potential gains, while detection and mitigation techniques, including power optimization under thermal design power, node-level GPU power capping, and CPU power sloshing, address the issue.
What carries the argument
The Lit Silicon effect, a mechanism where thermal imbalance couples with concurrent computation and communication to produce stragglers that limit overall node performance.
If this is right
- Thermal imbalance leads to node-level performance variation in multi-GPU systems.
- Models predict gains from balancing thermal effects.
- Detection and mitigation techniques can reduce straggling.
- Power management solutions yield up to 6% performance and 4% power improvements.
- Savings in electricity costs for datacenters running LLM training.
Where Pith is reading between the lines
- Implementing temperature-aware scheduling could extend these benefits to larger clusters.
- The effect may compound with other sources of variation in distributed training.
- Broader adoption of node-level power sloshing might optimize energy use across facilities.
- Similar thermal coupling could appear in other parallel hardware architectures.
Load-bearing premise
That kernel-level performance variation is primarily driven by thermal imbalance interacting with concurrent computation and communication rather than workload imbalance or other hardware factors.
What would settle it
Running the same C3 workloads on a multi-GPU node with enforced uniform GPU temperatures and observing whether performance variation is eliminated.
Figures
read the original abstract
GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems can exhibit performance variation at the node and cluster levels. Such performance variation can significantly impact both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). In this work, we analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation and communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupled with C3 impacts performance variation, which we coin the Lit Silicon effect. More specifically, Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs can introduce node-level straggler GPUs (hotter and slower), which in turn slow down the leader GPUs (cooler and faster). Lit Silicon can lead to node-level performance variation and inefficiency, potentially impacting the entire datacenter. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including (1) power optimization under GPU thermal design power, (2) performance optimization under node-level GPU power capping, and (3) performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving several tens of millions of dollars in electricity costs in datacenters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that thermal imbalance across GPUs in a multi-GPU node induces straggler GPUs (hotter/slower) that, when combined with concurrent computation and communication (C3), slow down leader GPUs (cooler/faster), producing node-level performance variation termed the 'Lit Silicon' effect. It supports this with observations from LLM training on AMD MI300X systems, proposes analytical performance and power models, and evaluates three power-management mitigations that yield up to 6% performance and 4% power gains.
Significance. If the proposed causal mechanism can be isolated from confounders, the work identifies a practically relevant source of inefficiency in large-scale GPU clusters that could inform both scheduling and power-management policies. The evaluation on two real MI300X nodes, two workloads, and two training frameworks provides concrete empirical grounding, and the explicit mitigation techniques constitute a useful engineering contribution.
major comments (2)
- [Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.
- [Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.
minor comments (2)
- [Introduction and Lit Silicon definition] Clarify the precise definition of 'node-level straggler' versus 'leader' GPUs and how these roles are identified in the C3 overlap measurements.
- [Results and evaluation] The abstract states 'up to 6% performance and 4% power improvements'; the corresponding tables or figures should report confidence intervals or statistical significance for these gains across the two frameworks and workloads.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address the concerns on causal isolation in experiments and model derivation below, with revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Experimental evaluation and methodology sections] The central causal claim—that thermal imbalance is the dominant driver of C3-induced straggling—rests on observed correlations between temperature and kernel-level performance variation. However, the experimental description does not report controlled interventions that independently vary per-GPU temperature or power while holding workload balance, interconnect traffic, and software scheduling fixed. Without such isolation, alternative explanations (workload imbalance, interconnect variability, or unmeasured hardware differences) cannot be ruled out, weakening the move from correlation to the proposed analytical models and mitigation claims.
Authors: We acknowledge the value of controlled interventions for stronger causal isolation. Our evaluation relies on natural thermal variations observed across repeated runs of real LLM training workloads on two distinct MI300X nodes and two frameworks, where temperature differentials consistently correlate with C3-induced straggling while workload balance is enforced by the frameworks and interconnect patterns do not align with the observed performance variation. In the revised manuscript we have added a dedicated subsection discussing potential confounders with supporting measurements, arguing that thermal effects are the most parsimonious explanation. Full active control of per-GPU temperature would require hardware not available in our production testbed. revision: partial
-
Referee: [Analytical models section] The analytical performance and power models are introduced to quantify system-level gains, yet the manuscript provides no indication whether they are derived from first principles or fitted to the same experimental observations used to identify the effect. If the latter, the models risk circularity and cannot be used to predict behavior under new thermal or C3 conditions.
Authors: The models are derived from first principles using standard thermal throttling equations (frequency scaling with temperature from vendor datasheets) and power models from the literature. We have revised the manuscript to include an explicit derivation section with the base equations, references to MI300X specifications, and validation on held-out data points to demonstrate predictive use beyond the original observations. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper reports empirical correlations between kernel-level performance variation, C3 overlap, and thermal imbalance on MI300X nodes, coins the Lit Silicon effect from these observations, and proposes analytical performance and power models to quantify potential gains from mitigations. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the central claims to tautological inputs by construction. The models are presented as explanatory tools for system-level understanding and are evaluated via separate mitigation experiments yielding measured improvements, keeping the chain independent of the initial observations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Kernel-level performance variation is highly correlated with concurrent computation and communication (C3)
invented entities (1)
-
Lit Silicon effect
no independent evidence
Reference graph
Works this paper leans on
-
[1]
R. C. Agarwal, F. G. Gustavson, and M. Zubair, “A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication ,”IBM Journal of Re- search and Development, vol. 38, no. 6, pp. 673–681, 1994
work page 1994
-
[2]
ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,
A. Agrawal, S. Aga, S. Pati, and M. Islam, “ConCCL: Optimizing ML Concurrent Computation and Communication with GPU DMA Engines,” inIEEE International Symposium on Performance Analysis of Systems and Software, 2025
work page 2025
-
[3]
Accelerating SQL database operations on a GPU with CUDA,
P. Bakkum and K. Skadron, “Accelerating SQL database operations on a GPU with CUDA,” inWorkshop on General-Purpose Computation on Graphics Processing Units, 2010
work page 2010
-
[4]
Language Models are Few-Shot Learners
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
GPU Database Systems Characterization and Optimization,
J. Cao, R. Sen, M. Interlandi, J. Arulraj, and H. Kim, “GPU Database Systems Characterization and Optimization,”VLDB Endowment, vol. 17, no. 3, p. 441–454, Nov. 2023
work page 2023
-
[6]
C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Cen- tauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning,” in International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
work page 2024
-
[7]
Reducing Energy Bloat in Large Model Training,
J.-W. Chung, Y . Gu, I. Jang, L. Meng, N. Bansal, and M. Chowdhury, “Reducing Energy Bloat in Large Model Training,” inSymposium on Operating Systems Principles, 2024
work page 2024
-
[8]
LogP: Towards a Realistic Model of Parallel Computation,
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. V on Eicken, “LogP: Towards a Realistic Model of Parallel Computation,” inACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993
work page 1993
-
[9]
Efficient AllReduce with Stragglers,
A. Devraj, E. Ding, A. V . Kumar, R. Kleinberg, and R. Singh, “Efficient AllReduce with Stragglers,”arXiv preprint arXiv:2505.23523, 2025
-
[10]
Temperature Management in Data Centers: Why Some (Might) Like It Hot,
N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder, “Temperature Management in Data Centers: Why Some (Might) Like It Hot,” inACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, 2012
work page 2012
-
[11]
Nvidia Hopper GPU and Grace CPU Highlights,
A. C. Elster and T. A. Haugdahl, “Nvidia Hopper GPU and Grace CPU Highlights,”Computing in Science & Engineering, vol. 24, no. 2, pp. 95–100, 2022
work page 2022
-
[12]
Power Provisioning for a Warehouse-sized Computer,
X. Fan, W.-D. Weber, and L. A. Barroso, “Power Provisioning for a Warehouse-sized Computer,” inInternational Symposium on Computer Architecture, 2007
work page 2007
-
[13]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022
work page 2022
-
[14]
Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,
P. Garraghan, X. Ouyang, R. Yang, D. McKee, and J. Xu, “Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters,”IEEE Transactions on Services Computing, vol. 12, no. 1, pp. 91–104, 2019
work page 2019
-
[15]
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
R. Gond, N. Kwatra, and R. Ramjee, “TokenWeave: Efficient Compute- Communication Overlap for Distributed LLM Inference,”arXiv preprint arXiv:2505.11329, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2018
work page 2018
-
[18]
Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,
Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying NCCL: An In- depth Analysis of GPU Communication Protocols and Algorithms ,” arXiv preprint arXiv:2507.04786, 2025
-
[19]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,
Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wuet al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[20]
Tutel: Adaptive Mixture-of-Experts at Scale,
C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive Mixture-of-Experts at Scale,” inMachine Learning and Systems, 2023
work page 2023
-
[21]
DMA-Assisted, Intranode Communication in GPU Accelerated Systems,
F. Ji, A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, R. Thakur, W.- c. Feng, and X. Ma, “DMA-Assisted, Intranode Communication in GPU Accelerated Systems,” inIEEE International Conference on High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, 2012
work page 2012
-
[22]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of Experts,”arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Prediction-Based Power Oversubscription in Cloud Platforms,
A. G. Kumbhare, R. Azimi, I. Manousakis, A. Bonde, F. Frujeri, N. Mahalingam, P. A. Misra, S. A. Javadi, B. Schroeder, M. Fontoura, and R. Bianchini, “Prediction-Based Power Oversubscription in Cloud Platforms,” inUSENIX Annual Technical Conference, 2021
work page 2021
-
[24]
Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Lan- guage Model Serving with PagedAttention,” inSymposium on Operating Systems Principles, 2023
work page 2023
-
[25]
S. Lee, J. Oh, S. Go, and D. Mahajan, “Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learn- ing: Performance and Power Implications,” inIEEE International Sym- posium on Performance Analysis of Systems and Software, 2025
work page 2025
-
[26]
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,
A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker, “Evaluating Modern GPU Interconnect: PCIe, NVLink, NV- SLI, NVSwitch and GPUDirect,”IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 94–110, 2019
work page 2019
-
[27]
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “PyTorch Dis- tributed: Experiences on Accelerating Data Parallel Training,”arXiv preprint arXiv:2006.15704, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
Understanding Stragglers in Large Model Training Using What- if Analysis,
J. Lin, Z. Jiang, Z. Song, S. Zhao, M. Yu, Z. Wang, C. Wang, Z. Shi, X. Shi, W. Jia, Z. Liu, S. Wang, H. Lin, X. Liu, A. Panda, and J. Li, “Understanding Stragglers in Large Model Training Using What- if Analysis,” inUSENIX Conference on Operating Systems Design and Implementation, 2025
work page 2025
-
[29]
RingAttention with Blockwise Transformers for Near-Infinite Context ,
H. Liu, M. Zaharia, and P. Abbeel, “RingAttention with Blockwise Transformers for Near-Infinite Context ,” inInternational Conference on Learning Representations, 2024
work page 2024
-
[30]
Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,
V . Marjanovi ´c, J. Labarta, E. Ayguad ´e, and M. Valero, “Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach,” inInternational Conference on Supercomputing, 2010
work page 2010
-
[31]
A Measurement Study of GPU DVFS on Energy Conservation,
X. Mei, L. S. Yung, K. Zhao, and X. Chu, “A Measurement Study of GPU DVFS on Energy Conservation,” inWorkshop on Power-Aware Computing and Systems, 2013
work page 2013
-
[32]
PipeDream: Generalized Pipeline Parallelism for DNN Training,
D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized Pipeline Parallelism for DNN Training,” inSymposium on Operating Systems Principles, 2019
work page 2019
-
[33]
AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,
OpenAI, “AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs,” https://openai.com/index/openai-amd- strategic-partnership/, Oct 6 2025
work page 2025
-
[34]
Characterizing Power Management Opportunities for LLMs in the Cloud,
P. Patel, E. Choukse, C. Zhang, . I. n. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “Characterizing Power Management Opportunities for LLMs in the Cloud,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024
work page 2024
-
[35]
Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,
S. Pati, S. Aga, M. Islam, N. Jayasena, and M. D. Sinclair, “Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware,” inIEEE International Symposium on Workload Characterization, 2023
work page 2023
-
[36]
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,
——, “T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2024
work page 2024
-
[37]
Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,
T. Patki, Z. Frye, H. Bhatia, F. Di Natale, J. Glosli, H. Ingolfsson, and B. Rountree, “Comparing GPU Power and Frequency Capping: A Case Study with the MuMMI Workflow,” inIEEE/ACM Workflows in Support of Large-Scale Science, 2019
work page 2019
-
[38]
Power-aware Deep Learning Model Serving with µ-serve,
H. Qiu, W. Mao, A. Patke, S. Cui, S. Jha, C. Wang, H. Franke, Z. T. Kalbarczyk, T. Bas ¸ar, and R. K. Iyer, “Power-aware Deep Learning Model Serving with µ-serve,” inUSENIX Annual Technical Conference, 2024
work page 2024
-
[39]
ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,
S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “ZeRO: Memory opti- mizations Toward Training Trillion Parameter Models,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020
work page 2020
-
[40]
Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,
S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms,” inInternational Symposium on Computer Architecture, 2021
work page 2021
-
[41]
J. C. Sancho, K. J. Barker, D. J. Kerbyson, and K. Davis, “Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications,” inACM/IEEE Conference on Supercomputing, 2006
work page 2006
-
[42]
Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,
G. Schieffer, R. Shi, S. Markidis, A. Herten, J. Faj, and I. Peng, “Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric,” inWorkshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024
work page 2024
-
[43]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-LM: Training Multi-Billion Parameter Language Mod- els Using Model Parallelism,”arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[44]
The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,
Z. Tang, Y . Wang, Q. Wang, and X. Chu, “The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,” inACM International Conference on Future Energy Systems, 2019
work page 2019
-
[45]
Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,
U.S. Energy Information Administration (EIA), “Electric Power Monthly: Table ES1.A. Total Electric Power Industry Summary Statis- 13 tics,” https://www.eia.gov/electricity/monthly/epm table grapher.php?t= table es1a, 2025
work page 2025
-
[46]
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,
J. S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. Mc- Nally, J. Meredith, J. Rogers, P. Roth, K. Spaffordet al., “Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,”Computing in Science & Engineering, vol. 13, no. 05, pp. 90–95, 2011
work page 2011
-
[47]
Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,
L. Wang, G. von Laszewski, J. Dayal, and F. Wang, “Towards Energy Aware Scheduling for Precedence Constrained Parallel Tasks in a Cluster with DVFS,” inIEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
work page 2010
-
[48]
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,
S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang, S. Kumar, T. Guo, Y . Xu, and Z. Zhou, “Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2022
work page 2022
-
[49]
Z. Wang, Y . Zhang, F. Wei, B. Wang, Y . Liu, Z. Hu, J. Zhang, X. Xu, J. He, X. Wang, W. Dou, G. Chen, and C. Tian, “Using Analytical Perfor- mance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2025
work page 2025
-
[50]
A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,
Y . Wei, M. Langer, F. Yu, M. Lee, J. Liu, J. Shi, and Z. Wang, “A GPU-specialized Inference Parameter Server for Large-Scale Deep Rec- ommendation Models,” inACM Conference on Recommender Systems, 2022
work page 2022
-
[51]
OCP Accel- erator Module Design Specification,
T. J. Whitney Zhao, C. Chen, S. Taveallaei, and Z. Wu, “OCP Accel- erator Module Design Specification,”Open Compute Project. Retrieved February, vol. 13, p. 2021, 2019
work page 2021
-
[52]
Dynamo: Facebook’s Data Center-wide Power Management System,
Q. Wu, Q. Deng, L. Ganesh, C.-H. Hsu, Y . Jin, S. Kumar, B. Li, J. Meza, and Y . J. Song, “Dynamo: Facebook’s Data Center-wide Power Management System,” inInternational Symposium on Computer Architecture, 2016
work page 2016
-
[53]
G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,
Y . Xiao, S. Zhao, Z. Zhou, Z. Huan, L. Ju, X. Zhang, L. Wang, and J. Zhou, “G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems,” inInternational Conference on Information and Knowledge Management, 2023
work page 2023
-
[54]
G. Xu, Z. Le, Y . Chen, Z. Lin, Z. Jin, Y . Miao, and C. Li, “AutoCCL: Au- tomated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training,” inUSENIX Symposium on Networked Systems Design and Implementation, 2025
work page 2025
-
[55]
H. Zhang and H. Hoffmann, “Maximizing Performance Under a Power Cap: A Comparison of Hardware , Software, and Hybrid Techniques,” inInternational Conference on Architectural Support for Programming Languages and Operating Systems, 2016
work page 2016
-
[56]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleiferet al., “Pytorch FSDP: Ex- periences on Scaling Fully Sharded Data Parallel,”arXiv preprint arXiv:2304.11277, 2023. 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.