pith. sign in

arxiv: 2605.21427 · v1 · pith:KLGG565Fnew · submitted 2026-05-20 · 💻 cs.AI · cs.DC

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Pith reviewed 2026-05-21 04:02 UTC · model grok-4.3

classification 💻 cs.AI cs.DC
keywords power-aware LLM servingGPU power cappingenergy efficiencymixture-of-experts modelsfeedback controlvLLMQoS management
0
0 comments X

The pith

PALS improves LLM serving energy efficiency up to 26.3% by treating GPU power caps as a tunable control knob alongside batch size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PALS, a runtime system for large language model inference that jointly optimizes GPU power limits and software settings like batch size. It builds simple offline power-performance models and feeds them into a feedback controller to pick configurations that hit throughput targets while using less energy. This matters for data centers where LLMs drive high GPU power draw, because power has usually been treated as a fixed limit rather than something the serving system can actively manage. The implementation sits inside vLLM, requires no model retraining or API changes, and works for both dense models and mixture-of-experts architectures on multi-GPU hardware. Experiments show the approach cuts energy use, sharply reduces quality-of-service violations when power is constrained, and follows changing power budgets.

Core claim

PALS treats GPU power caps as a first-class control knob that is optimized together with batch size. Lightweight offline power-performance models combined with a feedback-driven controller select operating points that satisfy throughput targets while maximizing energy efficiency. The system runs inside an unmodified vLLM serving stack and delivers up to 26.3% better energy efficiency, 4x to 7x fewer QoS violations under power constraints, and the ability to track dynamic power budgets across multi-GPU setups for both dense and MoE models.

What carries the argument

Lightweight offline power-performance models paired with a feedback-driven controller that jointly tunes GPU power caps and batch size to meet throughput targets.

If this is right

  • LLM serving systems can operate closer to energy-proportional behavior by actively lowering power when load permits.
  • Data centers gain the ability to respect dynamic power caps from the grid without large drops in delivered throughput.
  • The same power-aware control loop applies to both dense and sparse mixture-of-experts models without separate tuning paths.
  • Existing inference frameworks can adopt the technique through a runtime layer rather than hardware or model changes.
  • Quality-of-service targets become easier to maintain when power availability fluctuates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar offline modeling plus feedback control could be applied to other GPU-heavy workloads such as training or scientific simulation if comparable power-performance surfaces exist.
  • Integration with demand-response signals from utilities would let AI clusters participate in grid stabilization without custom hardware.
  • Online refinement of the power models during operation might further reduce the gap between predicted and actual energy use under changing thermal conditions.

Load-bearing premise

Lightweight offline power-performance models built without model retraining can accurately guide a feedback controller to choose batch sizes and power caps that meet throughput targets on both dense and MoE models.

What would settle it

Run the controller on a held-out GPU architecture or workload trace and measure whether the selected power-cap and batch-size pairs consistently miss the target throughput by more than a few percent; sustained misses would show the models do not transfer well enough to support the claims.

Figures

Figures reproduced from arXiv: 2605.21427 by Ayse K. Coskun, Can Hankendi, Minlan Yu, Rana Shahout.

Figure 1
Figure 1. Figure 1: (a) tokens/J vs. power cap showing divergent behavior: compute-bound Mixtral continues to improve while communication-bound Qwen-MoE and OLMoE peak at 200 W and decline. (b) tokens/J vs. batch size: efficiency gains are substantial for all model families. that cannot be captured by either layer alone. This paper introduces the first LLM serving runtime that jointly opti￾mizes hardware power limits and soft… view at source ↗
Figure 2
Figure 2. Figure 2: Compute vs. communication time breakdown by model and configuration. Mixtral remains compute-bound; Qwen-MoE and OLMoE become communication-bound at higher TP and batch size. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pareto frontier expansion for three MoE models (single node, 4×A100). Four frontiers are shown: SW only (batch sweep, fixed cap), HW only (cap sweep, fixed batch), HW+SW (joint cap×batch), and full joint (HW+SW+TP). The full frontier dominates any single-knob approach; gains are model-dependent and follow the compute/communication ratio. frontiers: it achieves the high-throughput end of the SW-only frontie… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-node scaling: (a) efficiency drops as node count grows, especially for communication-bound Qwen￾MoE; (b) throughput grows but at diminishing efficiency returns. 2.3 Multi-Node Scaling Behavior MoE models do not always scale efficiently simply by adding more nodes [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Energy efficiency under expert parallelism across different node counts. Each group shows normalized tokens/J for a given model at 1, 2, and 3 nodes. Models with higher communication intensity exhibit larger efficiency degrada￾tion as parallelism increases. Telemetry layer. The telemetry layer monitors system state in real time, including GPU power consumption, through￾put (tokens/s) and GPU utilization. T… view at source ↗
Figure 6
Figure 6. Figure 6: PALS runtime design. Telemetry from inference execution is aggregated and fed to a controller that predicts feasible operating points and issues hardware- and software-level actuation decisions. a dataset that captures the mapping from inference config￾urations to performance and power, which is later used to train predictive models for runtime control. We perform controlled parameter sweeps over the follo… view at source ↗
Figure 8
Figure 8. Figure 8: Efficiency headroom from TP under varying power caps. Each curve shows the gap between a fixed-TP deploy￾ment and the best offline TP choice. Headroom varies with changing power caps, indicating that the benefit of TP selec￾tion is power-dependent. over 60-minute runs. We compare four strategies correspond￾ing to different levels of control: no adaptation (Baseline), software-only adaptation (Adaptive Batc… view at source ↗
Figure 7
Figure 7. Figure 7: Normalized tokens/J, average of five MoE models and three dense models. PALS achieves 26.3% improvement over baseline and reaches 95% of oracle efficiency [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-node power-constrained evaluation (3 nodes, 60 min). (a) QoS violation rates: PALS reduces violations by 4×–7×. (b) Normalized aggregate efficiency by strategy. baseline, while able to track the power signal, suffers from underutilization at low power levels, as large batch sizes become inefficient under constrained power. PALS improves throughput by up to 22% at low power targets compared to the sta… view at source ↗
Figure 10
Figure 10. Figure 10: Grid demand-response tracking (1-hour, DeepSeek-MoE, 3-nodes). PALS maintains higher through￾put at low power targets by co-adapting batch size. PALS improves throughput by up to 22% at low power targets compared to the static-batch baseline [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PALS, a power-aware runtime for LLM serving that treats GPU power caps as a first-class control knob. It combines lightweight offline power-performance models with a feedback-driven controller to jointly tune batch sizes and power caps, aiming to meet throughput targets while maximizing energy efficiency. Implemented in vLLM with no model retraining or API changes, the system is evaluated on multi-GPU setups for both dense and MoE models, claiming up to 26.3% energy efficiency gains, 4x–7x reductions in QoS violations under power constraints, and the ability to track dynamic power budgets.

Significance. If the results hold, this work could meaningfully advance energy-proportional LLM inference by integrating power control into serving runtimes. The practical focus on deployment without retraining or API modifications, along with explicit evaluation on MoE models, is a strength that addresses an increasingly relevant architecture.

major comments (2)
  1. [§4.2] §4.2 (Offline Power-Performance Models): The central claim that lightweight offline models can accurately guide the feedback controller for MoE models rests on the assumption that profiling runs capture input-dependent expert activation patterns. The manuscript does not describe how the models account for variability in routing decisions or token distributions; if constructed from fixed or average-case traces, predictions may deviate in deployment and directly undermine the reported energy-efficiency and QoS-violation results.
  2. [§6] §6 (Evaluation): The quantitative claims (26.3% efficiency improvement, 4x–7x QoS reduction) are presented without reported error bars, number of runs, or explicit description of power-measurement methodology and baselines. This information is load-bearing for assessing whether the gains are robust across input distributions and hardware configurations.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'up to 26.3%' would be clearer if the specific model, hardware configuration, and workload that achieve this maximum were stated.
  2. [§3] Notation: The symbols for power cap (P_cap) and target throughput could be introduced with a small table or equation early in §3 to improve readability for readers unfamiliar with the control loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, indicating where we agree and plan revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Offline Power-Performance Models): The central claim that lightweight offline models can accurately guide the feedback controller for MoE models rests on the assumption that profiling runs capture input-dependent expert activation patterns. The manuscript does not describe how the models account for variability in routing decisions or token distributions; if constructed from fixed or average-case traces, predictions may deviate in deployment and directly undermine the reported energy-efficiency and QoS-violation results.

    Authors: We agree that §4.2 would benefit from a more explicit description of how variability is handled. The offline models were constructed from profiling runs using a diverse collection of input traces drawn from real-world workloads, deliberately including sequences with varying lengths, content, and resulting expert routing patterns to capture input-dependent activation behavior in MoE models. The feedback controller then uses online measurements to compensate for any residual deviations from the profiled averages. We will revise the section to detail the trace selection process, the range of routing variability observed, and how this informs the lightweight model construction. revision: yes

  2. Referee: [§6] §6 (Evaluation): The quantitative claims (26.3% efficiency improvement, 4x–7x QoS reduction) are presented without reported error bars, number of runs, or explicit description of power-measurement methodology and baselines. This information is load-bearing for assessing whether the gains are robust across input distributions and hardware configurations.

    Authors: The referee is correct that these details are necessary for rigorous evaluation. We performed 5 independent runs for each reported configuration and will add error bars showing standard deviation. Power was measured via the NVIDIA NVML API with a 100 ms sampling interval on the multi-GPU testbed; baselines were unmodified vLLM with a static power cap matching the hardware limit. We will expand §6 and the experimental setup to include this methodology, the number of runs, and a discussion of robustness across the tested input distributions and hardware setups. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical offline models plus runtime feedback form a self-contained systems design.

full rationale

The paper presents PALS as a runtime system that constructs lightweight offline power-performance models from profiling runs and feeds them into a feedback-driven controller for joint batch-size and power-cap selection. No equations, uniqueness theorems, or derivations are shown that reduce any claimed prediction or result to its own fitted inputs by construction. The central claims rest on implementation inside vLLM and experimental measurements across dense and MoE models; these are externally falsifiable via reproduction on the same hardware rather than being forced by self-citation chains or definitional loops. The approach therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text would be required to audit modeling assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5741 in / 1222 out tokens · 45727 ms · 2026-05-21T04:02:59.831015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 9 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Am- mar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jian- min Bao, Harkirat Behl, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219(2024)

  2. [2]

    Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Kiwan Maeng, Udit Gupta, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu

  3. [3]

    InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’23)

    Carbon Explorer: A Holistic Framework for Designing Car- bon Aware Datacenters. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’23). 118–132. doi:10.1145/3575693.3575754

  4. [4]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 117–134

  5. [5]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. 2022. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. InSC22: International Conference for High Performance Comput- ing, Networking, Storage ...

  6. [6]

    Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2019. The Datacenter as a Computer: Designing Warehouse-Scale Machines (3rd ed.). Morgan & Claypool Publishers

  7. [7]

    Rishabh Bhoria, Anubhav Sehgal, Divyanshu Saxena, Debadatta Mishra, and Purushottam Kulkarni. 2025. TAPAS: Thermal and Power- Aware Scheduling for GPU Clusters. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

  8. [8]

    Rishabh Bhoria, Anubhav Sehgal, Divyanshu Saxena, Debadatta Mishra, and Purushottam Kulkarni. 2025. TAPAS: Thermal and Power- Aware Scheduling for GPU Clusters. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25)

  9. [9]

    Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, and Mosharaf Chowdhury. 2023. Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In20th USENIX Sym- posium on Networked Systems Design and Implementation (NSDI ’23). USENIX Association, 119–139

  10. [10]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

  11. [11]

    Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

  12. [12]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Spe- cialization in Mixture-of-Experts Language Models.arXiv preprint arXiv:2401.06066(2024)

  13. [13]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.arXiv preprint ...

  15. [15]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

  16. [16]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woojin Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

  17. [17]

    Kim, K., Seo, A

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles. doi:10.1145/3600006.3613165

  18. [18]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.arXiv preprint arXiv:2006.16668 (2020)

  19. [19]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2022. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 663–679

  20. [20]

    Rohan Mahajan, Minsung Jang, Arjun Singhvi, Krishnan Kutty, Aditya Akella, and Shivaram Venkataraman. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA ’25)

  21. [21]

    Rohan Mahajan, Minsung Jang, Arjun Singhvi, Krishnan Kutty, Aditya Akella, and Shivaram Venkataraman. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 12 PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

  22. [22]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25). 1–17. doi:10.1145/3669940.3707215

  23. [23]

    OLMoE: Open Mixture-of-Experts Language Models

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. 2...

  24. [24]

    Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi

  25. [25]

    doi:10.1109/TPDS.2021.3137867

    Coordinated Batching and DVFS for DNN Inference on GPU Accelerators.IEEE Transactions on Parallel and Distributed Systems33, 10 (2022), 2496–2508. doi:10.1109/TPDS.2021.3137867

  26. [26]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phan- ishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProceedings of the International Conference ...

  27. [27]

    NVIDIA Corporation.https://docs.nvidia.com/ deploy/nvml-api/index.htmlVersion R550

    NVIDIA Corporation 2024.NVIDIA Management Library (NVML) API Reference Guide. NVIDIA Corporation.https://docs.nvidia.com/ deploy/nvml-api/index.htmlVersion R550

  28. [28]

    NVIDIA Corporation.https://developer.nvidia.com/nvidia- system-management-interface

    NVIDIA Corporation 2024.nvidia-smi: NVIDIA System Management Interface. NVIDIA Corporation.https://developer.nvidia.com/nvidia- system-management-interface

  29. [29]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini. 2024. Charac- terizing Power Management Opportunities for LLMs in the Cloud. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24). 207–222. doi:10.114...

  30. [30]

    Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 of the Parameters.Qwen Blog(2024).https://qwenlm.github. io/blog/qwen-moe/

  31. [31]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.OpenAI Blog1, 8 (2019), 9.https://openai.com/research/ language-unsupervised

  32. [32]

    Ana Radovanovic, Ross Koningstein, Ian Schneider, Bokan Chen, Alexandre Duarte, Binz Roy, Diyue Xiao, Maya Haridasan, Patrick Hung, Nick Care, Saurav Talukdar, Eric Mullen, Kendal Smith, MariEllen Cottman, and Walfredo Cirne. 2023. Carbon-Aware Com- puting for Datacenters.IEEE Transactions on Power Systems38, 2 (2023), 1270–1280. doi:10.1109/TPWRS.2022.3173250

  33. [33]

    Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, and Michael Mitzenmacher. 2024. Fast inference for augmented large language models.arXiv preprint arXiv:2410.18248(2024)

  34. [34]

    Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, and Michael Mitzenmacher. 2024. Don’t Stop Me Now: Embedding Based Scheduling for LLMs.arXiv preprint arXiv:2410.01035(2024)

  35. [35]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In5th International Conference on Learning Representations (ICLR ’17). https://openreview.net/forum?id=B1ckMDqlg

  36. [36]

    Fu, Zhiqiang Xie, Beidi Chen, Clark W

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark W. Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InProceedings of the 40th International Conference on Machine...

  37. [37]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. InarXiv preprint arXiv:1909.08053

  38. [38]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

  39. [39]

    Runxin Wu, Yushi Bai, Siyang He, Shengding Hu, Yukun Zhou, Zhiyuan Liu, Furu Wei, and Maosong Sun. 2024. Fast Inference of Mixture-of-Experts Language Models with Offloading.arXiv preprint arXiv:2312.17238(2024)

  40. [40]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538

  41. [41]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics (ACL ’19). 4791–4800. doi:10.18653/v1/P19-1472

  42. [42]

    Zhuoran Zhang, Daniel Wang, and Ayse K. Coskun. 2021. HPC Data Center Participation in Demand Response: An Adaptive Policy with QoS Assurance.IEEE Transactions on Sustainable Computing8, 3 (2021), 754–768. doi:10.1109/TSUSC.2021.3079166

  43. [43]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 193–210. 13