arxiv: 2305.05920 · v3 · submitted 2023-05-10 · 💻 cs.LG · cs.DC

Recognition: no theorem link

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu , Yinmin Zhong , Zili Zhang , Shengyu Liu , Fangyue Liu , Yuanhang Sun , Gang Huang , Xuanzhe Liu

show 1 more author

Xin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:17 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords LLM inference servingpreemptive schedulingmulti-level feedback queuetoken-level preemptionGPU memory managementdistributed serving systemlatency-aware scheduling

0 comments

The pith

FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LLM serving systems process each request to completion, which creates head-of-line blocking when new requests arrive. FastServe instead allows preemption after every generated token by exploiting the autoregressive structure of inference. It introduces a skip-join Multi-Level Feedback Queue scheduler that places each job into an initial queue using only its input length and skips higher-priority queues to limit demotions. The system also offloads and reloads intermediate GPU states to host memory as needed to free capacity. Experiments show these changes raise throughput by up to 31.4 times under average-latency bounds and 17.9 times under tail-latency bounds relative to the prior state-of-the-art system.

Core claim

FastServe is a distributed inference serving system for large language models that performs preemption at the granularity of each output token, employs a skip-join Multi-Level Feedback Queue scheduler that uses input length to assign an appropriate initial queue while skipping higher ones to reduce demotions, and uses proactive offloading of intermediate states between GPU and host memory; these mechanisms together produce throughput gains of up to 31.4x under average latency requirements and 17.9x under tail latency requirements compared with vLLM.

What carries the argument

The skip-join Multi-Level Feedback Queue scheduler that assigns each arrival job to an initial queue based on input length and skips higher-priority queues to reduce demotions while supporting token-level preemption.

If this is right

More concurrent interactive sessions can be supported on the same GPU cluster without violating latency targets.
Requests of widely varying input lengths experience less interference with one another.
GPU memory is freed more promptly for new work, raising overall hardware utilization.
Serving clusters can be sized smaller while still meeting service-level objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-level preemption idea could be applied to other autoregressive generation tasks such as code completion or image captioning.
Integration with existing model-parallel frameworks would be needed to test whether the gains scale to models that do not fit on a single device.
Production deployments would benefit from adding runtime feedback to adjust queue levels when input-length statistics drift.

Load-bearing premise

Token-level preemption and the skip-join MLFQ assignment based on input length incur low enough overhead to deliver the reported gains without hidden costs in real workloads.

What would settle it

A workload trace in which the measured preemption and offload overhead exceeds the latency reduction achieved by the scheduler would show that the throughput gains disappear.

read the original abstract

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastServe's token-level preemption and skip-join MLFQ look like a practical step forward for LLM serving latency, but the KV cache offload costs need explicit measurement to back the big throughput claims.

read the letter

The core new piece is the skip-join multi-level feedback queue that uses input length to place jobs in an initial queue and skips higher ones to limit demotions, combined with token-granularity preemption that exploits the autoregressive generation pattern. They add proactive offloading of intermediate state to host memory when GPU pressure builds. This moves away from the run-to-completion model in systems like vLLM and targets head-of-line blocking directly. The reported gains are large: up to 31.4x throughput under fixed average latency and 17.9x under tail latency constraints. For anyone running interactive LLM workloads, those numbers would matter if they hold up in practice. The prototype evaluation gives a concrete basis for the claims rather than just simulation. The memory management looks like standard engineering done carefully for this setting. The main soft spot is the preemption cost. Each token-level switch requires copying the growing KV cache over PCIe, and if the scheduler triggers many demotions the cumulative transfer time could eat into the latency budget. The abstract gives no direct bound on offload time versus compute or on how often the skip-join rule actually demotes jobs under realistic output-length distributions. If the full paper shows low demotion rates and measures the PCIe overhead explicitly, that concern shrinks; otherwise the headline numbers rest on an assumption that still needs checking. No mathematical circularity or hidden fitted parameters here; it's a systems measurement result. This is for people building or tuning LLM inference clusters who care about latency-throughput tradeoffs. A reader working on serving systems would get usable ideas from the scheduler and offload design even if they adapt rather than copy it. It deserves peer review because the problem is real, the approach is distinct from prior run-to-completion work, and the empirical claims are specific enough to test.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FastServe, a distributed LLM inference serving system that exploits the autoregressive generation pattern to support preemption at token granularity. It introduces a skip-join Multi-Level Feedback Queue scheduler that assigns jobs to initial queues using input length information while skipping higher-priority queues to limit demotions, paired with a proactive GPU memory manager that offloads and restores KV-cache states to host memory. A prototype implementation is evaluated against vLLM, with the central empirical claim being throughput gains of up to 31.4× under equivalent average latency and 17.9× under tail latency constraints.

Significance. If the performance claims are substantiated with overhead measurements and reproducible workloads, the work would offer a practical advance for latency-sensitive LLM serving by showing how token-level preemption can reduce head-of-line blocking. The semi-information-agnostic scheduling heuristic represents a pragmatic engineering compromise that could be adopted in production systems.

major comments (2)

[System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.
[Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.

minor comments (2)

[Introduction] The phrase 'semi-information-agnostic setting' is used in the abstract and introduction without a precise definition or comparison to fully agnostic or fully aware baselines; adding a short clarifying paragraph would improve accessibility.
[Evaluation] Figure captions and axis labels in the evaluation section should explicitly state the latency SLO values used for the throughput comparisons to allow direct interpretation of the 31.4× and 17.9× numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of overhead analysis and experimental details.

read point-by-point responses

Referee: [System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.

Authors: We agree that explicit quantitative bounds on overheads would improve attribution of the reported gains. In the revised manuscript we have added a dedicated analysis subsection that measures PCIe transfer latency for KV-cache offload/restore operations relative to per-token generation time across the evaluated models and hardware. We also report demotion frequencies measured under output-length distributions drawn from public conversation traces, showing that the skip-join mechanism keeps demotions low. These new results confirm that the overhead remains small and that the throughput improvements are primarily due to reduced head-of-line blocking. revision: yes
Referee: [Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.

Authors: We acknowledge that the original experimental section lacked sufficient detail for full reproducibility and generalization assessment. The revised manuscript expands the evaluation section to specify the exact workload traces (both synthetic and real traces from public sources), the model sizes and architectures tested, the precise hardware configuration (GPU models, memory, and interconnect), the concrete average and tail latency targets used to compute throughput, and an explicit statement that no post-hoc parameter tuning was applied beyond the design choices described in the paper. We have also made the evaluation configurations and scripts available as supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation

full rationale

The paper presents a systems design for LLM inference serving that relies on token-granularity preemption, proactive KV-cache offload, and a skip-join MLFQ scheduler whose initial queue assignment uses only input length. All load-bearing claims are throughput and latency improvements measured on a prototype implementation versus vLLM; no equations, fitted parameters, or first-principles derivations appear in the provided text. Consequently no step reduces by construction to its own inputs, self-citations, or ansatzes. The result is an independent empirical observation rather than a tautological renaming or prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claim rests on the assumption that token-level preemption is feasible with low overhead and that input length is known at arrival. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1111 out tokens · 46674 ms · 2026-05-17T11:17:21.588710+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
cs.DC 2026-05 unverdicted novelty 7.0

Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
cs.DC 2026-03 unverdicted novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Efficient Remote KV Cache Reuse with GPU-native Video Codec
cs.DC 2026-02 conditional novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
cs.LG 2026-05 unverdicted novelty 6.0

PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
cs.DC 2026-05 unverdicted novelty 6.0

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
cs.DC 2026-05 unverdicted novelty 6.0

BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
cs.LG 2026-05 unverdicted novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
cs.AR 2025-12 unverdicted novelty 5.0

ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
cs.IR 2025-04 unverdicted novelty 5.0

The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
cs.DC 2026-04 unverdicted novelty 2.0

This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
cs.LG 2026-03 unverdicted novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 17 Pith papers

[1]

Introducing ChatGPT

“Introducing ChatGPT.”https://openai.com/blog/ chatgpt, 2022

work page 2022
[2]

ChatGPT sets record for fastest-growing user base

“ChatGPT sets record for fastest-growing user base.” https://www.reuters.com/technology/chatgpt- sets-record-fastest-growing-user-base- analyst-note-2023-02-01/ , 2023

work page 2023
[3]

Reinventing search with a new ai-powered bing and edge, your copilot for the web

“Reinventing search with a new ai-powered bing and edge, your copilot for the web.” https:// news.microsoft.com/the-new-Bing/, 2023

work page 2023
[4]

Our next-generation model: Gemini 1.5

Google, “Our next-generation model: Gemini 1.5.” https://blog.google/technology/ai/google- gemini-next-generation-model-february- 2024/, 2024

work page 2024
[5]

Introducing the next generation of Claude

Anthropic, “Introducing the next generation of Claude.” https://www.anthropic.com/news/claude-3- family, 2024

work page 2024
[6]

Introducing Qwen

“Introducing Qwen.” https://qwenlm.github.io/ blog/qwen/, 2023

work page 2023
[7]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[8]

Serving DNNs like clockwork: Performance predictability from the bottom up,

A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kauf- mann, Y . Vigfusson, and J. Mace, “Serving DNNs like clockwork: Performance predictability from the bottom up,” in USENIX OSDI, 2020

work page 2020
[9]

Shep- herd: Serving dnns in the wild,

H. Zhang, Y . Tang, A. Khandelwal, and I. Stoica, “Shep- herd: Serving dnns in the wild,” inUSENIX NSDI, 2023

work page 2023
[10]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.- G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in USENIX OSDI, 2022

work page 2022
[11]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in ACM SOSP, 2023

work page 2023
[13]

Sharegpt teams

“Sharegpt teams.” https://sharegpt.com/, 2023

work page 2023
[14]

Stanford alpaca: An instruction-following llama model

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” https: //github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[15]

Information-agnostic flow scheduling for commodity data centers,

W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-agnostic flow scheduling for commodity data centers,” in USENIX OSDI, 2015

work page 2015
[16]

Megatron-lm: Training multi-billion parameter language models using model parallelism,

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv, 2020

work page 2020
[17]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Neural Information Processing Systems, 2019

work page 2019
[18]

Opt: Open pre-trained transformer language models,

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mi- haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” arXiv, 2022

work page 2022
[19]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

work page 2020
[20]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam- ple, “Llama: Open and efficient foundation language models,” arXiv, 2023

work page 2023
[21]

At- tention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “At- tention is all you need,” Neural Information Processing Systems, 2017. 13

work page 2017
[22]

Tensorflow-serving: Flexible, high-performance ml serving,

C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V . Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” arXiv, 2017

work page 2017
[23]

Triton inference server: An optimized cloud and edge inferencing solution.,

N. Corporation, “Triton inference server: An optimized cloud and edge inferencing solution.,” 2019

work page 2019
[24]

fairseq: A fast, extensible toolkit for sequence modeling,

M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv, 2019

work page 2019
[25]

Huggingface’s transform- ers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jer- nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transform- ers: State-of-the-art natural language processing,”arXiv, 2020

work page 2020
[26]

Fastertransformer,

N. Corporation, “Fastertransformer,” 2019

work page 2019
[27]

A proof of the optimality of the shortest remaining processing time discipline,

L. Schrage, “A proof of the optimality of the shortest remaining processing time discipline,” Operations Re- search, 1968

work page 1968
[28]

Fast transformer decoding: One write-head is all you need,

N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv, 2019

work page 2019
[29]

Gqa: Training generalized multi-query transformer models from multi-head check- points,

J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head check- points,” arXiv, 2023

work page 2023
[30]

How long can open-source llms truly promise on context length?,

D. Li*, R. Shao*, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source llms truly promise on context length?,” 2023

work page 2023
[31]

Finishing flows quickly with preemptive scheduling,

C.-Y . Hong, M. Caesar, and P. B. Godfrey, “Finishing flows quickly with preemptive scheduling,” in ACM SIGCOMM, 2012

work page 2012
[32]

pfabric: Minimal near- optimal datacenter transport,

M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker, “pfabric: Minimal near- optimal datacenter transport,” SIGCOMM CCR, 2013

work page 2013
[33]

Efficient coflow schedul- ing without prior knowledge,

M. Chowdhury and I. Stoica, “Efficient coflow schedul- ing without prior knowledge,”SIGCOMM CCR, 2015

work page 2015
[34]

Tiresias: A gpu clus- ter manager for distributed deep learning.,

J. Gu, M. Chowdhury, K. G. Shin, Y . Zhu, M. Jeon, J. Qian, H. H. Liu, and C. Guo, “Tiresias: A gpu clus- ter manager for distributed deep learning.,” in USENIX NSDI, 2019

work page 2019
[35]

Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,

Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez,et al., “Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,” in USENIX OSDI, 2023

work page 2023
[36]

Scaling laws for neural language models,

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv, 2020

work page 2020
[37]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- ley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” arXiv, 2021

work page 2021
[38]

Pipedream: Generalized pipeline parallelism for dnn training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Za- haria, “Pipedream: Generalized pipeline parallelism for dnn training,” in ACM SOSP, 2019

work page 2019
[39]

Ray: A distributed framework for emerging AI applications,

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” in USENIX OSDI, 2018

work page 2018
[40]

Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,

G. Prekas, M. Kogias, and E. Bugnion, “Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,” in ACM SOSP, 2017

work page 2017
[41]

Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in USENIX OSDI, 2024

work page 2024
[42]

Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,

B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,” arXiv, 2024

work page 2024
[43]

Efficient coflow scheduling with varys,

M. Chowdhury, Y . Zhong, and I. Stoica, “Efficient coflow scheduling with varys,” in ACM SIGCOMM , 2014

work page 2014
[44]

Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,

K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Maz- ières, and C. Kozyrakis, “Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,” in USENIX NSDI, 2019

work page 2019
[45]

Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,

A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Bal- akrishnan, “Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,” inUSENIX NSDI, 2019

work page 2019
[46]

Caladan: Mitigating interference at microsecond timescales,

J. Fried, Z. Ruan, A. Ousterhout, and A. Belay, “Caladan: Mitigating interference at microsecond timescales,” in USENIX OSDI, 2020

work page 2020
[47]

Pipeswitch: Fast pipelined context switching for deep learning applica- tions,

Z. Bai, Z. Zhang, Y . Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applica- tions,” in USENIX OSDI, 2020. 14

work page 2020
[48]

Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,

M. Han, H. Zhang, R. Chen, and H. Chen, “Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,” in USENIX OSDI, 2022

work page 2022
[49]

Clipper: A low-latency online prediction serving system.,

D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system.,” in USENIX NSDI, 2017

work page 2017
[50]

Turbotransform- ers: an efficient gpu serving system for transformer mod- els,

J. Fang, Y . Yu, C. Zhao, and J. Zhou, “Turbotransform- ers: an efficient gpu serving system for transformer mod- els,” in ACM PPoPP, 2021

work page 2021
[51]

Mpcformer: fast, performant and private transformer inference with mpc,

D. Li, R. Shao, H. Wang, H. Guo, E. P. Xing, and H. Zhang, “Mpcformer: fast, performant and private transformer inference with mpc,” arXiv, 2023

work page 2023
[52]

Fairness in serving large lan- guage models,

Y . Sheng, S. Cao, D. Li, B. Zhu, Z. Li, D. Zhuo, J. E. Gonzalez, and I. Stoica, “Fairness in serving large lan- guage models,” in USENIX OSDI, 2024

work page 2024
[53]

Splitwise: Efficient gener- ative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, Íñigo Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient gener- ative llm inference using phase splitting,” in ACM/IEEE ISCA, 2024

work page 2024
[54]

Gradient compression supercharged high- performance data parallel dnn training,

Y . Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y . Xu, “Gradient compression supercharged high- performance data parallel dnn training,” in ACM SOSP, 2021

work page 2021
[55]

Fine-tuning language models over slow networks using activation quantization with guarantees,

J. Wang, B. Yuan, L. Rimanic, Y . He, T. Dao, B. Chen, C. Re, and C. Zhang, “Fine-tuning language models over slow networks using activation quantization with guarantees,” Neural Information Processing Systems , 2022

work page 2022
[56]

Train big, then compress: Rethink- ing model size for efficient training and inference of transformers,

Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethink- ing model size for efficient training and inference of transformers,” in International Conference on Machine Learning (ICML), 2020

work page 2020
[57]

Smoothquant: Accurate and efficient post-training quan- tization for large language models,

G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quan- tization for large language models,”International Con- ference on Machine Learning, 2022

work page 2022
[58]

Gptq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv, 2022

work page 2022
[59]

Llm. int8 (): 8-bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm. int8 (): 8-bit matrix multiplication for transformers at scale,” arXiv, 2022

work page 2022
[60]

SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,

N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,” in USENIX OSDI, 2022. 15

work page 2022