pith. machine review for the scientific record. sign in

arxiv: 2305.05920 · v3 · submitted 2023-05-10 · 💻 cs.LG · cs.DC

Recognition: no theorem link

Fast Distributed Inference Serving for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:17 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords LLM inference servingpreemptive schedulingmulti-level feedback queuetoken-level preemptionGPU memory managementdistributed serving systemlatency-aware scheduling
0
0 comments X

The pith

FastServe enables token-level preemption and skip-join scheduling for LLM inference to raise throughput while holding latency fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LLM serving systems process each request to completion, which creates head-of-line blocking when new requests arrive. FastServe instead allows preemption after every generated token by exploiting the autoregressive structure of inference. It introduces a skip-join Multi-Level Feedback Queue scheduler that places each job into an initial queue using only its input length and skips higher-priority queues to limit demotions. The system also offloads and reloads intermediate GPU states to host memory as needed to free capacity. Experiments show these changes raise throughput by up to 31.4 times under average-latency bounds and 17.9 times under tail-latency bounds relative to the prior state-of-the-art system.

Core claim

FastServe is a distributed inference serving system for large language models that performs preemption at the granularity of each output token, employs a skip-join Multi-Level Feedback Queue scheduler that uses input length to assign an appropriate initial queue while skipping higher ones to reduce demotions, and uses proactive offloading of intermediate states between GPU and host memory; these mechanisms together produce throughput gains of up to 31.4x under average latency requirements and 17.9x under tail latency requirements compared with vLLM.

What carries the argument

The skip-join Multi-Level Feedback Queue scheduler that assigns each arrival job to an initial queue based on input length and skips higher-priority queues to reduce demotions while supporting token-level preemption.

If this is right

  • More concurrent interactive sessions can be supported on the same GPU cluster without violating latency targets.
  • Requests of widely varying input lengths experience less interference with one another.
  • GPU memory is freed more promptly for new work, raising overall hardware utilization.
  • Serving clusters can be sized smaller while still meeting service-level objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-level preemption idea could be applied to other autoregressive generation tasks such as code completion or image captioning.
  • Integration with existing model-parallel frameworks would be needed to test whether the gains scale to models that do not fit on a single device.
  • Production deployments would benefit from adding runtime feedback to adjust queue levels when input-length statistics drift.

Load-bearing premise

Token-level preemption and the skip-join MLFQ assignment based on input length incur low enough overhead to deliver the reported gains without hidden costs in real workloads.

What would settle it

A workload trace in which the measured preemption and offload overhead exceeds the latency reduction achieved by the scheduler would show that the throughput gains disappear.

read the original abstract

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FastServe, a distributed LLM inference serving system that exploits the autoregressive generation pattern to support preemption at token granularity. It introduces a skip-join Multi-Level Feedback Queue scheduler that assigns jobs to initial queues using input length information while skipping higher-priority queues to limit demotions, paired with a proactive GPU memory manager that offloads and restores KV-cache states to host memory. A prototype implementation is evaluated against vLLM, with the central empirical claim being throughput gains of up to 31.4× under equivalent average latency and 17.9× under tail latency constraints.

Significance. If the performance claims are substantiated with overhead measurements and reproducible workloads, the work would offer a practical advance for latency-sensitive LLM serving by showing how token-level preemption can reduce head-of-line blocking. The semi-information-agnostic scheduling heuristic represents a pragmatic engineering compromise that could be adopted in production systems.

major comments (2)
  1. [System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.
  2. [Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.
minor comments (2)
  1. [Introduction] The phrase 'semi-information-agnostic setting' is used in the abstract and introduction without a precise definition or comparison to fully agnostic or fully aware baselines; adding a short clarifying paragraph would improve accessibility.
  2. [Evaluation] Figure captions and axis labels in the evaluation section should explicitly state the latency SLO values used for the throughput comparisons to allow direct interpretation of the 31.4× and 17.9× numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of overhead analysis and experimental details.

read point-by-point responses
  1. Referee: [System Design] System Design (preemption and memory management description): the central throughput claims rest on the assumption that token-granularity preemption plus repeated KV-cache offload/restore incurs negligible overhead. No quantitative bound is given on PCIe transfer time relative to per-token compute time, nor is the demotion frequency under realistic output-length distributions analyzed; without this, the 31.4×/17.9× gains cannot be confidently attributed to the scheduler rather than hidden costs.

    Authors: We agree that explicit quantitative bounds on overheads would improve attribution of the reported gains. In the revised manuscript we have added a dedicated analysis subsection that measures PCIe transfer latency for KV-cache offload/restore operations relative to per-token generation time across the evaluated models and hardware. We also report demotion frequencies measured under output-length distributions drawn from public conversation traces, showing that the skip-join mechanism keeps demotions low. These new results confirm that the overhead remains small and that the throughput improvements are primarily due to reduced head-of-line blocking. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation: the reported speedups versus vLLM are presented without sufficient detail on workload traces, model sizes, hardware configuration, exact latency targets, or whether post-hoc tuning occurred. This information is load-bearing for assessing whether the gains generalize beyond the specific prototype runs.

    Authors: We acknowledge that the original experimental section lacked sufficient detail for full reproducibility and generalization assessment. The revised manuscript expands the evaluation section to specify the exact workload traces (both synthetic and real traces from public sources), the model sizes and architectures tested, the precise hardware configuration (GPU models, memory, and interconnect), the concrete average and tail latency targets used to compute throughput, and an explicit statement that no post-hoc parameter tuning was applied beyond the design choices described in the paper. We have also made the evaluation configurations and scripts available as supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation

full rationale

The paper presents a systems design for LLM inference serving that relies on token-granularity preemption, proactive KV-cache offload, and a skip-join MLFQ scheduler whose initial queue assignment uses only input length. All load-bearing claims are throughput and latency improvements measured on a prototype implementation versus vLLM; no equations, fitted parameters, or first-principles derivations appear in the provided text. Consequently no step reduces by construction to its own inputs, self-citations, or ansatzes. The result is an independent empirical observation rather than a tautological renaming or prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claim rests on the assumption that token-level preemption is feasible with low overhead and that input length is known at arrival. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1111 out tokens · 46674 ms · 2026-05-17T11:17:21.588710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  2. Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

    cs.DC 2026-05 unverdicted novelty 7.0

    Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.

  3. Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    cs.DC 2026-04 unverdicted novelty 7.0

    Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

  4. GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

    cs.DC 2026-03 unverdicted novelty 7.0

    GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

  5. Efficient Remote KV Cache Reuse with GPU-native Video Codec

    cs.DC 2026-02 conditional novelty 7.0

    KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

  6. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  7. PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design

    cs.LG 2026-05 unverdicted novelty 6.0

    PRISM reduces P99 TTFT by 23.3-37.1% and raises exact-prefix KV-cache hit rates by 5.9-12.2 points versus the strongest baseline on 4B and 13B models by jointly optimizing scheduling and memory.

  8. Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

    cs.DC 2026-05 unverdicted novelty 6.0

    BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and A...

  9. Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

    cs.DC 2026-05 unverdicted novelty 6.0

    BalanceRoute reduces data-parallel imbalance in LLM inference via F-score routing and lookahead, yielding higher end-to-end throughput on 144-NPU clusters versus vLLM baselines.

  10. A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

    cs.LG 2026-05 unverdicted novelty 6.0

    A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

  11. EdgeFM: Efficient Edge Inference for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...

  12. ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

    cs.AR 2025-12 unverdicted novelty 5.0

    ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.

  13. From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

    cs.IR 2025-04 unverdicted novelty 5.0

    The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

  14. Token Economics for LLM Agents: A Dual-View Study from Computing and Economics

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...

  15. Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

    cs.DC 2026-04 unverdicted novelty 3.0

    A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

  16. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  17. Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

    cs.DC 2026-04 unverdicted novelty 2.0

    This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.

  18. Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

    cs.LG 2026-03 unverdicted novelty 2.0

    The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 17 Pith papers

  1. [1]

    Introducing ChatGPT

    “Introducing ChatGPT.”https://openai.com/blog/ chatgpt, 2022

  2. [2]

    ChatGPT sets record for fastest-growing user base

    “ChatGPT sets record for fastest-growing user base.” https://www.reuters.com/technology/chatgpt- sets-record-fastest-growing-user-base- analyst-note-2023-02-01/ , 2023

  3. [3]

    Reinventing search with a new ai-powered bing and edge, your copilot for the web

    “Reinventing search with a new ai-powered bing and edge, your copilot for the web.” https:// news.microsoft.com/the-new-Bing/, 2023

  4. [4]

    Our next-generation model: Gemini 1.5

    Google, “Our next-generation model: Gemini 1.5.” https://blog.google/technology/ai/google- gemini-next-generation-model-february- 2024/, 2024

  5. [5]

    Introducing the next generation of Claude

    Anthropic, “Introducing the next generation of Claude.” https://www.anthropic.com/news/claude-3- family, 2024

  6. [6]

    Introducing Qwen

    “Introducing Qwen.” https://qwenlm.github.io/ blog/qwen/, 2023

  7. [7]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016

  8. [8]

    Serving DNNs like clockwork: Performance predictability from the bottom up,

    A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kauf- mann, Y . Vigfusson, and J. Mace, “Serving DNNs like clockwork: Performance predictability from the bottom up,” in USENIX OSDI, 2020

  9. [9]

    Shep- herd: Serving dnns in the wild,

    H. Zhang, Y . Tang, A. Khandelwal, and I. Stoica, “Shep- herd: Serving dnns in the wild,” inUSENIX NSDI, 2023

  10. [10]

    Orca: A distributed serving system for Transformer-Based generative models,

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.- G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in USENIX OSDI, 2022

  11. [11]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in ACM SOSP, 2023

  12. [13]

    Sharegpt teams

    “Sharegpt teams.” https://sharegpt.com/, 2023

  13. [14]

    Stanford alpaca: An instruction-following llama model

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” https: //github.com/tatsu-lab/stanford_alpaca, 2023

  14. [15]

    Information-agnostic flow scheduling for commodity data centers,

    W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-agnostic flow scheduling for commodity data centers,” in USENIX OSDI, 2015

  15. [16]

    Megatron-lm: Training multi-billion parameter language models using model parallelism,

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv, 2020

  16. [17]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism,

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Neural Information Processing Systems, 2019

  17. [18]

    Opt: Open pre-trained transformer language models,

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mi- haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” arXiv, 2022

  18. [19]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

  19. [20]

    Llama: Open and efficient foundation language models,

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam- ple, “Llama: Open and efficient foundation language models,” arXiv, 2023

  20. [21]

    At- tention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “At- tention is all you need,” Neural Information Processing Systems, 2017. 13

  21. [22]

    Tensorflow-serving: Flexible, high-performance ml serving,

    C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V . Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” arXiv, 2017

  22. [23]

    Triton inference server: An optimized cloud and edge inferencing solution.,

    N. Corporation, “Triton inference server: An optimized cloud and edge inferencing solution.,” 2019

  23. [24]

    fairseq: A fast, extensible toolkit for sequence modeling,

    M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv, 2019

  24. [25]

    Huggingface’s transform- ers: State-of-the-art natural language processing,

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jer- nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transform- ers: State-of-the-art natural language processing,”arXiv, 2020

  25. [26]

    Fastertransformer,

    N. Corporation, “Fastertransformer,” 2019

  26. [27]

    A proof of the optimality of the shortest remaining processing time discipline,

    L. Schrage, “A proof of the optimality of the shortest remaining processing time discipline,” Operations Re- search, 1968

  27. [28]

    Fast transformer decoding: One write-head is all you need,

    N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv, 2019

  28. [29]

    Gqa: Training generalized multi-query transformer models from multi-head check- points,

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head check- points,” arXiv, 2023

  29. [30]

    How long can open-source llms truly promise on context length?,

    D. Li*, R. Shao*, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source llms truly promise on context length?,” 2023

  30. [31]

    Finishing flows quickly with preemptive scheduling,

    C.-Y . Hong, M. Caesar, and P. B. Godfrey, “Finishing flows quickly with preemptive scheduling,” in ACM SIGCOMM, 2012

  31. [32]

    pfabric: Minimal near- optimal datacenter transport,

    M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker, “pfabric: Minimal near- optimal datacenter transport,” SIGCOMM CCR, 2013

  32. [33]

    Efficient coflow schedul- ing without prior knowledge,

    M. Chowdhury and I. Stoica, “Efficient coflow schedul- ing without prior knowledge,”SIGCOMM CCR, 2015

  33. [34]

    Tiresias: A gpu clus- ter manager for distributed deep learning.,

    J. Gu, M. Chowdhury, K. G. Shin, Y . Zhu, M. Jeon, J. Qian, H. H. Liu, and C. Guo, “Tiresias: A gpu clus- ter manager for distributed deep learning.,” in USENIX NSDI, 2019

  34. [35]

    Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,

    Z. Li, L. Zheng, Y . Zhong, V . Liu, Y . Sheng, X. Jin, Y . Huang, Z. Chen, H. Zhang, J. E. Gonzalez,et al., “Al- paServe: Statistical multiplexing with model parallelism for deep learning serving,” in USENIX OSDI, 2023

  35. [36]

    Scaling laws for neural language models,

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv, 2020

  36. [37]

    Efficient large-scale language model training on gpu clusters using megatron-lm,

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- ley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- ishayee, and M. Zaharia, “Efficient large-scale language model training on gpu clusters using megatron-lm,” arXiv, 2021

  37. [38]

    Pipedream: Generalized pipeline parallelism for dnn training,

    D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Za- haria, “Pipedream: Generalized pipeline parallelism for dnn training,” in ACM SOSP, 2019

  38. [39]

    Ray: A distributed framework for emerging AI applications,

    P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” in USENIX OSDI, 2018

  39. [40]

    Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,

    G. Prekas, M. Kogias, and E. Bugnion, “Zygos: Achiev- ing low tail latency for microsecond-scale networked tasks,” in ACM SOSP, 2017

  40. [41]

    Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,

    Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in USENIX OSDI, 2024

  41. [42]

    Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,

    B. Wu, S. Liu, Y . Zhong, P. Sun, X. Liu, and X. Jin, “Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism,” arXiv, 2024

  42. [43]

    Efficient coflow scheduling with varys,

    M. Chowdhury, Y . Zhong, and I. Stoica, “Efficient coflow scheduling with varys,” in ACM SIGCOMM , 2014

  43. [44]

    Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,

    K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Maz- ières, and C. Kozyrakis, “Shinjuku: Preemptive schedul- ing for µsecond-scale tail latency,” in USENIX NSDI, 2019

  44. [45]

    Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,

    A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Bal- akrishnan, “Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads.,” inUSENIX NSDI, 2019

  45. [46]

    Caladan: Mitigating interference at microsecond timescales,

    J. Fried, Z. Ruan, A. Ousterhout, and A. Belay, “Caladan: Mitigating interference at microsecond timescales,” in USENIX OSDI, 2020

  46. [47]

    Pipeswitch: Fast pipelined context switching for deep learning applica- tions,

    Z. Bai, Z. Zhang, Y . Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applica- tions,” in USENIX OSDI, 2020. 14

  47. [48]

    Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,

    M. Han, H. Zhang, R. Chen, and H. Chen, “Microsecond- scale preemption for concurrent GPU-accelerated DNN inferences,” in USENIX OSDI, 2022

  48. [49]

    Clipper: A low-latency online prediction serving system.,

    D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system.,” in USENIX NSDI, 2017

  49. [50]

    Turbotransform- ers: an efficient gpu serving system for transformer mod- els,

    J. Fang, Y . Yu, C. Zhao, and J. Zhou, “Turbotransform- ers: an efficient gpu serving system for transformer mod- els,” in ACM PPoPP, 2021

  50. [51]

    Mpcformer: fast, performant and private transformer inference with mpc,

    D. Li, R. Shao, H. Wang, H. Guo, E. P. Xing, and H. Zhang, “Mpcformer: fast, performant and private transformer inference with mpc,” arXiv, 2023

  51. [52]

    Fairness in serving large lan- guage models,

    Y . Sheng, S. Cao, D. Li, B. Zhu, Z. Li, D. Zhuo, J. E. Gonzalez, and I. Stoica, “Fairness in serving large lan- guage models,” in USENIX OSDI, 2024

  52. [53]

    Splitwise: Efficient gener- ative llm inference using phase splitting,

    P. Patel, E. Choukse, C. Zhang, A. Shah, Íñigo Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient gener- ative llm inference using phase splitting,” in ACM/IEEE ISCA, 2024

  53. [54]

    Gradient compression supercharged high- performance data parallel dnn training,

    Y . Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y . Xu, “Gradient compression supercharged high- performance data parallel dnn training,” in ACM SOSP, 2021

  54. [55]

    Fine-tuning language models over slow networks using activation quantization with guarantees,

    J. Wang, B. Yuan, L. Rimanic, Y . He, T. Dao, B. Chen, C. Re, and C. Zhang, “Fine-tuning language models over slow networks using activation quantization with guarantees,” Neural Information Processing Systems , 2022

  55. [56]

    Train big, then compress: Rethink- ing model size for efficient training and inference of transformers,

    Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethink- ing model size for efficient training and inference of transformers,” in International Conference on Machine Learning (ICML), 2020

  56. [57]

    Smoothquant: Accurate and efficient post-training quan- tization for large language models,

    G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quan- tization for large language models,”International Con- ference on Machine Learning, 2022

  57. [58]

    Gptq: Accurate post-training quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv, 2022

  58. [59]

    Llm. int8 (): 8-bit matrix multiplication for transformers at scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm. int8 (): 8-bit matrix multiplication for transformers at scale,” arXiv, 2022

  59. [60]

    SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,

    N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep- Learning model sparsity via Tensor-with-Sparsity- Attribute,” in USENIX OSDI, 2022. 15