Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arxiv: 2605.19593 · v1 · pith:ZSADPI3Rnew · submitted 2026-05-19 · 💻 cs.AI · cs.DC

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Mert Yildiz , Pietro Spadaccino , Alexey Rolich , Francesca Cuomo , Andrea Baiocchi This is my paper

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.AI cs.DC

keywords multi-model LLMGPU offloadingpreemptiondecode throughputschedulingheterogeneous hardwareempirical study

0 comments p. Extension

pith:ZSADPI3R Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{ZSADPI3R}

Prints a linked pith:ZSADPI3R badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Offloading LLM layers to CPU causes non-linear and model-specific drops in decode throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the challenges of running several large language models together on the same hardware when GPU memory is limited. It measures what happens when parts of a model are moved to CPU memory or when one model is paused to run another. The results show that moving layers off the GPU does not slow things down gradually but in sharp, uneven drops, and smaller models are more sensitive to this change. Preempting a model to switch to another adds big delays, mostly from reloading the whole model rather than just the temporary data. Understanding these costs helps design better systems that can handle many different models at once without wasting resources.

Core claim

The central claim is that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. Preemption incurs substantial overhead dominated by model state reload rather than key-value cache transfer, with costs varying significantly across models and hardware platforms. Sequence length and interconnect bandwidth amplify these effects.

What carries the argument

Measurements of throughput under varying GPU residency fractions and preemption events for different LLMs on multiple hardware platforms.

Load-bearing premise

The specific models, hardware platforms, sequence lengths, and workloads tested are representative of the conditions under which future multi-model LLM schedulers will operate.

What would settle it

Running experiments with a wider range of models and finding that all exhibit similar linear degradation under offloading would disprove the model-dependent non-linear claim.

Figures

Figures reproduced from arXiv: 2605.19593 by Alexey Rolich, Andrea Baiocchi, Francesca Cuomo, Mert Yildiz, Pietro Spadaccino.

**Figure 2.** Figure 2: Throughput by the Layer allocation of GPU % for Qwen3:32B model. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Throughput by the Layer allocation of GPU % for Llama2:70B model. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized throughput [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Preemption overhead vs. preemption point for four model configurations. Shaded bands indicate one standard deviation. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Overhead composition at the earliest and latest preemption points for different models. The GPU used is RTX 5000. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Overhead composition at the earliest and latest preemption points for different models. The GPU used is RTX A6000. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents an empirical study of how different LLMs behave across hardware platforms under layer offloading and preemption in multi-model serving scenarios with GPU memory constraints. It reports that offloading produces strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency; that preemption incurs substantial overhead dominated by model state reload rather than KV cache transfer, varying across models and platforms; and that sequence length and interconnect bandwidth amplify data movement inefficiencies. From these observations the authors identify key features (model-specific offloading sensitivity, workload characteristics, preemption and data-transfer cost structures) that future schedulers must consider.

Significance. If the reported patterns generalize, the work supplies concrete empirical guidance for scheduler design in heterogeneous multi-model LLM deployments, a setting that existing single-model throughput optimizers do not address. The emphasis on model-dependent sensitivities and the dominance of reload costs over KV-cache movement could directly inform practical resource-allocation policies.

major comments (1)

Abstract: the claim that the findings 'provide guidance for the design of next-generation LLM serving systems' is load-bearing for the paper's contribution, yet the representativeness of the tested models, hardware platforms, sequence lengths, and synthetic workloads is not established. No sensitivity sweeps, cross-validation on held-out configurations, or explicit justification for extrapolation are provided, leaving the central utility claim dependent on an untested generalization step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope and generalizability of our empirical study. We address the single major comment below and will revise the manuscript to improve transparency around our experimental choices and limitations.

read point-by-point responses

Referee: Abstract: the claim that the findings 'provide guidance for the design of next-generation LLM serving systems' is load-bearing for the paper's contribution, yet the representativeness of the tested models, hardware platforms, sequence lengths, and synthetic workloads is not established. No sensitivity sweeps, cross-validation on held-out configurations, or explicit justification for extrapolation are provided, leaving the central utility claim dependent on an untested generalization step.

Authors: We acknowledge that our work is an empirical investigation rather than an exhaustive benchmark and that stronger justification for the tested configurations would better support the guidance claim. The models were selected to cover a practical range of sizes (7B–70B) and families (Llama, Mistral, and others) that appear in current multi-model deployments; hardware comprised A100 and H100 GPUs, which dominate production clusters; sequence lengths reached 4096 tokens, consistent with typical serving traces; and workloads were synthetic to isolate offloading and preemption effects. While we did not perform additional sensitivity sweeps or held-out cross-validation, the reported non-linear throughput degradation and reload-cost dominance appeared consistently across every configuration we measured. In the revision we will add an explicit “Experimental Scope and Limitations” subsection that justifies the chosen points by reference to their prevalence in production systems and that states the boundaries of extrapolation. This addition will make the basis for our guidance transparent without requiring new experiments or altering the core observations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observations only

full rationale

The paper is an empirical study reporting measured throughput degradation under offloading and preemption for specific LLMs, hardware, and workloads. No equations, fitted parameters, derivations, or self-citation chains appear in the provided abstract or description. Claims rest on direct experimental results rather than any reduction to inputs by construction, self-definitional loops, or imported uniqueness results. The representativeness concern raised by the skeptic is a question of external validity, not circularity in the derivation. This matches the default case of a self-contained empirical paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical measurement study, the claims rest primarily on the validity and representativeness of the experimental conditions rather than on new mathematical axioms or invented entities.

axioms (1)

domain assumption The tested hardware platforms, models, and workloads are representative of real-world multi-model LLM serving scenarios.
The abstract draws general implications for future schedulers from measurements on specific platforms.

pith-pipeline@v0.9.0 · 5786 in / 1222 out tokens · 43637 ms · 2026-05-20T05:53:57.028054+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

offloading leads to strongly non-linear and model-dependent degradation in decode throughput... preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Efﬁcient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Y u, J. Gonzalez, H. Zhang, and I. Stoica, “Efﬁcient memory management for large language model serving with pagedattention,” in Proceedings of the 29th symposium on operating systems principles , 2023, pp. 611–626

work page 2023
[2]

Flexgen: High-throughput generative inference of large language models with a single gpu,

Y . Sheng, L. Zheng, B. Y uan, Z. Li, M. Ryabinin, B. Chen, P . Liang, C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” in International Conference on Machine Learning . PMLR, 2023, pp. 31 094–31 116

work page 2023
[3]

Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl ofﬂoading,

H. Kim, N. Wang, Q. Xia, J. Huang, A. Y azdanbakhsh, and N. S. Kim, “Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl ofﬂoading,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture , 2025, pp. 544–558

work page 2025
[4]

Powerinfer: Fast large language model serving with a consumer-grade gpu,

Y . Song, Z. Mi, H. Xie, and H. Chen, “Powerinfer: Fast large language model serving with a consumer-grade gpu,” in Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , 2024, pp. 590–606

work page 2024
[5]

Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,

K. Kamahori, T. Tang, Y . Gu, K. Zhu, and B. Kasikci, “Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,” 2024

work page 2024
[6]

Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,

H. Chen, W. Xie, B. Zhang, J. Tang, J. Wang, J. Dong, S. Chen, Z. Y uan, C. Lin, C. Qiu et al. , “Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1014–1029

work page 2025
[7]

Neo: Saving gpu memory crisis with cpu ofﬂoading for online llm inference,

X. Jiang, Y . Zhou, S. Cao, I. Stoica, and M. Y u, “Neo: Saving gpu memory crisis with cpu ofﬂoading for online llm inference,” vol. 7, 2025

work page 2025
[8]

Lm-ofﬂoad: Performance model-guided generative inference of large language models with parallelism control,

J. Wu, J. Ren, S. Y ang, K. Parasyris, G. Georgakoudis, I. Laguna, and D. Li, “Lm-ofﬂoad: Performance model-guided generative inference of large language models with parallelism control,” IEEE, pp. 840–849, 2025

work page 2025
[9]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,” Advances in Neural Information Processing Systems , vol. 37, pp. 1270–1303, 2024

work page 2024
[10]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Y u, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX symposium on operating systems design and imple- mentation (OSDI 22) , 2022, pp. 521–538

work page 2022
[11]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efﬁcient llm inference by piggybacking decodes with chunked preﬁlls,” arXiv preprint arXiv:2308.16369 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Taming throughput-latency tradeoff in llm inference with sarathi-serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” in 18th USENIX symposium on operating systems design and implementation (OSDI 24) , 2024, pp. 117–134

work page 2024
[13]

Splitwise: Efﬁcient generative llm inference using phase splitting,

P . Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efﬁcient generative llm inference using phase splitting,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 2024, pp. 118–132

work page 2024
[14]

Distserve: Disaggregating preﬁll and decoding for goodput- optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating preﬁll and decoding for goodput- optimized large language model serving,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 193–210

work page 2024
[15]

Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,

F. Strati, S. McAllister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,” in Proceedings of the 41st International Conference on Machine Learning , 2024, pp. 46 745–46 771

work page 2024
[16]

Inﬁnigen: Efﬁcient generative inference of large language models with dynamic kv cache manage- ment,

W. Lee, J. Lee, J. Seo, and J. Sim, “Inﬁnigen: Efﬁcient generative inference of large language models with dynamic kv cache manage- ment,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 155–172

work page 2024
[17]

Instattention: in-storage attention ofﬂoading for cost- effective long-context llm inference,

X. Pan, E. Li, Q. Li, S. Liang, Y . Shan, K. Zhou, Y . Luo, X. Wang, and J. Zhang, “Instattention: in-storage attention ofﬂoading for cost- effective long-context llm inference,” pp. 1510–1525, 2025

work page 2025
[18]

Llumnix: Dynamic scheduling for large language model serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in 18th USENIX symposium on operating systems design and implemen- tation (OSDI 24) , 2024, pp. 173–191

work page 2024
[19]

Fastswitch: Optimizing context switching efﬁciency in fairness-aware large language model serving,

A. Shen, Z. Li, and M. Gao, “Fastswitch: Optimizing context switching efﬁciency in fairness-aware large language model serving,” arXiv preprint arXiv:2411.18424 , 2024

work page arXiv 2024
[20]

Fast Distributed Inference Serving for Large Language Models

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

s3: Increasing GPU utilization during generative inference for higher throughput,

Y . Jin, C.-F. Wu, D. Brooks, and G.-Y . Wei, “ s3: Increasing GPU utilization during generative inference for higher throughput,” vol. 36, 2023, pp. 18 015–18 027

work page 2023
[22]

Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,

Z. Zheng, X. Ren, F. Xue, Y . Luo, X. Jiang, and Y . Y ou, “Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,” Advances in Neural Information Processing Systems , vol. 36, pp. 65 517–65 530, 2023

work page 2023
[23]

The effect of scheduling and preemption on the efﬁciency of llm inference serving,

K.-M. Kim, K. Hong, C. Gulcehre, and A. Ailamaki, “The effect of scheduling and preemption on the efﬁciency of llm inference serving,” 2024

work page 2024
[24]

Spotserve: Serving generative large language models on preemptible instances,

X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spotserve: Serving generative large language models on preemptible instances,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, 2024, pp. 1112–1127

work page 2024
[25]

Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,

M. Heisler, Z. Y ouseﬁjamarani, X. Wang, Q. Wang, G. Shi, H. Sadri, T. Y u, Y . Li, H. Li, G. Singh et al. , “Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,” Authorea Preprints, 2025

work page 2025
[26]

Muxserve: ﬂexible spatial-temporal multiplexing for multiple llm serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “Muxserve: ﬂexible spatial-temporal multiplexing for multiple llm serving,” 2024

work page 2024
[27]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market,

Y . Xiang, X. Li, K. Qian, Y . Y ang, D. Zhu, W. Y u, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective gpu pooling for concurrent llm serving on the market,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1030–1045

work page 2025
[28]

Prism: Unleashing gpu sharing for cost-efﬁcient multi- llm serving,

S. Y u, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Y ang, Z. Xie, S. Cao, K. Bao et al. , “Prism: Unleashing gpu sharing for cost-efﬁcient multi- llm serving,” arXiv preprint arXiv:2505.04021 , 2025

work page arXiv 2025
[29]

Serverlessllm: Low-latency serverless inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “Serverlessllm: Low-latency serverless inference for large language models,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 135–153

work page 2024
[30]

[Online]

NVIDIA Corporation, CUDA C++ Programming Guide – API Synchronization Behavior , 2025, accessed: 2025. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html# api-synchronization-behavior

work page 2025

[1] [1]

Efﬁcient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Y u, J. Gonzalez, H. Zhang, and I. Stoica, “Efﬁcient memory management for large language model serving with pagedattention,” in Proceedings of the 29th symposium on operating systems principles , 2023, pp. 611–626

work page 2023

[2] [2]

Flexgen: High-throughput generative inference of large language models with a single gpu,

Y . Sheng, L. Zheng, B. Y uan, Z. Li, M. Ryabinin, B. Chen, P . Liang, C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” in International Conference on Machine Learning . PMLR, 2023, pp. 31 094–31 116

work page 2023

[3] [3]

Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl ofﬂoading,

H. Kim, N. Wang, Q. Xia, J. Huang, A. Y azdanbakhsh, and N. S. Kim, “Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl ofﬂoading,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture , 2025, pp. 544–558

work page 2025

[4] [4]

Powerinfer: Fast large language model serving with a consumer-grade gpu,

Y . Song, Z. Mi, H. Xie, and H. Chen, “Powerinfer: Fast large language model serving with a consumer-grade gpu,” in Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , 2024, pp. 590–606

work page 2024

[5] [5]

Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,

K. Kamahori, T. Tang, Y . Gu, K. Zhu, and B. Kasikci, “Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,” 2024

work page 2024

[6] [6]

Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,

H. Chen, W. Xie, B. Zhang, J. Tang, J. Wang, J. Dong, S. Chen, Z. Y uan, C. Lin, C. Qiu et al. , “Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1014–1029

work page 2025

[7] [7]

Neo: Saving gpu memory crisis with cpu ofﬂoading for online llm inference,

X. Jiang, Y . Zhou, S. Cao, I. Stoica, and M. Y u, “Neo: Saving gpu memory crisis with cpu ofﬂoading for online llm inference,” vol. 7, 2025

work page 2025

[8] [8]

Lm-ofﬂoad: Performance model-guided generative inference of large language models with parallelism control,

J. Wu, J. Ren, S. Y ang, K. Parasyris, G. Georgakoudis, I. Laguna, and D. Li, “Lm-ofﬂoad: Performance model-guided generative inference of large language models with parallelism control,” IEEE, pp. 840–849, 2025

work page 2025

[9] [9]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,” Advances in Neural Information Processing Systems , vol. 37, pp. 1270–1303, 2024

work page 2024

[10] [10]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Y u, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX symposium on operating systems design and imple- mentation (OSDI 22) , 2022, pp. 521–538

work page 2022

[11] [11]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efﬁcient llm inference by piggybacking decodes with chunked preﬁlls,” arXiv preprint arXiv:2308.16369 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Taming throughput-latency tradeoff in llm inference with sarathi-serve,

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” in 18th USENIX symposium on operating systems design and implementation (OSDI 24) , 2024, pp. 117–134

work page 2024

[13] [13]

Splitwise: Efﬁcient generative llm inference using phase splitting,

P . Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efﬁcient generative llm inference using phase splitting,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 2024, pp. 118–132

work page 2024

[14] [14]

Distserve: Disaggregating preﬁll and decoding for goodput- optimized large language model serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating preﬁll and decoding for goodput- optimized large language model serving,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 193–210

work page 2024

[15] [15]

Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,

F. Strati, S. McAllister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,” in Proceedings of the 41st International Conference on Machine Learning , 2024, pp. 46 745–46 771

work page 2024

[16] [16]

Inﬁnigen: Efﬁcient generative inference of large language models with dynamic kv cache manage- ment,

W. Lee, J. Lee, J. Seo, and J. Sim, “Inﬁnigen: Efﬁcient generative inference of large language models with dynamic kv cache manage- ment,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 155–172

work page 2024

[17] [17]

Instattention: in-storage attention ofﬂoading for cost- effective long-context llm inference,

X. Pan, E. Li, Q. Li, S. Liang, Y . Shan, K. Zhou, Y . Luo, X. Wang, and J. Zhang, “Instattention: in-storage attention ofﬂoading for cost- effective long-context llm inference,” pp. 1510–1525, 2025

work page 2025

[18] [18]

Llumnix: Dynamic scheduling for large language model serving,

B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in 18th USENIX symposium on operating systems design and implemen- tation (OSDI 24) , 2024, pp. 173–191

work page 2024

[19] [19]

Fastswitch: Optimizing context switching efﬁciency in fairness-aware large language model serving,

A. Shen, Z. Li, and M. Gao, “Fastswitch: Optimizing context switching efﬁciency in fairness-aware large language model serving,” arXiv preprint arXiv:2411.18424 , 2024

work page arXiv 2024

[20] [20]

Fast Distributed Inference Serving for Large Language Models

B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

s3: Increasing GPU utilization during generative inference for higher throughput,

Y . Jin, C.-F. Wu, D. Brooks, and G.-Y . Wei, “ s3: Increasing GPU utilization during generative inference for higher throughput,” vol. 36, 2023, pp. 18 015–18 027

work page 2023

[22] [22]

Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,

Z. Zheng, X. Ren, F. Xue, Y . Luo, X. Jiang, and Y . Y ou, “Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,” Advances in Neural Information Processing Systems , vol. 36, pp. 65 517–65 530, 2023

work page 2023

[23] [23]

The effect of scheduling and preemption on the efﬁciency of llm inference serving,

K.-M. Kim, K. Hong, C. Gulcehre, and A. Ailamaki, “The effect of scheduling and preemption on the efﬁciency of llm inference serving,” 2024

work page 2024

[24] [24]

Spotserve: Serving generative large language models on preemptible instances,

X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spotserve: Serving generative large language models on preemptible instances,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, 2024, pp. 1112–1127

work page 2024

[25] [25]

Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,

M. Heisler, Z. Y ouseﬁjamarani, X. Wang, Q. Wang, G. Shi, H. Sadri, T. Y u, Y . Li, H. Li, G. Singh et al. , “Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,” Authorea Preprints, 2025

work page 2025

[26] [26]

Muxserve: ﬂexible spatial-temporal multiplexing for multiple llm serving,

J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “Muxserve: ﬂexible spatial-temporal multiplexing for multiple llm serving,” 2024

work page 2024

[27] [27]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market,

Y . Xiang, X. Li, K. Qian, Y . Y ang, D. Zhu, W. Y u, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective gpu pooling for concurrent llm serving on the market,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1030–1045

work page 2025

[28] [28]

Prism: Unleashing gpu sharing for cost-efﬁcient multi- llm serving,

S. Y u, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Y ang, Z. Xie, S. Cao, K. Bao et al. , “Prism: Unleashing gpu sharing for cost-efﬁcient multi- llm serving,” arXiv preprint arXiv:2505.04021 , 2025

work page arXiv 2025

[29] [29]

Serverlessllm: Low-latency serverless inference for large language models,

Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “Serverlessllm: Low-latency serverless inference for large language models,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 135–153

work page 2024

[30] [30]

[Online]

NVIDIA Corporation, CUDA C++ Programming Guide – API Synchronization Behavior , 2025, accessed: 2025. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html# api-synchronization-behavior

work page 2025