pith. sign in

arxiv: 2605.19593 · v1 · pith:ZSADPI3Rnew · submitted 2026-05-19 · 💻 cs.AI · cs.DC

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.AI cs.DC
keywords multi-model LLMGPU offloadingpreemptiondecode throughputschedulingheterogeneous hardwareempirical study
0
0 comments X p. Extension
pith:ZSADPI3R Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{ZSADPI3R}

Prints a linked pith:ZSADPI3R badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Offloading LLM layers to CPU causes non-linear and model-specific drops in decode throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the challenges of running several large language models together on the same hardware when GPU memory is limited. It measures what happens when parts of a model are moved to CPU memory or when one model is paused to run another. The results show that moving layers off the GPU does not slow things down gradually but in sharp, uneven drops, and smaller models are more sensitive to this change. Preempting a model to switch to another adds big delays, mostly from reloading the whole model rather than just the temporary data. Understanding these costs helps design better systems that can handle many different models at once without wasting resources.

Core claim

The central claim is that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. Preemption incurs substantial overhead dominated by model state reload rather than key-value cache transfer, with costs varying significantly across models and hardware platforms. Sequence length and interconnect bandwidth amplify these effects.

What carries the argument

Measurements of throughput under varying GPU residency fractions and preemption events for different LLMs on multiple hardware platforms.

Load-bearing premise

The specific models, hardware platforms, sequence lengths, and workloads tested are representative of the conditions under which future multi-model LLM schedulers will operate.

What would settle it

Running experiments with a wider range of models and finding that all exhibit similar linear degradation under offloading would disprove the model-dependent non-linear claim.

Figures

Figures reproduced from arXiv: 2605.19593 by Alexey Rolich, Andrea Baiocchi, Francesca Cuomo, Mert Yildiz, Pietro Spadaccino.

Figure 1
Figure 1. Figure 1: Throughput by the Layer allocation of GPU % for Llama3:8B model. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Throughput by the Layer allocation of GPU % for Qwen3:32B model. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput by the Layer allocation of GPU % for Llama2:70B model. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized throughput [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Preemption overhead vs. preemption point for four model configurations. Shaded bands indicate one standard deviation. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overhead composition at the earliest and latest preemption points for different models. The GPU used is RTX 5000. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overhead composition at the earliest and latest preemption points for different models. The GPU used is RTX A6000. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents an empirical study of how different LLMs behave across hardware platforms under layer offloading and preemption in multi-model serving scenarios with GPU memory constraints. It reports that offloading produces strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency; that preemption incurs substantial overhead dominated by model state reload rather than KV cache transfer, varying across models and platforms; and that sequence length and interconnect bandwidth amplify data movement inefficiencies. From these observations the authors identify key features (model-specific offloading sensitivity, workload characteristics, preemption and data-transfer cost structures) that future schedulers must consider.

Significance. If the reported patterns generalize, the work supplies concrete empirical guidance for scheduler design in heterogeneous multi-model LLM deployments, a setting that existing single-model throughput optimizers do not address. The emphasis on model-dependent sensitivities and the dominance of reload costs over KV-cache movement could directly inform practical resource-allocation policies.

major comments (1)
  1. Abstract: the claim that the findings 'provide guidance for the design of next-generation LLM serving systems' is load-bearing for the paper's contribution, yet the representativeness of the tested models, hardware platforms, sequence lengths, and synthetic workloads is not established. No sensitivity sweeps, cross-validation on held-out configurations, or explicit justification for extrapolation are provided, leaving the central utility claim dependent on an untested generalization step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope and generalizability of our empirical study. We address the single major comment below and will revise the manuscript to improve transparency around our experimental choices and limitations.

read point-by-point responses
  1. Referee: Abstract: the claim that the findings 'provide guidance for the design of next-generation LLM serving systems' is load-bearing for the paper's contribution, yet the representativeness of the tested models, hardware platforms, sequence lengths, and synthetic workloads is not established. No sensitivity sweeps, cross-validation on held-out configurations, or explicit justification for extrapolation are provided, leaving the central utility claim dependent on an untested generalization step.

    Authors: We acknowledge that our work is an empirical investigation rather than an exhaustive benchmark and that stronger justification for the tested configurations would better support the guidance claim. The models were selected to cover a practical range of sizes (7B–70B) and families (Llama, Mistral, and others) that appear in current multi-model deployments; hardware comprised A100 and H100 GPUs, which dominate production clusters; sequence lengths reached 4096 tokens, consistent with typical serving traces; and workloads were synthetic to isolate offloading and preemption effects. While we did not perform additional sensitivity sweeps or held-out cross-validation, the reported non-linear throughput degradation and reload-cost dominance appeared consistently across every configuration we measured. In the revision we will add an explicit “Experimental Scope and Limitations” subsection that justifies the chosen points by reference to their prevalence in production systems and that states the boundaries of extrapolation. This addition will make the basis for our guidance transparent without requiring new experiments or altering the core observations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observations only

full rationale

The paper is an empirical study reporting measured throughput degradation under offloading and preemption for specific LLMs, hardware, and workloads. No equations, fitted parameters, derivations, or self-citation chains appear in the provided abstract or description. Claims rest on direct experimental results rather than any reduction to inputs by construction, self-definitional loops, or imported uniqueness results. The representativeness concern raised by the skeptic is a question of external validity, not circularity in the derivation. This matches the default case of a self-contained empirical paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical measurement study, the claims rest primarily on the validity and representativeness of the experimental conditions rather than on new mathematical axioms or invented entities.

axioms (1)
  • domain assumption The tested hardware platforms, models, and workloads are representative of real-world multi-model LLM serving scenarios.
    The abstract draws general implications for future schedulers from measurements on specific platforms.

pith-pipeline@v0.9.0 · 5786 in / 1222 out tokens · 43637 ms · 2026-05-20T05:53:57.028054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Y u, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th symposium on operating systems principles , 2023, pp. 611–626

  2. [2]

    Flexgen: High-throughput generative inference of large language models with a single gpu,

    Y . Sheng, L. Zheng, B. Y uan, Z. Li, M. Ryabinin, B. Chen, P . Liang, C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” in International Conference on Machine Learning . PMLR, 2023, pp. 31 094–31 116

  3. [3]

    Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl offloading,

    H. Kim, N. Wang, Q. Xia, J. Huang, A. Y azdanbakhsh, and N. S. Kim, “Lia: A single-gpu llm inference acceleration with cooperative amx- enabled cpu-gpu computation and cxl offloading,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture , 2025, pp. 544–558

  4. [4]

    Powerinfer: Fast large language model serving with a consumer-grade gpu,

    Y . Song, Z. Mi, H. Xie, and H. Chen, “Powerinfer: Fast large language model serving with a consumer-grade gpu,” in Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , 2024, pp. 590–606

  5. [5]

    Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,

    K. Kamahori, T. Tang, Y . Gu, K. Zhu, and B. Kasikci, “Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,” 2024

  6. [6]

    Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,

    H. Chen, W. Xie, B. Zhang, J. Tang, J. Wang, J. Dong, S. Chen, Z. Y uan, C. Lin, C. Qiu et al. , “Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1014–1029

  7. [7]

    Neo: Saving gpu memory crisis with cpu offloading for online llm inference,

    X. Jiang, Y . Zhou, S. Cao, I. Stoica, and M. Y u, “Neo: Saving gpu memory crisis with cpu offloading for online llm inference,” vol. 7, 2025

  8. [8]

    Lm-offload: Performance model-guided generative inference of large language models with parallelism control,

    J. Wu, J. Ren, S. Y ang, K. Parasyris, G. Georgakoudis, I. Laguna, and D. Li, “Lm-offload: Performance model-guided generative inference of large language models with parallelism control,” IEEE, pp. 840–849, 2025

  9. [9]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization,

    C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quantization,” Advances in Neural Information Processing Systems , vol. 37, pp. 1270–1303, 2024

  10. [10]

    Orca: A distributed serving system for Transformer-Based generative models,

    G.-I. Y u, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX symposium on operating systems design and imple- mentation (OSDI 22) , 2022, pp. 521–538

  11. [11]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,” arXiv preprint arXiv:2308.16369 , 2023

  12. [12]

    Taming throughput-latency tradeoff in llm inference with sarathi-serve,

    A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” in 18th USENIX symposium on operating systems design and implementation (OSDI 24) , 2024, pp. 117–134

  13. [13]

    Splitwise: Efficient generative llm inference using phase splitting,

    P . Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . IEEE, 2024, pp. 118–132

  14. [14]

    Distserve: Disaggregating prefill and decoding for goodput- optimized large language model serving,

    Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodput- optimized large language model serving,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 193–210

  15. [15]

    Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,

    F. Strati, S. McAllister, A. Phanishayee, J. Tarnawski, and A. Klimovic, “Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,” in Proceedings of the 41st International Conference on Machine Learning , 2024, pp. 46 745–46 771

  16. [16]

    Infinigen: Efficient generative inference of large language models with dynamic kv cache manage- ment,

    W. Lee, J. Lee, J. Seo, and J. Sim, “Infinigen: Efficient generative inference of large language models with dynamic kv cache manage- ment,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 155–172

  17. [17]

    Instattention: in-storage attention offloading for cost- effective long-context llm inference,

    X. Pan, E. Li, Q. Li, S. Liang, Y . Shan, K. Zhou, Y . Luo, X. Wang, and J. Zhang, “Instattention: in-storage attention offloading for cost- effective long-context llm inference,” pp. 1510–1525, 2025

  18. [18]

    Llumnix: Dynamic scheduling for large language model serving,

    B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y . Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in 18th USENIX symposium on operating systems design and implemen- tation (OSDI 24) , 2024, pp. 173–191

  19. [19]

    Fastswitch: Optimizing context switching efficiency in fairness-aware large language model serving,

    A. Shen, Z. Li, and M. Gao, “Fastswitch: Optimizing context switching efficiency in fairness-aware large language model serving,” arXiv preprint arXiv:2411.18424 , 2024

  20. [20]

    Fast Distributed Inference Serving for Large Language Models

    B. Wu, Y . Zhong, Z. Zhang, S. Liu, F. Liu, Y . Sun, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920 , 2023

  21. [21]

    s3: Increasing GPU utilization during generative inference for higher throughput,

    Y . Jin, C.-F. Wu, D. Brooks, and G.-Y . Wei, “ s3: Increasing GPU utilization during generative inference for higher throughput,” vol. 36, 2023, pp. 18 015–18 027

  22. [22]

    Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,

    Z. Zheng, X. Ren, F. Xue, Y . Luo, X. Jiang, and Y . Y ou, “Response length perception and sequence scheduling: An llm-empowered llm in- ference pipeline,” Advances in Neural Information Processing Systems , vol. 36, pp. 65 517–65 530, 2023

  23. [23]

    The effect of scheduling and preemption on the efficiency of llm inference serving,

    K.-M. Kim, K. Hong, C. Gulcehre, and A. Ailamaki, “The effect of scheduling and preemption on the efficiency of llm inference serving,” 2024

  24. [24]

    Spotserve: Serving generative large language models on preemptible instances,

    X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spotserve: Serving generative large language models on preemptible instances,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, 2024, pp. 1112–1127

  25. [25]

    Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,

    M. Heisler, Z. Y ousefijamarani, X. Wang, Q. Wang, G. Shi, H. Sadri, T. Y u, Y . Li, H. Li, G. Singh et al. , “Llm inference scheduling: A survey of techniques, frameworks, and trade-offs,” Authorea Preprints, 2025

  26. [26]

    Muxserve: flexible spatial-temporal multiplexing for multiple llm serving,

    J. Duan, R. Lu, H. Duanmu, X. Li, X. Zhang, D. Lin, I. Stoica, and H. Zhang, “Muxserve: flexible spatial-temporal multiplexing for multiple llm serving,” 2024

  27. [27]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market,

    Y . Xiang, X. Li, K. Qian, Y . Y ang, D. Zhu, W. Y u, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective gpu pooling for concurrent llm serving on the market,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles , 2025, pp. 1030–1045

  28. [28]

    Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,

    S. Y u, J. Xing, Y . Qiao, M. Ma, Y . Li, Y . Wang, S. Y ang, Z. Xie, S. Cao, K. Bao et al. , “Prism: Unleashing gpu sharing for cost-efficient multi- llm serving,” arXiv preprint arXiv:2505.04021 , 2025

  29. [29]

    Serverlessllm: Low-latency serverless inference for large language models,

    Y . Fu, L. Xue, Y . Huang, A.-O. Brabete, D. Ustiugov, Y . Patel, and L. Mai, “Serverlessllm: Low-latency serverless inference for large language models,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 135–153

  30. [30]

    [Online]

    NVIDIA Corporation, CUDA C++ Programming Guide – API Synchronization Behavior , 2025, accessed: 2025. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html# api-synchronization-behavior