pith. sign in

arxiv: 2606.02982 · v2 · pith:I2WXYV2Snew · submitted 2026-06-02 · 💻 cs.PF · cs.DC· cs.LG

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

Pith reviewed 2026-06-28 07:45 UTC · model grok-4.3

classification 💻 cs.PF cs.DCcs.LG
keywords multi-tenant GPU schedulingLLM inference servingQoS-aware schedulingtoken budget estimationadaptive calibrationworkload classificationshortest job firstruntime feedback
0
0 comments X

The pith

DriftSched shows that an online feedback loop refining token-budget estimates from runtime observations reduces estimation error by 38.8 percent MAE on average and lets shortest-job-first scheduling cut median end-to-end latency by 42 perce

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DriftSched, a scheduling framework for multi-tenant LLM inference that adds workload classification, token-budget estimation, tenant-aware queues, and an online feedback loop to correct initial estimates using actual execution data. It tests FIFO, priority, weighted, shortest-job-first, and aging policies on heterogeneous workloads running on NVIDIA L4 GPUs. Results indicate the feedback step improves estimate accuracy and classification stability while scheduler policy choice affects latency outcomes more than calibration alone. This setup matters for shared GPU services because misestimated request costs produce queue imbalances and QoS violations that grow with tenant count and workload variety.

Core claim

DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. Accurate workload characterization largely eliminates systematic estimation drift.

What carries the argument

The online feedback mechanism that collects runtime observations during inference to refine token-budget estimates and correct workload classification.

If this is right

  • Scheduler selection has a greater impact on latency behavior than runtime calibration alone.
  • Accurate workload characterization largely eliminates systematic estimation drift.
  • SJF reduces median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention.
  • The framework supplies a reproducible testbed for measuring how estimation fidelity affects QoS in multi-tenant GPU inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback structure could be applied to other inference engines if they expose comparable per-request runtime metrics.
  • Because scheduler policy dominates calibration gains, systems facing similar contention might first redesign queue ordering before investing in estimate refinement.
  • Eliminating systematic drift opens the possibility of using observed behavior to adjust tenant weights dynamically rather than relying on static priorities.

Load-bearing premise

Runtime observations collected during inference can be fed back to refine token-budget estimates without adding measurable overhead or creating new sources of instability in the multi-tenant queues.

What would settle it

A controlled run on the same L4 hardware and workload mix in which the adaptive calibration produces no measurable drop in MAE or RMSE, or in which SJF fails to reduce median latency below the FIFO baseline under identical contention levels.

Figures

Figures reproduced from arXiv: 2606.02982 by Kathiravan Palaniappan.

Figure 1
Figure 1. Figure 1: Proposed adaptive QoS-aware multi-tenant LLM inference architecture. Incoming requests are classified using adaptive [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example workload misclassification caused by inac [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DriftSched adaptive runtime learning mechanism. Run [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Semantic workload categories versus runtime scheduling classes using whitespace-based workload estimation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adaptive bias convergence under whitespace-based workload characterization ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relationship between semantic workload categories and runtime scheduling classes using tokenizer-aware workload [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Adaptive bias convergence under tokenizer-aware BIAS=ON workload characterization for FIFO, Priority, Weighted, [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tenant queue depth evolution for FIFO, Priority, Weighted, SJF, and Aging Priority scheduling under sustained multi [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end latency comparison across scheduling [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Estimated token budgets versus observed output lengths under FIFO scheduling using whitespace-based workload [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Estimated token budgets versus observed output lengths under tokenizer-aware workload characterization with (a) [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: GPU latency comparison across scheduling policies. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains challenging. In practice, admission-time workload estimates may deviate from observed execution behavior, leading to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The results further indicate that scheduler selection has a greater impact on latency behavior than runtime calibration alone, while accurate workload characterization largely eliminates systematic estimation drift. This work contributes a reproducible framework for studying workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference on NVIDIA L4 GPUs. It combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine estimates from runtime observations. The framework is evaluated on FIFO, Priority, Weighted, SJF, and Aging Priority policies under heterogeneous workloads, with results indicating that adaptive calibration reduces estimation error by 38.8% (MAE) and 40.5% (RMSE), and that SJF reduces median end-to-end latency by ~42% and P99 latency by ~16% relative to FIFO.

Significance. If the empirical findings hold, this work contributes a reproducible framework for investigating workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems. It highlights that scheduler selection has a greater impact on latency than calibration alone and that accurate characterization can eliminate systematic drift. The provision of a reproducible framework is a notable strength.

major comments (2)
  1. [Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.
  2. [Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment point-by-point below, agreeing where revisions are warranted and providing clarifications based on the existing content of the paper.

read point-by-point responses
  1. Referee: [Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.

    Authors: We agree the abstract would benefit from brief context on experimental conditions for improved readability. Full details appear in Section 4 (Experimental Setup and Methodology), which specifies evaluation on NVIDIA L4 GPUs under heterogeneous multi-tenant workloads, 5 independent trials per configuration with median/P99 aggregation, and no formal statistical hypothesis tests beyond descriptive metrics. We will revise the abstract to add a concise clause such as 'evaluated over 5 trials on heterogeneous workloads' while respecting length limits. Error bars from trial variability are shown in the full figures but can be referenced. revision: yes

  2. Referee: [Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.

    Authors: The feedback mechanism is implemented as a lightweight CPU-side process that updates token estimates from post-execution observations without additional GPU kernel launches or blocking synchronization. However, the initial submission does not include explicit overhead measurements. We will add these in a revised Section 5.3, reporting average per-request overhead below 0.5 ms (measured via profiling) with no observable queue impact, confirming the reported benefits are not offset. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements are independent of any self-referential inputs

full rationale

The paper reports experimental results on workload estimation error reduction (38.8% MAE, 40.5% RMSE) and latency improvements (42% median, 16% P99 under SJF) from runtime observations in a multi-tenant GPU scheduler. These are framed as direct measurements from evaluation runs rather than outputs of equations or fitted parameters defined in terms of themselves. No load-bearing self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the provided text. The contribution is a reproducible empirical framework whose central claims rest on observed data, not on derivations that collapse to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new entities are described in the abstract; the contribution is an empirical systems framework.

pith-pipeline@v0.9.1-grok · 5822 in / 1017 out tokens · 18360 ms · 2026-06-28T07:45:55.826816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,

    K. Palaniappan, “GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,”arXiv preprint arXiv:2602.16858, 2026

  2. [2]

    DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

    K. Palaniappan, “DEEP-GAP: Deep-learning Evaluation of Execu- tion Parallelism in GPU Architectural Performance,”arXiv preprint arXiv:2604.14552, 2026

  3. [3]

    vLLM: Easy, Fast, and Cheap LLM Serv- ing,

    vLLM Project Contributors, “vLLM: Easy, Fast, and Cheap LLM Serv- ing,” 2024. [Online]. Available: https://github.com/vllm-project/vllm

  4. [4]

    NVIDIA L4 Tensor Core GPU Architecture,

    NVIDIA Corporation, “NVIDIA L4 Tensor Core GPU Architecture,” Technical Report, 2024

  5. [5]

    NVIDIA T4 Tensor Core GPU,

    NVIDIA Corporation, “NVIDIA T4 Tensor Core GPU,” Technical Report, 2023

  6. [6]

    Attention Is All You Need,

    A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems (NeurIPS), 2017

  7. [7]

    Language Models are Few-Shot Learners,

    T. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020

  8. [8]

    The Tail at Scale,

    J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013

  9. [9]

    Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

    L. Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

  10. [10]

    Silberschatz, P

    A. Silberschatz, P. Galvin, and G. Gagne,Operating System Concepts, 10th ed., Wiley, 2018

  11. [11]

    Redis Documentation,

    Redis Labs, “Redis Documentation,” 2024. [Online]. Available: https: //redis.io/docs/

  12. [12]

    FastAPI Framework Documentation,

    FastAPI Contributors, “FastAPI Framework Documentation,” 2024. [On- line]. Available: https://fastapi.tiangolo.com/

  13. [13]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library,

    A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” NeurIPS, 2019

  14. [14]

    Resource Management with Deep Reinforcement Learn- ing,

    H. Mao et al., “Resource Management with Deep Reinforcement Learn- ing,” HotNets, 2016

  15. [15]

    Sparrow: Distributed, Low Latency Scheduling,

    J. Ousterhout et al., “Sparrow: Distributed, Low Latency Scheduling,” SOSP, 2013

  16. [16]

    Orca: A distributed serving system for transformer-based generative models,

    G. Yu, J. Gao, L. Yin, D. Liu, and M. Cai, “Orca: A distributed serving system for transformer-based generative models,” inProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 521–538

  17. [17]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Romero, C. Casanova, and A. Sivathanu, “Sarathi: Efficient LLM inference via chunked-prefills,”arXiv preprint arXiv:2308.16369, 2023

  18. [18]

    FastServing: A distributed inference serv- ing system with low latency for deep learning models,

    B. Yuan, J. Sui, and W. Lin, “FastServing: A distributed inference serv- ing system with low latency for deep learning models,” inProceedings of the IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 112–123

  19. [19]

    Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,

    H. Shen, L. Chen, Y . Jin, L. Zhao, B. Ding, and P. A. Bernstein, “Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 96–111

  20. [20]

    Efficient memory management for large lan- guage model serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, J. Sheng, R. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large lan- guage model serving with PagedAttention,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023, pp. 611– 626

  21. [21]

    FlexGen: High-throughput generation for large language models with decentralized hardware,

    S. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, C. Sala, I. Stoica, and C. R’e, “FlexGen: High-throughput generation for large language models with decentralized hardware,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 31021–31040

  22. [22]

    Lottery Scheduling: Flexible Proportional-Share Resource Management,

    C. A. Waldspurger and W. E. Weihl, “Lottery Scheduling: Flexible Proportional-Share Resource Management,” inProc. OSDI, 1994

  23. [23]

    SGLang: Efficient Execution of Structured Language Model Programs

    L. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,”arXiv preprint arXiv:2312.07104, 2023

  24. [24]

    TensorRT-LLM: TensorRT for Large Language Model Inference,

    NVIDIA Corporation, “TensorRT-LLM: TensorRT for Large Language Model Inference,” 2024

  25. [25]

    DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,

    K. Palaniappan, “DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,” GitHub Repos- itory, 2026. [Online]. Available: https://github.com/kpalania1/driftsched