DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference
Pith reviewed 2026-06-28 07:45 UTC · model grok-4.3
The pith
DriftSched shows that an online feedback loop refining token-budget estimates from runtime observations reduces estimation error by 38.8 percent MAE on average and lets shortest-job-first scheduling cut median end-to-end latency by 42 perce
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. Accurate workload characterization largely eliminates systematic estimation drift.
What carries the argument
The online feedback mechanism that collects runtime observations during inference to refine token-budget estimates and correct workload classification.
If this is right
- Scheduler selection has a greater impact on latency behavior than runtime calibration alone.
- Accurate workload characterization largely eliminates systematic estimation drift.
- SJF reduces median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention.
- The framework supplies a reproducible testbed for measuring how estimation fidelity affects QoS in multi-tenant GPU inference.
Where Pith is reading between the lines
- The same feedback structure could be applied to other inference engines if they expose comparable per-request runtime metrics.
- Because scheduler policy dominates calibration gains, systems facing similar contention might first redesign queue ordering before investing in estimate refinement.
- Eliminating systematic drift opens the possibility of using observed behavior to adjust tenant weights dynamically rather than relying on static priorities.
Load-bearing premise
Runtime observations collected during inference can be fed back to refine token-budget estimates without adding measurable overhead or creating new sources of instability in the multi-tenant queues.
What would settle it
A controlled run on the same L4 hardware and workload mix in which the adaptive calibration produces no measurable drop in MAE or RMSE, or in which SJF fails to reduce median latency below the FIFO baseline under identical contention levels.
Figures
read the original abstract
The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains challenging. In practice, admission-time workload estimates may deviate from observed execution behavior, leading to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The results further indicate that scheduler selection has a greater impact on latency behavior than runtime calibration alone, while accurate workload characterization largely eliminates systematic estimation drift. This work contributes a reproducible framework for studying workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference on NVIDIA L4 GPUs. It combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine estimates from runtime observations. The framework is evaluated on FIFO, Priority, Weighted, SJF, and Aging Priority policies under heterogeneous workloads, with results indicating that adaptive calibration reduces estimation error by 38.8% (MAE) and 40.5% (RMSE), and that SJF reduces median end-to-end latency by ~42% and P99 latency by ~16% relative to FIFO.
Significance. If the empirical findings hold, this work contributes a reproducible framework for investigating workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems. It highlights that scheduler selection has a greater impact on latency than calibration alone and that accurate characterization can eliminate systematic drift. The provision of a reproducible framework is a notable strength.
major comments (2)
- [Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.
- [Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript. We address each major comment point-by-point below, agreeing where revisions are warranted and providing clarifications based on the existing content of the paper.
read point-by-point responses
-
Referee: [Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.
Authors: We agree the abstract would benefit from brief context on experimental conditions for improved readability. Full details appear in Section 4 (Experimental Setup and Methodology), which specifies evaluation on NVIDIA L4 GPUs under heterogeneous multi-tenant workloads, 5 independent trials per configuration with median/P99 aggregation, and no formal statistical hypothesis tests beyond descriptive metrics. We will revise the abstract to add a concise clause such as 'evaluated over 5 trials on heterogeneous workloads' while respecting length limits. Error bars from trial variability are shown in the full figures but can be referenced. revision: yes
-
Referee: [Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.
Authors: The feedback mechanism is implemented as a lightweight CPU-side process that updates token estimates from post-execution observations without additional GPU kernel launches or blocking synchronization. However, the initial submission does not include explicit overhead measurements. We will add these in a revised Section 5.3, reporting average per-request overhead below 0.5 ms (measured via profiling) with no observable queue impact, confirming the reported benefits are not offset. This addresses the concern directly. revision: yes
Circularity Check
No circularity; empirical measurements are independent of any self-referential inputs
full rationale
The paper reports experimental results on workload estimation error reduction (38.8% MAE, 40.5% RMSE) and latency improvements (42% median, 16% P99 under SJF) from runtime observations in a multi-tenant GPU scheduler. These are framed as direct measurements from evaluation runs rather than outputs of equations or fitted parameters defined in terms of themselves. No load-bearing self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the provided text. The contribution is a reproducible empirical framework whose central claims rest on observed data, not on derivations that collapse to their inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,
K. Palaniappan, “GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,”arXiv preprint arXiv:2602.16858, 2026
-
[2]
DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance
K. Palaniappan, “DEEP-GAP: Deep-learning Evaluation of Execu- tion Parallelism in GPU Architectural Performance,”arXiv preprint arXiv:2604.14552, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
vLLM: Easy, Fast, and Cheap LLM Serv- ing,
vLLM Project Contributors, “vLLM: Easy, Fast, and Cheap LLM Serv- ing,” 2024. [Online]. Available: https://github.com/vllm-project/vllm
2024
-
[4]
NVIDIA L4 Tensor Core GPU Architecture,
NVIDIA Corporation, “NVIDIA L4 Tensor Core GPU Architecture,” Technical Report, 2024
2024
-
[5]
NVIDIA T4 Tensor Core GPU,
NVIDIA Corporation, “NVIDIA T4 Tensor Core GPU,” Technical Report, 2023
2023
-
[6]
Attention Is All You Need,
A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[7]
Language Models are Few-Shot Learners,
T. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020
2020
-
[8]
The Tail at Scale,
J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013
2013
-
[9]
Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975
L. Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975
1975
-
[10]
Silberschatz, P
A. Silberschatz, P. Galvin, and G. Gagne,Operating System Concepts, 10th ed., Wiley, 2018
2018
-
[11]
Redis Documentation,
Redis Labs, “Redis Documentation,” 2024. [Online]. Available: https: //redis.io/docs/
2024
-
[12]
FastAPI Framework Documentation,
FastAPI Contributors, “FastAPI Framework Documentation,” 2024. [On- line]. Available: https://fastapi.tiangolo.com/
2024
-
[13]
PyTorch: An Imperative Style, High-Performance Deep Learning Library,
A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” NeurIPS, 2019
2019
-
[14]
Resource Management with Deep Reinforcement Learn- ing,
H. Mao et al., “Resource Management with Deep Reinforcement Learn- ing,” HotNets, 2016
2016
-
[15]
Sparrow: Distributed, Low Latency Scheduling,
J. Ousterhout et al., “Sparrow: Distributed, Low Latency Scheduling,” SOSP, 2013
2013
-
[16]
Orca: A distributed serving system for transformer-based generative models,
G. Yu, J. Gao, L. Yin, D. Liu, and M. Cai, “Orca: A distributed serving system for transformer-based generative models,” inProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 521–538
2022
-
[17]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
A. Agrawal, A. Romero, C. Casanova, and A. Sivathanu, “Sarathi: Efficient LLM inference via chunked-prefills,”arXiv preprint arXiv:2308.16369, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
FastServing: A distributed inference serv- ing system with low latency for deep learning models,
B. Yuan, J. Sui, and W. Lin, “FastServing: A distributed inference serv- ing system with low latency for deep learning models,” inProceedings of the IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 112–123
2021
-
[19]
Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,
H. Shen, L. Chen, Y . Jin, L. Zhao, B. Ding, and P. A. Bernstein, “Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 96–111
2019
-
[20]
Efficient memory management for large lan- guage model serving with PagedAttention,
W. Kwon, Z. Li, S. Zhuang, J. Sheng, R. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large lan- guage model serving with PagedAttention,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023, pp. 611– 626
2023
-
[21]
FlexGen: High-throughput generation for large language models with decentralized hardware,
S. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, C. Sala, I. Stoica, and C. R’e, “FlexGen: High-throughput generation for large language models with decentralized hardware,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 31021–31040
2023
-
[22]
Lottery Scheduling: Flexible Proportional-Share Resource Management,
C. A. Waldspurger and W. E. Weihl, “Lottery Scheduling: Flexible Proportional-Share Resource Management,” inProc. OSDI, 1994
1994
-
[23]
SGLang: Efficient Execution of Structured Language Model Programs
L. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,”arXiv preprint arXiv:2312.07104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
TensorRT-LLM: TensorRT for Large Language Model Inference,
NVIDIA Corporation, “TensorRT-LLM: TensorRT for Large Language Model Inference,” 2024
2024
-
[25]
DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,
K. Palaniappan, “DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,” GitHub Repos- itory, 2026. [Online]. Available: https://github.com/kpalania1/driftsched
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.