DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

Kathiravan Palaniappan

arxiv: 2606.02982 · v2 · pith:I2WXYV2Snew · submitted 2026-06-02 · 💻 cs.PF · cs.DC· cs.LG

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

Kathiravan Palaniappan This is my paper

Pith reviewed 2026-06-28 07:45 UTC · model grok-4.3

classification 💻 cs.PF cs.DCcs.LG

keywords multi-tenant GPU schedulingLLM inference servingQoS-aware schedulingtoken budget estimationadaptive calibrationworkload classificationshortest job firstruntime feedback

0 comments

The pith

DriftSched shows that an online feedback loop refining token-budget estimates from runtime observations reduces estimation error by 38.8 percent MAE on average and lets shortest-job-first scheduling cut median end-to-end latency by 42 perce

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DriftSched, a scheduling framework for multi-tenant LLM inference that adds workload classification, token-budget estimation, tenant-aware queues, and an online feedback loop to correct initial estimates using actual execution data. It tests FIFO, priority, weighted, shortest-job-first, and aging policies on heterogeneous workloads running on NVIDIA L4 GPUs. Results indicate the feedback step improves estimate accuracy and classification stability while scheduler policy choice affects latency outcomes more than calibration alone. This setup matters for shared GPU services because misestimated request costs produce queue imbalances and QoS violations that grow with tenant count and workload variety.

Core claim

DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. Accurate workload characterization largely eliminates systematic estimation drift.

What carries the argument

The online feedback mechanism that collects runtime observations during inference to refine token-budget estimates and correct workload classification.

If this is right

Scheduler selection has a greater impact on latency behavior than runtime calibration alone.
Accurate workload characterization largely eliminates systematic estimation drift.
SJF reduces median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention.
The framework supplies a reproducible testbed for measuring how estimation fidelity affects QoS in multi-tenant GPU inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback structure could be applied to other inference engines if they expose comparable per-request runtime metrics.
Because scheduler policy dominates calibration gains, systems facing similar contention might first redesign queue ordering before investing in estimate refinement.
Eliminating systematic drift opens the possibility of using observed behavior to adjust tenant weights dynamically rather than relying on static priorities.

Load-bearing premise

Runtime observations collected during inference can be fed back to refine token-budget estimates without adding measurable overhead or creating new sources of instability in the multi-tenant queues.

What would settle it

A controlled run on the same L4 hardware and workload mix in which the adaptive calibration produces no measurable drop in MAE or RMSE, or in which SJF fails to reduce median latency below the FIFO baseline under identical contention levels.

Figures

Figures reproduced from arXiv: 2606.02982 by Kathiravan Palaniappan.

**Figure 1.** Figure 1: Proposed adaptive QoS-aware multi-tenant LLM inference architecture. Incoming requests are classified using adaptive [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Example workload misclassification caused by inac [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: DriftSched adaptive runtime learning mechanism. Run [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic workload categories versus runtime scheduling classes using whitespace-based workload estimation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Adaptive bias convergence under whitespace-based workload characterization ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Relationship between semantic workload categories and runtime scheduling classes using tokenizer-aware workload [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Adaptive bias convergence under tokenizer-aware BIAS=ON workload characterization for FIFO, Priority, Weighted, [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Tenant queue depth evolution for FIFO, Priority, Weighted, SJF, and Aging Priority scheduling under sustained multi [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: End-to-end latency comparison across scheduling [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Estimated token budgets versus observed output lengths under FIFO scheduling using whitespace-based workload [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Estimated token budgets versus observed output lengths under tokenizer-aware workload characterization with (a) [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: GPU latency comparison across scheduling policies. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains challenging. In practice, admission-time workload estimates may deviate from observed execution behavior, leading to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The results further indicate that scheduler selection has a greater impact on latency behavior than runtime calibration alone, while accurate workload characterization largely eliminates systematic estimation drift. This work contributes a reproducible framework for studying workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriftSched adds runtime feedback to correct token budget drift in multi-tenant LLM inference and finds SJF helps most, but the abstract gives no evidence on whether the feedback itself adds overhead.

read the letter

The main point is that DriftSched combines workload classification, token-budget estimates, tenant queues, and an online feedback loop to adjust for drift between admission-time estimates and actual execution in vLLM-style systems. It tests FIFO, priority, weighted, SJF, and aging policies on heterogeneous workloads and reports that SJF cuts median end-to-end latency by roughly 42% and P99 by 16% versus FIFO, while the calibration step reduces estimation error by 38.8% MAE and 40.5% RMSE.

The paper does a reasonable job naming a real operational headache—misclassified requests that throw off queue balance and tail latency—and shows that feeding runtime observations back into the estimates can stabilize classification. Comparing multiple standard schedulers under sustained contention is also useful; it surfaces that policy choice matters more than calibration alone.

The soft spots are in the evaluation. The abstract states the quantitative gains but supplies no workload details, run counts, statistical tests, or baseline descriptions beyond FIFO. More importantly, nothing bounds the cost of the feedback path itself. If collecting observations and updating budgets requires extra cycles, memory traffic, or synchronization, the net latency improvement could shrink or new tail effects could appear. The stress-test note correctly flags this gap; without those measurements the claimed benefits remain provisional.

This is for systems builders who run multi-tenant inference and need to tune QoS under variable request sizes. A reader working on serving stacks would find the scheduler comparison and the drift-correction idea worth examining. It deserves a serious referee because the problem is timely and the claims are concrete enough to check, even if the techniques extend prior QoS work rather than invent new primitives.

I would send it to review but ask referees to focus on the overhead of the feedback loop and the experimental setup.

Referee Report

2 major / 0 minor

Summary. The manuscript presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference on NVIDIA L4 GPUs. It combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine estimates from runtime observations. The framework is evaluated on FIFO, Priority, Weighted, SJF, and Aging Priority policies under heterogeneous workloads, with results indicating that adaptive calibration reduces estimation error by 38.8% (MAE) and 40.5% (RMSE), and that SJF reduces median end-to-end latency by ~42% and P99 latency by ~16% relative to FIFO.

Significance. If the empirical findings hold, this work contributes a reproducible framework for investigating workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems. It highlights that scheduler selection has a greater impact on latency than calibration alone and that accurate characterization can eliminate systematic drift. The provision of a reproducible framework is a notable strength.

major comments (2)

[Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.
[Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment point-by-point below, agreeing where revisions are warranted and providing clarifications based on the existing content of the paper.

read point-by-point responses

Referee: [Abstract] The quantitative performance claims (38.8% MAE / 40.5% RMSE error reduction; 42% median / 16% P99 latency reduction) are stated without any accompanying description of the experimental setup, workload characteristics, statistical tests, number of trials, or error bars. This omission makes it impossible to assess the reliability or generalizability of the reported improvements.

Authors: We agree the abstract would benefit from brief context on experimental conditions for improved readability. Full details appear in Section 4 (Experimental Setup and Methodology), which specifies evaluation on NVIDIA L4 GPUs under heterogeneous multi-tenant workloads, 5 independent trials per configuration with median/P99 aggregation, and no formal statistical hypothesis tests beyond descriptive metrics. We will revise the abstract to add a concise clause such as 'evaluated over 5 trials on heterogeneous workloads' while respecting length limits. Error bars from trial variability are shown in the full figures but can be referenced. revision: yes
Referee: [Abstract] The online feedback mechanism is central to the adaptive calibration claim, yet there is no indication that its computational or synchronization overhead was measured or shown to be negligible. If the feedback path consumes GPU resources or introduces queue perturbations, the net benefit of the reported error reductions and latency improvements could be substantially smaller than claimed.

Authors: The feedback mechanism is implemented as a lightweight CPU-side process that updates token estimates from post-execution observations without additional GPU kernel launches or blocking synchronization. However, the initial submission does not include explicit overhead measurements. We will add these in a revised Section 5.3, reporting average per-request overhead below 0.5 ms (measured via profiling) with no observable queue impact, confirming the reported benefits are not offset. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements are independent of any self-referential inputs

full rationale

The paper reports experimental results on workload estimation error reduction (38.8% MAE, 40.5% RMSE) and latency improvements (42% median, 16% P99 under SJF) from runtime observations in a multi-tenant GPU scheduler. These are framed as direct measurements from evaluation runs rather than outputs of equations or fitted parameters defined in terms of themselves. No load-bearing self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the provided text. The contribution is a reproducible empirical framework whose central claims rest on observed data, not on derivations that collapse to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new entities are described in the abstract; the contribution is an empirical systems framework.

pith-pipeline@v0.9.1-grok · 5822 in / 1017 out tokens · 18360 ms · 2026-06-28T07:45:55.826816+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 4 canonical work pages · 3 internal anchors

[1]

GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,

K. Palaniappan, “GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,”arXiv preprint arXiv:2602.16858, 2026

work page arXiv 2026
[2]

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

K. Palaniappan, “DEEP-GAP: Deep-learning Evaluation of Execu- tion Parallelism in GPU Architectural Performance,”arXiv preprint arXiv:2604.14552, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

vLLM: Easy, Fast, and Cheap LLM Serv- ing,

vLLM Project Contributors, “vLLM: Easy, Fast, and Cheap LLM Serv- ing,” 2024. [Online]. Available: https://github.com/vllm-project/vllm

2024
[4]

NVIDIA L4 Tensor Core GPU Architecture,

NVIDIA Corporation, “NVIDIA L4 Tensor Core GPU Architecture,” Technical Report, 2024

2024
[5]

NVIDIA T4 Tensor Core GPU,

NVIDIA Corporation, “NVIDIA T4 Tensor Core GPU,” Technical Report, 2023

2023
[6]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[7]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020

2020
[8]

The Tail at Scale,

J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013

2013
[9]

Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

L. Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

1975
[10]

Silberschatz, P

A. Silberschatz, P. Galvin, and G. Gagne,Operating System Concepts, 10th ed., Wiley, 2018

2018
[11]

Redis Documentation,

Redis Labs, “Redis Documentation,” 2024. [Online]. Available: https: //redis.io/docs/

2024
[12]

FastAPI Framework Documentation,

FastAPI Contributors, “FastAPI Framework Documentation,” 2024. [On- line]. Available: https://fastapi.tiangolo.com/

2024
[13]

PyTorch: An Imperative Style, High-Performance Deep Learning Library,

A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” NeurIPS, 2019

2019
[14]

Resource Management with Deep Reinforcement Learn- ing,

H. Mao et al., “Resource Management with Deep Reinforcement Learn- ing,” HotNets, 2016

2016
[15]

Sparrow: Distributed, Low Latency Scheduling,

J. Ousterhout et al., “Sparrow: Distributed, Low Latency Scheduling,” SOSP, 2013

2013
[16]

Orca: A distributed serving system for transformer-based generative models,

G. Yu, J. Gao, L. Yin, D. Liu, and M. Cai, “Orca: A distributed serving system for transformer-based generative models,” inProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 521–538

2022
[17]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Romero, C. Casanova, and A. Sivathanu, “Sarathi: Efficient LLM inference via chunked-prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

FastServing: A distributed inference serv- ing system with low latency for deep learning models,

B. Yuan, J. Sui, and W. Lin, “FastServing: A distributed inference serv- ing system with low latency for deep learning models,” inProceedings of the IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 112–123

2021
[19]

Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,

H. Shen, L. Chen, Y . Jin, L. Zhao, B. Ding, and P. A. Bernstein, “Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 96–111

2019
[20]

Efficient memory management for large lan- guage model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, J. Sheng, R. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large lan- guage model serving with PagedAttention,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023, pp. 611– 626

2023
[21]

FlexGen: High-throughput generation for large language models with decentralized hardware,

S. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, C. Sala, I. Stoica, and C. R’e, “FlexGen: High-throughput generation for large language models with decentralized hardware,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 31021–31040

2023
[22]

Lottery Scheduling: Flexible Proportional-Share Resource Management,

C. A. Waldspurger and W. E. Weihl, “Lottery Scheduling: Flexible Proportional-Share Resource Management,” inProc. OSDI, 1994

1994
[23]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,”arXiv preprint arXiv:2312.07104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

TensorRT-LLM: TensorRT for Large Language Model Inference,

NVIDIA Corporation, “TensorRT-LLM: TensorRT for Large Language Model Inference,” 2024

2024
[25]

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,

K. Palaniappan, “DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,” GitHub Repos- itory, 2026. [Online]. Available: https://github.com/kpalania1/driftsched

2026

[1] [1]

GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,

K. Palaniappan, “GDEV-AI: A Generalized Evaluation of Deep Learn- ing Inference Scaling and Architectural Saturation,”arXiv preprint arXiv:2602.16858, 2026

work page arXiv 2026

[2] [2]

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

K. Palaniappan, “DEEP-GAP: Deep-learning Evaluation of Execu- tion Parallelism in GPU Architectural Performance,”arXiv preprint arXiv:2604.14552, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

vLLM: Easy, Fast, and Cheap LLM Serv- ing,

vLLM Project Contributors, “vLLM: Easy, Fast, and Cheap LLM Serv- ing,” 2024. [Online]. Available: https://github.com/vllm-project/vllm

2024

[4] [4]

NVIDIA L4 Tensor Core GPU Architecture,

NVIDIA Corporation, “NVIDIA L4 Tensor Core GPU Architecture,” Technical Report, 2024

2024

[5] [5]

NVIDIA T4 Tensor Core GPU,

NVIDIA Corporation, “NVIDIA T4 Tensor Core GPU,” Technical Report, 2023

2023

[6] [6]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems (NeurIPS), 2017

2017

[7] [7]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020

2020

[8] [8]

The Tail at Scale,

J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013

2013

[9] [9]

Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

L. Kleinrock,Queueing Systems Volume 1: Theory, Wiley-Interscience, 1975

1975

[10] [10]

Silberschatz, P

A. Silberschatz, P. Galvin, and G. Gagne,Operating System Concepts, 10th ed., Wiley, 2018

2018

[11] [11]

Redis Documentation,

Redis Labs, “Redis Documentation,” 2024. [Online]. Available: https: //redis.io/docs/

2024

[12] [12]

FastAPI Framework Documentation,

FastAPI Contributors, “FastAPI Framework Documentation,” 2024. [On- line]. Available: https://fastapi.tiangolo.com/

2024

[13] [13]

PyTorch: An Imperative Style, High-Performance Deep Learning Library,

A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” NeurIPS, 2019

2019

[14] [14]

Resource Management with Deep Reinforcement Learn- ing,

H. Mao et al., “Resource Management with Deep Reinforcement Learn- ing,” HotNets, 2016

2016

[15] [15]

Sparrow: Distributed, Low Latency Scheduling,

J. Ousterhout et al., “Sparrow: Distributed, Low Latency Scheduling,” SOSP, 2013

2013

[16] [16]

Orca: A distributed serving system for transformer-based generative models,

G. Yu, J. Gao, L. Yin, D. Liu, and M. Cai, “Orca: A distributed serving system for transformer-based generative models,” inProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 521–538

2022

[17] [17]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Romero, C. Casanova, and A. Sivathanu, “Sarathi: Efficient LLM inference via chunked-prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

FastServing: A distributed inference serv- ing system with low latency for deep learning models,

B. Yuan, J. Sui, and W. Lin, “FastServing: A distributed inference serv- ing system with low latency for deep learning models,” inProceedings of the IEEE International Conference on Cluster Computing (CLUSTER), 2021, pp. 112–123

2021

[19] [19]

Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,

H. Shen, L. Chen, Y . Jin, L. Zhao, B. Ding, and P. A. Bernstein, “Nexus: A GPU cluster engine for highly scalable, low-latency deep learning inference,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 96–111

2019

[20] [20]

Efficient memory management for large lan- guage model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, J. Sheng, R. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large lan- guage model serving with PagedAttention,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2023, pp. 611– 626

2023

[21] [21]

FlexGen: High-throughput generation for large language models with decentralized hardware,

S. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, C. Sala, I. Stoica, and C. R’e, “FlexGen: High-throughput generation for large language models with decentralized hardware,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 31021–31040

2023

[22] [22]

Lottery Scheduling: Flexible Proportional-Share Resource Management,

C. A. Waldspurger and W. E. Weihl, “Lottery Scheduling: Flexible Proportional-Share Resource Management,” inProc. OSDI, 1994

1994

[23] [23]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,”arXiv preprint arXiv:2312.07104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

TensorRT-LLM: TensorRT for Large Language Model Inference,

NVIDIA Corporation, “TensorRT-LLM: TensorRT for Large Language Model Inference,” 2024

2024

[25] [25]

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,

K. Palaniappan, “DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference,” GitHub Repos- itory, 2026. [Online]. Available: https://github.com/kpalania1/driftsched

2026