pith. sign in

arxiv: 2606.00946 · v1 · pith:3QJ46CVBnew · submitted 2026-05-31 · 💻 cs.DC · cs.AI· cs.LG

Lodestar: An Online-Learning LLM Inference Router

Pith reviewed 2026-06-28 16:53 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords LLM inference routingonline learningtime-to-first-tokendistributed GPU clustersrequest schedulingprefix cachecloud serving systems
0
0 comments X

The pith

Lodestar routes each LLM inference request by training an online predictor on per-request cluster snapshots to choose the instance that maximizes a reward such as low time-to-first-token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lodestar as a routing system for LLM inference across distributed GPU clusters. Traditional load balancing and heuristic methods struggle because execution time depends on input length, KV-cache reuse creates dependencies between requests, and hardware can be heterogeneous. Lodestar instead gathers fine-grained snapshots of instance state, request features, and measured outcomes for every request, then continuously trains a reward predictor to decide the best destination. A sympathetic reader would expect this to deliver lower latency and better utilization without needing hand-tuned rules that break when workloads shift. The system is designed to plug into existing engines and to adapt within minutes to new conditions.

Core claim

Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize a given reward such as minimizing TTFT.

What carries the argument

The online reward predictor trained in real time from per-request snapshots of instance state, request characteristics, and observed performance.

If this is right

  • Average TTFT drops by a factor of 1.41 and P99 TTFT by 1.47 relative to a strong prefix-cache and load-aware baseline.
  • Gains reach up to 4.38x on heterogeneous clusters and the system learns effective policies within about five minutes.
  • The same snapshot-driven predictor can be retargeted to other rewards such as throughput or energy without changing the core collection mechanism.
  • No changes are required to the underlying serving engine because the router operates at the request-assignment layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The predictor could be extended to jointly optimize routing and batch-size decisions if the snapshots also captured pending batch queues.
  • Because adaptation happens online, the approach may naturally handle model updates or engine version changes that would invalidate static heuristics.
  • In multi-tenant settings the same machinery could route requests across models of different sizes once the reward function is made model-aware.

Load-bearing premise

Continuously collecting detailed per-request snapshots of instance state and performance is feasible at scale without adding significant overhead, and the online predictor adapts reliably without instability during learning.

What would settle it

Run Lodestar on a production-scale cluster while measuring the latency overhead of the snapshot collection and predictor training; if the added cost erases the reported TTFT gains or routing becomes unstable in the first minutes, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.00946 by Brighten Godfrey, Gangmuk Lim, Jiaxin Shan, Le Xu, Liguang Xie, Wanyu Zhao.

Figure 1
Figure 1. Figure 1: TTFT performance of the four existing routing policies under a high prefix sharing (80%) workload. Experiments were conducted on a seven-NVIDIA-L20 GPU instance cluster using the DeepSeek 7B model with FP16 precision. Two different workloads were used for each experiment (Left: RPS=5, avg input length=4K, Right: RPS=10, avg input length=1K). 20 40 60 80 100 Threshold τ (%) 0 500 1000 1500 Avg TTFT (ms) Lod… view at source ↗
Figure 2
Figure 2. Figure 2: Prefix-cache routing with different prefix hit thresh￾olds. The threshold (τ) means the minimum prefix hit ratio to enable prefix-aware routing. If the max possible prefix hit ratio across all instances is lower than the τ, it uses least request. from offline profiles. These approaches share a fundamen￾tal problem. First, such LLM layer-wise latency models are likely to be inaccurate: request latency emerg… view at source ↗
Figure 3
Figure 3. Figure 3: Figure 3a shows the accuracy performance of the offline trained model when it is evaluated offline and when it was evaluated online. Blue dots show the offline predictions of the offline trained model. Red dots show the online predictions when the same offline trained model was used online. The diagonal line represents perfect prediction. The performance degradation highlights the necessity of an online ad… view at source ↗
Figure 5
Figure 5. Figure 5: Reward (TTFT) prediction performance in linear regres￾sion and neural network. from potential failure, slowdown, or unreliable decision of Routing Service. Next, we describe these components in more detail, be￾ginning with the learning-based routing policy (§4.1), which shows how Lodestar makes decisions for each request. We then cover the system design of Stateful Gateway (§4.2) and Routing Service includ… view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture. complex routing logic with additional overhead for LLM in￾ference can be compensated with much larger margin. And the per-decision value is also much higher: each decision com￾mits an expensive GPU for a relatively long time and, through shared KV cache and queue state, propagates to subsequent requests—so a single bad choice can waste seconds of ac￾celerator time. This newly widened … view at source ↗
Figure 6
Figure 6. Figure 6: TTFT latency comparison for Mooncake workloads in aggregated deployment. The numbers noted inside the parentheses above Lodestar are its relative performance against Prefix-cache-and-load-aware. 1 2 3 Policy index 0 5000 10000 15000 Avg TTFT (ms) 12665 6197 4297 (0.69) Sharing ratio 70% (RPS 6) 0 20000 40000 33383 16359 14381 (0.88) 1 2 3 Policy index 0 1000 2000 1853 1559 1220 (0.78) Sharing ratio 50% (RP… view at source ↗
Figure 7
Figure 7. Figure 7: TTFT latency comparison for different prefix sharing ratio workloads (10%, 30%, 50%, 70%, Mixed%) in aggregated deployment. 1 2 3 Policy index 0 500 1000 1500 2000 Avg TTFT (ms) 1722 1215 1088 (0.90) Conversation — RPS 21 0 2500 5000 7500 10000 8412 5008 4212 (0.84) 1 2 3 Policy index 0 500 1000 1500 2000 1656 1611 1470 (0.91) ToolAgent — RPS 21 0 2500 5000 7500 10000 7328 8298 5920 (0.71) 1 2 3 Policy ind… view at source ↗
Figure 8
Figure 8. Figure 8: TTFT latency comparison for prefill-only workload. calls; and synthetic is constructed by mixing three datasets: ShareGPT [3], LeVal [13] and LooGLE [34] [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TTFT latency comparison in heterogeneous GPU cluster in aggregated deployment. The right-most figure shows the experiment with changing loads. not tied to any specific LLM model, engine configuration, or cluster configuration. For example, multi-GPU inference with tensor parallelism or pipeline parallelism can be considered as one logical instance with a group of GPUs serving each request as a whole and ro… view at source ↗
Figure 11
Figure 11. Figure 11: Online adaptation experiment with dynamically chang￾ing prefix sharing ratio distribution. The average prefix sharing ratio changes from 5% to 50% in the middle of the experiment. Lodestar (mid-frozen) stopped learning online right before the work￾load changed. Lodestar kept learning online continuously. In (d) and (e), the solid lines indicate the target metrics of the selected instance by routing policy… view at source ↗
Figure 10
Figure 10. Figure 10: Routing decision example in a heterogeneous GPU cluster for each routing policy. The bars show mean TTFT of each instance, and dots show the total number of requests routed to it. The instances are sorted from higher latency to lower from left to right on the x-axis. that static heuristics cannot navigate. Lodestar finds a better policy and exhibits the opposite pattern: A30 instances see slightly lower a… view at source ↗
Figure 14
Figure 14. Figure 14: K candidate filtering ablation study (ToolAgent workload, RPS 12). cost initially grows linearly and plateaus once the replay buffer saturates. By retaining only informative and recent sam￾ples, Lodestar’s data selection is both effective and bounded in cost. 5.6 Consistent hashing based filtering [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training data selection algorithm ablation study (LS: Lodestar). learned policy reduces TTFT. And again, the model observes them and updates the model toward the direction. On the other hand, Lodestar (mid-frozen) policy spreads out KV blocks even in the 50%-shared workload and starts to oversubscribe KV cache space. Eventually, it leads to more KV evictions in the caches and low max KV cache hit ratio. 5… view at source ↗
Figure 15
Figure 15. Figure 15: Mooncake workload distribution (Conversation, ToolA￾gent, Synthetic). 1 2 3 4 Policy index 0 500 1000 1500 Avg TTFT (ms) 796 1183 552 (0.69) 764 (0.65) Conversation — RPS 9 0 2000 4000 6000 8000 10000 6345 8397 3507 (0.55) 4006 (0.48) 1 2 3 4 Policy index 0 500 1000 1500 814 1199 669 (0.82) 843 (0.70) ToolAgent — RPS 10 0 2000 4000 6000 8000 P99 TTFT (ms) 6165 6750 4803 (0.78) 5130 (0.76) 1. Prefix-and-lo… view at source ↗
Figure 16
Figure 16. Figure 16: TTFT performance with and without bitsandbytes quan￾tization configuration on vLLM. bitsandbytes is a popular on-the-fly quantization method that compress FP16 KV into INT4/INT8 when storing them. When it is used, vLLM engine decompresses the KVs back to the original precision. It enables memory-efficient LLM inference at the cost of computation overhead during decompress [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
read the original abstract

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Lodestar, a cloud-native online-learning request router for distributed LLM inference clusters that integrates with vLLM. It continuously collects per-request snapshots of instance state, request characteristics, and observed performance to train an online reward predictor, which is then used to route each request to the instance expected to maximize a chosen reward (e.g., minimizing TTFT). Experiments on a public-cloud GPU cluster report that Lodestar achieves 1.41× lower average TTFT and 1.47× lower P99 TTFT on average (with larger gains on heterogeneous clusters) versus a state-of-the-art prefix-cache and load-aware heuristic, while adapting to workload changes within approximately five minutes.

Significance. If the overhead of snapshot collection and online prediction is shown to be negligible, the work would provide a practical, adaptive alternative to static heuristics for LLM serving, particularly valuable in heterogeneous or dynamic environments where input-dependent execution and KV-cache coupling make traditional load balancing ineffective. The emphasis on continuous online adaptation and seamless integration with existing stacks is a positive contribution.

major comments (3)
  1. [Abstract and experimental evaluation (likely §5–6)] The abstract and experimental evaluation sections do not report measurements of the per-request snapshot collection latency, reward-predictor inference time, or overall routing decision overhead when integrated with vLLM. Because the headline TTFT gains are modest on homogeneous clusters (2.15×/1.86×), even a few milliseconds of added latency per request could materially reduce or reverse the claimed benefit; this measurement is load-bearing for the central performance claim.
  2. [System design and online learning sections (likely §3–4)] The paper states that Lodestar “learns these efficient routing strategies within about 5 minutes,” yet provides no details on the online training procedure (update frequency, reward model architecture, handling of non-stationarity, or safeguards against poor decisions during early learning). Without these, it is impossible to assess whether the reported adaptation speed is reproducible or stable across workloads.
  3. [Evaluation setup (likely §5)] The comparison baseline is described only as “a state-of-the-art prefix cache and load-aware heuristic.” The manuscript should explicitly name the baseline, cite its source, and report its configuration parameters so that the 1.41×/1.47× gains can be independently verified.
minor comments (2)
  1. [Figures in evaluation section] Figure captions and axis labels should explicitly state whether TTFT numbers include or exclude the routing decision latency.
  2. [Abstract and §5] The abstract claims results “on average” across experiments; the manuscript should clarify the number of runs, statistical significance tests, and workload characteristics (request arrival rates, context-length distributions) used to compute the averages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: The abstract and experimental evaluation sections do not report measurements of the per-request snapshot collection latency, reward-predictor inference time, or overall routing decision overhead when integrated with vLLM. Because the headline TTFT gains are modest on homogeneous clusters (2.15×/1.86×), even a few milliseconds of added latency per request could materially reduce or reverse the claimed benefit; this measurement is load-bearing for the central performance claim.

    Authors: We agree that quantifying these overheads is essential to support the performance claims. In the revised manuscript we will add measurements of per-request snapshot collection latency, reward-predictor inference time, and end-to-end routing decision overhead (under representative loads) to the experimental evaluation section. revision: yes

  2. Referee: The paper states that Lodestar “learns these efficient routing strategies within about 5 minutes,” yet provides no details on the online training procedure (update frequency, reward model architecture, handling of non-stationarity, or safeguards against poor decisions during early learning). Without these, it is impossible to assess whether the reported adaptation speed is reproducible or stable across workloads.

    Authors: We will expand the system design and online-learning sections with a detailed description of the training procedure, including update frequency, reward-model architecture, handling of non-stationarity, and any safeguards used during early learning. revision: yes

  3. Referee: The comparison baseline is described only as “a state-of-the-art prefix cache and load-aware heuristic.” The manuscript should explicitly name the baseline, cite its source, and report its configuration parameters so that the 1.41×/1.47× gains can be independently verified.

    Authors: We will name the baseline explicitly, add the appropriate citation, and report its configuration parameters in the evaluation-setup section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims with no derivation chain

full rationale

The paper presents an online learning router for LLM inference with empirical results on TTFT reductions versus baselines. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains appear in the abstract or described method. The central claims rest on measured performance in cloud experiments rather than any reduction of results to inputs by construction. This is the expected non-finding for an applied systems paper whose value is in implementation and benchmarking.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The online reward predictor and per-request snapshot collection are implied mechanisms but not detailed enough to ledger.

pith-pipeline@v0.9.1-grok · 5824 in / 1075 out tokens · 15860 ms · 2026-06-28T16:53:58.111198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    https://docs.nvidia.com/ dynamo/latest/

    Nvidia dynamo. https://docs.nvidia.com/ dynamo/latest/. Accessed: 26-Oct-2025

  2. [2]

    https: //www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html, 2023

    AMD Instinct MI300A Accelerators. https: //www.amd.com/en/products/accelerators/ instinct/mi300/mi300a.html, 2023

  3. [3]

    https://sharegpt.com, 2023

    ShareGPT: Share your wildest ChatGPT conversations. https://sharegpt.com, 2023

  4. [4]

    https: //www.nvidia.com/en-us/data-center/ technologies/blackwell-architecture/, 2024

    NVIDIA Blackwell Architecture. https: //www.nvidia.com/en-us/data-center/ technologies/blackwell-architecture/, 2024

  5. [5]

    https:// blog.google/products/google-cloud/ ironwood-tpu-age-of-inference/, 2025

    Google Ironwood: The first Google TPU for the age of inference. https:// blog.google/products/google-cloud/ ironwood-tpu-age-of-inference/, 2025

  6. [6]

    JITServe: SLO-aware LLM serving with imprecise re- quest information, 2025

  7. [7]

    llm-d.https://github.com/llm-d/llm-d, 2025

  8. [8]

    https://www.nvidia

    NVIDIA Vera Rubin Platform. https://www.nvidia. com/en-us/data-center/technologies/rubin/, 2025

  9. [9]

    Gulavani, Alexey Tu- manov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tu- manov, and Ramachandran Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024

  10. [10]

    Gulavani, and Ramachandran Ramjee

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggyback- ing decodes with chunked prefills, 2023

  11. [11]

    Llmrank: Under- standing llm strengths for model routing, 2025

    Shubham Agrawal and Prasang Gupta. Llmrank: Under- standing llm strengths for model routing, 2025

  12. [12]

    Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022

  13. [13]

    L-Eval: Insti- tuting standardized evaluation for long context language models, 2023

    Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-Eval: Insti- tuting standardized evaluation for long context language models, 2023

  14. [14]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  15. [15]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei A. Zaharia, and James Y . Zou. Frugalgpt: How to use large language models while re- ducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

  16. [16]

    {IMPRESS}: An {Importance- Informed}{Multi-Tier} prefix {KV} storage system for large language model inference

    Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. {IMPRESS}: An {Importance- Informed}{Multi-Tier} prefix {KV} storage system for large language model inference. In23rd USENIX Con- ference on File and Storage Technologies (FAST 25), pages 187–201, 2025

  17. [17]

    arXiv preprint arXiv:2510.09665 , year=

    Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang. Lmcache: An efficient kv cache layer for enterprise-scale llm inference.arXiv preprint arXiv:2510.09665, 2025

  18. [18]

    Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- sios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  19. [19]

    Lmdeploy: A toolkit for com- pressing, deploying, and serving llm

    LMDeploy Contributors. Lmdeploy: A toolkit for com- pressing, deploying, and serving llm. https://github. com/InternLM/lmdeploy, 2023

  20. [20]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  21. [21]

    Pre- fillonly: An inference engine for prefill-only workloads in large language model applications

    Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaox- uan Liu, Yifan Qiao, Ion Stoica, and Junchen Jiang. Pre- fillonly: An inference engine for prefill-only workloads in large language model applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, SOSP ’25, page 399–4...

  22. [22]

    Turbotransformers: an efficient gpu serving system for transformer models

    Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. Turbotransformers: an efficient gpu serving system for transformer models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 389–402, 2021

  23. [23]

    Efficient llm scheduling by learn- ing to rank

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Sto- ica, and Hao Zhang. Efficient llm scheduling by learn- ing to rank. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 59006–59029. Curran Associates, Inc., 2024

  24. [24]

    Cost-effective attention reuse across multi-turn conversations in large language model serv- ing

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. Cost-effective attention reuse across multi-turn conversations in large language model serv- ing. InUSENIX Annual Technical Conference (ATC 24), 2024

  25. [25]

    Prompt cache: Modular attention reuse for low-latency inference.Pro- ceedings of Machine Learning and Systems, 6:325–338, 2024

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Pro- ceedings of Machine Learning and Systems, 6:325–338, 2024

  26. [26]

    Serving DNNs like clockwork: Performance predictability from the bottom up

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clockwork: Performance predictability from the bottom up. In14th USENIX Sym- posium on Operating Systems Design and Implementa- tion (OSDI 20), pages 443–462. USENIX Association, November 2020

  27. [27]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  28. [28]

    Towards generalized routing: Model and agent orchestration for adaptive and efficient inference, 2025

    Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, and Junlan Feng. Towards generalized routing: Model and agent orchestration for adaptive and efficient inference, 2025

  29. [29]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- burg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  30. [30]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  31. [31]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  32. [32]

    InfiniGen: Efficient generative inference of large language models with dynamic KV cache management

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. InfiniGen: Efficient generative inference of large language models with dynamic KV cache management. In18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 24), 2024

  33. [33]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  34. [34]

    LooGLE: Can long-context language models understand long contexts?, 2023

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. LooGLE: Can long-context language models understand long contexts?, 2023

  35. [35]

    Eagle: speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st Inter- national Conference on Machine Learning, ICML’24. JMLR.org, 2024

  36. [36]

    Opportunities and challenges in service layer traffic engineering

    Gangmuk Lim, Aditya Prerepa, Brighten Godfrey, and Radhika Mittal. Opportunities and challenges in service layer traffic engineering. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, pages 352– 359, 2024

  37. [37]

    KV-Cache Indexer: Architecture

    llm-d Authors. KV-Cache Indexer: Architecture. https://github.com/llm-d/llm-d-kv-cache/ blob/main/docs/architecture.md, 2026. Ac- cessed: 2026-04-23

  38. [38]

    Helix: Serving large language models over heterogeneous gpus and net- work via max-flow

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Serving large language models over heterogeneous gpus and net- work via max-flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 1, pages 586–602, 2025

  39. [39]

    Mitzenmacher

    M. Mitzenmacher. The power of two choices in random- ized load balancing.IEEE Transactions on Parallel and Distributed Systems, 12(10):1094–1104, 2001

  40. [40]

    Heterogeneity-aware cluster scheduling policies for deep learning workloads

    Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 481–498, 2020

  41. [41]

    Fastertransformer

    NVIDIA. Fastertransformer. https://github.com/ NVIDIA/FasterTransformer, 2020

  42. [42]

    Dynamo kv-aware router

    NVIDIA Dynamo team. Dynamo kv-aware router. https://docs.nvidia.com/dynamo/latest/ user-guides/kv-cache-aware-routing, 2025

  43. [43]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024

  44. [44]

    Mar- coni: Prefix caching for the era of hybrid llms, 2025

    Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Mar- coni: Prefix caching for the era of hybrid llms, 2025

  45. [45]

    Prefill-as-a-service: Kvcache of next-generation models could go cross-datacenter, 2026

    Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xin- ran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: Kvcache of next-generation models could go cross-datacenter, 2026

  46. [46]

    Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association

  47. [47]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

  48. [48]

    Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

    Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, and Cheng Li. Dynaserve: Unified and elastic execution for dynamic disaggregated llm serving, 2025

  49. [49]

    Campbell, Aditya Akella, Christopher J

    Divyanshu Saxena, Jiayi Chen, Sujay Yadalam, Yeonju Ro, Rohit Dwivedula, Eric H. Campbell, Aditya Akella, Christopher J. Rossbach, and Michael Swift. How i learned to stop worrying and love learned os policies. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, HotOS ’25, page 1–7, New York, NY , USA, 2025. Association for Computing Machinery

  50. [50]

    Aibrix: Towards scalable, cost-effective large language model inference infrastructure, 2025

    Jiaxin Shan, Varun Gupta, Le Xu, Haiyang Shi, Jingyuan Zhang, Ning Wang, Linhui Xu, Rong Kang, Tongping Liu, Yifei Zhang, Yiqing Zhu, Shuowei Jin, Gangmuk Lim, Binbin Chen, Zuzhi Chen, Xiao Liu, Xin Chen, Kante Yin, Chak-Pong Chung, Chenyu Jiang, Yicheng Lu, Jianjun Chen, Caixue Lin, Wu Xiang, Rui Shi, and Liguang Xie. Aibrix: Towards scalable, cost-effec...

  51. [51]

    Preble: Efficient distributed prompt scheduling for llm serving, 2024

    Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dong- ming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for llm serving, 2024

  52. [52]

    C3: Cutting tail latency in cloud data stores via adaptive replica selection

    Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In12th USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI 15), pages 513–527, Oakland, CA, May 2015. USENIX Association

  53. [53]

    AIBrix Gateway: Pre- fix Cache and Load-Aware Routing

    The AIBrix Team. AIBrix Gateway: Pre- fix Cache and Load-Aware Routing. https: //github.com/vllm-project/aibrix/blob/main/ pkg/plugins/gateway/algorithms/README.md,

  54. [54]

    Introduced in AIBrix v0.3.0, accessed 2026-04- 22

  55. [55]

    SGLang model gate- way: prefix_hash routing policy

    The SGLang Team. SGLang model gate- way: prefix_hash routing policy. https: //github.com/sgl-project/sglang/blob/ 95910331797f9d42d69773d847910c10a050c247/ sgl-model-gateway/src/policies/prefix_hash. rs, 2025. Commit 9591033, accessed 2026-04-22

  56. [56]

    GCR: Gradient coreset based re- play buffer selection for continual learning

    Rishabh Tiwari, Krishnateja Killamsetty, Rishabh Iyer, and Pradeep Shenoy. GCR: Gradient coreset based re- play buffer selection for continual learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  57. [57]

    vllm seman- tic router

    vLLM Semantic Router Team. vllm seman- tic router. https://github.com/vllm-project/ semantic-router, 2025

  58. [58]

    Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider

    Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. Kvcache cache in the wild: Char- acterizing and optimizing kvcache cache at a large cloud provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025

  59. [59]

    Hetis: Serving LLMs in heterogeneous GPU clusters with fine-grained and dynamic parallelism

    Zizhao Wang, Yuhao Hu, Jiaqi Wang, Jiahao Du, Yanghua Liu, Yuyang Ma, et al. Hetis: Serving LLMs in heterogeneous GPU clusters with fine-grained and dynamic parallelism. InProceedings of the Inter- national Conference for High Performance Comput- ing, Networking, Storage and Analysis (SC ’25), 2025. https://arxiv.org/abs/2509.08309

  60. [60]

    MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clus- ters

    Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clus- ters. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945–960, 2022

  61. [61]

    Fast distributed inference serving for large language models, 2024

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2024

  62. [62]

    Deserve: Towards affordable offline llm inference via decentralization, 2025

    Linyu Wu, Xiaoyuan Liu, Tianneng Shi, Zhe Ye, and Dawn Song. Deserve: Towards affordable offline llm inference via decentralization, 2025

  63. [63]

    Rum- ble, and Aaron Archer

    Bartek Wydrowski, Robert Kleinberg, Stephen M. Rum- ble, and Aaron Archer. Load is not what you should balance: Introducing prequal. In21st USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI 24), pages 1285–1299, Santa Clara, CA, April

  64. [64]

    Towards efficient and practical gpu multitasking in the era of llm.arXiv preprint arXiv:2508.08448, 2025

    Jiarong Xing, Yifan Qiao, Simon Mo, Xingqi Cui, Gur- Eyal Sela, Yang Zhou, Joseph Gonzalez, and Ion Stoica. Towards efficient and practical gpu multitasking in the era of llm.arXiv preprint arXiv:2508.08448, 2025

  65. [65]

    Orca: A distributed serving system for Transformer-Based generative mod- els

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative mod- els. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association

  66. [66]

    Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

    Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yang- min Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021, 2025

  67. [67]

    Gonzalez, Ion Stoica, and Hao Zhang

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS-chat-1m: A large-scale real-world LLM conversation dataset. InThe Twelfth International Conference on Learning Representations, 2024

  68. [68]

    Gonzalez, Clark Bar- rett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Bar- rett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024

  69. [69]

    Dist- serve: disaggregating prefill and decoding for goodput- optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- serve: disaggregating prefill and decoding for goodput- optimized large language model serving. InProceed- ings of the 18th USENIX Conference on Operating Sys- tems Design and Implementation, OSDI’24, USA, 2024. USENIX Association

  70. [70]

    NanoFlow: Towards optimal large language model serving through- put

    Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. NanoFlow: Towards optimal large language model serving through- put. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), 2025

  71. [71]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. Megascale- infer: Efficient mixture-of-experts model serving with disaggregated expert parallelism. InPro...

  72. [72]

    Prefix-and-load-aware (fp16)

  73. [73]

    Prefix-and-load-aware (quant)

  74. [74]

    bitsandbytes is a popular on-the-fly quantization method that compress FP16 KV into INT4/INT8 when storing them

    Lodestar (quant) Figure 16:TTFT performance with and without bitsandbytes quan- tization configuration on vLLM. bitsandbytes is a popular on-the-fly quantization method that compress FP16 KV into INT4/INT8 when storing them. When it is used, vLLM engine decompresses the KVs back to the original precision. It enables memory-efficient LLM inference at the c...