BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
arXiv preprint arXiv:2508.14544 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
citing papers explorer
-
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
-
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
-
Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
-
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.
-
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.