arXiv preprint arXiv:2508.14544 , year=

B¨ auerle, N · 2002 · arXiv 2508.14544

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

cs.LG · 2025-04-15 · unverdicted · novelty 6.0

The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

cs.DC · 2026-05-02 · accept · novelty 4.0

LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

citing papers explorer

Showing 5 of 5 citing papers.

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale cs.DC · 2026-05-07 · unverdicted · none · ref 3 · 2 links
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 18
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.
Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees cs.LG · 2026-04-13 · unverdicted · none · ref 10
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints cs.LG · 2025-04-15 · unverdicted · none · ref 3
The paper develops fluid-guided online scheduling algorithms (WAIT and Nested WAIT) for LLM inference that handle endogenous KV-cache memory growth and improve stability and latency over baselines in simulations.
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics cs.DC · 2026-05-02 · accept · none · ref 54
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

arXiv preprint arXiv:2508.14544 , year=

fields

years

verdicts

representative citing papers

citing papers explorer