Splitwise: Efficient generative

Patel, Pratyush, Choukse, Esha, Zhang, Chaojie, Shah, Aashaka, Goiri

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

citing papers explorer

Showing 2 of 2 citing papers.

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale cs.DC · 2026-05-07 · unverdicted · none · ref 13 · 2 links
BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints cs.LG · 2026-05-06 · unverdicted · none · ref 47
A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.

Splitwise: Efficient generative

fields

years

verdicts

representative citing papers

citing papers explorer