Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.
Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.