StreamServe achieves 11-18x lower latency than standard vLLM setups for LLM serving by combining disaggregated prefill-decode execution with metric-aware routing and runtime-adaptive speculative decoding.
Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2405.08691, 2024a
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
StreamServe achieves 11-18x lower latency than standard vLLM setups for LLM serving by combining disaggregated prefill-decode execution with metric-aware routing and runtime-adaptive speculative decoding.