CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
Dist- serve: Disaggregating prefill and decoding for goodput- optimized large language model serving
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.