CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
Llumnix: Dynamic scheduling for large language model serving
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 2years
2025 2representative citing papers
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
citing papers explorer
-
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.
-
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.