Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.
citing papers explorer
-
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
-
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.