{"total":11,"items":[{"citing_arxiv_id":"2605.19593","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption","primary_cat":"cs.AI","submitted_at":"2026-05-19T09:39:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical study finds non-linear, model-size-dependent throughput degradation from offloading and high model-state reload costs from preemption in multi-LLM serving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06534","ref_index":89,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","primary_cat":"cs.DC","submitted_at":"2026-05-07T16:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Implementation (OSDI 24). 193-210. [89] Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. 2025. StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation.arXiv preprint arXiv:2504.15930(2025). [90] Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. 2025. Optimizing {RLHF} training for large language models with stage fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 489-503. 18 ROSE: Rollouts on Serving GPUs Conference'17, July 2017, Washington, DC, USA"},{"citing_arxiv_id":"2605.04357","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs","primary_cat":"cs.DC","submitted_at":"2026-05-05T23:25:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08151","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference","primary_cat":"cs.DC","submitted_at":"2026-05-04T01:27:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Let r be the fraction of requests whose pre-generated speculative continuation becomes invalid after veri- fication. These requests cannot reuse the prepared continuation and fall back to autoregressive decoding, contributing Br tokens per batch. The remaining fraction 1 − r successfully reuses the prepared speculative segment and contributes B(1 − r)L tokens per batch. Thus, the total number of output tokens per round is B \u0002 r + (1 − r)L \u0003 . (42) Under Eq. ( 39), draft-side generation is hidden by target-side verification, so the per-round latency is approx- imated by Tpar ≈ TT . (43) The throughput of parallel speculative decoding is therefore Thrpar = B \u0002 r + (1 − r)L \u0003 Tpar ≈ B \u0002 r + (1 − r)L \u0003 TT . (44) Critical fallback ratio. We derive the condition under which ordinary speculative decoding outperforms"},{"citing_arxiv_id":"2604.25080","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration","primary_cat":"cs.DC","submitted_at":"2026-04-28T00:24:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23838","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training","primary_cat":"cs.LG","submitted_at":"2026-04-26T18:45:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025. [63] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521- 538, 2022. [64] Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yang- min Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021, 2025. [65] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al."},{"citing_arxiv_id":"2604.15186","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines","primary_cat":"cs.DC","submitted_at":"2026-04-16T16:15:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07874","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate","primary_cat":"cs.OS","submitted_at":"2026-04-09T06:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06664","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start","primary_cat":"cs.DC","submitted_at":"2026-04-08T04:31:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04745","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Energy Cost of Execution-Idle in GPU Clusters","primary_cat":"cs.DC","submitted_at":"2026-04-06T15:10:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Execution-idle accounts for 19.7% of GPU execution time and 10.7% of energy in a large cluster, motivating power management that treats it as a distinct operating state.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"org/abs/2505.04021 [61] Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. 2025. BLITZSCALE: fast and live large model autoscaling with O(1) host caching. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (Boston, MA, USA)(OSDI '25). USENIX Association, USA, Article 16, 19 pages. [62] Yijia Zhang, Qiang Wang, Zhe Lin, Pengxiang Xu, and Bingqiang Wang. 2024. Improving GPU Energy Efficiency through an Application- transparent Frequency Scaling Policy with Performance Assurance. In Proceedings of the Nineteenth European Conference on Computer Systems (Athens, Greece)(EuroSys '24). Association for Computing Machinery, New York, NY, USA, 769-785."},{"citing_arxiv_id":"2512.09472","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving","primary_cat":"cs.DC","submitted_at":"2025-12-10T09:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WarmServe reduces tail TTFT by up to 50.8× versus autoscaling and supports 2.5× higher throughput than GPU-sharing by using one-for-many prewarming, model placement, KV cache reservation, and efficient tensor switching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}