ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
Pith reviewed 2026-05-21 08:35 UTC · model grok-4.3
The pith
Agentic RL training can borrow idle GPUs from serving clusters to increase throughput by 1.3 to 3.3 times without violating service level objectives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ROSE realizes cooperative elasticity by co-locating heterogeneous serving and rollout models on the same GPUs through an SLO-safe executor that dynamically shares memory and compute, a weight transfer engine that uses shard-aware routing and sparsity for fast synchronization, and an elastic scheduler that routes rollouts across dedicated and opportunistic GPUs. Experiments across model sizes and cluster scales report end-to-end throughput gains of 1.3-3.3x over resource-fixed baselines and rollout time reductions of 1.2-1.5x over resource-elastic baselines, all without serving SLO violations.
What carries the argument
The SLO-safe co-serving executor that dynamically shares memory and compute between serving and rollout models on the same GPUs while preserving latency guarantees.
If this is right
- Rollout phases complete faster because they access on-demand capacity without allocation delays.
- Overall post-training time for agentic RL decreases as the variable compute demand is met from existing serving pools.
- Serving clusters support additional training workloads without requiring extra dedicated hardware.
- Resource utilization rises because idle capacity in production inference fleets becomes available for training steps.
Where Pith is reading between the lines
- The same co-location pattern could apply to other bursty workloads such as online fine-tuning or evaluation jobs that run alongside serving.
- Cloud operators might redesign GPU fleets to treat serving and training as co-located rather than separate resource pools.
- If weight transfer overhead stays low at larger scales, the approach could extend to multi-tenant environments with more frequent model updates.
Load-bearing premise
Serving clusters consistently leave substantial GPU compute and memory idle and can co-locate heterogeneous models dynamically while preserving serving SLOs under bursty traffic.
What would settle it
Deploying the system on a cluster with consistently high serving load and measuring either no throughput gain or any increase in serving latency violations.
Figures
read the original abstract
Agentic reinforcement learning (RL) is reshaping LLM post-training, but end-to-end training time is dominated by compute-intensive, multi-turn rollouts whose resource demand varies significantly across training steps. Resource-fixed systems cannot adapt to this variation, while resource-elastic approaches that provision external GPUs on demand suffer from high allocation overhead and limited availability. We observe that serving clusters leave substantial GPU compute and memory idle, and propose cooperative elasticity: sharing already-deployed serving GPUs with rollout workloads to provide on-demand elastic capacity. Realizing this is non-trivial, as it must preserve serving SLOs under bursty traffic while minimizing cross-cluster communication overhead. We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines and reduces rollout time by 1.2 - 1.5 x over resource-elastic baselines, with no serving SLO violations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ROSE, a system realizing cooperative elasticity for agentic RL post-training. It co-locates rollout workloads on already-deployed serving GPUs via an SLO-safe co-serving executor, a shard-aware cross-cluster weight transfer engine, and an elastic rollout scheduler. The central empirical claim is that this yields 1.3–3.3× end-to-end throughput gains over resource-fixed baselines and 1.2–1.5× rollout-time reductions over resource-elastic baselines across model sizes and cluster scales, with no serving SLO violations.
Significance. If the reported speedups and SLO preservation hold under production burst patterns, ROSE would demonstrate a practical way to harvest idle serving capacity for variable-demand RL rollouts, reducing the need for dedicated elastic provisioning. The three-component design and cross-cluster synchronization techniques are concrete contributions to systems for heterogeneous co-location.
major comments (3)
- [§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.
- [§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.
- [§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.
minor comments (3)
- [Table 1, §4.1] Table 1 and §4.1: Model-size notation (e.g., “7B”, “70B”) is used inconsistently with the text; align the table headers with the exact parameter counts reported in the experimental setup.
- [Figure 4] Figure 4: Axis labels and legend text are too small to read at standard print size; increase font size or split into two panels.
- [§6] §6 (Related Work): The discussion of prior elastic scheduling and co-location systems omits several recent papers on GPU sharing for inference; add citations to complete the positioning.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor, motivation, and guarantees that we will address to improve the manuscript. We respond to each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): The headline 1.3–3.3× throughput and 1.2–1.5× rollout-time numbers are presented without reported variance, number of runs, or precise definition of how serving SLOs (latency, throughput) were measured under the simulated bursty traffic; this makes it impossible to judge whether the gains are robust or sensitive to post-hoc tuning.
Authors: We agree that reporting statistical details is essential for assessing robustness. In the revised manuscript we will add the number of runs performed for each configuration (five independent runs), include error bars or standard deviations in the relevant figures, and provide an explicit description of the SLO measurement methodology. This will include the precise latency percentile (99th), throughput threshold, and how bursty traffic was generated and monitored to ensure no violations occurred. revision: yes
-
Referee: [§2, §3.1] §2 and §3.1: The enabling premise that serving clusters consistently leave substantial GPU compute and memory idle under bursty traffic is stated as an observation but is not backed by any production traces, utilization histograms, or worst-case analysis of co-location feasibility for heterogeneous models; if sustained utilization is higher than assumed, the opportunistic capacity and therefore the reported speedups disappear.
Authors: We acknowledge that the current motivation section relies on general observations rather than public production traces. We will expand §2 with utilization histograms generated from our bursty-traffic simulator across a range of arrival rates and model sizes, plus a new worst-case analysis subsection that quantifies the minimum idle capacity needed for net gains and shows how speedups degrade under higher sustained utilization. While we cannot release proprietary production traces, these additions will make the feasibility argument more concrete and transparent. revision: partial
-
Referee: [§4.3] §4.3 (SLO-safe co-serving executor): The dynamic memory and compute sharing mechanism is described at a high level, yet no formal bound or micro-benchmark isolates the latency impact on the serving model when rollout jobs are co-located at varying intensities; the claim of “no SLO violations” therefore rests entirely on the specific experimental traffic rather than a general guarantee.
Authors: We will revise §4.3 to include dedicated micro-benchmarks that isolate serving-model latency under controlled rollout intensities, varying both compute and memory sharing ratios while holding serving traffic fixed. These experiments will report latency distributions and the maximum rollout intensity at which the 99th-percentile SLO remains satisfied. Although deriving a tight formal latency bound is difficult given nondeterministic GPU scheduling, the added micro-benchmarks will provide empirical evidence beyond the end-to-end traffic scenarios and clarify the operating regime in which SLOs are preserved. revision: yes
Circularity Check
No circularity in ROSE derivation chain
full rationale
The paper is a systems description of ROSE for cooperative elasticity, with three engineering components (SLO-safe co-serving executor, cross-cluster weight transfer engine, elastic rollout scheduler) and performance claims supported solely by experimental measurements across model sizes and cluster scales. No mathematical derivations, equations, fitted parameters presented as predictions, or first-principles results appear in the provided text. The idle-capacity observation is an empirical premise, not a derived quantity, and the speedups are direct experimental outcomes rather than reductions to inputs by construction. The derivation chain is therefore self-contained with independent empirical content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Serving clusters leave substantial GPU compute and memory idle under normal operation.
- domain assumption Co-location of heterogeneous serving and rollout models can preserve serving SLOs under bursty traffic.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alibaba Cloud. 2026. Creating a GPU function.https://www.alibabac loud.com/help/en/functioncompute/fc/user-guide/creating-a-gpu- function/. (2026). Accessed: 2026-04
work page 2026
-
[2]
Li, Ryota Tomioka, and Milan Vojnovic
Dan Alistarh, Demjan Grubic, Jerry Z. Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: communication-efficient SGD via gradient quantization and encoding. InProceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1707–1718
work page 2017
-
[3]
Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous Learning of Video Analytics Models on Edge Compute Servers. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 119–135.http...
work page 2022
-
[4]
Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica
-
[5]
arXiv preprint arXiv:2511.16108(2025)
SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent. arXiv preprint arXiv:2511.16108(2025)
- [6]
-
[7]
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
- [8]
-
[9]
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML
work page 2024
-
[10]
Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A check- pointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943
work page 2022
-
[11]
Farama Foundation. 2024. Gymnasium - FrozenLake Environment. https://gymnasium.farama.org/environments/toy_text/frozen_lake/. (2024). Accessed: 2025-09
work page 2024
-
[12]
Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio
-
[13]
InProceedings of the 2021 ACM SIGCOMM 2021 Conference
Efficient sparse collective communication and its application to accelerate distributed deep learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 676–691
work page 2021
-
[16]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. 2025. AReaL: A Large-Scale Asynchronous Rein- forcement Learning System for Language Reasoning.arXiv preprint arXiv:2505.10978(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InOSDI’24
work page 2024
-
[18]
Wei Gao, Zhuoyuan Ouyang, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters.IEEE Transactions on Parallel and Distributed Systems36, 6 (2025), 1071–1086.https://doi.org/10.1109/ TPDS.2025.3553137
-
[19]
Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollPacker: Mitigating Long- Tail Rollouts for Fast, Synchronous RL Post-Training.arXiv preprint arXiv:2509.21009(2025)
-
[20]
Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure.arXiv preprint arXiv:2512.22560(2025)
-
[21]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen
-
[22]
In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)
Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN} inferences. In16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22). 539–558
-
[23]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. 2025. AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post- Training.arXiv preprint arXiv:2507.01...
- [24]
-
[25]
Mor Harchol-Balter, Cuihong Li, Takayuki Osogami, Alan Scheller- Wolf, and Mark S. Squillante. 2003. Cycle stealing under immediate dispatch task assignment. InProceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’03). As- sociation for Computing Machinery, New York, NY, USA, 274–285. https://doi.org/10.1145/777...
-
[26]
Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakhturina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Ginsburg. 2025. NeMo: a toolkit for Conversational AI and Large Language Models. (2025).https://github.com/NVIDIA/NeMo
work page 2025
- [27]
-
[28]
Jian Hu, Xibin Wu, Weixun Wang, Dehao Zhang, Yu Cao, et al. 2024. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework.arXiv preprint arXiv:2405.11143(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin J...
work page 2024
-
[30]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv preprint arXiv:2310.06770(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture. 1–14
work page 2023
-
[32]
Gonzalez, Hao Zhang, and Ion Sto- ica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
work page 2023
-
[33]
Jiamin Li, Hong Xu, Yibo Zhu, Zherui Liu, Chuanxiong Guo, and Cong Wang. 2023. Lyra: Elastic Scheduling for Deep Learning Clusters. In Proceedings of the Eighteenth European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA, 835–850. https://doi.org/10.1145/3552326.3587445
-
[34]
Yufei Li, Zexin Li, Yinglun Zhu, and Cong Liu. 2025. Lemix: Unified Scheduling for Llm Training and Inference on Multi-Gpu Systems. In 2025 IEEE Real-Time Systems Symposium (RTSS)
work page 2025
- [35]
-
[36]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679
work page 2023
-
[37]
Hwijoon Lim, Juncheol Ye, Sangeetha Abdu Jyothi, and Dongsu Han
-
[38]
InProceedings of the ACM SIGCOMM 2024 Con- ference
Accelerating model training in multi-cluster environments with consumer-grade gpus. InProceedings of the ACM SIGCOMM 2024 Con- ference. 707–720
work page 2024
-
[39]
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Ad- vancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. 2025. Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony.arXiv preprint arXiv:251...
-
[41]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice We- ber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level.https: //pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source- 14B-Coder-at-O3-mini-Level-1cf81902c14...
work page 2025
-
[43]
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577
work page 2018
-
[45]
Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. 2024. Grass: Compute efficient low-memory llm training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14978–15003
work page 2024
-
[46]
OpenPipe. 2025. Serverless RL. (2025).https://openpipe.ai/blog/serve rless-rl
work page 2025
-
[47]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE Press, 118–132.https://doi.org/10.1109/ISCA59077.2024.000 19
-
[48]
Gon- zalez, Ion Stoica, and Harry Xu
Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Shuo Yang, Yang Wang, Miryung Kim, Yongji Wu, Yang Zhou, Jiarong Xing, Joseph E. Gon- zalez, Ion Stoica, and Harry Xu. 2025. ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving.arXiv preprint arXiv:2410.01228(2025)
-
[49]
Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, and Mingxing Zhang
-
[50]
Seer: Online Context Learning for Fast Synchronous LLM Rein- forcement Learning.arXiv preprint arXiv:2511.14617(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)
work page 2024
-
[52]
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. 2025. ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Mul- timodal Model Serving. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC 2025). ...
work page 2025
- [53]
- [54]
-
[55]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.arXiv preprint arXiv:1802.05799(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[56]
SGLang Team. 2025. SGLang: Fast Serving Framework for Large Language Models.https://github.com/sgl-project/sglang. (2025). Version 0.4
work page 2025
-
[57]
Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Al- pay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, and Junx- iong Wang. 2025. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training.arXiv preprint arXiv:2511.13841(2025)
-
[58]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [59]
-
[60]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hy- bridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv:2409.19256(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. verl: Volcano Engine Reinforcement Learning for LLM.https://github.com /volcengine/verl. (2024)
work page 2024
-
[62]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[63]
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[64]
Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi
-
[65]
Agentic Reasoning and Tool Integration for LLMs via Reinforce- ment Learning.arXiv preprint arXiv:2505.01441(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for per- formance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362
work page 2025
-
[67]
The Terminal-Bench Team. 2025. Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. (2025).https://github.com/laude- institute/terminal-bench
work page 2025
-
[68]
Thinking Machines AI. 2025. Tinker.https://thinkingmachines.ai/ti nker/. (2025). Accessed: 2026-02
work page 2025
-
[69]
Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’25). USENIX Association, USA, Article 28...
work page 2025
-
[70]
Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Si...
-
[71]
Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qip...
-
[72]
Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real- world Workload Dataset to Optimize LLM Serving Systems.arXiv preprint arXiv:2401.17644(2025)
-
[73]
Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shri- vastava, and TS Eugene Ng. 2025. {ZEN}: Empowering Distributed Training with Sparsity-driven Data Synchronization. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 537–556
work page 2025
-
[74]
Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. 2025. Agentic reasoning: A streamlined framework for enhancing llm rea- soning with agentic tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28489–28503
work page 2025
-
[75]
Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, and Wei Wang. 2025. RollMux: Phase- Level Multiplexing for Disaggregated RL Post-Training.arXiv preprint arXiv:2512.11306(2025)
-
[76]
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, and Ion Stoica. 2025. RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs.arXiv preprint arXiv:2510.19225(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peid- ian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongsh...
-
[78]
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1030–1045
work page 2025
-
[79]
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic scaling on GPU clusters for deep learning. InUSENIX OSDI
work page 2020
- [80]
-
[81]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
-
[82]
Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.