AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
Pith reviewed 2026-05-20 20:05 UTC · model grok-4.3
The pith
AstraFlow decouples rollout, dataflow management, and training into autonomous components to handle complex agentic RL workloads on elastic heterogeneous resources without custom engineering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AstraFlow replaces conventional trainer-centered control with principled component abstractions. In this system, rollout services, dataflow management, and training are decoupled into autonomous components. This enables native support for complex multi-policy agentic RL workloads and efficient exploitation of diverse compute resources. Evaluations across math, code, search, and AgentBench workloads show that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, it achieves comparable or better accuracy than existing RL systems while using
What carries the argument
Autonomous component decoupling in a dataflow-oriented architecture, which separates rollout services, dataflow management, and training to enable flexible workload handling.
If this is right
- Multi-policy collaborative training runs on the same system with comparable or better accuracy.
- Elastic scaling and heterogeneous cross-region execution are supported without any system code modifications.
- Composable data algorithms can be added directly for different RL workloads.
- Overall training time for multi-policy setups improves by a factor of 2.7 while preserving model performance.
Where Pith is reading between the lines
- The decoupling pattern may reduce engineering effort when adapting RL systems to new hardware types or training strategies.
- Wider adoption could lower the overall compute cost of developing agentic language models by improving resource sharing across regions.
- Similar component separation might simplify scaling for other distributed training paradigms beyond reinforcement learning.
Load-bearing premise
Decoupling the components into autonomous units creates no significant coordination overhead or integration issues that would undermine the speedups and workload support in large-scale production deployments.
What would settle it
A direct comparison of end-to-end training time for multi-policy collaborative RL on a large heterogeneous cross-region cluster, checking whether coordination costs between the decoupled components reduce the speedup below 2.7x or introduce accuracy loss.
read the original abstract
Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AstraFlow, a dataflow-oriented RL system for agentic LLMs. It replaces trainer-centered control with decoupled autonomous components for rollout services, dataflow management, and training. This design is claimed to natively support multi-policy collaborative training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. Evaluations across math, code, search, and AgentBench workloads report that AstraFlow achieves comparable or better accuracy than existing RL systems while delivering a 2.7x speedup in multi-policy training time.
Significance. If the speedup and flexibility claims hold under production-scale conditions, the work would be significant for scaling RL to agentic LLMs by reducing custom engineering overhead through principled component abstractions. The dataflow orientation could enable better exploitation of elastic and heterogeneous resources, addressing a practical bottleneck in current systems.
major comments (1)
- [Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.
minor comments (2)
- [Abstract] The abstract and evaluation summary omit baseline system names, exact hardware configurations, error bars, and workload definitions (e.g., trajectory lengths or AgentBench task subsets), which would strengthen verifiability of the accuracy and speedup numbers.
- [System Design] Notation for the dataflow abstractions (e.g., how policies and rollouts are represented in the dataflow graph) could be clarified with a small diagram or pseudocode example in the system-design section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.
Authors: We agree that isolated measurements of inter-component messaging, synchronization, and data-movement costs would provide stronger support for the claim of negligible overhead from the decoupled architecture. The 2.7x speedup reported in the multi-policy collaborative training results is an end-to-end figure that already incorporates all such costs across the evaluated workloads, including variable-length trajectories and cross-region execution. To directly address this point, we will add component-level profiling results (e.g., per-component latency breakdowns and data-transfer overheads) to the revised evaluation section. revision: yes
Circularity Check
No circularity: claims rest on architecture description and empirical benchmarks
full rationale
The paper introduces AstraFlow as a dataflow-oriented system that decouples rollout services, dataflow management, and training into autonomous components to support multi-policy agentic RL workloads on elastic and heterogeneous resources. All load-bearing claims (comparable accuracy, 2.7x speedup, support for multi-policy training/elastic scaling/cross-region execution without code changes) are justified by the proposed component abstractions plus direct experimental results on math, code, search, and AgentBench workloads. No equations, fitted parameters, predictions, or uniqueness theorems appear; the evaluation is externally falsifiable via reported benchmarks and does not reduce to self-citation chains or definitional loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing LLM RL systems require dedicated system engineering for each new extension due to trainer-centered control architectures
invented entities (1)
-
Autonomous rollout services, dataflow management, and training components
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.16108(2025)
Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025
-
[2]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[4]
Fireworks AI. Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026. Fireworks.ai Blog
work page 2026
-
[5]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
LLM Multi-Agent Systems: Challenges and Open Problems
Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun 10 Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025
-
[10]
Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025
-
[11]
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, and George Karypis. Hetrl: Efficient reinforcement learning for llms in heterogeneous environments.arXiv preprint arXiv:2512.12476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025
Brad Hilton, Kyle Corbitt, David Corbitt, Saumya Gandhi, Angky William, Bohdan Kovalevskyi, and Andie Jones. Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025
work page 2025
-
[13]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Prime Intellect. Prime-rl, 2025. URLhttps://github.com/PrimeIntellect-ai/prime-rl
work page 2025
-
[15]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025
-
[19]
Deepcoder: A fully open-source 14b coder at o3-mini level, 2025
Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025
work page 2025
-
[20]
Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-R L-19681902c1468005bed8ca30...
work page 2025
-
[21]
Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024
-
[22]
Real: Efficient rlhf training of large language models with parameter reallocation
Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025
work page 2025
-
[23]
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL
Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed rl.arXiv preprint arXiv:2602.03839, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024
-
[25]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024. 11
-
[29]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[31]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, et al. Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026
-
[33]
Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026
-
[34]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024
work page 2024
-
[36]
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible resources for cost-efficient reinforcement learning on llms.arXiv preprint arXiv:2510.19225, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Less: selecting influential data for targeted instruction tuning
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024
work page 2024
-
[38]
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025
-
[40]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025. URLhttps://arxiv.org/abs/2510.04206
-
[45]
Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025
-
[46]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen
Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025
-
[48]
Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026
Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026
work page 2026
-
[49]
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025
work page 2025
-
[50]
Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation
Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025
-
[51]
Optimizing {RLHF} training for large language models with stage fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025
work page 2025
-
[52]
Let’s think step by step . . . \boxed{}
Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. 13 Appendix Limitations AstraFlow focuses on system abstractions for agentic RL workloads rather than proposing a new RL optimization algorithm. Our evaluation covers representative math, ...
work page 2025
-
[53]
Any actions except provided available actions will be regarded as illegal
The action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal
-
[54]
Think when necessary, try to act directly more in the process. 21 WebShop.The task server injects the system-level prompt below, then pre-loads a one-shot demonstration of a complete shopping trajectory (search → click product → select size → buy) before the agent receives the actual instruction and observation. WebShop Task Instruction You are web shoppi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.