pith. sign in

arxiv: 2605.15565 · v1 · pith:F6Z2Z74Bnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Pith reviewed 2026-05-20 20:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learninglarge language modelsdataflow architectureagentic systemsmulti-policy trainingdistributed RLelastic scalingheterogeneous computing
0
0 comments X

The pith

AstraFlow decouples rollout, dataflow management, and training into autonomous components to handle complex agentic RL workloads on elastic heterogeneous resources without custom engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AstraFlow, a reinforcement learning system for large language models used as agents. It claims that replacing trainer-centered control with a dataflow-oriented design, where rollout services, dataflow management, and training operate as independent components, allows the same infrastructure to support multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms. This matters to readers because scaling RL for agentic LLMs is currently expensive and requires repeated system engineering for each new capability. If the approach works, developers could train more capable models faster by reusing one flexible system across varied compute setups and workloads.

Core claim

AstraFlow replaces conventional trainer-centered control with principled component abstractions. In this system, rollout services, dataflow management, and training are decoupled into autonomous components. This enables native support for complex multi-policy agentic RL workloads and efficient exploitation of diverse compute resources. Evaluations across math, code, search, and AgentBench workloads show that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, it achieves comparable or better accuracy than existing RL systems while using

What carries the argument

Autonomous component decoupling in a dataflow-oriented architecture, which separates rollout services, dataflow management, and training to enable flexible workload handling.

If this is right

  • Multi-policy collaborative training runs on the same system with comparable or better accuracy.
  • Elastic scaling and heterogeneous cross-region execution are supported without any system code modifications.
  • Composable data algorithms can be added directly for different RL workloads.
  • Overall training time for multi-policy setups improves by a factor of 2.7 while preserving model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling pattern may reduce engineering effort when adapting RL systems to new hardware types or training strategies.
  • Wider adoption could lower the overall compute cost of developing agentic language models by improving resource sharing across regions.
  • Similar component separation might simplify scaling for other distributed training paradigms beyond reinforcement learning.

Load-bearing premise

Decoupling the components into autonomous units creates no significant coordination overhead or integration issues that would undermine the speedups and workload support in large-scale production deployments.

What would settle it

A direct comparison of end-to-end training time for multi-policy collaborative RL on a large heterogeneous cross-region cluster, checking whether coordination costs between the decoupled components reduce the speedup below 2.7x or introduce accuracy loss.

read the original abstract

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces AstraFlow, a dataflow-oriented RL system for agentic LLMs. It replaces trainer-centered control with decoupled autonomous components for rollout services, dataflow management, and training. This design is claimed to natively support multi-policy collaborative training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. Evaluations across math, code, search, and AgentBench workloads report that AstraFlow achieves comparable or better accuracy than existing RL systems while delivering a 2.7x speedup in multi-policy training time.

Significance. If the speedup and flexibility claims hold under production-scale conditions, the work would be significant for scaling RL to agentic LLMs by reducing custom engineering overhead through principled component abstractions. The dataflow orientation could enable better exploitation of elastic and heterogeneous resources, addressing a practical bottleneck in current systems.

major comments (1)
  1. [Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.
minor comments (2)
  1. [Abstract] The abstract and evaluation summary omit baseline system names, exact hardware configurations, error bars, and workload definitions (e.g., trajectory lengths or AgentBench task subsets), which would strengthen verifiability of the accuracy and speedup numbers.
  2. [System Design] Notation for the dataflow abstractions (e.g., how policies and rollouts are represented in the dataflow graph) could be clarified with a small diagram or pseudocode example in the system-design section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.

    Authors: We agree that isolated measurements of inter-component messaging, synchronization, and data-movement costs would provide stronger support for the claim of negligible overhead from the decoupled architecture. The 2.7x speedup reported in the multi-policy collaborative training results is an end-to-end figure that already incorporates all such costs across the evaluated workloads, including variable-length trajectories and cross-region execution. To directly address this point, we will add component-level profiling results (e.g., per-component latency breakdowns and data-transfer overheads) to the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on architecture description and empirical benchmarks

full rationale

The paper introduces AstraFlow as a dataflow-oriented system that decouples rollout services, dataflow management, and training into autonomous components to support multi-policy agentic RL workloads on elastic and heterogeneous resources. All load-bearing claims (comparable accuracy, 2.7x speedup, support for multi-policy training/elastic scaling/cross-region execution without code changes) are justified by the proposed component abstractions plus direct experimental results on math, code, search, and AgentBench workloads. No equations, fitted parameters, predictions, or uniqueness theorems appear; the evaluation is externally falsifiable via reported benchmarks and does not reduce to self-citation chains or definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the stated limitation of existing trainer-centered systems and the introduction of new autonomous component abstractions; no numerical free parameters are described in the abstract.

axioms (1)
  • domain assumption Existing LLM RL systems require dedicated system engineering for each new extension due to trainer-centered control architectures
    This premise is invoked in the abstract as the core motivation and limitation being addressed.
invented entities (1)
  • Autonomous rollout services, dataflow management, and training components no independent evidence
    purpose: To decouple the RL pipeline and enable native support for complex multi-policy and heterogeneous workloads
    These are the principal new abstractions introduced by the AstraFlow design.

pith-pipeline@v0.9.0 · 5784 in / 1491 out tokens · 108159 ms · 2026-05-20T20:05:31.533141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 22 internal anchors

  1. [1]

    arXiv preprint arXiv:2511.16108(2025)

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

  2. [2]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

  3. [3]

    Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

  4. [4]

    Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026

    Fireworks AI. Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026. Fireworks.ai Blog

  5. [5]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

  6. [6]

    Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    LLM Multi-Agent Systems: Challenges and Open Problems

    Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578, 2024

  9. [9]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun 10 Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

  10. [10]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

  11. [11]

    HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

    Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, and George Karypis. Hetrl: Efficient reinforcement learning for llms in heterogeneous environments.arXiv preprint arXiv:2512.12476, 2025

  12. [12]

    Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

    Brad Hilton, Kyle Corbitt, David Corbitt, Saumya Gandhi, Angky William, Bohdan Kovalevskyi, and Andie Jones. Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

  13. [13]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024

  14. [14]

    Prime-rl, 2025

    Prime Intellect. Prime-rl, 2025. URLhttps://github.com/PrimeIntellect-ai/prime-rl

  15. [15]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  16. [16]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  17. [17]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  18. [18]

    Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025

    Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025

  19. [19]

    Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

    Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

  20. [20]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-R L-19681902c1468005bed8ca30...

  21. [21]

    Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

  22. [22]

    Real: Efficient rlhf training of large language models with parameter reallocation

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025

  23. [23]

    Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

    Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed rl.arXiv preprint arXiv:2602.03839, 2026

  24. [24]

    Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024

  25. [25]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  28. [28]

    Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024. 11

  29. [29]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  30. [30]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  31. [31]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  32. [32]

    Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

    Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, et al. Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

  33. [33]

    Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

    Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

  34. [34]

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

  35. [35]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  36. [36]

    RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

    Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible resources for cost-efficient reinforcement learning on llms.arXiv preprint arXiv:2510.19225, 2025

  37. [37]

    Less: selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

  38. [38]

    Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

    Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025

  39. [39]

    Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

    Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

  40. [40]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  43. [43]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  44. [44]

    Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

    Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025. URLhttps://arxiv.org/abs/2510.04206

  45. [45]

    Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

    Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

  46. [46]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12

  47. [47]

    Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

    Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

  48. [48]

    Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

    Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

  49. [49]

    Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

  50. [50]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

  51. [51]

    Optimizing {RLHF} training for large language models with stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

  52. [52]

    Let’s think step by step . . . \boxed{}

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. 13 Appendix Limitations AstraFlow focuses on system abstractions for agentic RL workloads rather than proposing a new RL optimization algorithm. Our evaluation covers representative math, ...

  53. [53]

    Any actions except provided available actions will be regarded as illegal

    The action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal

  54. [54]

    Think when necessary, try to act directly more in the process. 21 WebShop.The task server injects the system-level prompt below, then pre-loads a one-shot demonstration of a complete shopping trajectory (search → click product → select size → buy) before the agent receives the actual instruction and observation. WebShop Task Instruction You are web shoppi...