AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Beidi Chen; Haizhong Zheng; Ion Stoica; Jiahui Wang; Jiawei Zhao; Shuowei Jin; Xueshen Liu; Yizhuo Di; Yongji Wu; Z. Morley Mao

REVIEW 1 major objections 2 minor 54 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

AstraFlow decouples rollout, dataflow management, and training into autonomous components to handle complex agentic RL workloads on elastic heterogeneous resources without custom engineering.

2026-05-20 20:05 UTC pith:F6Z2Z74B

load-bearing objection AstraFlow decouples rollout, dataflow, and training into autonomous components to support complex agentic RL without per-feature engineering, but the 2.7x speedup claim needs checks on unmeasured coordination costs. the 1 major comments →

arxiv 2605.15565 v1 pith:F6Z2Z74B submitted 2026-05-15 cs.LG cs.AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Haizhong Zheng , Yizhuo Di , Jiahui Wang , Shuowei Jin , Xueshen Liu , Yongji Wu , Z. Morley Mao , Ion Stoica

show 2 more authors

Jiawei Zhao Beidi Chen

This is my paper

classification cs.LG cs.AI

keywords reinforcement learninglarge language modelsdataflow architectureagentic systemsmulti-policy trainingdistributed RLelastic scalingheterogeneous computing

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AstraFlow, a reinforcement learning system for large language models used as agents. It claims that replacing trainer-centered control with a dataflow-oriented design, where rollout services, dataflow management, and training operate as independent components, allows the same infrastructure to support multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms. This matters to readers because scaling RL for agentic LLMs is currently expensive and requires repeated system engineering for each new capability. If the approach works, developers could train more capable models faster by reusing one flexible system across varied compute setups and workloads.

Core claim

AstraFlow replaces conventional trainer-centered control with principled component abstractions. In this system, rollout services, dataflow management, and training are decoupled into autonomous components. This enables native support for complex multi-policy agentic RL workloads and efficient exploitation of diverse compute resources. Evaluations across math, code, search, and AgentBench workloads show that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, it achieves comparable or better accuracy than existing RL systems while using

What carries the argument

Autonomous component decoupling in a dataflow-oriented architecture, which separates rollout services, dataflow management, and training to enable flexible workload handling.

Load-bearing premise

Decoupling the components into autonomous units creates no significant coordination overhead or integration issues that would undermine the speedups and workload support in large-scale production deployments.

What would settle it

A direct comparison of end-to-end training time for multi-policy collaborative RL on a large heterogeneous cross-region cluster, checking whether coordination costs between the decoupled components reduce the speedup below 2.7x or introduce accuracy loss.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Multi-policy collaborative training runs on the same system with comparable or better accuracy.
Elastic scaling and heterogeneous cross-region execution are supported without any system code modifications.
Composable data algorithms can be added directly for different RL workloads.
Overall training time for multi-policy setups improves by a factor of 2.7 while preserving model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupling pattern may reduce engineering effort when adapting RL systems to new hardware types or training strategies.
Wider adoption could lower the overall compute cost of developing agentic language models by improving resource sharing across regions.
Similar component separation might simplify scaling for other distributed training paradigms beyond reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

1 major / 2 minor

Summary. The paper introduces AstraFlow, a dataflow-oriented RL system for agentic LLMs. It replaces trainer-centered control with decoupled autonomous components for rollout services, dataflow management, and training. This design is claimed to natively support multi-policy collaborative training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. Evaluations across math, code, search, and AgentBench workloads report that AstraFlow achieves comparable or better accuracy than existing RL systems while delivering a 2.7x speedup in multi-policy training time.

Significance. If the speedup and flexibility claims hold under production-scale conditions, the work would be significant for scaling RL to agentic LLMs by reducing custom engineering overhead through principled component abstractions. The dataflow orientation could enable better exploitation of elastic and heterogeneous resources, addressing a practical bottleneck in current systems.

major comments (1)

[Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.

minor comments (2)

[Abstract] The abstract and evaluation summary omit baseline system names, exact hardware configurations, error bars, and workload definitions (e.g., trajectory lengths or AgentBench task subsets), which would strengthen verifiability of the accuracy and speedup numbers.
[System Design] Notation for the dataflow abstractions (e.g., how policies and rollouts are represented in the dataflow graph) could be clarified with a small diagram or pseudocode example in the system-design section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section (multi-policy collaborative training results): the 2.7x training-time speedup is presented as evidence that autonomous decoupling of rollout services, dataflow management, and training introduces negligible overhead. However, no isolated measurements of inter-component messaging, synchronization, or data-movement costs are reported, despite the workloads involving variable-length trajectories and cross-region execution. This measurement gap is load-bearing for the central claim that the architecture supports the listed capabilities without performance penalties.

Authors: We agree that isolated measurements of inter-component messaging, synchronization, and data-movement costs would provide stronger support for the claim of negligible overhead from the decoupled architecture. The 2.7x speedup reported in the multi-policy collaborative training results is an end-to-end figure that already incorporates all such costs across the evaluated workloads, including variable-length trajectories and cross-region execution. To directly address this point, we will add component-level profiling results (e.g., per-component latency breakdowns and data-transfer overheads) to the revised evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on architecture description and empirical benchmarks

full rationale

The paper introduces AstraFlow as a dataflow-oriented system that decouples rollout services, dataflow management, and training into autonomous components to support multi-policy agentic RL workloads on elastic and heterogeneous resources. All load-bearing claims (comparable accuracy, 2.7x speedup, support for multi-policy training/elastic scaling/cross-region execution without code changes) are justified by the proposed component abstractions plus direct experimental results on math, code, search, and AgentBench workloads. No equations, fitted parameters, predictions, or uniqueness theorems appear; the evaluation is externally falsifiable via reported benchmarks and does not reduce to self-citation chains or definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the stated limitation of existing trainer-centered systems and the introduction of new autonomous component abstractions; no numerical free parameters are described in the abstract.

axioms (1)

domain assumption Existing LLM RL systems require dedicated system engineering for each new extension due to trainer-centered control architectures
This premise is invoked in the abstract as the core motivation and limitation being addressed.

invented entities (1)

Autonomous rollout services, dataflow management, and training components no independent evidence
purpose: To decouple the RL pipeline and enable native support for complex multi-policy and heterogeneous workloads
These are the principal new abstractions introduced by the AstraFlow design.

pith-pipeline@v0.9.0 · 5784 in / 1491 out tokens · 108159 ms · 2026-05-20T20:05:31.533141+00:00 · methodology

0 comments

read the original abstract

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 22 internal anchors

[1]

Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025
[2]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

work page arXiv 2026
[4]

Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026

Fireworks AI. Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026. Fireworks.ai Blog

work page 2026
[5]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976,

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

work page arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

arXiv preprint arXiv:2507.01663 (2025)

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun 10 Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025
[10]

History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

work page arXiv 2025
[11]

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, and George Karypis. Hetrl: Efficient reinforcement learning for llms in heterogeneous environments.arXiv preprint arXiv:2512.12476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

Brad Hilton, Kyle Corbitt, David Corbitt, Saumya Gandhi, Angky William, Bohdan Kovalevskyi, and Andie Jones. Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

work page 2025
[13]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Prime-rl, 2025

Prime Intellect. Prime-rl, 2025. URLhttps://github.com/PrimeIntellect-ai/prime-rl

work page 2025
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

arXiv preprint arXiv:2508.14040 , year=

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025

work page arXiv 2025
[19]

Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

work page 2025
[20]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-R L-19681902c1468005bed8ca30...

work page 2025
[21]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024
[22]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025

work page 2025
[23]

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed rl.arXiv preprint arXiv:2602.03839, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024

work page Pith review arXiv 2024
[25]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

arXiv preprint arXiv:2405.01481 , year=

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024. 11

work page arXiv 2024
[29]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[31]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, et al. Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

work page arXiv 2026
[33]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026
[34]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[36]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible resources for cost-efficient reinforcement learning on llms.arXiv preprint arXiv:2510.19225, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Less: selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

work page 2024
[38]

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

work page arXiv 2025
[40]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025. URLhttps://arxiv.org/abs/2510.04206

work page arXiv 2025
[45]

Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs,

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

work page arXiv 2025
[46]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

arXiv preprint arXiv:2506.02177 , year =

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025
[48]

Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

work page 2026
[49]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

work page 2025
[50]

Zhong, Z

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025
[51]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

work page 2025
[52]

Let’s think step by step . . . \boxed{}

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. 13 Appendix Limitations AstraFlow focuses on system abstractions for agentic RL workloads rather than proposing a new RL optimization algorithm. Our evaluation covers representative math, ...

work page 2025
[53]

Any actions except provided available actions will be regarded as illegal

The action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal

work page
[54]

Think when necessary, try to act directly more in the process. 21 WebShop.The task server injects the system-level prompt below, then pre-loads a one-shot demonstration of a complete shopping trajectory (search → click product → select size → buy) before the agent receives the actual instruction and observation. WebShop Task Instruction You are web shoppi...

work page

[1] [1]

Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025

[2] [2]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, and Bo An. Dr. mas: Stable reinforcement learning for multi-agent llm systems.arXiv preprint arXiv:2602.08847, 2026

work page arXiv 2026

[4] [4]

Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026

Fireworks AI. Frontier rl is cheaper than you think.https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think , March 2026. Fireworks.ai Blog

work page 2026

[5] [5]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976,

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

work page arXiv 2025

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han, Qifan Zhang, Weizhao Jin, and Zhaozhuo Xu. Llm multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

arXiv preprint arXiv:2507.01663 (2025)

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun 10 Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training.arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025

[10] [10]

History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588, 2025

work page arXiv 2025

[11] [11]

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, and George Karypis. Hetrl: Efficient reinforcement learning for llms in heterogeneous environments.arXiv preprint arXiv:2512.12476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

Brad Hilton, Kyle Corbitt, David Corbitt, Saumya Gandhi, Angky William, Bohdan Kovalevskyi, and Andie Jones. Art: Agent reinforcement trainer.https://github.com/openpipe/art, 2025

work page 2025

[13] [13]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Prime-rl, 2025

Prime Intellect. Prime-rl, 2025. URLhttps://github.com/PrimeIntellect-ai/prime-rl

work page 2025

[15] [15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

arXiv preprint arXiv:2508.14040 , year=

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025

work page arXiv 2025

[19] [19]

Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

work page 2025

[20] [20]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-R L-19681902c1468005bed8ca30...

work page 2025

[21] [21]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Realhf: Optimized rlhf training for large language models through parameter reallocation.arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024

[22] [22]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025

work page 2025

[23] [23]

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed rl.arXiv preprint arXiv:2602.03839, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252, 2024

work page Pith review arXiv 2024

[25] [25]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[26] [26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

arXiv preprint arXiv:2405.01481 , year=

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. Nemo-aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481, 2024. 11

work page arXiv 2024

[29] [29]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[31] [31]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, et al. Marti-mars2: Scaling multi-agent self-search via reinforcement learning for code generation.arXiv preprint arXiv:2602.07848, 2026

work page arXiv 2026

[33] [33]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and reward model in completely dynamic rl system.arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026

[34] [34]

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[36] [36]

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z Morley Mao, Arvind Krishnamurthy, and Ion Stoica. Rlboost: Harvesting preemptible resources for cost-efficient reinforcement learning on llms.arXiv preprint arXiv:2510.19225, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Less: selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

work page 2024

[38] [38]

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Yixuan Even Xu, Yash Savani, Fei Fang, and Zico Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

work page arXiv 2025

[40] [40]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv preprint arXiv:2510.04206, 2025

Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025. URLhttps://arxiv.org/abs/2510.04206

work page arXiv 2025

[45] [45]

Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs,

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

work page arXiv 2025

[46] [46]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

arXiv preprint arXiv:2506.02177 , year =

Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025

[48] [48]

Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data on llms? InICLR, 2026

work page 2026

[49] [49]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

work page 2025

[50] [50]

Zhong, Z

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025

[51] [51]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

work page 2025

[52] [52]

Let’s think step by step . . . \boxed{}

Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. 13 Appendix Limitations AstraFlow focuses on system abstractions for agentic RL workloads rather than proposing a new RL optimization algorithm. Our evaluation covers representative math, ...

work page 2025

[53] [53]

Any actions except provided available actions will be regarded as illegal

The action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal

work page

[54] [54]

Think when necessary, try to act directly more in the process. 21 WebShop.The task server injects the system-level prompt below, then pre-loads a one-shot demonstration of a complete shopping trajectory (search → click product → select size → buy) before the agent receives the actual instruction and observation. WebShop Task Instruction You are web shoppi...

work page