arxiv: 2605.02913 · v1 · submitted 2026-04-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana , Gagan Mundada , Xunyi Jiang , Chuhan Wang , Zhenwei Tang , Difan Jiao , Zihan Huang , Yuxin Xiong

show 14 more authors

Junda Wu Sheldon Yu Xintong Li Raghav Jain Nikki Kuang Sizhe Zhou Bowen Jin Zhendong Chu Tong Yu Ryan Rossi Kuan-Hao Huang Jingbo Shang Jiawei Han Julian McAuley

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords rollout strategiesLLM reinforcement learningGFCR taxonomypost-trainingreasoning modelstrajectory samplingadaptive computeself-improvement

0 comments

The pith

Rollout strategies for LLM reinforcement learning decompose into four modular stages: Generate, Filter, Control, and Replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines rollout pipelines in reinforcement learning for post-training large language models to improve reasoning. It introduces the GFCR taxonomy to decompose these pipelines into four stages where Generate proposes trajectories, Filter builds signals from verifiers, Control manages compute budgets, and Replay reuses artifacts without updating weights. A sympathetic reader would care because rollout design determines the data the optimizer sees yet remains underreported, so a shared structure can improve reproducibility, efficiency, and diagnosis of training problems across math, code, and agent tasks. The paper synthesizes methods from verifiable rewards to adaptive allocation and supplies case studies plus a diagnostic index.

Core claim

The paper claims that rollout pipelines can be formalized with unified notation and decomposed via the Generate-Filter-Control-Replay lifecycle taxonomy into four modular stages, paired with a criterion taxonomy of reliability, coverage, and cost sensitivity, enabling synthesis of existing techniques and identification of pathologies in RL-based LLM post-training.

What carries the argument

The GFCR lifecycle taxonomy that decomposes rollout pipelines into Generate (candidate trajectories), Filter (intermediate signals), Control (compute allocation and decisions), and Replay (artifact retention and reuse) stages.

If this is right

Existing methods from verifiable-reward RL, process supervision, judge gating, and tree rollouts can be organized under the same four-stage structure.
A diagnostic index directly links common rollout failures such as poor coverage or high cost to specific GFCR modules.
Self-evolving curricula arise naturally when Replay retains and reuses generated tasks across iterations.
Case studies in math, code, multimodal, and agent benchmarks show how the taxonomy supports skill induction and cross-task transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four-stage breakdown could help organize rollout design in reinforcement learning outside language models.
Stronger Control mechanisms might allow more aggressive early stopping to cut post-training compute.
Widespread adoption of Replay for curricula could shift training toward more autonomous task generation.

Load-bearing premise

The assumption that rollout strategies across the literature can be uniformly mapped to the GFCR stages without oversimplifying interactions between stages.

What would settle it

An examination of published rollout methods that identifies multiple strategies which resist clean mapping into the four GFCR stages without substantial loss of their original design details.

Figures

Figures reproduced from arXiv: 2605.02913 by Bowen Jin, Chuhan Wang, Difan Jiao, Gagan Mundada, Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Kuan-Hao Huang, Nikki Kuang, Raghav Jain, Rohan Surana, Ryan Rossi, Sheldon Yu, Sizhe Zhou, Tong Yu, Xintong Li, Xunyi Jiang, Yuxin Xiong, Zhendong Chu, Zhenwei Tang, Zihan Huang.

**Figure 2.** Figure 2: GFCR as an end-to-end rollout system. Given prompts x ∼ D and a compute budget B, Generate samples a rollout group T (x); Filter maps each rollout to intermediate signals ϕ and trainingfacing supervision S; Control uses costs and signals to adapt continuation/pruning/resampling and decide what to store; and Replay retrieves/stores artifacts that condition future generation. The overall objective is to max… view at source ↗

**Figure 3.** Figure 3: Rollout criterion taxonomy (at a glance). We evaluate rollout strategies along three crosscutting desiderata: reliability (trustworthy signals via verifiers or robust judges), coverage & informativeness (diverse candidates and disagreement/uncertainty), and cost sensitivity (value under compute budgets via value-per-cost and early stopping) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Generate module design space. Rollout proposal mechanisms can be organized along three axes: (left) topology & interaction (single, group, tree/graph, and multi-turn/tool rollouts), (middle) guidance/scaffolding z (ICL/rubrics, plans, reflection, adaptive guidance, tool augmentation), and (right) sampling & exploration configuration κG (decoding knobs, uncertainty-aware allocation, partial rollouts with … view at source ↗

**Figure 5.** Figure 5: Filter module design space. Filter maps sampled rollouts (or rollout prefixes) into intermediate signals and optimizer-facing supervision. It can be organized along three components: (left) gating & verification (structural validity, schema/executability checks, deterministic correctness tests), (middle) process & comparative scoring (dense step-level process signals and relative judgments via ranking/jud… view at source ↗

**Figure 6.** Figure 6: Control module design space. The control layer of rollout pipelines can be organized into four orthogonal components: (left) Instance & Budget Selection (prompt prioritization by difficulty/novelty/uncertainty; adaptive allocation of rollout slots, group size, depth/width, and token budgets); (middle) Rollout Dynamics Control (decoding configuration, partial rollouts with early exit, and branching–prunin… view at source ↗

**Figure 7.** Figure 7: This figure summarizes the Replay layer of rollout pipelines, which governs how past rollouts [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a comprehensive survey of rollout strategies for RL-based post-training of reasoning LLMs. It introduces unified notation for rollout pipelines and the Generate-Filter-Control-Replay (GFCR) lifecycle taxonomy that decomposes them into four modular stages—Generate (proposing trajectories/topologies), Filter (building signals via verifiers/judges/critics), Control (compute allocation and branching/stopping decisions), and Replay (retaining artifacts for reuse and self-evolving curricula)—plus a reliability-coverage-cost criterion taxonomy. The work synthesizes literature on verifiable-reward RL, process supervision, tree/segment rollouts, adaptive compute, and self-improvement; grounds the framework in case studies across math, code/SQL, multimodal, tool-use, and agentic benchmarks; and supplies a diagnostic index mapping pathologies to GFCR modules along with open challenges.

Significance. If the GFCR decomposition holds without substantial oversimplification, the survey would provide a useful synthesis tool for the field, enabling clearer comparison of rollout designs, identification of trade-offs, and guidance toward reproducible and compute-efficient pipelines. The case studies and diagnostic index add concrete value for practitioners working on skill induction and cross-task transfer in LLM RL.

major comments (1)

[GFCR taxonomy and synthesis of methods] The central claim that rollout strategies across the literature map cleanly to the four GFCR stages without losing essential cross-stage dependencies is load-bearing but insufficiently substantiated. The synthesis of tree/segment rollouts and adaptive compute (mentioned in the abstract) does not explicitly discuss hybrids in which filtering (e.g., joint critic evaluation) and control (e.g., branching or early stopping) are computed together inside the same search step, as occurs in many guided tree-search methods. This leaves the modularity claim vulnerable to post-hoc categorization rather than predictive structure.

minor comments (2)

The unified notation for rollout pipelines would be easier to follow if accompanied by a compact summary table of symbols and their definitions early in the manuscript.
A few citations to recent tree-search and self-improvement works appear to be missing from the synthesis section; adding them would strengthen coverage of the surveyed literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback on the GFCR taxonomy's handling of modularity and cross-stage dependencies is well-taken, and we address it directly below with a commitment to targeted revisions.

read point-by-point responses

Referee: The central claim that rollout strategies across the literature map cleanly to the four GFCR stages without losing essential cross-stage dependencies is load-bearing but insufficiently substantiated. The synthesis of tree/segment rollouts and adaptive compute (mentioned in the abstract) does not explicitly discuss hybrids in which filtering (e.g., joint critic evaluation) and control (e.g., branching or early stopping) are computed together inside the same search step, as occurs in many guided tree-search methods. This leaves the modularity claim vulnerable to post-hoc categorization rather than predictive structure.

Authors: We acknowledge that the current synthesis section on tree/segment rollouts and adaptive compute does not provide explicit examples of tightly interleaved hybrids, where Filter-stage signals (such as joint critic evaluation) and Control-stage decisions (such as branching or early stopping) are computed within the same search step. While the GFCR framework is designed to accommodate such interleaving by treating Control as the locus for integrating Filter outputs under compute budgets, the manuscript would be strengthened by concrete illustrations showing that this decomposition remains predictive rather than purely post-hoc. In the revised version, we will expand the relevant synthesis subsection to include a dedicated discussion of guided tree-search methods, clarifying how joint computation is modeled as a Control operation that consumes Filter signals without collapsing the stages. This addition will reference representative works on hybrid search and demonstrate that the taxonomy preserves essential dependencies by design. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive taxonomy with no derivations or self-referential reductions

full rationale

The paper is a survey that introduces the GFCR taxonomy as a classification framework for existing rollout strategies in LLM RL literature. It formalizes notation and decomposes pipelines into Generate, Filter, Control, and Replay stages, then maps methods from the literature onto this structure. No mathematical derivations, equations, fitted parameters, or predictions are present. The taxonomy is defined explicitly in the abstract and used to synthesize prior work without any step reducing to a self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The central claim (uniform mapping of strategies to GFCR) is a descriptive organization, not a derivation that collapses by construction. This is the expected non-finding for a taxonomy survey.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that proposes a new taxonomy rather than introducing fitted parameters, unproven axioms, or new physical entities. The GFCR stages are conceptual categories derived from literature synthesis.

pith-pipeline@v0.9.0 · 5687 in / 1224 out tokens · 84686 ms · 2026-05-10T19:11:02.263811+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories... Filter constructs intermediate signals... Control allocates compute... Replay retains and reuses artifacts...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
cs.LG 2026-05 unverdicted novelty 7.0

F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked perfo...
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives ...

Reference graph

Works this paper leans on

189 extracted references · 155 canonical work pages · cited by 2 Pith papers · 30 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review arXiv 2024
[2]

A state-of-the-art sql reasoning model using rlvr, 2025

Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Misra, Krista Opsahl-Ong, Jose Javier Gonzalez Ortiz, Matei Zaharia, and Yue Zhang. A state-of-the-art sql reasoning model using rlvr, 2025. URL https://arxiv.org/abs/2509.21459

work page arXiv 2025
[3]

XRPO: Pushing the limits of GRPO with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation. arXiv preprint arXiv:2510.06672, 2025

work page arXiv 2025
[4]

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training

Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training, 2025. URL https://arxiv.org/abs/2503.18929

work page arXiv 2025
[5]

Troll: Trust regions improve reinforcement learning for large language models, 2025

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, and Gerhard Neumann. Troll: Trust regions improve reinforcement learning for large language models, 2025. URL https://arxiv.org/abs/2510.03817

work page arXiv 2025
[6]

Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms

Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, and Bruno Castro da Silva. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. ACM Comput. Surv., 58 0 (2), September 2025. ISSN 0360-0300. doi:10.1145/3743127. URL https://doi.org/10.1145/3743127

work page doi:10.1145/3743127 2025
[7]

xverify: Efficient answer verifier for reasoning model evaluations

Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, and Zhiyu Li. xverify: Efficient answer verifier for reasoning model evaluations. arXiv preprint arXiv:2504.10481, 2025 a . URL https://arxiv.org/abs/2504.10481

work page arXiv 2025
[8]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization, 2025 b . URL https://arxiv.org/abs/2505.12346

work page arXiv 2025
[9]

Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, and Tianyi Lin. Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo. arXiv preprint arXiv:2505.11595, 2025 c . URL https://arxiv.org/abs/2505.11595

work page arXiv 2025
[10]

Respec: Towards optimizing speculative decoding in reinforcement learning systems

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, and Tianwei Zhang. Respec: Towards optimizing speculative decoding in reinforcement learning systems. arXiv preprint arXiv:2510.26475, 2025 d

work page arXiv 2025
[11]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning, 2025 e . URL https://arxiv.org/abs/2505.14970

work page arXiv 2025
[12]

From data-centric to sample-centric: Enhancing llm reasoning via progressive optimization

Xinjie Chen, Minpeng Liao, Guoxin Chen, Chengxi Li, Biao Fu, Kai Fan, and Xinggao Liu. From data-centric to sample-centric: Enhancing llm reasoning via progressive optimization. arXiv preprint arXiv:2507.06573, 2025 f

work page arXiv 2025
[13]

The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han L \`u , Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research. arXiv preprint arXiv:2412.05467, 2024

work page arXiv 2024
[14]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

2017
[15]

Soft policy optimization: Online off-policy rl for sequence models

Taco Cohen, David W Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, and Gabriel Synnaeve. Soft policy optimization: Online off-policy rl for sequence models. arXiv preprint arXiv:2503.05453, 2025

work page arXiv 2025
[16]

Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025

Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards, 2025. URL https://arxiv.org/abs/2506.11425

work page arXiv 2025
[17]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models. arXiv preprint arXiv:2505.07686, 2025

work page arXiv 2025
[18]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37: 0 82895--82920, 2024

2024
[19]

Internspatial: A comprehen- sive dataset for spatial reasoning in vision-language models

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, et al. Internspatial: A comprehensive dataset for spatial reasoning in vision-language models. arXiv preprint arXiv:2506.18385, 2025

work page arXiv 2025
[20]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153, 2025

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models. arXiv preprint arXiv:2512.08153, 2025

work page arXiv 2025
[21]

Plan then action:high-level planning guidance reinforcement learning for llm reasoning, 2025

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, Shufei Zhang, and Sumon Biswas. Plan then action:high-level planning guidance reinforcement learning for llm reasoning, 2025. URL https://arxiv.org/abs/2510.01833

work page arXiv 2025
[22]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory, 2026. URL https://arxiv.org/abs/2508.06433

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review arXiv 2025
[24]

Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, and Mahdi Jalili. Learning hierarchical procedural memory for llm agents through bayesian selection and contrastive refinement. arXiv preprint arXiv:2512.18950, 2025

work page arXiv 2025
[25]

Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, et al. Dynamic-treerpo: Breaking the independent trajectory bottleneck with structured sampling. arXiv preprint arXiv:2509.23352, 2025

work page arXiv 2025
[26]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

2023
[27]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024. URL https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

G\^2rpo-a: Guided group relative policy optimization with adaptive guidance

Yongxin Guo, Wenbo Deng, Zhenglin Cheng, and Xiaoying Tang. G\^2rpo-a: Guided group relative policy optimization with adaptive guidance. arXiv preprint arXiv:2508.13023, 2025 c

work page arXiv 2025
[31]

A survey of reinforcement learning in large language models: From data generation to test-time inference

Zichuan Guo and Hao Wang. A survey of reinforcement learning in large language models: From data generation to test-time inference. Available at SSRN 5128927, 2025

2025
[32]

Train long, think short: Curriculum learning for efficient reasoning

Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, and Bernard Ghanem. Train long, think short: Curriculum learning for efficient reasoning. arXiv preprint arXiv:2508.08940, 2025

work page arXiv 2025
[33]

Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R \"u hle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent llm systems for workflow automation. arXiv preprint arXiv:2510.04851, 2025

work page arXiv 2025
[34]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008

work page internal anchor Pith review arXiv 2024
[35]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

T ree RL : LLM reinforcement learning with on-policy tree search

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. T ree RL : LLM reinforcement learning with on-policy tree search. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12355--12369, Vie...

work page doi:10.18653/v1/2025.acl-long.604 2025
[37]

Figure captioning with large generative models: A survey

Chan-Wei Hu, Reuben Luera, Franck Dernoncourt, Huanjiang Liu, Seon Gyeom Kim, Ashish Singh, Puneet Mathur, Bo Ni, Samyadeep Basu, Hongjie Chen, et al. Figure captioning with large generative models: A survey. 2026

2026
[38]

Taming the long-tail: Efficient reasoning rl training with adaptive drafter.arXiv preprint arXiv:2511.16665,

Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, and Song Han. Taming the long-tail: Efficient reasoning rl training with adaptive drafter. arXiv preprint arXiv:2511.16665, 2025 a

work page arXiv 2025
[39]

V ADE: Variance-aware dynamic sampling via online sample-level difficulty estimation for multimodal RL.arXiv preprint arXiv:2511.18902, 2025

Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, and Wentao Zhang. Vade: Variance-aware dynamic sampling via online sample-level difficulty estimation for multimodal rl, 2025 b . URL https://arxiv.org/abs/2511.18902

work page arXiv 2025
[40]

arXiv preprint arXiv:2509.09284 , year=

Bingning Huang, Tu Nguyen, and Matthieu Zimmer. Tree-opo: Off-policy monte carlo tree-guided advantage optimization for multistep reasoning. arXiv preprint arXiv:2509.09284, 2025 a

work page arXiv 2025
[41]

Pluralistic off-policy evaluation and alignment

Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment. arXiv preprint arXiv:2509.19333, 2025 b

work page arXiv 2025
[42]

Listwise preference diffusion optimization for user behavior trajectories prediction

Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 c . URL https://openreview.net/forum?id=x5KUOlYKQr

2025
[43]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025 d . URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review arXiv 2025
[44]

Pitfalls of rule-and model-based verifiers–a case study on mathematical reasoning.arXiv preprint arXiv:2505.22203, 2025

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. From accuracy to robustness: A study of rule- and model-based verifiers in mathematical reasoning, 2025 e . URL https://arxiv.org/abs/2505.22203

work page arXiv 2025
[45]

Image difference captioning via adversarial preference optimization

Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 33746--33758, 2025 f

2025
[46]

Math-verify: A robust mathematical expression evaluation system

Hugging Face . Math-verify: A robust mathematical expression evaluation system. https://github.com/huggingface/Math-Verify, 2025. Accessed: 2026-02-03

2025
[47]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Guiding exploration in reinforcement learning through llm-augmented observations, 2025

Vaibhav Jain and Gerrit Grossmann. Guiding exploration in reinforcement learning through llm-augmented observations, 2025. URL https://arxiv.org/abs/2510.08779

work page arXiv 2025
[49]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning. arXiv preprint arXiv:2509.21240, 2025

work page arXiv 2025
[50]

arXiv preprint arXiv:2509.01055 , year=

Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use. arXiv preprint arXiv:2509.01055, 2025 a

work page arXiv 2025
[51]

Vcrl: Variance-based curriculum reinforcement learning for large language models

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models. arXiv preprint arXiv:2509.19803, 2025 b

work page arXiv 2025
[52]

Think only when you need with large hybrid-reasoning models.arXiv preprint arXiv:2505.14631,

Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, and Furu Wei. Think only when you need with large hybrid-reasoning models. arXiv preprint arXiv:2505.14631, 2025 c

work page arXiv 2025
[53]

A survey on human preference learning for aligning large language models

Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. A survey on human preference learning for aligning large language models. ACM Computing Surveys, 58 0 (6): 0 1--39, 2025 d

2025
[54]

Overthinking reduction with decoupled rewards and curriculum data scheduling, 2025 e

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Overthinking reduction with decoupled rewards and curriculum data scheduling, 2025 e . URL https://arxiv.org/abs/2509.25827

work page arXiv 2025
[55]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

arXiv preprint arXiv:2511.02755 , year =

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, et al. Controlling performance and budget of a centralized multi-agent llm system with reinforcement learning. arXiv preprint arXiv:2511.02755, 2025 a

work page arXiv 2025
[57]

Search-r1: Training llms to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. In Second Conference on Language Modeling, 2025 b

2025
[58]

Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think, 2025. URL https://arxiv.org/abs/2510.14901

work page arXiv 2025
[59]

A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke H \"u llermeier. A Survey of Reinforcement Learning from Human Feedback . Transactions on Machine Learning Research, 2025. ISSN 2835-8856

2025
[60]

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. Abstentionbench: Reasoning llms fail on unanswerable questions, 2025. URL https://arxiv.org/abs/2506.09038

work page arXiv 2025
[61]

Reinforcing code generation: Improving text-to-sql with execution-based learning

Atharv Kulkarni and Vivek Srikumar. Reinforcing code generation: Improving text-to-sql with execution-based learning. arXiv preprint arXiv:2506.06093, 2025

work page arXiv 2025
[62]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning, 2022. URL https://arxiv.org/abs/2207.01780

work page arXiv 2022
[63]

rlhf: Scaling reinforcement learning from human feedback with ai feedback , author=

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Cărbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023. URL https://arxiv.org/abs/2309.00267

work page arXiv 2023
[64]

" is generally preferable to

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, and Shijian Lu. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources, 2025. URL https://arxiv.org/abs/2509.21268

work page arXiv 2025
[65]

From generation to judgment: Opportunities and challenges of llm-as-a-judge

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 2757--2791, 2025 a

2025
[66]

Rorecomp: Enhancing reasoning efficiency via rollout response recomposition in reinforcement learning

Gang Li, Yulei Qin, Xiaoyu Tan, Dingkang Yang, Yuchen Shi, Zihan Xu, Xiang Li, Xing Sun, and Ke Li. Rorecomp: Enhancing reasoning efficiency via rollout response recomposition in reinforcement learning. arXiv preprint arXiv:2509.25958, 2025 b

work page arXiv 2025
[67]

Chang, Fei Huang, Reynold Cheng, and Yongbin Li

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. In Proceedings of the 37th International...

2023
[68]

arXiv preprint arXiv:2506.09340 (2025) 3

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. Repo: Replay-enhanced policy optimization, 2025 c . URL https://arxiv.org/abs/2506.09340

work page arXiv 2025
[69]

Importance sampling for multi-negative multimodal direct preference optimization

Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=HEFPwoGtTj

2026
[70]

arXiv preprint arXiv:2508.17445 , year=

Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling. arXiv preprint arXiv:2508.17445, 2025 d

work page arXiv 2025
[71]

arXiv preprint arXiv:2509.06040 (2025) 2, 3

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models. arXiv preprint arXiv:2509.06040, 2025 e

work page arXiv 2025
[72]

Review of reinforcement learning for large language models: Formulations, algorithms, and opportunities, 2025 f

Ziniu Li, Pengyuan Wang, Tian Xu, Tian Ding, Ruoyu Sun, and Yang Yu. Review of reinforcement learning for large language models: Formulations, algorithms, and opportunities, 2025 f

2025
[73]

arXiv preprint arXiv:2507.06892 (2025) 3

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, and Jianye Hao. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model. arXiv preprint arXiv:2507.06892, 2025

work page arXiv 2025
[74]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Ravr: Reference-answer-guided variational reasoning for large language models

Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, and Bo Zheng. Ravr: Reference-answer-guided variational reasoning for large language models. arXiv preprint arXiv:2510.25206, 2025. URL https://arxiv.org/abs/2510.25206

work page arXiv 2025
[76]

Rltf: Reinforcement learning from unit test feedback.arXiv preprint arXiv:2307.04349, 2023

Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. Rltf: Reinforcement learning from unit test feedback, 2023. URL https://arxiv.org/abs/2307.04349

work page arXiv 2023
[77]

Enhancing large language model reasoning with reward models: An analytical survey

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, and Ning Miao. Enhancing large language model reasoning with reward models: An analytical survey. arXiv preprint arXiv:2510.01925, 2025 a

work page arXiv 2025
[78]

Reuseit: Synthesizing reusable ai agent workflows for web automation

Yimeng Liu, Misha Sra, Jeevana Priya Inala, and Chenglong Wang. Reuseit: Synthesizing reusable ai agent workflows for web automation. arXiv preprint arXiv:2510.14308, 2025 b

work page arXiv 2025
[79]

Spark: Synergistic policy and reward co-evolving framework.arXiv preprint arXiv:2509.22624,

Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spark: Synergistic policy and reward co-evolving framework. arXiv preprint arXiv:2509.22624, 2025 c

work page arXiv 2025
[80]

Explorllm: Guiding exploration in reinforcement learning with large language models, 2025

Runyu Ma, Jelle Luijkx, Zlatan Ajanovic, and Jens Kober. Explorllm: Guiding exploration in reinforcement learning with large language models, 2025. URL https://arxiv.org/abs/2403.09583

work page arXiv 2025

Showing first 80 references.