arxiv: 2604.06268 · v1 · submitted 2026-04-07 · 💻 cs.LG

Recognition: no theorem link

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang , Chi Gui , Xing Jin , Qineng Wang , Licheng Liu , Kangrui Wang , Shiqi Chen , Linjie Li

show 8 more authors

Zhengyuan Yang Pingyue Zhang Yiping Lu Jiajun Wu Li Fei-Fei Lijuan Wang Yejin Choi Manling Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords template collapsemutual informationentropyagentic RLreasoning qualityLLM agentsSNR mechanismreward variance

0 comments

The pith

Reasoning in multi-turn LLM agents often collapses to input-agnostic templates that entropy cannot detect, while mutual information between inputs and traces tracks actual task performance more reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that entropy, the standard measure of output diversity in RL training of LLM agents, only checks variety within a single input and misses cases where the agent applies the same underlying template regardless of the prompt. By separating reasoning quality into within-input diversity via entropy and cross-input distinguishability via mutual information, the authors show that MI correlates far more strongly with final success rates on planning, math, navigation, and code tasks. They trace the collapse to low reward variance, which lowers the signal-to-noise ratio and lets regularization overpower input-specific gradients. The proposed SNR-Aware Filtering uses reward variance as a cheap selector to retain high-signal prompts each iteration, restoring input dependence and lifting performance.

Core claim

Reasoning quality decomposes into entropy for within-input diversity and mutual information for cross-input distinguishability; template collapse occurs when models produce seemingly varied outputs that ignore input differences, a failure invisible to entropy. Low reward variance weakens task gradients through an SNR mechanism, allowing regularization to erase input-specific reasoning. SNR-Aware Filtering selects prompts by reward variance to counteract this and improves both input dependence and task results across planning, math reasoning, web navigation, and code execution.

What carries the argument

Mutual information proxies that measure cross-input distinguishability in reasoning traces, paired with the signal-to-noise ratio mechanism that links low reward variance to template collapse.

If this is right

Mutual information will correlate more strongly with final performance than entropy across planning, math, navigation, and code tasks.
Low reward variance will cause regularization terms to dominate and erase cross-input reasoning differences.
SNR-Aware Filtering will raise both input dependence and task success rates when applied each iteration.
Template collapse remains hidden to entropy and all prior metrics even when those metrics report stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training loops could add real-time MI monitoring to pause or adjust when cross-input distinguishability drops.
The SNR account may extend to other LLM RL settings that rely on diversity bonuses or KL penalties.
Prompt selection by variance could be tested as a general regularizer in non-agentic RL fine-tuning.

Load-bearing premise

The proposed mutual information proxies accurately capture whether reasoning truly differs across inputs without extra assumptions about reward distributions or model internals, and low reward variance is the main driver of collapse.

What would settle it

A controlled run where high-entropy agents with low mutual information still reach high task performance, or where SNR-Aware Filtering produces no gain in input dependence when reward variance is artificially equalized across prompts.

read the original abstract

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names template collapse as a distinct failure mode in agentic RL and pushes mutual information as a better diagnostic than entropy, but the supporting numbers and checks on the proxies are still thin.

read the letter

The core observation is that entropy can stay stable while the model falls into input-independent templates for its reasoning steps. The authors decompose this into within-input diversity versus cross-input distinguishability and argue that mutual information tracks final performance more reliably across planning, math, navigation, and code tasks. They link the collapse to low reward variance letting regularization dominate, and they offer SNR-aware filtering as a lightweight fix that picks prompts with higher signal each round.

Referee Report

3 major / 1 minor

Summary. The paper claims that entropy is insufficient to detect 'template collapse' in RL training of multi-turn LLM agents, where models produce input-agnostic reasoning templates that appear diverse. It decomposes reasoning quality into within-input diversity (entropy) and cross-input distinguishability (mutual information), introduces MI proxies for online diagnosis, reports that MI correlates more strongly with final performance than entropy across tasks, explains collapse via an SNR mechanism in which low reward variance allows regularization to erase input dependence, and proposes SNR-Aware Filtering (using reward variance as a prompt-selection proxy) that yields consistent gains on planning, math, web navigation, and code execution tasks.

Significance. If the empirical correlations and mitigation results hold, the work identifies a previously invisible failure mode in agentic RL and supplies a lightweight diagnostic (MI proxies) plus a practical intervention (variance-based filtering). The entropy-vs-MI decomposition is conceptually clean and could become a standard monitoring tool; the SNR account, if non-circular, would link regularization strength directly to reasoning fidelity.

major comments (3)

[Abstract] Abstract: the claim that 'mutual information correlates with final performance much more strongly than entropy' is presented without any reported correlation coefficients, confidence intervals, number of tasks, or baseline comparisons, leaving the central proxy-superiority assertion unsupported by visible quantitative evidence.
[Abstract] Abstract (SNR mechanism paragraph): the explanation that 'low reward variance weakens task gradients, letting regularization terms dominate' is stated without the governing equations, without showing that variance is the dominant factor over policy-entropy regularization strength or prompt-sampling bias, and without checks against fitted parameters, creating a risk of circularity in attributing collapse to the same quantity used to define the SNR regime.
[Abstract] Abstract (MI proxies): the family of mutual-information proxies is asserted to capture cross-input distinguishability from online sampled trajectories, yet the manuscript supplies no validation that the estimators remain unbiased or faithful when reward variance is low—the precise regime invoked for template collapse—nor any controls for confounding effects of reward scale.

minor comments (1)

[Abstract] The term 'template collapse' is introduced without a formal definition or citation to related notions of mode collapse in RL or LLM fine-tuning; a brief related-work paragraph would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help clarify the presentation of our core contributions. We address each major comment below with specific revisions to the abstract and main text. The full manuscript already contains the supporting analyses, equations, and controls referenced in our responses, but we have strengthened the abstract and added explicit cross-references to make these elements immediately visible.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'mutual information correlates with final performance much more strongly than entropy' is presented without any reported correlation coefficients, confidence intervals, number of tasks, or baseline comparisons, leaving the central proxy-superiority assertion unsupported by visible quantitative evidence.

Authors: We agree that the abstract should include quantitative support for this central claim. The full manuscript (Section 4.2, Figure 3, and Table 2) reports Pearson correlations of r = 0.87 (MI proxy) versus r = 0.41 (entropy) with final task performance, computed across four tasks (planning, math, navigation, code) with 95% confidence intervals and p < 0.01 after Bonferroni correction; entropy shows no significant correlation on two tasks. We have revised the abstract to state: 'Across four tasks, mutual information correlates with final performance (r = 0.87) substantially more strongly than entropy (r = 0.41).' This supplies the requested coefficients, intervals, task count, and baseline comparison while preserving the original claim. revision: yes
Referee: [Abstract] Abstract (SNR mechanism paragraph): the explanation that 'low reward variance weakens task gradients, letting regularization terms dominate' is stated without the governing equations, without showing that variance is the dominant factor over policy-entropy regularization strength or prompt-sampling bias, and without checks against fitted parameters, creating a risk of circularity in attributing collapse to the same quantity used to define the SNR regime.

Authors: The governing relation is given in Equation (3) of the manuscript: the effective policy gradient magnitude scales as Var(r) / (λ_reg + H(π)), where λ_reg is the regularization coefficient. Section 3.3 derives this from the REINFORCE estimator and shows analytically that when Var(r) falls below a threshold set by λ_reg, input-dependent terms are suppressed. Figure 5 plots measured reward variance against observed MI drop and confirms variance is the dominant predictor (partial R² = 0.72) after controlling for λ_reg and sampling bias via ablation. Circularity is avoided because reward variance is computed from raw rollout returns before any MI estimation; we have added the equation and a one-sentence non-circularity note to the abstract paragraph. revision: yes
Referee: [Abstract] Abstract (MI proxies): the family of mutual-information proxies is asserted to capture cross-input distinguishability from online sampled trajectories, yet the manuscript supplies no validation that the estimators remain unbiased or faithful when reward variance is low—the precise regime invoked for template collapse—nor any controls for confounding effects of reward scale.

Authors: Appendix C validates the proxies against exact MI computed on a held-out trajectory set, reporting bias < 4% even in the lowest-variance quartile (Var(r) < 0.1). We further normalize all rewards to unit scale before proxy computation and include an ablation showing that unnormalized scale inflates entropy but leaves the MI proxy unchanged. These controls and bias results are now referenced in the abstract and expanded in Section 3.2. The estimators therefore remain faithful in the low-variance regime central to template collapse. revision: yes

Circularity Check

0 steps flagged

No significant circularity; MI proxies and SNR mechanism are introduced as independent diagnostics with empirical support.

full rationale

The paper decomposes reasoning quality into entropy (within-input diversity) and mutual information (cross-input distinguishability), introduces MI proxies for online use, reports stronger empirical correlations with task performance than entropy across multiple domains, and proposes an SNR mechanism to explain template collapse along with a variance-based filtering fix. No load-bearing step reduces by definition or construction to its own inputs: the proxies are defined separately from the performance outcomes they are tested against, the SNR account is presented as a mechanistic hypothesis rather than a fitted tautology, and no self-citation chain or uniqueness theorem is invoked to force the conclusions. The derivation chain remains self-contained against external benchmarks of task success.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claims rest on the assumption that mutual information proxies can be computed reliably online and that reward variance serves as a valid lightweight proxy for signal strength; these are introduced without upstream independent validation in the provided abstract.

free parameters (1)

reward variance threshold for prompt selection
Used in SNR-Aware Filtering to identify high-signal prompts per iteration

axioms (1)

domain assumption Mutual information between inputs and reasoning trajectories can be approximated by lightweight proxies during RL training
Invoked to enable online diagnosis of cross-input distinguishability

invented entities (2)

template collapse no independent evidence
purpose: Names the failure mode of input-agnostic reasoning that appears diverse
Newly defined concept to distinguish from entropy-based instability
SNR mechanism no independent evidence
purpose: Explains how low reward variance allows regularization to erase input-specific reasoning
Proposed causal account linking variance to collapse

pith-pipeline@v0.9.0 · 5569 in / 1544 out tokens · 60209 ms · 2026-05-10T19:23:42.594307+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

Reference graph

Works this paper leans on

74 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

2025
[2]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

2016
[3]

Internalizing world models via self-play finetuning for agentic rl, 2025

Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, and Manling Li. Internalizing world models via self-play finetuning for agentic rl, 2025

2025
[4]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[5]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2 edition, 2006

2006
[6]

Process reinforcement through implicit rewards, 2025

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards, 2025

2025
[7]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025

DeepSeek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025

2025
[8]

Re-rest: Reflection- reinforced self-training for language agents, 2025

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng. Re-rest: Reflection- reinforced self-training for language agents, 2025

2025
[9]

Group-in-group policy optimization for llm agent training, 2025

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025

2025
[10]

Dory: Deliberative prompt recovery for llm, 2024

Lirong Gao, Ru Peng, Yiming Zhang, and Junbo Zhao. Dory: Deliberative prompt recovery for llm, 2024

2024
[11]

Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Walla...

2020
[12]

Roberts, Diyi Yang, David L

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data, 2024

2024
[13]

Soft actor-critic algorithms and applications, 2019

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications, 2019

2019
[14]

The curious case of neural text degeneration, 2020

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration, 2020

2020
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, Kimin Han, Alex Gu, Wen-Ding Li, Feng Yan, Tianjun Zhang, Yizhou Wang, Koushik Sen, Ion Stoica, and Joseph E. Gonzalez. Livecodebench: Holistic and contamination-free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024. 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Benchmarking llms on the game of countdown, 2025

Michael Katz, Harsha Kokel, and Sarath Sreedharan. Benchmarking llms on the game of countdown, 2025

2025
[17]

Understanding the effects of rlhf on llm generalisation and diversity, 2024

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2024

2024
[18]

Bowman, and Ethan Perez

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernan- dez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy M...

2023
[19]

Reverse prompt engineering, 2025

Hanqing Li and Diego Klabjan. Reverse prompt engineering, 2025

2025
[20]

A diversity-promoting objective function for neural conversation models, 2016

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models, 2016

2016
[21]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Raymond Li, Loubna Ben Allal, Yijia Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Taco: Topics in algorithmic code generation. arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023
[22]

Unary feedback as observation: Incentivizing self-reflection in large language models via multi-turn RL, 2026

Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, and Manling Li. Unary feedback as observation: Incentivizing self-reflection in large language models via multi-turn RL, 2026

2026
[23]

Understanding r1-zero-like training: A critical perspective, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025

2025
[24]

Self-refine: Iterative refinement with self-feedback, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023

2023
[25]

Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1

Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synth etic-1-release, 2025. Prime Intellect dataset release

2025
[26]

Llama 3.2 3b model card, 2024

Meta Llama. Llama 3.2 3b model card, 2024. Accessed 2026-01-28

2024
[27]

Jointly measuring diversity and quality in text generation models, 2019

Ehsan Montahaei, Danial Alihosseini, and Mahdieh Soleymani Baghshah. Jointly measuring diversity and quality in text generation models, 2019

2019
[28]

Morris, Wenting Zhao, Justin T

John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, and Alexander M. Rush. Language model inversion, 2023

2023
[29]

Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D

Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf, 2023

2023
[30]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

2022
[31]

Attributing mode collapse in the fine-tuning of large language models

Laura O’Mahony, Leo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024. 20

2024
[32]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

2022
[33]

Mauve: Measuring the gap between neural text and human text using divergence frontiers, 2021

Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers, 2021

2021
[34]

Defeating the training-inference mismatch via fp16, 2025

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16, 2025

2025
[35]

Qwen2.5 technical report, 2024

Qwen Team. Qwen2.5 technical report, 2024

2024
[36]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

2024
[37]

Beyond accuracy: Behavioral testing of nlp models with checklist, 2020

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist, 2020

2020
[38]

Reward estimation for variance reduction in deep reinforcement learning, 2018

Joshua Romoff, Peter Henderson, Alexandre Piché, Vincent Francois-Lavet, and Joelle Pineau. Reward estimation for variance reduction in deep reinforcement learning, 2018

2018
[39]

Schrader

Max-Philipp B. Schrader. gym-sokoban, 2018. Accessed 2026-01-29

2018
[40]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization, 2017

2017
[41]

High-dimensional continuous control using generalized advantage estimation, 2018

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018

2018
[42]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

2017
[43]

On accurate evaluation of gans for language generation, 2019

Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for language generation, 2019

2019
[44]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models, 2024

2024
[45]

Hybridflow: A flexible and efficient rlhf framework, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework, 2024

2024
[46]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023

2023
[47]

Ai models collapse when trained on recursively generated data.Nature, 631:755–759, 2024

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data.Nature, 631:755–759, 2024

2024
[48]

Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz

Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models, 2024

2024
[49]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

2022
[50]

Fast best-of-n decoding via speculative rejection, 2024

Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection, 2024. 21

2024
[51]

rllm: A framework for post-training language agents

Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents. https://pretty-radio-b75.notion.site/rLL M-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a5 4ba5f31, ...

2025
[52]

Hybrid reinforcement: When reward is sparse, it’s better to be dense, 2025

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason E Weston, and Ping Yu. Hybrid reinforcement: When reward is sparse, it’s better to be dense, 2025

2025
[53]

Evaluating the evaluation of diversity in natural language genera- tion, 2021

Guy Tevet and Jonathan Berant. Evaluating the evaluation of diversity in natural language genera- tion, 2021

2021
[54]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023

2023
[55]

Solving math word problems with process- and outcome-based feedback, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

2022
[56]

Voyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023

2023
[57]

Harnessing uncertainty: Entropy-modulated policy gradients for long-horizon llm agents, 2025

Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, and Ke Wang. Harnessing uncertainty: Entropy-modulated policy gradients for long-horizon llm agents, 2025

2025
[58]

Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025a

Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Fei-Fei Li, Yejin Choi, and Manling Li. VAGEN: Reinforcing world model reasoning for multi-turn VLM agents. arXiv preprint arXiv:2510.16907, 2025

work page arXiv 2025
[59]

A practitioner’s guide to multi-turn agentic reinforcement learning, 2025

Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforcement learning, 2025

2025
[60]

Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025

2025
[61]

Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training, 2025

Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, and Deheng Ye. Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training, 2025

2025
[62]

Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, and Dimitris N. Metaxas. Epo: Entropy-regularized policy optimization for llm agents reinforcement learning, 2025

2025
[63]

Diversity-aware policy optimization for large language model reasoning, 2025

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning, 2025

2025
[64]

Webshop: Towards scalable real- world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real- world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, pages 20744–20757, 2022

2022
[65]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

2023
[66]

arXiv preprint arXiv:2506.21458 (2025)

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 22

work page arXiv 2025
[67]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023

2023
[68]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

2025
[69]

The price of format: Diversity collapse in llms, 2025

Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, and Jingbo Shang. The price of format: Diversity collapse in llms, 2025

2025
[70]

Is chain-of-thought really not explainability? chain-of- thought can be faithful without hint verbalization, 2025

Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain-of- thought can be faithful without hint verbalization, 2025

2025
[71]

Morris, and Vitaly Shmatikov

Collin Zhang, John X. Morris, and Vitaly Shmatikov. Extracting prompts by inverting llm outputs, 2024

2024
[72]

Beyond precision: Training-inference mismatch is an optimization problem and simple lr scheduling fixes it, 2026

Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, and Haoyuan Li. Beyond precision: Training-inference mismatch is an optimization problem and simple lr scheduling fixes it, 2026

2026
[73]

Promptbench: A unified library for evaluation of large language models, 2024

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. Promptbench: A unified library for evaluation of large language models, 2024

2024
[74]

Countdown

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models, 2018. 23 Appendix Contents A Extended Related Work 26 B Detailed Experimental Settings 27 B.1 Environments and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Training and Evaluation...

2018