ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Chengwei Qin; Heqing Zou; Hui Xiong; Jianbo Lin; Weishi Wang; Xiaomin Yu; Yifu Guo; Yi Xin; Zhongqi Yue; Zhuosong Jiang

arxiv: 2605.15224 · v1 · pith:23J2LGEKnew · submitted 2026-05-13 · 💻 cs.AI · cs.MA

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Jianbo Lin , Xiaomin Yu , Yi Xin , Yifu Guo , Zhuosong Jiang , Zhongqi Yue , Weishi Wang , Heqing Zou

show 2 more authors

Chengwei Qin Hui Xiong

This is my paper

Pith reviewed 2026-05-19 17:53 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords self-critiquereinforcement learningjoint trainingsolver and criticdistribution calibrationagentic tasksmathematical reasoninginternalization

0 comments

The pith

ICRL jointly trains a solver and critic from one backbone so critique gains become part of unassisted performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ICRL to solve the problem that language models lose the benefit of critique as soon as the critique is taken away. It jointly optimizes a solver and a critic from a shared backbone with reinforcement learning, giving the critic a reward tied directly to how much the solver improves afterward. A distribution-calibration re-weighting step keeps only those improvements that fit the solver's normal behavior without critique, while role-wise advantage estimates keep the joint training stable. Experiments on agentic and math benchmarks show gains over standard methods, and the resulting critic works at smaller scale than much larger alternatives. If the approach holds, models could keep improving on their own instead of staying dependent on external feedback.

Core claim

ICRL jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without externa

What carries the argument

The distribution-calibration re-weighting ratio that selects only critique-guided improvements compatible with the solver's own prompt distribution during joint RL training of solver and critic.

Load-bearing premise

The re-weighting ratio successfully selects critique-guided improvements that remain compatible with the solver's own prompt distribution without introducing bias or reducing performance on critique-free queries.

What would settle it

Ablating the re-weighting ratio during training and checking whether the solver's accuracy on critique-free queries drops below the GRPO baseline or the full ICRL version.

Figures

Figures reproduced from arXiv: 2605.15224 by Chengwei Qin, Heqing Zou, Hui Xiong, Jianbo Lin, Weishi Wang, Xiaomin Yu, Yifu Guo, Yi Xin, Zhongqi Yue, Zhuosong Jiang.

**Figure 2.** Figure 2: Overview of the ICRL framework. (1) Rollout with critique alternates solver and critic; (2) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Test-time self-improvement performance on ALFWorld, WebShop, and SearchQA. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics on ALFWorld. (1) Reward curve of the solver agent. (2) Reward curve [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICRL gets some benchmark gains from joint solver-critic RL training with a performance-based critic reward, but the distribution-calibration re-weighting lacks direct checks that it preserves unassisted solver performance.

read the letter

The main point is that this paper trains a solver and critic together from one backbone so that feedback which helps the solver actually gets turned into better base behavior without the feedback present. They reward the critic for measured performance gains on the solver and add a re-weighting step to keep the updates compatible with normal prompts, plus role-wise advantage estimation for training stability. Results show average lifts of 6.4 points on agent tasks and 7.0 on math over GRPO with Qwen3-4B and 8B models, and the learned critic stays competitive with much larger ones while using fewer tokens. Code is released, which helps.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ICRL, a reinforcement learning framework that jointly trains a solver and a critic from a shared LLM backbone. The critic is rewarded according to the solver's subsequent performance gain after critique, a distribution-calibration re-weighting ratio is introduced to mitigate distribution shift between critique-conditioned and critique-free behavior, and role-wise group advantage estimation is used to stabilize joint optimization. Experiments on agentic and mathematical reasoning tasks with Qwen3-4B and Qwen3-8B backbones report average gains of 6.4 and 7.0 points over GRPO, respectively, and code is released.

Significance. If the internalization claim holds after addressing verification gaps, the work could advance methods for autonomous self-improvement in LLM agents by converting external critique signals into unassisted capability. The public code release supports reproducibility, and the reported gains across two task categories and two model sizes provide a baseline for generality.

major comments (2)

[Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.
[§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.

minor comments (1)

[Abstract] The abstract states that the learned 8B critic is 'comparable to 32B critics while using substantially fewer tokens,' but the main text would benefit from specifying the exact comparison metric and the token counts involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.

Authors: We agree that direct verification of performance on critique-free prompts before versus after training, together with an ablation of the re-weighting component, would provide stronger support for the internalization claim. While the current experiments demonstrate overall gains and the re-weighting is motivated by mitigating distribution shift, these additional analyses were not included. In the revised manuscript we will add (i) solver accuracy on purely critique-free queries pre- and post-training and (ii) an ablation removing the distribution-calibration re-weighting, reporting its effect on critique-free performance. These results will quantify any bias and help confirm that gains transfer to unassisted behavior. revision: yes
Referee: [§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.

Authors: We acknowledge that reporting experimental robustness details is necessary for readers to assess the reliability of the reported improvements. The current manuscript presents aggregate results without these statistics. In the revision we will specify the number of independent runs, report variance or standard deviations across random seeds, include statistical significance tests (e.g., paired t-tests against GRPO), and add a brief discussion of controls for RL stochasticity and re-weighting ratio selection. These additions will allow better evaluation of whether the observed gains are attributable to the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ICRL derivation chain.

full rationale

The paper's core mechanism defines critic rewards directly from externally measured solver performance gains on benchmarks after critique is applied, which constitutes an independent training signal rather than a self-referential or fitted quantity renamed as a prediction. The distribution-calibration re-weighting ratio is presented as an explicit design choice to mitigate shift between critique-conditioned and critique-free prompts, but it does not reduce the reported benchmark improvements (6.4/7.0 points over GRPO) to the inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the internalization claim; results are empirical outcomes from joint RL training on Qwen3 backbones evaluated on agentic and mathematical tasks. The framework remains self-contained against external benchmarks without tautological reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to elements explicitly named; the re-weighting ratio and advantage estimation are treated as introduced mechanisms whose tuning details are not specified.

free parameters (1)

distribution-calibration re-weighting ratio
Introduced to handle distribution shift between critique-conditioned and critique-free behavior; value not stated in abstract and presumed tuned on validation data.

axioms (1)

domain assumption Critique can often guide the model toward correct behavior on the same query.
Stated as the starting observation in the abstract.

pith-pipeline@v0.9.0 · 5853 in / 1322 out tokens · 39448 ms · 2026-05-19T17:53:19.786250+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Role-wise group advantage estimation stabilizes joint optimization across the two roles.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 12 internal anchors

[1]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[2]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[3]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

ArXiv , year=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. ArXiv , year=

work page
[5]

Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

Training language models with language feedback at scale , author=. arXiv preprint arXiv:2303.16755 , year=

work page arXiv
[6]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

work page 2024
[7]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Ui-tars: Pioneering automated gui interaction with native agents , author=. arXiv preprint arXiv:2501.12326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

WebSailor: Navigating Super-human Reasoning for Web Agent

Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2508.13167 , year=

Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl , author=. arXiv preprint arXiv:2508.13167 , year=

work page arXiv
[12]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[13]

Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025

Incentivizing LLMs to Self-Verify Their Answers , author=. arXiv preprint arXiv:2506.01369 , year=

work page arXiv
[14]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

S2r: Teaching llms to self-verify and self-correct via reinforcement learning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[15]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2512.01374 , year=

Stabilizing reinforcement learning with llms: Formulation and practices , author=. arXiv preprint arXiv:2512.01374 , year=

work page arXiv
[20]

Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.13445, 2025

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2505.13445 , year=

work page arXiv
[21]

arXiv preprint arXiv:2602.07594 , year=

Learning to Self-Verify Makes Language Models Better Reasoners , author=. arXiv preprint arXiv:2602.07594 , year=

work page arXiv
[22]

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback , author=. arXiv preprint arXiv:2506.03106 , year=

work page arXiv
[23]

arXiv preprint arXiv:2501.05727 , year=

Self-evolving critique abilities in large language models , author=. arXiv preprint arXiv:2501.05727 , year=

work page arXiv
[24]

2021 , url =

Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht , booktitle =. 2021 , url =

work page 2021
[25]

Advances in Neural Information Processing Systems , volume=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

work page doi:10.18653/v1/d18-1259 2018
[27]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho and Anh. Constructing. Proceedings of the 28th International Conference on Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.COLING-MAIN.580 , timestamp =

work page doi:10.18653/v1/2020.coling-main.580 2020
[28]

Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Trans. Assoc. Comput. Linguistics , volume =. 2022 , url =. doi:10.1162/TACL\_A\_00475 , timestamp =

work page internal anchor Pith review doi:10.1162/tacl 2022
[29]

and Lewis, Mike , editor =

Ofir Press and Muru Zhang and Sewon Min and Ludwig Schmidt and Noah A. Smith and Mike Lewis , editor =. Measuring and Narrowing the Compositionality Gap in Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.378 , timestamp =

work page doi:10.18653/v1/2023.findings-emnlp.378 2023
[30]

Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs , author=. arXiv preprint arXiv:2510.11062 , year=

work page arXiv
[31]

arXiv preprint arXiv:2510.04678 , year=

Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=

work page arXiv
[32]

Wideseek-r1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2602.04634 , year=

work page arXiv
[33]

MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=. arXiv preprint arXiv:2602.08847 , year=

work page arXiv
[34]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentgym: Evaluating and training large language model-based agents across diverse environments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[35]

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=

work page arXiv
[36]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021
[37]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

work page 2022
[38]

2024 , address =

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[39]

GitHub repository , howpublished =

Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

work page 2024
[40]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[42]

2026 , howpublished =

work page 2026
[43]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Areal: A large-scale asynchronous reinforcement learning system for language reasoning , author=. arXiv preprint arXiv:2505.24298 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

2025 , eprint=

Learning to Reason under Off-Policy Guidance , author=. 2025 , eprint=

work page 2025
[45]

Ceva’s theorem

Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning , author=. arXiv preprint arXiv:2506.17211 , year=

work page arXiv

[1] [1]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[3] [3]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

ArXiv , year=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. ArXiv , year=

work page

[5] [5]

Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

Training language models with language feedback at scale , author=. arXiv preprint arXiv:2303.16755 , year=

work page arXiv

[6] [6]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

work page 2024

[7] [7]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Ui-tars: Pioneering automated gui interaction with native agents , author=. arXiv preprint arXiv:2501.12326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

WebSailor: Navigating Super-human Reasoning for Web Agent

Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2508.13167 , year=

Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl , author=. arXiv preprint arXiv:2508.13167 , year=

work page arXiv

[12] [12]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[13] [13]

Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025

Incentivizing LLMs to Self-Verify Their Answers , author=. arXiv preprint arXiv:2506.01369 , year=

work page arXiv

[14] [14]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

S2r: Teaching llms to self-verify and self-correct via reinforcement learning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[15] [15]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2512.01374 , year=

Stabilizing reinforcement learning with llms: Formulation and practices , author=. arXiv preprint arXiv:2512.01374 , year=

work page arXiv

[20] [20]

Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.13445, 2025

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2505.13445 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2602.07594 , year=

Learning to Self-Verify Makes Language Models Better Reasoners , author=. arXiv preprint arXiv:2602.07594 , year=

work page arXiv

[22] [22]

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback

Critique-grpo: Advancing llm reasoning with natural language and numerical feedback , author=. arXiv preprint arXiv:2506.03106 , year=

work page arXiv

[23] [23]

arXiv preprint arXiv:2501.05727 , year=

Self-evolving critique abilities in large language models , author=. arXiv preprint arXiv:2501.05727 , year=

work page arXiv

[24] [24]

2021 , url =

Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht , booktitle =. 2021 , url =

work page 2021

[25] [25]

Advances in Neural Information Processing Systems , volume=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[26] [26]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

work page doi:10.18653/v1/d18-1259 2018

[27] [27]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Xanh Ho and Anh. Constructing. Proceedings of the 28th International Conference on Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.COLING-MAIN.580 , timestamp =

work page doi:10.18653/v1/2020.coling-main.580 2020

[28] [28]

Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Trans. Assoc. Comput. Linguistics , volume =. 2022 , url =. doi:10.1162/TACL\_A\_00475 , timestamp =

work page internal anchor Pith review doi:10.1162/tacl 2022

[29] [29]

and Lewis, Mike , editor =

Ofir Press and Muru Zhang and Sewon Min and Ludwig Schmidt and Noah A. Smith and Mike Lewis , editor =. Measuring and Narrowing the Compositionality Gap in Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.378 , timestamp =

work page doi:10.18653/v1/2023.findings-emnlp.378 2023

[30] [30]

Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs , author=. arXiv preprint arXiv:2510.11062 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2510.04678 , year=

Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=

work page arXiv

[32] [32]

Wideseek-r1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2602.04634 , year=

work page arXiv

[33] [33]

MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=. arXiv preprint arXiv:2602.08847 , year=

work page arXiv

[34] [34]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentgym: Evaluating and training large language model-based agents across diverse environments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[35] [35]

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=

work page arXiv

[36] [36]

2021 , eprint=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

work page 2021

[37] [37]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

work page 2022

[38] [38]

2024 , address =

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024

[39] [39]

GitHub repository , howpublished =

Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

work page 2024

[40] [40]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025

[42] [42]

2026 , howpublished =

work page 2026

[43] [43]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Areal: A large-scale asynchronous reinforcement learning system for language reasoning , author=. arXiv preprint arXiv:2505.24298 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

2025 , eprint=

Learning to Reason under Off-Policy Guidance , author=. 2025 , eprint=

work page 2025

[45] [45]

Ceva’s theorem

Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning , author=. arXiv preprint arXiv:2506.17211 , year=

work page arXiv