ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
Pith reviewed 2026-05-19 17:53 UTC · model grok-4.3
The pith
ICRL jointly trains a solver and critic from one backbone so critique gains become part of unassisted performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ICRL jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without externa
What carries the argument
The distribution-calibration re-weighting ratio that selects only critique-guided improvements compatible with the solver's own prompt distribution during joint RL training of solver and critic.
Load-bearing premise
The re-weighting ratio successfully selects critique-guided improvements that remain compatible with the solver's own prompt distribution without introducing bias or reducing performance on critique-free queries.
What would settle it
Ablating the re-weighting ratio during training and checking whether the solver's accuracy on critique-free queries drops below the GRPO baseline or the full ICRL version.
Figures
read the original abstract
Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ICRL, a reinforcement learning framework that jointly trains a solver and a critic from a shared LLM backbone. The critic is rewarded according to the solver's subsequent performance gain after critique, a distribution-calibration re-weighting ratio is introduced to mitigate distribution shift between critique-conditioned and critique-free behavior, and role-wise group advantage estimation is used to stabilize joint optimization. Experiments on agentic and mathematical reasoning tasks with Qwen3-4B and Qwen3-8B backbones report average gains of 6.4 and 7.0 points over GRPO, respectively, and code is released.
Significance. If the internalization claim holds after addressing verification gaps, the work could advance methods for autonomous self-improvement in LLM agents by converting external critique signals into unassisted capability. The public code release supports reproducibility, and the reported gains across two task categories and two model sizes provide a baseline for generality.
major comments (2)
- [Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.
- [§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.
minor comments (1)
- [Abstract] The abstract states that the learned 8B critic is 'comparable to 32B critics while using substantially fewer tokens,' but the main text would benefit from specifying the exact comparison metric and the token counts involved.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.
read point-by-point responses
-
Referee: [Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.
Authors: We agree that direct verification of performance on critique-free prompts before versus after training, together with an ablation of the re-weighting component, would provide stronger support for the internalization claim. While the current experiments demonstrate overall gains and the re-weighting is motivated by mitigating distribution shift, these additional analyses were not included. In the revised manuscript we will add (i) solver accuracy on purely critique-free queries pre- and post-training and (ii) an ablation removing the distribution-calibration re-weighting, reporting its effect on critique-free performance. These results will quantify any bias and help confirm that gains transfer to unassisted behavior. revision: yes
-
Referee: [§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.
Authors: We acknowledge that reporting experimental robustness details is necessary for readers to assess the reliability of the reported improvements. The current manuscript presents aggregate results without these statistics. In the revision we will specify the number of independent runs, report variance or standard deviations across random seeds, include statistical significance tests (e.g., paired t-tests against GRPO), and add a brief discussion of controls for RL stochasticity and re-weighting ratio selection. These additions will allow better evaluation of whether the observed gains are attributable to the proposed mechanisms. revision: yes
Circularity Check
No significant circularity in ICRL derivation chain.
full rationale
The paper's core mechanism defines critic rewards directly from externally measured solver performance gains on benchmarks after critique is applied, which constitutes an independent training signal rather than a self-referential or fitted quantity renamed as a prediction. The distribution-calibration re-weighting ratio is presented as an explicit design choice to mitigate shift between critique-conditioned and critique-free prompts, but it does not reduce the reported benchmark improvements (6.4/7.0 points over GRPO) to the inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the internalization claim; results are empirical outcomes from joint RL training on Qwen3 backbones evaluated on agentic and mathematical tasks. The framework remains self-contained against external benchmarks without tautological reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- distribution-calibration re-weighting ratio
axioms (1)
- domain assumption Critique can often guide the model toward correct behavior on the same query.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Role-wise group advantage estimation stabilizes joint optimization across the two roles.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
-
[2]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[3]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. ArXiv , year=
-
[5]
Training language models with language feedback at scale
Training language models with language feedback at scale , author=. arXiv preprint arXiv:2303.16755 , year=
-
[6]
John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=
work page 2024
-
[7]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Ui-tars: Pioneering automated gui interaction with native agents , author=. arXiv preprint arXiv:2501.12326 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
WebSailor: Navigating Super-human Reasoning for Web Agent
Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2508.13167 , year=
Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl , author=. arXiv preprint arXiv:2508.13167 , year=
-
[12]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
work page 2023
-
[13]
Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025
Incentivizing LLMs to Self-Verify Their Answers , author=. arXiv preprint arXiv:2506.01369 , year=
-
[14]
S2r: Teaching llms to self-verify and self-correct via reinforcement learning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[15]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Stabilizing reinforcement learning with llms: Formulation and practices
Stabilizing reinforcement learning with llms: Formulation and practices , author=. arXiv preprint arXiv:2512.01374 , year=
-
[20]
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2505.13445 , year=
-
[21]
arXiv preprint arXiv:2602.07594 , year=
Learning to Self-Verify Makes Language Models Better Reasoners , author=. arXiv preprint arXiv:2602.07594 , year=
-
[22]
arXiv preprint arXiv:2506.03106 , year=
Critique-grpo: Advancing llm reasoning with natural language and numerical feedback , author=. arXiv preprint arXiv:2506.03106 , year=
-
[23]
arXiv preprint arXiv:2501.05727 , year=
Self-evolving critique abilities in large language models , author=. arXiv preprint arXiv:2501.05727 , year=
-
[24]
Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht , booktitle =. 2021 , url =
work page 2021
-
[25]
Advances in Neural Information Processing Systems , volume=
Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Cohen and Ruslan Salakhutdinov and Christopher D
Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =
-
[27]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho and Anh. Constructing. Proceedings of the 28th International Conference on Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.COLING-MAIN.580 , timestamp =
-
[28]
Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Trans. Assoc. Comput. Linguistics , volume =. 2022 , url =. doi:10.1162/TACL\_A\_00475 , timestamp =
work page internal anchor Pith review doi:10.1162/tacl 2022
-
[29]
Ofir Press and Muru Zhang and Sewon Min and Ludwig Schmidt and Noah A. Smith and Mike Lewis , editor =. Measuring and Narrowing the Compositionality Gap in Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.378 , timestamp =
-
[30]
Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs , author=. arXiv preprint arXiv:2510.11062 , year=
-
[31]
arXiv preprint arXiv:2510.04678 , year=
Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=
-
[32]
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2602.04634 , year=
-
[33]
MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=. arXiv preprint arXiv:2602.08847 , year=
-
[34]
Agentgym: Evaluating and training large language model-based agents across diverse environments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[35]
Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=
-
[36]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=
work page 2021
-
[37]
Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=
work page 2022
-
[38]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...
-
[39]
GitHub repository , howpublished =
Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =
work page 2024
-
[40]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
-
[42]
2026 , howpublished =
work page 2026
-
[43]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Areal: A large-scale asynchronous reinforcement learning system for language reasoning , author=. arXiv preprint arXiv:2505.24298 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Learning to Reason under Off-Policy Guidance , author=. 2025 , eprint=
work page 2025
-
[45]
Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning , author=. arXiv preprint arXiv:2506.17211 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.