pith. sign in

arxiv: 2605.15224 · v1 · pith:23J2LGEKnew · submitted 2026-05-13 · 💻 cs.AI · cs.MA

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Pith reviewed 2026-05-19 17:53 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords self-critiquereinforcement learningjoint trainingsolver and criticdistribution calibrationagentic tasksmathematical reasoninginternalization
0
0 comments X

The pith

ICRL jointly trains a solver and critic from one backbone so critique gains become part of unassisted performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ICRL to solve the problem that language models lose the benefit of critique as soon as the critique is taken away. It jointly optimizes a solver and a critic from a shared backbone with reinforcement learning, giving the critic a reward tied directly to how much the solver improves afterward. A distribution-calibration re-weighting step keeps only those improvements that fit the solver's normal behavior without critique, while role-wise advantage estimates keep the joint training stable. Experiments on agentic and math benchmarks show gains over standard methods, and the resulting critic works at smaller scale than much larger alternatives. If the approach holds, models could keep improving on their own instead of staying dependent on external feedback.

Core claim

ICRL jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without externa

What carries the argument

The distribution-calibration re-weighting ratio that selects only critique-guided improvements compatible with the solver's own prompt distribution during joint RL training of solver and critic.

Load-bearing premise

The re-weighting ratio successfully selects critique-guided improvements that remain compatible with the solver's own prompt distribution without introducing bias or reducing performance on critique-free queries.

What would settle it

Ablating the re-weighting ratio during training and checking whether the solver's accuracy on critique-free queries drops below the GRPO baseline or the full ICRL version.

Figures

Figures reproduced from arXiv: 2605.15224 by Chengwei Qin, Heqing Zou, Hui Xiong, Jianbo Lin, Weishi Wang, Xiaomin Yu, Yifu Guo, Yi Xin, Zhongqi Yue, Zhuosong Jiang.

Figure 1
Figure 1. Figure 1: Critique can turn failed trajectories into successful revisions, while training should internal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ICRL framework. (1) Rollout with critique alternates solver and critic; (2) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Test-time self-improvement performance on ALFWorld, WebShop, and SearchQA. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics on ALFWorld. (1) Reward curve of the solver agent. (2) Reward curve [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ICRL, a reinforcement learning framework that jointly trains a solver and a critic from a shared LLM backbone. The critic is rewarded according to the solver's subsequent performance gain after critique, a distribution-calibration re-weighting ratio is introduced to mitigate distribution shift between critique-conditioned and critique-free behavior, and role-wise group advantage estimation is used to stabilize joint optimization. Experiments on agentic and mathematical reasoning tasks with Qwen3-4B and Qwen3-8B backbones report average gains of 6.4 and 7.0 points over GRPO, respectively, and code is released.

Significance. If the internalization claim holds after addressing verification gaps, the work could advance methods for autonomous self-improvement in LLM agents by converting external critique signals into unassisted capability. The public code release supports reproducibility, and the reported gains across two task categories and two model sizes provide a baseline for generality.

major comments (2)
  1. [Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.
  2. [§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.
minor comments (1)
  1. [Abstract] The abstract states that the learned 8B critic is 'comparable to 32B critics while using substantially fewer tokens,' but the main text would benefit from specifying the exact comparison metric and the token counts involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Distribution-calibration re-weighting] The section describing the distribution-calibration re-weighting ratio claims this mechanism selects critique-guided improvements compatible with the solver's own prompt distribution. However, the results do not report solver accuracy on purely critique-free queries before versus after training, nor an ablation that removes the re-weighting component to quantify any introduced bias or performance change on critique-free prompts. This verification is load-bearing for the central claim that critique-induced gains are internalized rather than remaining dependent on critique conditioning.

    Authors: We agree that direct verification of performance on critique-free prompts before versus after training, together with an ablation of the re-weighting component, would provide stronger support for the internalization claim. While the current experiments demonstrate overall gains and the re-weighting is motivated by mitigating distribution shift, these additional analyses were not included. In the revised manuscript we will add (i) solver accuracy on purely critique-free queries pre- and post-training and (ii) an ablation removing the distribution-calibration re-weighting, reporting its effect on critique-free performance. These results will quantify any bias and help confirm that gains transfer to unassisted behavior. revision: yes

  2. Referee: [§5] §5 (Experimental results): Aggregate benchmark gains over GRPO are presented, but the manuscript provides no details on the number of independent runs, statistical significance tests, variance across random seeds, or explicit controls for the stochasticity of RL training and the re-weighting ratio selection. These omissions limit assessment of whether the 6.4/7.0 point improvements reliably reflect the proposed mechanisms.

    Authors: We acknowledge that reporting experimental robustness details is necessary for readers to assess the reliability of the reported improvements. The current manuscript presents aggregate results without these statistics. In the revision we will specify the number of independent runs, report variance or standard deviations across random seeds, include statistical significance tests (e.g., paired t-tests against GRPO), and add a brief discussion of controls for RL stochasticity and re-weighting ratio selection. These additions will allow better evaluation of whether the observed gains are attributable to the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ICRL derivation chain.

full rationale

The paper's core mechanism defines critic rewards directly from externally measured solver performance gains on benchmarks after critique is applied, which constitutes an independent training signal rather than a self-referential or fitted quantity renamed as a prediction. The distribution-calibration re-weighting ratio is presented as an explicit design choice to mitigate shift between critique-conditioned and critique-free prompts, but it does not reduce the reported benchmark improvements (6.4/7.0 points over GRPO) to the inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the internalization claim; results are empirical outcomes from joint RL training on Qwen3 backbones evaluated on agentic and mathematical tasks. The framework remains self-contained against external benchmarks without tautological reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to elements explicitly named; the re-weighting ratio and advantage estimation are treated as introduced mechanisms whose tuning details are not specified.

free parameters (1)
  • distribution-calibration re-weighting ratio
    Introduced to handle distribution shift between critique-conditioned and critique-free behavior; value not stated in abstract and presumed tuned on validation data.
axioms (1)
  • domain assumption Critique can often guide the model toward correct behavior on the same query.
    Stated as the starting observation in the abstract.

pith-pipeline@v0.9.0 · 5853 in / 1322 out tokens · 39448 ms · 2026-05-19T17:53:19.786250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 12 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  2. [2]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  3. [3]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

  4. [4]

    ArXiv , year=

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. ArXiv , year=

  5. [5]

    Training language models with language feedback at scale.arXiv preprint arXiv:2303.16755, 2023

    Training language models with language feedback at scale , author=. arXiv preprint arXiv:2303.16755 , year=

  6. [6]

    2024 , url=

    John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

  7. [7]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

  8. [8]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=

  9. [9]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Ui-tars: Pioneering automated gui interaction with native agents , author=. arXiv preprint arXiv:2501.12326 , year=

  10. [10]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Websailor: Navigating super-human reasoning for web agent , author=. arXiv preprint arXiv:2507.02592 , year=

  11. [11]

    arXiv preprint arXiv:2508.13167 , year=

    Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl , author=. arXiv preprint arXiv:2508.13167 , year=

  12. [12]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  13. [13]

    Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025

    Incentivizing LLMs to Self-Verify Their Answers , author=. arXiv preprint arXiv:2506.01369 , year=

  14. [14]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    S2r: Teaching llms to self-verify and self-correct via reinforcement learning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  15. [15]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  17. [17]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  18. [18]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

  19. [19]

    arXiv preprint arXiv:2512.01374 , year=

    Stabilizing reinforcement learning with llms: Formulation and practices , author=. arXiv preprint arXiv:2512.01374 , year=

  20. [20]

    Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.13445, 2025

    Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2505.13445 , year=

  21. [21]

    arXiv preprint arXiv:2602.07594 , year=

    Learning to Self-Verify Makes Language Models Better Reasoners , author=. arXiv preprint arXiv:2602.07594 , year=

  22. [22]

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback , author=. arXiv preprint arXiv:2506.03106 , year=

  23. [23]

    arXiv preprint arXiv:2501.05727 , year=

    Self-evolving critique abilities in large language models , author=. arXiv preprint arXiv:2501.05727 , year=

  24. [24]

    2021 , url =

    Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht , booktitle =. 2021 , url =

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

  27. [27]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho and Anh. Constructing. Proceedings of the 28th International Conference on Computational Linguistics,. 2020 , url =. doi:10.18653/V1/2020.COLING-MAIN.580 , timestamp =

  28. [28]

    Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Trans. Assoc. Comput. Linguistics , volume =. 2022 , url =. doi:10.1162/TACL\_A\_00475 , timestamp =

  29. [29]

    and Lewis, Mike , editor =

    Ofir Press and Muru Zhang and Sewon Min and Ludwig Schmidt and Noah A. Smith and Mike Lewis , editor =. Measuring and Narrowing the Compositionality Gap in Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-EMNLP.378 , timestamp =

  30. [30]

    Stronger-mas: Multi-agent reinforcement learning for collaborative llms.arXiv preprint arXiv:2510.11062, 2025

    Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs , author=. arXiv preprint arXiv:2510.11062 , year=

  31. [31]

    arXiv preprint arXiv:2510.04678 , year=

    Multi-Agent Tool-Integrated Policy Optimization , author=. arXiv preprint arXiv:2510.04678 , year=

  32. [32]

    Wideseek-r1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

    WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2602.04634 , year=

  33. [33]

    MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=

    Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems , author=. arXiv preprint arXiv:2602.08847 , year=

  34. [34]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Agentgym: Evaluating and training large language model-based agents across diverse environments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  35. [35]

    Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

    Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=

  36. [36]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  37. [37]

    2022 , eprint=

    Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

  38. [38]

    2024 , address =

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

  39. [39]

    GitHub repository , howpublished =

    Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

  40. [40]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  41. [41]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  42. [42]

    2026 , howpublished =

  43. [43]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning , author=. arXiv preprint arXiv:2505.24298 , year=

  44. [44]

    2025 , eprint=

    Learning to Reason under Off-Policy Guidance , author=. 2025 , eprint=

  45. [45]

    Ceva’s theorem

    Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning , author=. arXiv preprint arXiv:2506.17211 , year=