pith. machine review for the scientific record. sign in

arxiv: 2605.14539 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords Reinforcement Learning with Verifiable RewardsPolicy OptimizationLarge Language ModelsReasoningError CorrectionFailed Trajectories
0
0 comments X

The pith

CIPO turns failed LLM trajectories into correction signals to boost reasoning over standard RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Correction-Oriented Policy Optimization (CIPO) as an extension to Reinforcement Learning with Verifiable Rewards for large language models. Standard RLVR struggles with sparse binary rewards and weak credit assignment, leaving information in failed trajectories unused. CIPO derives correction samples directly from the model's own on-policy failures and optimizes them jointly with the usual RLVR objective. This produces measurable gains in both final reasoning accuracy and the model's ability to fix its own mistakes. Experiments across eleven math and code benchmarks show consistent outperformance plus stronger pass@K scaling, pointing to genuine capacity improvement rather than simple probability reallocation.

Core claim

CIPO converts on-policy failed trajectories into correction-oriented supervision without external signals. By jointly optimizing these correction samples together with the standard RLVR objective, the method improves learning effectiveness and explicitly strengthens the model's capacity to correct its own errors.

What carries the argument

CIPO objective that adds joint optimization of correction samples derived from the model's own failed attempts to the standard RLVR loss.

If this is right

  • CIPO yields consistent outperformance over strong baselines in both reasoning accuracy and correction performance.
  • Stronger pass@K gains indicate improved intrinsic reasoning capacity rather than redistribution of probability mass.
  • The approach works across mathematical reasoning and code generation without external signals.
  • Failed trajectories become a usable source of supervision instead of being discarded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same failure-to-correction pattern could extend to non-reasoning RLVR settings such as planning or tool-use tasks.
  • Adjusting the ratio of correction samples to standard samples might further tune training efficiency.
  • Combining CIPO with other failure-handling techniques could compound the gains on harder benchmarks.

Load-bearing premise

That correction samples derived from on-policy failed trajectories supply net-positive supervision without introducing harmful noise or distribution shift that degrades overall policy performance.

What would settle it

If training with the added correction term produces no accuracy gain or lower performance than plain RLVR on the same eleven benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.14539 by Boxi Cao, Hongyu Lin, Jie Lou, Le Sun, Mengjie Ren, Xianpei Han, Xing Yu, Xueru Wen, Yaojie Lu.

Figure 1
Figure 1. Figure 1: Comparison of how standard RLVR and CIPO [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of CIPO. First, we generate rollouts for the curated data via the policy model and verify their correctness. Subsequently, we construct replayed samples using a template governed by an adaptive mechanism, which dynamically adjusts the ratio of successful to failed rollouts in the replay. We then generate and verify rollouts for this replayed data. Finally, we perform RL on the rollout… view at source ↗
Figure 3
Figure 3. Figure 3: Pass@8 training dynamics on Live￾CodeBench v6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Correction-Oriented Policy Optimization (CIPO) as a simple extension to RLVR. It converts on-policy failed trajectories into correction-oriented supervision and jointly optimizes them with the standard RLVR objective to better utilize information from failures and improve error correction. Experiments on 11 benchmarks for math reasoning and code generation show consistent outperformance over strong baselines in reasoning and correction tasks, along with stronger pass@K gains that the authors interpret as evidence of improved intrinsic reasoning capacity.

Significance. Should the central claims hold after addressing the data-volume confound, CIPO offers a practical way to enhance RLVR by leveraging failed attempts without external signals. This could contribute to more efficient training of reasoning models by turning sparse rewards into denser correction signals. The focus on verifiable rewards and on-policy corrections is timely given the field's interest in self-improvement techniques for LLMs.

major comments (2)
  1. [Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.
  2. [Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.
minor comments (2)
  1. Provide more details on the exact weighting between the RLVR loss and the correction loss in the joint objective, and how the correction samples are formatted and filtered.
  2. [Experiments] Include error bars or statistical significance tests for the reported improvements across the 11 benchmarks to strengthen the 'consistently and significantly outperforms' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to include the requested ablations, which strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.

    Authors: We agree that data volume is a potential confound and that the original interpretation in the abstract required additional controls. In the revised manuscript we add an ablation that trains the RLVR baseline on an equal total number of examples by duplicating successful on-policy trajectories. Under this matched-volume setting CIPO still yields higher pass@K, which we now report in the experiments section and reflect in a tempered statement in the abstract. We have also added a brief discussion of this control in the method and results. revision: yes

  2. Referee: [Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.

    Authors: We acknowledge that isolating the benefit of correction samples from generic data augmentation is important. The revised manuscript now includes a controlled experiment that compares (i) CIPO’s correction samples against (ii) an equal volume of additional successful trajectories (oversampled on-policy correct answers) and (iii) neutral data (random unrelated prompts). The correction samples produce statistically significant gains over both alternatives, supporting that they supply net-positive supervision beyond volume or augmentation effects. These results are presented in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CIPO as a joint optimization extension to standard RLVR, converting on-policy failed trajectories into correction samples without any fitted parameters, self-referential predictions, or uniqueness theorems that reduce the claimed gains to inputs by construction. No equations are shown that equate the correction objective to a redefinition of the RLVR baseline or that rename empirical patterns as novel derivations. The pass@K gains are interpreted as evidence of improved capacity, but this interpretation rests on experimental comparison rather than a self-definitional or fitted-input reduction. The method is additive and externally benchmarked across 11 tasks, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on the standard RLVR objective plus an unstated conversion rule for failures.

pith-pipeline@v0.9.0 · 5495 in / 967 out tokens · 32750 ms · 2026-05-15T01:38:00.295122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 8 internal anchors

  1. [1]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  2. [2]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  3. [3]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  4. [4]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  5. [5]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  6. [6]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

  7. [7]

    Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms

    Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms. arXiv preprint arXiv:2601.08763, 2026

  8. [8]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  9. [9]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  10. [10]

    Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

    Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

  11. [11]

    Rethinking reward model evaluation: Are we barking up the wrong tree?arXiv preprint arXiv:2410.05584, 2024

    Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, and Le Sun. Rethinking reward model evaluation: Are we barking up the wrong tree?arXiv preprint arXiv:2410.05584, 2024

  12. [12]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  13. [13]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

  14. [14]

    Seed-coder: Let the code model curate data for itself, 2025

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, and Yonghui Wu. Seed-coder: Let the code...

  15. [15]

    DebugBench: Evaluating debugging capability of large language models

    Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Hui Haotian, Liu Weichuan, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 4173–4198, Bangkok,...

  16. [16]

    Claude 4.https://www.anthropic.com/news/claude-4, 2025

    Anthropic. Claude 4.https://www.anthropic.com/news/claude-4, 2025. Accessed: 2026-01-29

  17. [17]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  18. [18]

    Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002

    Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002

  19. [19]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, 10 Xiangpeng Wei, Hao Zhou, Jingjing Liu...

  20. [20]

    Questa: Expanding reasoning capacity in llms via question augmentation, 2025

    Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in llms via question augmentation, 2025

  21. [21]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

  22. [22]

    Jointly reinforcing diversity and quality in language model generations, 2025

    Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025

  23. [23]

    Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening

    Anonymous. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review

  24. [24]

    Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

    Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

  25. [25]

    Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y . Peng, Kai Shen, Jiahao Su, Jing Su, Tao...

  26. [26]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  27. [27]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  28. [28]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  29. [29]

    American invitational mathematics examination (aime) 2025, 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

  30. [30]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  31. [31]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  32. [32]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings o...

  33. [33]

    Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

    Naman Jain, King Han, and Wen-Ding Li Alex Gu, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

  34. [34]

    Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

    Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

  35. [35]

    CriticBench: Benchmarking LLMs for critique-correct reasoning

    Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. CriticBench: Benchmarking LLMs for critique-correct reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 1552–1587, Bangkok, Thailand, August 2024. Association for Computational Linguistics

  36. [36]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 11

  37. [37]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

  38. [38]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2024

  39. [39]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  40. [40]

    Hindsight experience replay.Advances in neural information processing systems, 30, 2017

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

  41. [41]

    Replay failures as successes: Sample-efficient reinforcement learning for instruction following.arXiv preprint arXiv:2512.23457, 2025

    Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, and Mingli Song. Replay failures as successes: Sample-efficient reinforcement learning for instruction following.arXiv preprint arXiv:2512.23457, 2025

  42. [42]

    Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023

  43. [43]

    Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic

    Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, and Le Sun. Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1768–1806, 2025

  44. [44]

    Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025

    Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025

  45. [45]

    Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

  46. [46]

    Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

  47. [47]

    Teaching Large Language Models to Self-Debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023. 12 A Details about Method A.1 Algorithm of CIPO Algorithm 1 outlines the core workflow of CIPO. In practice, to improve training efficiency, correction rollouts are based on the previous step, enabling parallel ...