Recognition: 2 theorem links
· Lean TheoremLearning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Pith reviewed 2026-05-15 01:38 UTC · model grok-4.3
The pith
CIPO turns failed LLM trajectories into correction signals to boost reasoning over standard RLVR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CIPO converts on-policy failed trajectories into correction-oriented supervision without external signals. By jointly optimizing these correction samples together with the standard RLVR objective, the method improves learning effectiveness and explicitly strengthens the model's capacity to correct its own errors.
What carries the argument
CIPO objective that adds joint optimization of correction samples derived from the model's own failed attempts to the standard RLVR loss.
If this is right
- CIPO yields consistent outperformance over strong baselines in both reasoning accuracy and correction performance.
- Stronger pass@K gains indicate improved intrinsic reasoning capacity rather than redistribution of probability mass.
- The approach works across mathematical reasoning and code generation without external signals.
- Failed trajectories become a usable source of supervision instead of being discarded.
Where Pith is reading between the lines
- The same failure-to-correction pattern could extend to non-reasoning RLVR settings such as planning or tool-use tasks.
- Adjusting the ratio of correction samples to standard samples might further tune training efficiency.
- Combining CIPO with other failure-handling techniques could compound the gains on harder benchmarks.
Load-bearing premise
That correction samples derived from on-policy failed trajectories supply net-positive supervision without introducing harmful noise or distribution shift that degrades overall policy performance.
What would settle it
If training with the added correction term produces no accuracy gain or lower performance than plain RLVR on the same eleven benchmarks, the central claim would be falsified.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Correction-Oriented Policy Optimization (CIPO) as a simple extension to RLVR. It converts on-policy failed trajectories into correction-oriented supervision and jointly optimizes them with the standard RLVR objective to better utilize information from failures and improve error correction. Experiments on 11 benchmarks for math reasoning and code generation show consistent outperformance over strong baselines in reasoning and correction tasks, along with stronger pass@K gains that the authors interpret as evidence of improved intrinsic reasoning capacity.
Significance. Should the central claims hold after addressing the data-volume confound, CIPO offers a practical way to enhance RLVR by leveraging failed attempts without external signals. This could contribute to more efficient training of reasoning models by turning sparse rewards into denser correction signals. The focus on verifiable rewards and on-policy corrections is timely given the field's interest in self-improvement techniques for LLMs.
major comments (2)
- [Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.
- [Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.
minor comments (2)
- Provide more details on the exact weighting between the RLVR loss and the correction loss in the joint objective, and how the correction samples are formatted and filtered.
- [Experiments] Include error bars or statistical significance tests for the reported improvements across the 11 benchmarks to strengthen the 'consistently and significantly outperforms' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to include the requested ablations, which strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.
Authors: We agree that data volume is a potential confound and that the original interpretation in the abstract required additional controls. In the revised manuscript we add an ablation that trains the RLVR baseline on an equal total number of examples by duplicating successful on-policy trajectories. Under this matched-volume setting CIPO still yields higher pass@K, which we now report in the experiments section and reflect in a tempered statement in the abstract. We have also added a brief discussion of this control in the method and results. revision: yes
-
Referee: [Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.
Authors: We acknowledge that isolating the benefit of correction samples from generic data augmentation is important. The revised manuscript now includes a controlled experiment that compares (i) CIPO’s correction samples against (ii) an equal volume of additional successful trajectories (oversampled on-policy correct answers) and (iii) neutral data (random unrelated prompts). The correction samples produce statistically significant gains over both alternatives, supporting that they supply net-positive supervision beyond volume or augmentation effects. These results are presented in a new subsection of the experiments. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents CIPO as a joint optimization extension to standard RLVR, converting on-policy failed trajectories into correction samples without any fitted parameters, self-referential predictions, or uniqueness theorems that reduce the claimed gains to inputs by construction. No equations are shown that equate the correction objective to a redefinition of the RLVR baseline or that rename empirical patterns as novel derivations. The pass@K gains are interpreted as evidence of improved capacity, but this interpretation rests on experimental comparison rather than a self-definitional or fitted-input reduction. The method is additive and externally benchmarked across 11 tasks, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CIPO transforms on-policy failed trajectories into correction-oriented supervision... jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
work page 2024
-
[2]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
work page 2025
-
[3]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...
work page 2025
-
[4]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
-
[5]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025
work page 2025
-
[7]
Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms
Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms. arXiv preprint arXiv:2601.08763, 2026
-
[8]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024
work page 2024
-
[10]
Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025
-
[11]
Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, and Le Sun. Rethinking reward model evaluation: Are we barking up the wrong tree?arXiv preprint arXiv:2410.05584, 2024
-
[12]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[13]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Seed-coder: Let the code model curate data for itself, 2025
ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, and Yonghui Wu. Seed-coder: Let the code...
work page 2025
-
[15]
DebugBench: Evaluating debugging capability of large language models
Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Hui Haotian, Liu Weichuan, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 4173–4198, Bangkok,...
work page 2024
-
[16]
Claude 4.https://www.anthropic.com/news/claude-4, 2025
Anthropic. Claude 4.https://www.anthropic.com/news/claude-4, 2025. Accessed: 2026-01-29
work page 2025
-
[17]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[18]
Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002
Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002
work page 2002
-
[19]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, 10 Xiangpeng Wei, Hao Zhou, Jingjing Liu...
work page 2025
-
[20]
Questa: Expanding reasoning capacity in llms via question augmentation, 2025
Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in llms via question augmentation, 2025
work page 2025
-
[21]
Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025
-
[22]
Jointly reinforcing diversity and quality in language model generations, 2025
Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025
work page 2025
-
[23]
Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening
Anonymous. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review
work page 2025
-
[24]
Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025
work page 2025
-
[25]
Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y . Peng, Kai Shen, Jiahao Su, Jing Su, Tao...
work page 2025
-
[26]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
work page 2024
-
[27]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[29]
American invitational mathematics examination (aime) 2025, 2025
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025
work page 2025
-
[30]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
work page 2022
-
[32]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings o...
work page 2024
-
[33]
Naman Jain, King Han, and Wen-Ding Li Alex Gu, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024
work page 2024
-
[34]
Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025
-
[35]
CriticBench: Benchmarking LLMs for critique-correct reasoning
Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. CriticBench: Benchmarking LLMs for critique-correct reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 1552–1587, Bangkok, Thailand, August 2024. Association for Computational Linguistics
work page 2024
-
[36]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 11
work page 2023
-
[37]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023
work page 2023
-
[40]
Hindsight experience replay.Advances in neural information processing systems, 30, 2017
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017
work page 2017
-
[41]
Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, and Mingli Song. Replay failures as successes: Sample-efficient reinforcement learning for instruction following.arXiv preprint arXiv:2512.23457, 2025
-
[42]
Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023
-
[43]
Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic
Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, and Le Sun. Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1768–1806, 2025
work page 2025
-
[44]
Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025
-
[45]
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024
-
[46]
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024
-
[47]
Teaching Large Language Models to Self-Debug
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023. 12 A Details about Method A.1 Algorithm of CIPO Algorithm 1 outlines the core workflow of CIPO. In practice, to improve training efficiency, correction rollouts are based on the previous step, enabling parallel ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.