arxiv: 2605.14539 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Mengjie Ren , Jie Lou , Boxi Cao , Xueru Wen , Hongyu Lin , Xianpei Han , Le Sun , Xing Yu

show 1 more author

Yaojie Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords Reinforcement Learning with Verifiable RewardsPolicy OptimizationLarge Language ModelsReasoningError CorrectionFailed Trajectories

0 comments

The pith

CIPO turns failed LLM trajectories into correction signals to boost reasoning over standard RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Correction-Oriented Policy Optimization (CIPO) as an extension to Reinforcement Learning with Verifiable Rewards for large language models. Standard RLVR struggles with sparse binary rewards and weak credit assignment, leaving information in failed trajectories unused. CIPO derives correction samples directly from the model's own on-policy failures and optimizes them jointly with the usual RLVR objective. This produces measurable gains in both final reasoning accuracy and the model's ability to fix its own mistakes. Experiments across eleven math and code benchmarks show consistent outperformance plus stronger pass@K scaling, pointing to genuine capacity improvement rather than simple probability reallocation.

Core claim

CIPO converts on-policy failed trajectories into correction-oriented supervision without external signals. By jointly optimizing these correction samples together with the standard RLVR objective, the method improves learning effectiveness and explicitly strengthens the model's capacity to correct its own errors.

What carries the argument

CIPO objective that adds joint optimization of correction samples derived from the model's own failed attempts to the standard RLVR loss.

If this is right

CIPO yields consistent outperformance over strong baselines in both reasoning accuracy and correction performance.
Stronger pass@K gains indicate improved intrinsic reasoning capacity rather than redistribution of probability mass.
The approach works across mathematical reasoning and code generation without external signals.
Failed trajectories become a usable source of supervision instead of being discarded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same failure-to-correction pattern could extend to non-reasoning RLVR settings such as planning or tool-use tasks.
Adjusting the ratio of correction samples to standard samples might further tune training efficiency.
Combining CIPO with other failure-handling techniques could compound the gains on harder benchmarks.

Load-bearing premise

That correction samples derived from on-policy failed trajectories supply net-positive supervision without introducing harmful noise or distribution shift that degrades overall policy performance.

What would settle it

If training with the added correction term produces no accuracy gain or lower performance than plain RLVR on the same eleven benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.14539 by Boxi Cao, Hongyu Lin, Jie Lou, Le Sun, Mengjie Ren, Xianpei Han, Xing Yu, Xueru Wen, Yaojie Lu.

**Figure 2.** Figure 2: The overall framework of CIPO. First, we generate rollouts for the curated data via the policy model and verify their correctness. Subsequently, we construct replayed samples using a template governed by an adaptive mechanism, which dynamically adjusts the ratio of successful to failed rollouts in the replay. We then generate and verify rollouts for this replayed data. Finally, we perform RL on the rollout… view at source ↗

**Figure 3.** Figure 3: Pass@8 training dynamics on LiveCodeBench v6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIPO adds correction supervision from on-policy failures to RLVR, which is a clean reuse of existing data, but the pass@K gains do not clearly separate the correction signal from extra training volume.

read the letter

The core move is to take failed rollouts from the current policy, turn them into correction targets, and optimize them jointly with the standard RLVR objective. No external signals needed, just the model's own mistakes. That part is straightforward and directly targets the sparse-reward credit-assignment issue mentioned in the abstract. The experiments across 11 math and code benchmarks report consistent gains over strong baselines plus better pass@K, which at least suggests the method is not just redistributing mass on already-correct answers. Credit to the authors for running a broad set of tasks and for framing the correction term as additive rather than a full replacement. The soft spot is exactly the one in the stress-test note: without an ablation that holds total tokens or effective data volume fixed, it is hard to know whether the reported improvements come from the correction mechanism itself or simply from training on more samples generated from the same policy. The abstract does not show the precise loss formulation or how the correction samples are constructed, so distribution shift or label noise from the failures cannot be ruled out. The claim that stronger pass@K proves intrinsic capacity improvement therefore rests on an untested assumption. This paper is aimed at people already working on RLVR-style training for reasoning models. It is worth sending to peer review because the idea is simple, the experimental scope is decent, and the potential practical payoff is real even if the current evidence for the mechanism is incomplete.

Referee Report

2 major / 2 minor

Summary. The paper introduces Correction-Oriented Policy Optimization (CIPO) as a simple extension to RLVR. It converts on-policy failed trajectories into correction-oriented supervision and jointly optimizes them with the standard RLVR objective to better utilize information from failures and improve error correction. Experiments on 11 benchmarks for math reasoning and code generation show consistent outperformance over strong baselines in reasoning and correction tasks, along with stronger pass@K gains that the authors interpret as evidence of improved intrinsic reasoning capacity.

Significance. Should the central claims hold after addressing the data-volume confound, CIPO offers a practical way to enhance RLVR by leveraging failed attempts without external signals. This could contribute to more efficient training of reasoning models by turning sparse rewards into denser correction signals. The focus on verifiable rewards and on-policy corrections is timely given the field's interest in self-improvement techniques for LLMs.

major comments (2)

[Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.
[Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.

minor comments (2)

Provide more details on the exact weighting between the RLVR loss and the correction loss in the joint objective, and how the correction samples are formatted and filtered.
[Experiments] Include error bars or statistical significance tests for the reported improvements across the 11 benchmarks to strengthen the 'consistently and significantly outperforms' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to include the requested ablations, which strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] The interpretation that stronger pass@K gains indicate improved intrinsic reasoning capacity (rather than redistribution of probability mass) is not fully supported without controlling for data volume. The joint optimization increases the amount of supervision; an ablation matching the total number of training examples between CIPO and the RLVR baseline is needed to isolate the effect of the correction objective.

Authors: We agree that data volume is a potential confound and that the original interpretation in the abstract required additional controls. In the revised manuscript we add an ablation that trains the RLVR baseline on an equal total number of examples by duplicating successful on-policy trajectories. Under this matched-volume setting CIPO still yields higher pass@K, which we now report in the experiments section and reflect in a tempered statement in the abstract. We have also added a brief discussion of this control in the method and results. revision: yes
Referee: [Method] The assumption that correction samples derived from failed trajectories supply net-positive supervision without harmful noise or distribution shift is central but untested in isolation. The manuscript should provide an experiment comparing the correction samples to an equivalent volume of additional on-policy successful trajectories or neutral data to rule out simple data-augmentation effects.

Authors: We acknowledge that isolating the benefit of correction samples from generic data augmentation is important. The revised manuscript now includes a controlled experiment that compares (i) CIPO’s correction samples against (ii) an equal volume of additional successful trajectories (oversampled on-policy correct answers) and (iii) neutral data (random unrelated prompts). The correction samples produce statistically significant gains over both alternatives, supporting that they supply net-positive supervision beyond volume or augmentation effects. These results are presented in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CIPO as a joint optimization extension to standard RLVR, converting on-policy failed trajectories into correction samples without any fitted parameters, self-referential predictions, or uniqueness theorems that reduce the claimed gains to inputs by construction. No equations are shown that equate the correction objective to a redefinition of the RLVR baseline or that rename empirical patterns as novel derivations. The pass@K gains are interpreted as evidence of improved capacity, but this interpretation rests on experimental comparison rather than a self-definitional or fitted-input reduction. The method is additive and externally benchmarked across 11 tasks, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on the standard RLVR objective plus an unstated conversion rule for failures.

pith-pipeline@v0.9.0 · 5495 in / 967 out tokens · 32750 ms · 2026-05-15T01:38:00.295122+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CIPO transforms on-policy failed trajectories into correction-oriented supervision... jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 8 internal anchors

[1]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

work page 2024
[2]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page 2025
[3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page 2025
[4]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[5]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

work page 2025
[7]

Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms

Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms. arXiv preprint arXiv:2601.08763, 2026

work page arXiv 2026
[8]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

work page 2024
[10]

Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

work page arXiv 2025
[11]

Rethinking reward model evaluation: Are we barking up the wrong tree?arXiv preprint arXiv:2410.05584, 2024

Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, and Le Sun. Rethinking reward model evaluation: Are we barking up the wrong tree?arXiv preprint arXiv:2410.05584, 2024

work page arXiv 2024
[12]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[13]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Seed-coder: Let the code model curate data for itself, 2025

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, and Yonghui Wu. Seed-coder: Let the code...

work page 2025
[15]

DebugBench: Evaluating debugging capability of large language models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Hui Haotian, Liu Weichuan, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 4173–4198, Bangkok,...

work page 2024
[16]

Claude 4.https://www.anthropic.com/news/claude-4, 2025

Anthropic. Claude 4.https://www.anthropic.com/news/claude-4, 2025. Accessed: 2026-01-29

work page 2025
[17]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[18]

Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002

Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning.Machine learning, 49(2):267–290, 2002

work page 2002
[19]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, 10 Xiangpeng Wei, Hao Zhou, Jingjing Liu...

work page 2025
[20]

Questa: Expanding reasoning capacity in llms via question augmentation, 2025

Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, and Jingzhao Zhang. Questa: Expanding reasoning capacity in llms via question augmentation, 2025

work page 2025
[21]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[22]

Jointly reinforcing diversity and quality in language model generations, 2025

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025

work page 2025
[23]

Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening

Anonymous. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review

work page 2025
[24]

Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

work page 2025
[25]

Bytedance-Seed-Foundation-Code-Team, :, Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu, Rui Long, Jing Mai, Guanghan Ning, Z. Y . Peng, Kai Shen, Jiahao Su, Jing Su, Tao...

work page 2025
[26]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024
[27]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[29]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

work page 2025
[30]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[32]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings o...

work page 2024
[33]

Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

Naman Jain, King Han, and Wen-Ding Li Alex Gu, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

work page 2024
[34]

Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

work page arXiv 2025
[35]

CriticBench: Benchmarking LLMs for critique-correct reasoning

Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. CriticBench: Benchmarking LLMs for critique-correct reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 1552–1587, Bangkok, Thailand, August 2024. Association for Computational Linguistics

work page 2024
[36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 11

work page 2023
[37]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[40]

Hindsight experience replay.Advances in neural information processing systems, 30, 2017

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

work page 2017
[41]

Replay failures as successes: Sample-efficient reinforcement learning for instruction following.arXiv preprint arXiv:2512.23457, 2025

Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, and Mingli Song. Replay failures as successes: Sample-efficient reinforcement learning for instruction following.arXiv preprint arXiv:2512.23457, 2025

work page arXiv 2025
[42]

Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner.arXiv preprint arXiv:2310.20689, 2023

work page arXiv 2023
[43]

Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic

Xin Zheng, Jie Lou, Boxi Cao, Xueru Wen, Yuqiu Ji, Hongyu Lin, Yaojie Lu, Xianpei Han, Debing Zhang, and Le Sun. Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1768–1806, 2025

work page 2025
[44]

Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025

Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine-tuning: Learning to critique is more effective than learning to imitate.arXiv preprint arXiv:2501.17703, 2025

work page arXiv 2025
[45]

Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

work page arXiv 2024
[46]

Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024
[47]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023. 12 A Details about Method A.1 Algorithm of CIPO Algorithm 1 outlines the core workflow of CIPO. In practice, to improve training efficiency, correction rollouts are based on the previous step, enabling parallel ...

work page internal anchor Pith review Pith/arXiv arXiv 2023