Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training
Pith reviewed 2026-06-26 21:05 UTC · model grok-4.3
The pith
REINFORCE post-training for code lets models improve on pass@1 then collapse within the same campaign due to within-task over-optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In controlled multi-seed runs on Qwen-2.5-3B and 7B models, pass@1 on competitive-programming tasks follows a consistent rise-then-collapse trajectory within each 20-step REINFORCE campaign on a fixed distribution; the collapse occurs as within-task policy over-optimization and is not prevented by standard KL or EWC constraints.
What carries the argument
The rise-and-collapse pattern arising from within-campaign REINFORCE updates on a fixed binary-reward distribution, addressed by CARE (between-campaign capability posterior with transfer gate), ES (early-stop rolling the peak checkpoint forward), and GRPO (group-relative reward normalization).
If this is right
- CARE nearly doubles end-of-chain pass@1 on the 3B model from 4.9% to 9.5% with gains in 4/5 seeds.
- ES lifts 7B performance to 22.2% versus 11.8% for naive REINFORCE.
- GRPO reaches 20.7% on 7B mainly by improving between-campaign carryover while the within-campaign gap stays roughly 17 points.
- GRPO+ES produces mixed results, with one seed showing a final cliff that lowers the mean.
Where Pith is reading between the lines
- The same within-task over-optimization could appear in self-training loops outside code generation when reward signals remain stationary.
- Effective long-horizon self-improvement may require explicit combination of within-campaign early stopping and between-campaign memory rather than relying on any single control.
- The persistence of the peak-to-end gap under both REINFORCE and GRPO suggests that reward normalization alone does not address the root instability of repeated updates on a fixed task.
Load-bearing premise
The collapse is produced by within-task over-optimization on the fixed distribution rather than grader noise, implementation artifacts, or unmeasured shifts.
What would settle it
If the same models trained with an identical fixed reward function and evaluation set show no peak-to-end drop when the reward signal is replaced by a perfectly stationary oracle label, the within-task over-optimization explanation would be falsified.
read the original abstract
Self-improvement can self-regress. In REINFORCE post-training for code, a model can quickly improve on its optimized metric and then collapse within the same training campaign. We study this in a controlled multi-seed testbed using Qwen-2.5-3B and Qwen-2.5-7B, trained on competitive-programming tasks with binary CodeGrader reward across 10 sequential 20-step campaigns. Across campaigns, pass@1 shows a robust rise-then-collapse pattern: it peaks within tens of gradient steps and then falls back, sometimes to near zero. This is not cross-task catastrophic forgetting, but within-task policy over-optimization on a fixed distribution; KL- and EWC-style constraints do not prevent it. We ask where the control loop should sit. We compare three levels: CARE, a between-campaign memory mechanism with a capability posterior, transfer gate, and regression-aware belief revision; ES, a within-campaign early-stop rule that rolls forward the peak checkpoint and sets the next budget to peak_step+3; and GRPO, which changes the RL update using group-relative reward normalization. The answer is regime-dependent. On Qwen-2.5-3B, where naive REINFORCE is fragile, CARE v2 nearly doubles end-of-chain pass@1 from 4.9% to 9.5%, with paired bootstrap 95% CI [+0.4,+8.9] and gains in 4/5 seeds. On Qwen-2.5-7B, CARE reaches parity with naive REINFORCE, 13.8% vs. 11.8%, while ES reaches 22.2% [14.1,28.0]. Out-of-the-box GRPO reaches 20.7% [15.7,25.1], nearly matching REINFORCE+ES. GRPO raises the floor but does not remove the cliff. Its 7B gain mainly comes from better between-campaign carryover, while the within-campaign peak-to-end gap remains about 17 points under both REINFORCE and GRPO. GRPO+ES gives mixed evidence: 2/3 seeds improve, but one final cliff lowers the mean to 17.0% [0.0,28.1]. A Gemma-3-4B pilot shows the same signature, suggesting the phenomenon is not limited to Qwen.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that REINFORCE-based self-training on competitive-programming code tasks produces a robust within-campaign rise-then-collapse pattern in pass@1 (peaking within tens of steps then falling, sometimes to near zero) due to within-task policy over-optimization on a fixed distribution rather than cross-task forgetting; KL/EWC constraints fail to prevent it. Three proposed controls (CARE between-campaign memory, ES early-stop rule, GRPO group-relative normalization) are evaluated across Qwen-2.5-3B/7B and a Gemma pilot, with regime-dependent gains (e.g., CARE nearly doubles 3B end-of-chain performance; ES and GRPO improve 7B).
Significance. If the rise-collapse pattern is confirmed as within-task over-optimization on a stable fixed distribution, the result identifies a previously under-appreciated failure mode in LLM RL post-training and demonstrates that control-loop placement (between- vs within-campaign) matters for self-improvement stability. The multi-seed bootstrap-CI design and cross-model replication are strengths that would make the empirical pattern a useful reference point for future work on RLHF/RLAIF stability.
major comments (2)
- [Abstract / experimental setup] Abstract: the central claim that collapse reflects within-task over-optimization on a fixed competitive-programming distribution (rather than grader noise, reward instability, or small implementation artifacts in the 20-step REINFORCE loop) is load-bearing for all subsequent conclusions about CARE/ES/GRPO; however, no verification of CodeGrader consistency, input-distribution stability, or reward-signal stationarity across campaigns is described, leaving the alternative explanations unruled-out.
- [Abstract] Abstract: the reported within-campaign peak-to-end gap of ~17 points under both REINFORCE and GRPO is presented as evidence that GRPO raises the floor but does not remove the cliff; this interpretation assumes the peak checkpoint is a reliable indicator of true capability rather than transient exploitation of the binary grader, which requires the same stability checks noted above.
minor comments (2)
- [Abstract] Abstract: the description of CARE v2 (capability posterior, transfer gate, regression-aware belief revision) is too terse to allow replication or comparison with the other two controls; a short methods paragraph or pseudocode would clarify the between-campaign mechanism.
- [Abstract] Abstract: the Gemma-3-4B pilot is mentioned only in passing; stating the seed count, campaign length, and whether the same rise-collapse signature appears would strengthen the cross-model claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments emphasizing the importance of ruling out alternative explanations for the observed collapse. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract / experimental setup] Abstract: the central claim that collapse reflects within-task over-optimization on a fixed competitive-programming distribution (rather than grader noise, reward instability, or small implementation artifacts in the 20-step REINFORCE loop) is load-bearing for all subsequent conclusions about CARE/ES/GRPO; however, no verification of CodeGrader consistency, input-distribution stability, or reward-signal stationarity across campaigns is described, leaving the alternative explanations unruled-out.
Authors: We agree that explicit verification of evaluation stability would strengthen the central claim. The setup used an unchanged task distribution and CodeGrader across all campaigns, with the rise-collapse pattern replicated across seeds, model sizes, and a Gemma pilot. To directly address the concern, the revised manuscript will add a subsection reporting: repeated independent evaluations of peak and terminal checkpoints to quantify grader consistency, and per-campaign reward statistics to assess signal stationarity. These will appear as new tables in Section 3. revision: yes
-
Referee: [Abstract] Abstract: the reported within-campaign peak-to-end gap of ~17 points under both REINFORCE and GRPO is presented as evidence that GRPO raises the floor but does not remove the cliff; this interpretation assumes the peak checkpoint is a reliable indicator of true capability rather than transient exploitation of the binary grader, which requires the same stability checks noted above.
Authors: We acknowledge that the peak-to-end gap interpretation depends on the peak being a stable indicator rather than transient exploitation. The multi-seed bootstrap CIs already provide some protection against evaluation noise. In revision we will extend the stability checks (repeated evaluations at peak vs. end checkpoints) to explicitly compare variance at those points and confirm the gap exceeds evaluation variability. The abstract and GRPO discussion will be updated to reference these results. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations
full rationale
The paper reports direct experimental outcomes from REINFORCE post-training runs on fixed competitive-programming tasks, measuring pass@1 across 20-step campaigns for Qwen models under different controls (CARE, ES, GRPO). No equations, first-principles derivations, or predictions are presented that could reduce to fitted parameters or self-citations by construction. All claims rest on observed rise-then-collapse patterns in the reported metrics, which are independent measurements rather than quantities defined inside the paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Binary CodeGrader reward is a stable and sufficient signal for code quality on the chosen competitive-programming distribution
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Anonymous. Exploration vs exploitation: Rethinking RLVR through clipping. InInternational Conference on Learning Representations (ICLR), 2026a.https://openreview.net/forum?id=sE8DCSJTzd. Anonymous. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InInternational Conference on Learning Representations (ICLR), 2026b.https...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Authors-TBD-from-arXiv-2602.09782. Flexible entropy control in RLVR with a gradient-preserving perspective. arXiv preprint arXiv:2602.09782, 2026.TODO: fill in author list fromhttps://arxiv.org/abs/2602.09782 before camera-ready. Authors-TBD-from-arXiv-2603.01162. Demystifying GRPO: Its policy gradient is equivalent to a reweighted REIN- FORCE.arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
ICLR 2026 poster.TODO: fill in author list fromhttps://arxiv.org/abs/2603.08660 before camera-ready. Maximilian Balandat, Brian Karrer, Daniel R Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization.Advances in Neural Information Processing Systems, 33:21524–21538,
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Pass@k training for adaptively balancing exploration and exploitation of large reasoning models
Zhipeng Chen et al. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,
-
[6]
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. ReST meets ReAct: Self-improvement for multi-step reasoning LLM agent.arXiv preprint arXiv:2312.10003,
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
23 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Population Based Training of Neural Networks
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. InarXiv preprint arXiv:1711.09846,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K
Sheikh Shafayat Kumar et al. Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,
-
[11]
Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for few-shot learning. In arXiv preprint arXiv:1707.09835,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Siyi Liu, Ziran Chen, et al. AgentHPO: Large language model agent for hyperparameter optimization.arXiv preprint arXiv:2402.11427,
-
[13]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733,
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733,
-
[15]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
24 Manav Singhal et al. Reward design for code generation reinforcement learning: Binary vs. pass-rate signals.arXiv preprint arXiv:2502.18449,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Progress or regress? Self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013,
Ting Wu, Xuefeng Yuan, Xinghao Pan, et al. Progress or regress? Self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013,
-
[19]
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more CRINGE than others: Iterative preference optimization with human feedback.arXiv preprint arXiv:2312.16682,
-
[20]
Self-Rewarding Language Models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self- rewarding language models.arXiv preprint arXiv:2401.10020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
come from Wave 17: a full 5-seed×10-campaign×20-step replication of the REINFORCE chain protocol, at both 3B and 7B, with both A0 (naive) and A3 (CARE) conditions. This appendix records the exact override string used by Wave 17 and the smoke-test history that established the configuration, so the reader can reproduce the GRPO runs from the released orches...
2024
-
[22]
I.1 Structural Comparison: CARE vs
Both are reported as additional evidence, not headline claims: their 95% bootstrap confidence intervals overlap with scalarCAREv2 on the same testbed at the sample sizes we ran, and we treat them as starting points for follow-up work rather than as established results. I.1 Structural Comparison: CARE vs. MORBO Multi-objective Bayesian optimization (MORBO;...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.