pith. sign in

arxiv: 2606.21090 · v1 · pith:XRUI6OAWnew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

Pith reviewed 2026-06-26 21:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM self-trainingREINFORCEpolicy over-optimizationcode generationrise-and-collapseRL post-trainingself-improvement failurewithin-task regression
0
0 comments X

The pith

REINFORCE post-training for code lets models improve on pass@1 then collapse within the same campaign due to within-task over-optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-improvement loops in LLMs can produce regression on the exact task being optimized. Experiments with Qwen-2.5 models on fixed competitive-programming problems using binary CodeGrader rewards show pass@1 rising sharply then falling back, sometimes near zero, across 20-step campaigns. This pattern holds without cross-task forgetting or distribution change and resists KL- and EWC-style regularization. Three controls placed at different timescales—between-campaign memory, within-campaign early stopping, and altered reward normalization—yield regime-dependent gains, nearly doubling final performance on the smaller model in the best case.

Core claim

In controlled multi-seed runs on Qwen-2.5-3B and 7B models, pass@1 on competitive-programming tasks follows a consistent rise-then-collapse trajectory within each 20-step REINFORCE campaign on a fixed distribution; the collapse occurs as within-task policy over-optimization and is not prevented by standard KL or EWC constraints.

What carries the argument

The rise-and-collapse pattern arising from within-campaign REINFORCE updates on a fixed binary-reward distribution, addressed by CARE (between-campaign capability posterior with transfer gate), ES (early-stop rolling the peak checkpoint forward), and GRPO (group-relative reward normalization).

If this is right

  • CARE nearly doubles end-of-chain pass@1 on the 3B model from 4.9% to 9.5% with gains in 4/5 seeds.
  • ES lifts 7B performance to 22.2% versus 11.8% for naive REINFORCE.
  • GRPO reaches 20.7% on 7B mainly by improving between-campaign carryover while the within-campaign gap stays roughly 17 points.
  • GRPO+ES produces mixed results, with one seed showing a final cliff that lowers the mean.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same within-task over-optimization could appear in self-training loops outside code generation when reward signals remain stationary.
  • Effective long-horizon self-improvement may require explicit combination of within-campaign early stopping and between-campaign memory rather than relying on any single control.
  • The persistence of the peak-to-end gap under both REINFORCE and GRPO suggests that reward normalization alone does not address the root instability of repeated updates on a fixed task.

Load-bearing premise

The collapse is produced by within-task over-optimization on the fixed distribution rather than grader noise, implementation artifacts, or unmeasured shifts.

What would settle it

If the same models trained with an identical fixed reward function and evaluation set show no peak-to-end drop when the reward signal is replaced by a perfectly stationary oracle label, the within-task over-optimization explanation would be falsified.

read the original abstract

Self-improvement can self-regress. In REINFORCE post-training for code, a model can quickly improve on its optimized metric and then collapse within the same training campaign. We study this in a controlled multi-seed testbed using Qwen-2.5-3B and Qwen-2.5-7B, trained on competitive-programming tasks with binary CodeGrader reward across 10 sequential 20-step campaigns. Across campaigns, pass@1 shows a robust rise-then-collapse pattern: it peaks within tens of gradient steps and then falls back, sometimes to near zero. This is not cross-task catastrophic forgetting, but within-task policy over-optimization on a fixed distribution; KL- and EWC-style constraints do not prevent it. We ask where the control loop should sit. We compare three levels: CARE, a between-campaign memory mechanism with a capability posterior, transfer gate, and regression-aware belief revision; ES, a within-campaign early-stop rule that rolls forward the peak checkpoint and sets the next budget to peak_step+3; and GRPO, which changes the RL update using group-relative reward normalization. The answer is regime-dependent. On Qwen-2.5-3B, where naive REINFORCE is fragile, CARE v2 nearly doubles end-of-chain pass@1 from 4.9% to 9.5%, with paired bootstrap 95% CI [+0.4,+8.9] and gains in 4/5 seeds. On Qwen-2.5-7B, CARE reaches parity with naive REINFORCE, 13.8% vs. 11.8%, while ES reaches 22.2% [14.1,28.0]. Out-of-the-box GRPO reaches 20.7% [15.7,25.1], nearly matching REINFORCE+ES. GRPO raises the floor but does not remove the cliff. Its 7B gain mainly comes from better between-campaign carryover, while the within-campaign peak-to-end gap remains about 17 points under both REINFORCE and GRPO. GRPO+ES gives mixed evidence: 2/3 seeds improve, but one final cliff lowers the mean to 17.0% [0.0,28.1]. A Gemma-3-4B pilot shows the same signature, suggesting the phenomenon is not limited to Qwen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that REINFORCE-based self-training on competitive-programming code tasks produces a robust within-campaign rise-then-collapse pattern in pass@1 (peaking within tens of steps then falling, sometimes to near zero) due to within-task policy over-optimization on a fixed distribution rather than cross-task forgetting; KL/EWC constraints fail to prevent it. Three proposed controls (CARE between-campaign memory, ES early-stop rule, GRPO group-relative normalization) are evaluated across Qwen-2.5-3B/7B and a Gemma pilot, with regime-dependent gains (e.g., CARE nearly doubles 3B end-of-chain performance; ES and GRPO improve 7B).

Significance. If the rise-collapse pattern is confirmed as within-task over-optimization on a stable fixed distribution, the result identifies a previously under-appreciated failure mode in LLM RL post-training and demonstrates that control-loop placement (between- vs within-campaign) matters for self-improvement stability. The multi-seed bootstrap-CI design and cross-model replication are strengths that would make the empirical pattern a useful reference point for future work on RLHF/RLAIF stability.

major comments (2)
  1. [Abstract / experimental setup] Abstract: the central claim that collapse reflects within-task over-optimization on a fixed competitive-programming distribution (rather than grader noise, reward instability, or small implementation artifacts in the 20-step REINFORCE loop) is load-bearing for all subsequent conclusions about CARE/ES/GRPO; however, no verification of CodeGrader consistency, input-distribution stability, or reward-signal stationarity across campaigns is described, leaving the alternative explanations unruled-out.
  2. [Abstract] Abstract: the reported within-campaign peak-to-end gap of ~17 points under both REINFORCE and GRPO is presented as evidence that GRPO raises the floor but does not remove the cliff; this interpretation assumes the peak checkpoint is a reliable indicator of true capability rather than transient exploitation of the binary grader, which requires the same stability checks noted above.
minor comments (2)
  1. [Abstract] Abstract: the description of CARE v2 (capability posterior, transfer gate, regression-aware belief revision) is too terse to allow replication or comparison with the other two controls; a short methods paragraph or pseudocode would clarify the between-campaign mechanism.
  2. [Abstract] Abstract: the Gemma-3-4B pilot is mentioned only in passing; stating the seed count, campaign length, and whether the same rise-collapse signature appears would strengthen the cross-model claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the importance of ruling out alternative explanations for the observed collapse. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract / experimental setup] Abstract: the central claim that collapse reflects within-task over-optimization on a fixed competitive-programming distribution (rather than grader noise, reward instability, or small implementation artifacts in the 20-step REINFORCE loop) is load-bearing for all subsequent conclusions about CARE/ES/GRPO; however, no verification of CodeGrader consistency, input-distribution stability, or reward-signal stationarity across campaigns is described, leaving the alternative explanations unruled-out.

    Authors: We agree that explicit verification of evaluation stability would strengthen the central claim. The setup used an unchanged task distribution and CodeGrader across all campaigns, with the rise-collapse pattern replicated across seeds, model sizes, and a Gemma pilot. To directly address the concern, the revised manuscript will add a subsection reporting: repeated independent evaluations of peak and terminal checkpoints to quantify grader consistency, and per-campaign reward statistics to assess signal stationarity. These will appear as new tables in Section 3. revision: yes

  2. Referee: [Abstract] Abstract: the reported within-campaign peak-to-end gap of ~17 points under both REINFORCE and GRPO is presented as evidence that GRPO raises the floor but does not remove the cliff; this interpretation assumes the peak checkpoint is a reliable indicator of true capability rather than transient exploitation of the binary grader, which requires the same stability checks noted above.

    Authors: We acknowledge that the peak-to-end gap interpretation depends on the peak being a stable indicator rather than transient exploitation. The multi-seed bootstrap CIs already provide some protection against evaluation noise. In revision we will extend the stability checks (repeated evaluations at peak vs. end checkpoints) to explicitly compare variance at those points and confirm the gap exceeds evaluation variability. The abstract and GRPO discussion will be updated to reference these results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations

full rationale

The paper reports direct experimental outcomes from REINFORCE post-training runs on fixed competitive-programming tasks, measuring pass@1 across 20-step campaigns for Qwen models under different controls (CARE, ES, GRPO). No equations, first-principles derivations, or predictions are presented that could reduce to fitted parameters or self-citations by construction. All claims rest on observed rise-then-collapse patterns in the reported metrics, which are independent measurements rather than quantities defined inside the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study; it relies on standard RL assumptions and the reliability of the binary CodeGrader signal but introduces no new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption Binary CodeGrader reward is a stable and sufficient signal for code quality on the chosen competitive-programming distribution
    Used as the sole reward in all REINFORCE updates across campaigns.

pith-pipeline@v0.9.1-grok · 5995 in / 1363 out tokens · 30146 ms · 2026-06-26T21:05:36.670855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Anonymous. Exploration vs exploitation: Rethinking RLVR through clipping. InInternational Conference on Learning Representations (ICLR), 2026a.https://openreview.net/forum?id=sE8DCSJTzd. Anonymous. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InInternational Conference on Learning Representations (ICLR), 2026b.https...

  2. [2]

    Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    Authors-TBD-from-arXiv-2602.09782. Flexible entropy control in RLVR with a gradient-preserving perspective. arXiv preprint arXiv:2602.09782, 2026.TODO: fill in author list fromhttps://arxiv.org/abs/2602.09782 before camera-ready. Authors-TBD-from-arXiv-2603.01162. Demystifying GRPO: Its policy gradient is equivalent to a reweighted REIN- FORCE.arXiv prepr...

  3. [3]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, and 1 others

    ICLR 2026 poster.TODO: fill in author list fromhttps://arxiv.org/abs/2603.08660 before camera-ready. Maximilian Balandat, Brian Karrer, Daniel R Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization.Advances in Neural Information Processing Systems, 33:21524–21538,

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  5. [5]

    Pass@k training for adaptively balancing exploration and exploitation of large reasoning models

    Zhipeng Chen et al. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751,

  6. [6]

    ReST meets ReAct: Self-improvement for multi-step reasoning LLM agent.arXiv preprint arXiv:2312.10003,

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. ReST meets ReAct: Self-improvement for multi-step reasoning LLM agent.arXiv preprint arXiv:2312.10003,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    23 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

  8. [8]

    Population Based Training of Neural Networks

    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. InarXiv preprint arXiv:1711.09846,

  9. [9]

    Training Language Models to Self-Correct via Reinforcement Learning

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

  10. [10]

    Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K

    Sheikh Shafayat Kumar et al. Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,

  11. [11]

    Meta-SGD: Learning to Learn Quickly for Few-Shot Learning

    Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for few-shot learning. In arXiv preprint arXiv:1707.09835,

  12. [12]

    AgentHPO: Large language model agent for hyperparameter optimization.arXiv preprint arXiv:2402.11427,

    Siyi Liu, Ziran Chen, et al. AgentHPO: Large language model agent for hyperparameter optimization.arXiv preprint arXiv:2402.11427,

  13. [13]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

  14. [14]

    Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733,

    Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733,

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  16. [16]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  17. [17]

    SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    24 Manav Singhal et al. Reward design for code generation reinforcement learning: Binary vs. pass-rate signals.arXiv preprint arXiv:2502.18449,

  18. [18]

    Progress or regress? Self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013,

    Ting Wu, Xuefeng Yuan, Xinghao Pan, et al. Progress or regress? Self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013,

  19. [19]

    Some things are more CRINGE than others: Iterative preference optimization with human feedback.arXiv preprint arXiv:2312.16682,

    Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more CRINGE than others: Iterative preference optimization with human feedback.arXiv preprint arXiv:2312.16682,

  20. [20]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self- rewarding language models.arXiv preprint arXiv:2401.10020,

  21. [21]

    come from Wave 17: a full 5-seed×10-campaign×20-step replication of the REINFORCE chain protocol, at both 3B and 7B, with both A0 (naive) and A3 (CARE) conditions. This appendix records the exact override string used by Wave 17 and the smoke-test history that established the configuration, so the reader can reproduce the GRPO runs from the released orches...

  22. [22]

    I.1 Structural Comparison: CARE vs

    Both are reported as additional evidence, not headline claims: their 95% bootstrap confidence intervals overlap with scalarCAREv2 on the same testbed at the sample sizes we ran, and we treat them as starting points for follow-up work rather than as established results. I.1 Structural Comparison: CARE vs. MORBO Multi-objective Bayesian optimization (MORBO;...