Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
Pith reviewed 2026-05-22 06:22 UTC · model grok-4.3
The pith
A single loop of writing, retrieving, curating and retiring lets a frozen LLM build an effective skill library for self-evolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ratchet establishes that a frozen LLM agent can autonomously manage its natural-language skills through a loop incorporating outcome-driven retirement, bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. This approach raises held-out pass@1 on MBPP+ hard-100 from 0.258 baseline to 0.584 rolling mean with peaks of 0.658, outperforming no-skill controls that show no drift. The gains transfer to an agentic solver on SWE-bench Verified. Ablations confirm that retirement and meta-skill guidance are the load-bearing elements, with deduplication subsumed by the meta-skill, and the bounded cap plus retirement ensure non-divergence from the baseline.
What carries the argument
The Ratchet single-agent loop that uses outcome-driven retirement and meta-skill authoring guidance to curate and retire natural-language skills while maintaining a bounded active set.
Load-bearing premise
The non-divergence relies on the bounded active-cap and outcome-driven retirement threshold remaining effective at preventing performance drift below baseline across task distributions.
What would settle it
Running the Ratchet system for many more rounds and observing if the rolling mean performance stays above or falls below the no-skill control would test the non-divergence proposition.
Figures
read the original abstract
Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ratchet, a single-agent loop for self-evolving LLM skill libraries that integrates outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. It reports large held-out pass@1 gains on MBPP+ hard-100 (baseline 0.258 ± 0.047 to late-window rolling mean 0.584, peak 0.658 ± 0.042) over 100 rounds and 3 seeds with Claude Opus 4.7, a +0.328 ± 0.018 rolling-mean lift versus near-zero drift in the no-skill control, with transfer to an agentic solver on SWE-bench Verified (+0.22 peak). Eight ablations identify retirement and the meta-skill prior as load-bearing; a non-divergence proposition asserts that the bounded cap plus retirement threshold keeps expected performance from falling below the no-skills floor.
Significance. If the central claims hold, the work is significant for scalable self-improving agents because it isolates a minimal hygiene recipe that delivers concrete gains without weight updates and supplies an explicit non-divergence argument. Credit is due for the reproducible protocol (3 seeds, error bars, eight ablations A1–A8) and for showing that explicit deduplication is subsumed by the meta-skill prior.
major comments (2)
- [§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.
- [§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.
minor comments (2)
- [Table 1] Table 1 (ablation summary) would benefit from an explicit column indicating which components remain active in each A1–A8 condition.
- [§4] Notation for pass@1 and the precise definition of 'held-out' tasks should be stated once in the experimental setup rather than assumed from prior work.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we plan to make to improve clarity and strengthen the claims.
read point-by-point responses
-
Referee: [§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.
Authors: We agree that the manuscript does not explicitly state the window size, the definition of the late-window, or the rule used to select the peak value. This omission hinders reproducibility. In the revised manuscript, we will update §5.2 to specify that the rolling mean uses a window size of 20 rounds, the late-window consists of rounds 81 to 100, and the peak is the maximum pass@1 achieved in any single round within the late-window across the three seeds. We will also include the exact computation in the code release to ensure the reported +0.328 ± 0.018 gain can be exactly reproduced. revision: yes
-
Referee: [§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.
Authors: The non-divergence proposition is presented as a heuristic argument based on the interaction between the bounded active-cap and the retirement mechanism, rather than a formal theorem with analytic bounds. We acknowledge that the empirical support is limited to the two coding benchmarks reported. To address this, we will add a sensitivity analysis varying the retirement threshold (e.g., 0.2, 0.4, 0.6) and show that performance remains above the no-skills baseline on MBPP+ in an appendix. We will also explicitly discuss the proposition's scope as applying to the evaluated domains and note the lack of cross-domain validation as a limitation. A full analytic bound for arbitrary distributions is beyond the current scope but we believe the mechanism provides a practical safeguard. revision: partial
- An analytic bound proving that the non-divergence holds for arbitrary task distributions beyond the evaluated coding benchmarks.
Circularity Check
Derivation chain is self-contained with no circular reductions
full rationale
The paper's central results consist of empirical performance measurements on held-out benchmarks (MBPP+ hard-100 and SWE-bench Verified) across multiple rounds and seeds, supported by ablations (A1-A8) that isolate the contributions of retirement and meta-skill authoring. The non-divergence proposition is an independent grounding argument showing that the bounded cap and retirement threshold prevent performance drift below the no-skills baseline, without reducing to a redefinition or fit of the measured outcomes. No load-bearing step relies on self-citation, fitted inputs renamed as predictions, or ansatzes smuggled via prior work. The evaluation uses standard pass@1 metrics on coding tasks, keeping the claims falsifiable and externally verifiable.
Axiom & Free-Parameter Ledger
free parameters (2)
- retirement threshold
- active-cap size
axioms (2)
- domain assumption An LLM can reliably evaluate the downstream utility of its own previously authored skills on held-out tasks.
- domain assumption Meta-skill authoring guidance produces higher-quality skills than unguided generation.
Reference graph
Works this paper leans on
-
[1]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Ex- perience compression spectrum: Unifying memory, skills, and rules in LLM agents.arXiv preprint arXiv:2604.15877, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[5]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Do agent rules shape or distort? guardrails beat guidance in coding agents.arXiv preprint arXiv:2604.11088, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. Blog post, March 13, 2019
work page 2019
-
[8]
ExpeL: LLM agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. 9
work page 2024
-
[9]
AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning
Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[10]
Ng, Daishi Harada, and Stuart Russell
Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping.International Conference on Machine Learning, 1999
work page 1999
-
[11]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[12]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[13]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[14]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Le, Denny Zhou, and Xinyun Chen
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024
work page 2024
-
[16]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023
work page 2023
-
[18]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Parallel inductive skill distillation for LLM agents.arXiv preprint arXiv:2603.25158, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025
-
[21]
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026
-
[22]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Qiliang Liang, Hansi Wang, Zhong Liang, and Yang Liu. From skill text to skill structure: The scheduling- structural-logical representation for agent skills.arXiv preprint arXiv:2604.24026, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 10
work page 2023
-
[27]
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017
work page 2017
-
[28]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[30]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 11 A Hyperparameters and runtime configuration Each ablation (A1–A8) overrides exactly the knob named after it and holds ev...
work page 2024
-
[31]
For every task in the 378-task split, run Claude Opus 4.7 under a no-skill baseline across 5 probe seeds (independent of the 3 experiment seeds), recording each seed’s pass/fail
-
[32]
Discard tasks the baseline solves on all 5 seeds (∼273 tasks), since a skill library cannot possibly improve on an already-saturated task. Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help
-
[33]
Randomly sample 100 tasks from the retained pool (fixed random seed for reproducibility); split 60/40 into train and eval subsets. The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper. Reporting on a fixed hard subset lets round-0 capsules be directly comparable across conditions; reporting on the full ...
-
[34]
Run the Claude Code agent (no skills) on all 500 tasks across 5 probe seeds (independent of the 3 experiment seeds); discard tasks solved on every seed
-
[35]
to avoid pitfall P on tasks where X, do Y and verify Z
From the retained pool, sample 150 tasks stratified by repository (10 repos) and difficulty, using a fixed random seed; split 90/60 into train and eval subsets. C Skill and meta-skill schemas A skill is a dataclass with the following fields: id: str # snake_case, LLM-proposed name: str # short human label version: str # incremented on resynth intent: str ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.