pith. sign in

arxiv: 2605.22148 · v1 · pith:WLIAPTDEnew · submitted 2026-05-21 · 💻 cs.AI · cs.CL

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Pith reviewed 2026-05-22 06:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords self-evolving LLM agentsskill librarieslifecycle managementoutcome-driven retirementmeta-skill guidancebounded active capagent hygiene mechanisms
0
0 comments X

The pith

A single loop of writing, retrieving, curating and retiring lets a frozen LLM build an effective skill library for self-evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem for self-evolving LLM agents is not generating new skills but managing their lifecycle to avoid bloat and degradation. LLM-authored skills have shown no gain over baselines, while human ones do, pointing to hygiene as the missing piece. Ratchet provides a minimal recipe using outcome-driven retirement, a bounded active skill cap, meta-skill guidance, and canonicalisation within one agent loop. Experiments on coding benchmarks demonstrate large lifts in success rates, with ablations showing which parts are essential. A supporting proposition indicates the design prevents performance from dropping below the starting point.

Core claim

Ratchet establishes that a frozen LLM agent can autonomously manage its natural-language skills through a loop incorporating outcome-driven retirement, bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. This approach raises held-out pass@1 on MBPP+ hard-100 from 0.258 baseline to 0.584 rolling mean with peaks of 0.658, outperforming no-skill controls that show no drift. The gains transfer to an agentic solver on SWE-bench Verified. Ablations confirm that retirement and meta-skill guidance are the load-bearing elements, with deduplication subsumed by the meta-skill, and the bounded cap plus retirement ensure non-divergence from the baseline.

What carries the argument

The Ratchet single-agent loop that uses outcome-driven retirement and meta-skill authoring guidance to curate and retire natural-language skills while maintaining a bounded active set.

Load-bearing premise

The non-divergence relies on the bounded active-cap and outcome-driven retirement threshold remaining effective at preventing performance drift below baseline across task distributions.

What would settle it

Running the Ratchet system for many more rounds and observing if the rolling mean performance stays above or falls below the no-skill control would test the non-divergence proposition.

Figures

Figures reproduced from arXiv: 2605.22148 by Bing Zhu, Guanghui Wang, Peiyang He, Wei Qiu, Xing Zhang, Yanwei Cui, Ziyuan Li.

Figure 1
Figure 1. Figure 1: The Ratchet loop. Inference (top): each task flows through Router→Solver→Grader→ Capsule. Memory (middle): three append-only stores (Skill Bank, Meta-Skill, Evidence Log). Reflection (bottom): every round the Critic labels failures, the Synthesizer writes new skills from failure clusters, and the Curator retires under-performers. Solid arrows = data flow; dashed = memory reads/writes. library; A3 (no-meta)… view at source ↗
Figure 2
Figure 2. Figure 2: Held-out pass@1 by round on MBPP+ hard-100, averaged over 3 seeds ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ratchet, a single-agent loop for self-evolving LLM skill libraries that integrates outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. It reports large held-out pass@1 gains on MBPP+ hard-100 (baseline 0.258 ± 0.047 to late-window rolling mean 0.584, peak 0.658 ± 0.042) over 100 rounds and 3 seeds with Claude Opus 4.7, a +0.328 ± 0.018 rolling-mean lift versus near-zero drift in the no-skill control, with transfer to an agentic solver on SWE-bench Verified (+0.22 peak). Eight ablations identify retirement and the meta-skill prior as load-bearing; a non-divergence proposition asserts that the bounded cap plus retirement threshold keeps expected performance from falling below the no-skills floor.

Significance. If the central claims hold, the work is significant for scalable self-improving agents because it isolates a minimal hygiene recipe that delivers concrete gains without weight updates and supplies an explicit non-divergence argument. Credit is due for the reproducible protocol (3 seeds, error bars, eight ablations A1–A8) and for showing that explicit deduplication is subsumed by the meta-skill prior.

major comments (2)
  1. [§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.
  2. [§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.
minor comments (2)
  1. [Table 1] Table 1 (ablation summary) would benefit from an explicit column indicating which components remain active in each A1–A8 condition.
  2. [§4] Notation for pass@1 and the precise definition of 'held-out' tasks should be stated once in the experimental setup rather than assumed from prior work.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we plan to make to improve clarity and strengthen the claims.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.

    Authors: We agree that the manuscript does not explicitly state the window size, the definition of the late-window, or the rule used to select the peak value. This omission hinders reproducibility. In the revised manuscript, we will update §5.2 to specify that the rolling mean uses a window size of 20 rounds, the late-window consists of rounds 81 to 100, and the peak is the maximum pass@1 achieved in any single round within the late-window across the three seeds. We will also include the exact computation in the code release to ensure the reported +0.328 ± 0.018 gain can be exactly reproduced. revision: yes

  2. Referee: [§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.

    Authors: The non-divergence proposition is presented as a heuristic argument based on the interaction between the bounded active-cap and the retirement mechanism, rather than a formal theorem with analytic bounds. We acknowledge that the empirical support is limited to the two coding benchmarks reported. To address this, we will add a sensitivity analysis varying the retirement threshold (e.g., 0.2, 0.4, 0.6) and show that performance remains above the no-skills baseline on MBPP+ in an appendix. We will also explicitly discuss the proposition's scope as applying to the evaluated domains and note the lack of cross-domain validation as a limitation. A full analytic bound for arbitrary distributions is beyond the current scope but we believe the mechanism provides a practical safeguard. revision: partial

standing simulated objections not resolved
  • An analytic bound proving that the non-divergence holds for arbitrary task distributions beyond the evaluated coding benchmarks.

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's central results consist of empirical performance measurements on held-out benchmarks (MBPP+ hard-100 and SWE-bench Verified) across multiple rounds and seeds, supported by ablations (A1-A8) that isolate the contributions of retirement and meta-skill authoring. The non-divergence proposition is an independent grounding argument showing that the bounded cap and retirement threshold prevent performance drift below the no-skills baseline, without reducing to a redefinition or fit of the measured outcomes. No load-bearing step relies on self-citation, fitted inputs renamed as predictions, or ansatzes smuggled via prior work. The evaluation uses standard pass@1 metrics on coding tasks, keeping the claims falsifiable and externally verifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on a small number of tunable hygiene parameters and domain assumptions about LLM self-evaluation; no new physical entities are introduced.

free parameters (2)
  • retirement threshold
    Outcome-driven retirement criterion whose exact numeric or statistical trigger is required for the non-divergence guarantee and observed gains.
  • active-cap size
    Bounded number of simultaneously active skills whose value interacts with retirement to prevent drift.
axioms (2)
  • domain assumption An LLM can reliably evaluate the downstream utility of its own previously authored skills on held-out tasks.
    Required for outcome-driven retirement to function without external labels.
  • domain assumption Meta-skill authoring guidance produces higher-quality skills than unguided generation.
    Invoked to explain why the meta-skill component is load-bearing in ablations.

pith-pipeline@v0.9.0 · 5856 in / 1663 out tokens · 38823 ms · 2026-05-22T06:22:50.867976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 14 internal anchors

  1. [1]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  2. [2]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  3. [3]

    Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Ex- perience compression spectrum: Unifying memory, skills, and rules in LLM agents.arXiv preprint arXiv:2604.15877, 2026

  4. [4]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

  5. [5]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

  6. [6]

    Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

    Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Do agent rules shape or distort? guardrails beat guidance in coding agents.arXiv preprint arXiv:2604.11088, 2026

  7. [7]

    Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. Blog post, March 13, 2019

  8. [8]

    ExpeL: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. 9

  9. [9]

    AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning

    Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning. InAdvances in Neural Information Processing Systems, volume 37, 2024

  10. [10]

    Ng, Daishi Harada, and Stuart Russell

    Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping.International Conference on Machine Learning, 1999

  11. [11]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

  12. [12]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023

  13. [13]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  14. [14]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  15. [15]

    Le, Denny Zhou, and Xinyun Chen

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

  16. [16]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

  17. [17]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

  18. [18]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025

  19. [19]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Parallel inductive skill distillation for LLM agents.arXiv preprint arXiv:2603.25158, 2026

  20. [20]

    CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

    Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

  21. [21]

    Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

  22. [22]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  23. [23]

    From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    Qiliang Liang, Hansi Wang, Zhong Liang, and Yang Liu. From skill text to skill structure: The scheduling- structural-logical representation for agent skills.arXiv preprint arXiv:2604.24026, 2026

  24. [24]

    From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

    Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026

  25. [25]

    Agent Workflow Memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024

  26. [26]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 10

  27. [27]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

  28. [28]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  29. [29]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  30. [30]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 11 A Hyperparameters and runtime configuration Each ablation (A1–A8) overrides exactly the knob named after it and holds ev...

  31. [31]

    For every task in the 378-task split, run Claude Opus 4.7 under a no-skill baseline across 5 probe seeds (independent of the 3 experiment seeds), recording each seed’s pass/fail

  32. [32]

    Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

    Discard tasks the baseline solves on all 5 seeds (∼273 tasks), since a skill library cannot possibly improve on an already-saturated task. Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

  33. [33]

    The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper

    Randomly sample 100 tasks from the retained pool (fixed random seed for reproducibility); split 60/40 into train and eval subsets. The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper. Reporting on a fixed hard subset lets round-0 capsules be directly comparable across conditions; reporting on the full ...

  34. [34]

    Run the Claude Code agent (no skills) on all 500 tasks across 5 probe seeds (independent of the 3 experiment seeds); discard tasks solved on every seed

  35. [35]

    to avoid pitfall P on tasks where X, do Y and verify Z

    From the retained pool, sample 150 tasks stratified by repository (10 repos) and difficulty, using a fixed random seed; split 90/60 into train and eval subsets. C Skill and meta-skill schemas A skill is a dataclass with the following fields: id: str # snake_case, LLM-proposed name: str # short human label version: str # incremented on resynth intent: str ...