Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Bing Zhu; Guanghui Wang; Peiyang He; Wei Qiu; Xing Zhang; Yanwei Cui; Ziyuan Li

arxiv: 2605.22148 · v1 · pith:WLIAPTDEnew · submitted 2026-05-21 · 💻 cs.AI · cs.CL

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Xing Zhang , Yanwei Cui , Guanghui Wang , Ziyuan Li , Wei Qiu , Bing Zhu , Peiyang He This is my paper

Pith reviewed 2026-05-22 06:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords self-evolving LLM agentsskill librarieslifecycle managementoutcome-driven retirementmeta-skill guidancebounded active capagent hygiene mechanisms

0 comments

The pith

A single loop of writing, retrieving, curating and retiring lets a frozen LLM build an effective skill library for self-evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem for self-evolving LLM agents is not generating new skills but managing their lifecycle to avoid bloat and degradation. LLM-authored skills have shown no gain over baselines, while human ones do, pointing to hygiene as the missing piece. Ratchet provides a minimal recipe using outcome-driven retirement, a bounded active skill cap, meta-skill guidance, and canonicalisation within one agent loop. Experiments on coding benchmarks demonstrate large lifts in success rates, with ablations showing which parts are essential. A supporting proposition indicates the design prevents performance from dropping below the starting point.

Core claim

Ratchet establishes that a frozen LLM agent can autonomously manage its natural-language skills through a loop incorporating outcome-driven retirement, bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. This approach raises held-out pass@1 on MBPP+ hard-100 from 0.258 baseline to 0.584 rolling mean with peaks of 0.658, outperforming no-skill controls that show no drift. The gains transfer to an agentic solver on SWE-bench Verified. Ablations confirm that retirement and meta-skill guidance are the load-bearing elements, with deduplication subsumed by the meta-skill, and the bounded cap plus retirement ensure non-divergence from the baseline.

What carries the argument

The Ratchet single-agent loop that uses outcome-driven retirement and meta-skill authoring guidance to curate and retire natural-language skills while maintaining a bounded active set.

Load-bearing premise

The non-divergence relies on the bounded active-cap and outcome-driven retirement threshold remaining effective at preventing performance drift below baseline across task distributions.

What would settle it

Running the Ratchet system for many more rounds and observing if the rolling mean performance stays above or falls below the no-skill control would test the non-divergence proposition.

Figures

Figures reproduced from arXiv: 2605.22148 by Bing Zhu, Guanghui Wang, Peiyang He, Wei Qiu, Xing Zhang, Yanwei Cui, Ziyuan Li.

**Figure 1.** Figure 1: The Ratchet loop. Inference (top): each task flows through Router→Solver→Grader→ Capsule. Memory (middle): three append-only stores (Skill Bank, Meta-Skill, Evidence Log). Reflection (bottom): every round the Critic labels failures, the Synthesizer writes new skills from failure clusters, and the Curator retires under-performers. Solid arrows = data flow; dashed = memory reads/writes. library; A3 (no-meta)… view at source ↗

**Figure 2.** Figure 2: Held-out pass@1 by round on MBPP+ hard-100, averaged over 3 seeds ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ratchet shows that retirement plus meta-skill guidance form a minimal working recipe for keeping self-evolved LLM skill libraries from degrading on coding tasks.

read the letter

Ratchet's main takeaway is that two hygiene rules—outcome-driven retirement and meta-skill authoring guidance—let a frozen LLM agent accumulate useful skills without the performance collapse seen in earlier self-evolution loops. The other mechanisms the authors tried turn out to be redundant once these two are active. This is a narrower but more precise claim than the broader Voyager-style literature usually makes. The eight ablations are the clearest part of the work; they directly test which components carry the gains and show that explicit deduplication is largely handled by the meta-skill step itself. The reported lift on MBPP+ hard-100 (0.258 baseline to 0.584 late-window mean, flat no-skill control) and the transfer to SWE-bench are concrete enough to be worth noting, with error bars and multiple seeds included. The non-divergence proposition supplies an independent argument that bounded cap plus retirement keeps expected performance above the floor. The paper is honest about what it measured and what it did not. The limitation is scope. Both the empirical gains and the non-divergence claim rest on retirement thresholds and cap sizes tuned to coding distributions. The paper supplies no evidence that the same thresholds remain sufficient when task difficulty, skill overlap, or LLM judgment noise changes. If the retirement criterion becomes too lenient or too strict on a new domain, the active set can accumulate low-value skills and the guarantee no longer holds. The rolling-mean computation rules are also not fully detailed in the abstract, though the ablations mitigate some of that concern. This is useful reading for anyone building long-horizon agent systems that need to maintain skill libraries without constant external curation. Researchers focused on practical lifecycle management rather than new architectures will get the most out of the ablations and the minimal recipe. It deserves a serious referee because the controls and the attempt to ground the safety claim are substantive enough to review, even if reviewers will rightly press on generalization.

Referee Report

2 major / 2 minor

Summary. The paper introduces Ratchet, a single-agent loop for self-evolving LLM skill libraries that integrates outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. It reports large held-out pass@1 gains on MBPP+ hard-100 (baseline 0.258 ± 0.047 to late-window rolling mean 0.584, peak 0.658 ± 0.042) over 100 rounds and 3 seeds with Claude Opus 4.7, a +0.328 ± 0.018 rolling-mean lift versus near-zero drift in the no-skill control, with transfer to an agentic solver on SWE-bench Verified (+0.22 peak). Eight ablations identify retirement and the meta-skill prior as load-bearing; a non-divergence proposition asserts that the bounded cap plus retirement threshold keeps expected performance from falling below the no-skills floor.

Significance. If the central claims hold, the work is significant for scalable self-improving agents because it isolates a minimal hygiene recipe that delivers concrete gains without weight updates and supplies an explicit non-divergence argument. Credit is due for the reproducible protocol (3 seeds, error bars, eight ablations A1–A8) and for showing that explicit deduplication is subsumed by the meta-skill prior.

major comments (2)

[§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.
[§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.

minor comments (2)

[Table 1] Table 1 (ablation summary) would benefit from an explicit column indicating which components remain active in each A1–A8 condition.
[§4] Notation for pass@1 and the precise definition of 'held-out' tasks should be stated once in the experimental setup rather than assumed from prior work.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below, indicating the revisions we plan to make to improve clarity and strengthen the claims.

read point-by-point responses

Referee: [§5.2] §5.2 (MBPP+ results) and the associated rolling-mean description: the exact window size, start of the 'late-window', and selection rule for the reported peak (0.658 ± 0.042) are not specified; without these rules the +0.328 ± 0.018 rolling-mean gain cannot be reproduced or distinguished from selection effects.

Authors: We agree that the manuscript does not explicitly state the window size, the definition of the late-window, or the rule used to select the peak value. This omission hinders reproducibility. In the revised manuscript, we will update §5.2 to specify that the rolling mean uses a window size of 20 rounds, the late-window consists of rounds 81 to 100, and the peak is the maximum pass@1 achieved in any single round within the late-window across the three seeds. We will also include the exact computation in the code release to ensure the reported +0.328 ± 0.018 gain can be exactly reproduced. revision: yes
Referee: [§3] Non-divergence proposition (stated in §3 and invoked in §5): the claim that bounded active-cap plus outcome-driven retirement keeps expected performance above the no-skills floor rests on the retirement criterion remaining effective across task distributions, yet the paper supplies no analytic bound and only demonstrates the effect on two coding benchmarks (MBPP+ hard-100 and SWE-bench Verified); a cross-domain check or sensitivity analysis on retirement threshold is required to support the safety guarantee.

Authors: The non-divergence proposition is presented as a heuristic argument based on the interaction between the bounded active-cap and the retirement mechanism, rather than a formal theorem with analytic bounds. We acknowledge that the empirical support is limited to the two coding benchmarks reported. To address this, we will add a sensitivity analysis varying the retirement threshold (e.g., 0.2, 0.4, 0.6) and show that performance remains above the no-skills baseline on MBPP+ in an appendix. We will also explicitly discuss the proposition's scope as applying to the evaluated domains and note the lack of cross-domain validation as a limitation. A full analytic bound for arbitrary distributions is beyond the current scope but we believe the mechanism provides a practical safeguard. revision: partial

standing simulated objections not resolved

An analytic bound proving that the non-divergence holds for arbitrary task distributions beyond the evaluated coding benchmarks.

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's central results consist of empirical performance measurements on held-out benchmarks (MBPP+ hard-100 and SWE-bench Verified) across multiple rounds and seeds, supported by ablations (A1-A8) that isolate the contributions of retirement and meta-skill authoring. The non-divergence proposition is an independent grounding argument showing that the bounded cap and retirement threshold prevent performance drift below the no-skills baseline, without reducing to a redefinition or fit of the measured outcomes. No load-bearing step relies on self-citation, fitted inputs renamed as predictions, or ansatzes smuggled via prior work. The evaluation uses standard pass@1 metrics on coding tasks, keeping the claims falsifiable and externally verifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on a small number of tunable hygiene parameters and domain assumptions about LLM self-evaluation; no new physical entities are introduced.

free parameters (2)

retirement threshold
Outcome-driven retirement criterion whose exact numeric or statistical trigger is required for the non-divergence guarantee and observed gains.
active-cap size
Bounded number of simultaneously active skills whose value interacts with retirement to prevent drift.

axioms (2)

domain assumption An LLM can reliably evaluate the downstream utility of its own previously authored skills on held-out tasks.
Required for outcome-driven retirement to function without external labels.
domain assumption Meta-skill authoring guidance produces higher-quality skills than unguided generation.
Invoked to explain why the meta-skill component is load-bearing in ablations.

pith-pipeline@v0.9.0 · 5856 in / 1663 out tokens · 38823 ms · 2026-05-22T06:22:50.867976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 14 internal anchors

[1]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Ex- perience compression spectrum: Unifying memory, skills, and rules in LLM agents.arXiv preprint arXiv:2604.15877, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[5]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Do agent rules shape or distort? guardrails beat guidance in coding agents.arXiv preprint arXiv:2604.11088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. Blog post, March 13, 2019

work page 2019
[8]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. 9

work page 2024
[9]

AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning

Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[10]

Ng, Daishi Harada, and Stuart Russell

Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping.International Conference on Machine Learning, 1999

work page 1999
[11]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[12]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[13]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

work page 2024
[16]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

work page 2023
[18]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Parallel inductive skill distillation for LLM agents.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025
[21]

Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhong Liang, and Yang Liu. From skill text to skill structure: The scheduling- structural-logical representation for agent skills.arXiv preprint arXiv:2604.24026, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 10

work page 2023
[27]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

work page 2017
[28]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[30]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 11 A Hyperparameters and runtime configuration Each ablation (A1–A8) overrides exactly the knob named after it and holds ev...

work page 2024
[31]

For every task in the 378-task split, run Claude Opus 4.7 under a no-skill baseline across 5 probe seeds (independent of the 3 experiment seeds), recording each seed’s pass/fail

work page
[32]

Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

Discard tasks the baseline solves on all 5 seeds (∼273 tasks), since a skill library cannot possibly improve on an already-saturated task. Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

work page
[33]

The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper

Randomly sample 100 tasks from the retained pool (fixed random seed for reproducibility); split 60/40 into train and eval subsets. The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper. Reporting on a fixed hard subset lets round-0 capsules be directly comparable across conditions; reporting on the full ...

work page
[34]

Run the Claude Code agent (no skills) on all 500 tasks across 5 probe seeds (independent of the 3 experiment seeds); discard tasks solved on every seed

work page
[35]

to avoid pitfall P on tasks where X, do Y and verify Z

From the retained pool, sample 150 tasks stratified by repository (10 repos) and difficulty, using a fixed random seed; split 90/60 into train and eval subsets. C Skill and meta-skill schemas A skill is a dataclass with the following fields: id: str # snake_case, LLM-proposed name: str # short human label version: str # incremented on resynth intent: str ...

work page

[1] [1]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Ex- perience compression spectrum: Unifying memory, skills, and rules in LLM agents.arXiv preprint arXiv:2604.15877, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[5] [5]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Do agent rules shape or distort? guardrails beat guidance in coding agents.arXiv preprint arXiv:2604.11088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson. html, 2019. Blog post, March 13, 2019

work page 2019

[8] [8]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. 9

work page 2024

[9] [9]

AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning

Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Generating instruction manuals by LLM agents via interactive environmental learning. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[10] [10]

Ng, Daishi Harada, and Stuart Russell

Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping.International Conference on Machine Learning, 1999

work page 1999

[11] [11]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[12] [12]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[13] [13]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[14] [14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

work page 2024

[16] [16]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

work page 2023

[18] [18]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Self-evolving LLM agents through an experience-driven lifecycle. arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Parallel inductive skill distillation for LLM agents.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. CAS- CADE: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025

[21] [21]

Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[22] [22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhong Liang, and Yang Liu. From skill text to skill structure: The scheduling- structural-logical representation for agent skills.arXiv preprint arXiv:2604.24026, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023. 10

work page 2023

[27] [27]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

work page 2017

[28] [28]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020

[30] [30]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 11 A Hyperparameters and runtime configuration Each ablation (A1–A8) overrides exactly the knob named after it and holds ev...

work page 2024

[31] [31]

For every task in the 378-task split, run Claude Opus 4.7 under a no-skill baseline across 5 probe seeds (independent of the 3 experiment seeds), recording each seed’s pass/fail

work page

[32] [32]

Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

Discard tasks the baseline solves on all 5 seeds (∼273 tasks), since a skill library cannot possibly improve on an already-saturated task. Retain the remainder (those that fail at least once); these are the tasks where a skill library could plausibly help

work page

[33] [33]

The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper

Randomly sample 100 tasks from the retained pool (fixed random seed for reproducibility); split 60/40 into train and eval subsets. The resulting subset is 100 tasks (60 train, 40 eval) and is consumed verbatim by every run in this paper. Reporting on a fixed hard subset lets round-0 capsules be directly comparable across conditions; reporting on the full ...

work page

[34] [34]

Run the Claude Code agent (no skills) on all 500 tasks across 5 probe seeds (independent of the 3 experiment seeds); discard tasks solved on every seed

work page

[35] [35]

to avoid pitfall P on tasks where X, do Y and verify Z

From the retained pool, sample 150 tasks stratified by repository (10 repos) and difficulty, using a fixed random seed; split 90/60 into train and eval subsets. C Skill and meta-skill schemas A skill is a dataclass with the following fields: id: str # snake_case, LLM-proposed name: str # short human label version: str # incremented on resynth intent: str ...

work page