arxiv: 2605.10500 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

SkillEvolver: Skill Learning as a Meta-Skill

Genrui Zhang , Erle Zhu , Jinfeng Zhou , Caiyan Jia , Hongning Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill learningmeta-skillagent skillsonline refinementdeployment failuresskill evolutionAI agents

0 comments

The pith

SkillEvolver lets a meta-skill author, deploy, and refine other skills using signals from their real-world failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SkillEvolver to address the limitation of static skills in AI agents. A single meta-skill handles the full cycle of creating domain-specific skills, putting them into use, auditing for issues like overfitting, and refining them based on actual performance failures. This process targets the skill's description and code rather than retraining the model, allowing seamless integration into any compatible agent. The approach yields measurable improvements on broad task benchmarks and specialized optimization problems compared to fixed human skills or no skills at all.

Core claim

SkillEvolver establishes that skill learning can function as a meta-skill which iteratively authors, deploys, and refines domain-specific skills. Refinement occurs only after deployment, drawing the learning signal from failures experienced by other agents using the skill. A fresh-agent overfit audit prevents leakage and detects cases where the skill is bypassed at runtime, ensuring the updates address genuine issues.

What carries the argument

The meta-skill itself, loaded like any other skill, that manages the iterative process of skill authoring, deployment, audit with fresh agents, and refinement from failure signals.

Load-bearing premise

The fresh-agent overfit audit provides a reliable way to detect problems like leakage and silent bypass without missing real issues or creating new biases in the learning signal.

What would settle it

A controlled test where skills refined by SkillEvolver show no improvement or even degrade when the overfit audit is removed or bypassed, revealing that the audit is necessary for the claimed gains.

Figures

Figures reproduced from arXiv: 2605.10500 by Caiyan Jia, Erle Zhu, Genrui Zhang, Hongning Wang, Jinfeng Zhou.

**Figure 1.** Figure 1: SkillEvolver as a portable meta-skill. SkillEvolver is a meta-skill that any CLI-agent that loads skills (Claude Code, Codex, . . . ) can load through the same interface used for any domain skill. Given a new task T =(Ttrain, Tval) with a held-out validation split, the CLI-agent uses the metaskill to iteratively construct, test, and update a deployment-ready domain skill v ∗ . The learned object is itself… view at source ↗

**Figure 2.** Figure 2: One iteration of the SkillEvolver loop. At iterations r = 0, . . . , R − 1, SkillEvolver observes only Ttrain. Starting from the current skill vr, the agent explores K training-time trials, analyzes success and failure traces, synthesizes a targeted revision vr+1, and audits it in an independent fresh session. Approved revisions continue through the loop; failed audits trigger another targeted patch. Afte… view at source ↗

**Figure 3.** Figure 3: Per-category avg@5 across the SkillsBench skill-utility taxonomy. Evolver wins biggest where curated skills hurt (B3) or fail entirely (C and D categories). On the A bucket the agent already solves the task without a skill, so the pipeline is not invoked and the bar repeats the no-skill rate. Categories: A = already easy (n=20), B1/B2/B3 = curated helps/is neutral/hurts, C1/C2 = curated unlocks (strong/wea… view at source ↗

**Figure 4.** Figure 4: Per-task Pass@5 on the 83-task paper scope under four Opus 4.6 conditions. Rows sorted by Curated descending. No-Skill: Opus 4.6 with no skill installed. Human Curated: the SkillsBench curated skill. SkillEvolver R=1: the non-refining ablation. SkillEvolver R=2: the full Evolver loop (§3.1). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

Agent skills today are static artifact: authored once -- by human curation or one-shot generation from parametric knowledge -- and then consumed unchanged, with no mechanism to improve from real use. We propose \textbf{SkillEvolver}, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill's prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it -- not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On $83$ SkillsBench tasks spanning $15^{+}$ domains, SkillEvolver reaches $56.8\%$ accuracy versus $43.6\%$ for curated human skills and $29.9\%$ for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from $1.16$ to $1.51$ on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillEvolver frames skill refinement as a post-deployment meta-skill using failure signals and a fresh-agent audit, with benchmark gains that look promising but rest on undetailed methods.

read the letter

The main takeaway is that SkillEvolver treats skill learning itself as a meta-skill that only kicks in after the initial skill is deployed, pulling learning signals from failures that another agent hits in real use and using a fresh-agent overfit audit to block leakage or silent-bypass cases where the skill never actually runs. This setup keeps the output as plain prose and code that any compliant agent can load without retraining, which sets it apart from trace-distillation approaches that rely on exploratory data alone. The paper shows this on 83 SkillsBench tasks across 15+ domains, hitting 56.8% accuracy against 43.6% for human-curated skills and 29.9% for the no-skill baseline, plus a mean speedup lift from 1.16 to 1.51 on three KernelBench GPU tasks. Those numbers give a clear, practical hook for anyone working on deployable agent capabilities. The distinction from prior work and the emphasis on post-deployment signals are the parts that feel fresh and worth testing. The soft spots sit in the missing details. No experimental protocols, baseline code, statistical tests, error bars, or concrete description of how the overfit audit actually detects silent-bypass or subtle leakage appear in the abstract, so the performance edge could still trace to contamination rather than the meta-loop itself. The stress-test worry about audit reliability lands because the guardrail is central and unexamined here. This is for researchers and engineers building adaptive agents who need skills that improve without full model retraining. A reader focused on online learning or agent tooling could pull useful framing from it, but would need the methods section to judge reproducibility. It deserves peer review to surface those implementation specifics and let referees check whether the audit holds up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes SkillEvolver, a lightweight meta-skill that iteratively authors, deploys, and refines domain-specific skills (as prose and code artifacts) using post-deployment failure signals from separate agents, rather than exploratory traces. A fresh-agent overfit audit is introduced to detect leakage and silent-bypass modes. On 83 SkillsBench tasks across 15+ domains, it reports 56.8% accuracy (vs. 43.6% curated human skills and 29.9% no-skill baseline); on three KernelBench GPU kernel tasks, mean speedup rises from 1.16 to 1.51.

Significance. If the central claims hold, the work offers a plug-and-play mechanism for continuous skill improvement without weight updates, which could enable more adaptive agents. The explicit separation of the learning target (deployed skill artifacts) from model parameters and the use of external failure signals are strengths that distinguish it from trace-distillation approaches.

major comments (2)

[Abstract and Experimental Evaluation] The headline performance comparison (56.8% vs. 43.6% human skills) depends on the fresh-agent overfit audit reliably preventing leakage and silent-bypass contamination of the learning signal. The abstract and experimental description provide no implementation details, pseudocode, false-negative rate analysis, or empirical validation of the audit, leaving open the possibility that reported gains arise from indirect task contamination rather than the meta-learning loop.
[Experimental Evaluation] No information is given on experimental protocols, statistical significance tests, error bars, variance across runs, or exact baseline implementations (e.g., how human-curated skills were selected and deployed). This absence directly undermines verifiability of the central claim that SkillEvolver outperforms static baselines.

minor comments (1)

[Abstract] The benchmarks SkillsBench and KernelBench are referenced without citations or brief descriptions of their construction and task distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the audit mechanism and experimental verifiability. We address each point below and will revise the manuscript to incorporate additional details, pseudocode, and analyses as outlined.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] The headline performance comparison (56.8% vs. 43.6% human skills) depends on the fresh-agent overfit audit reliably preventing leakage and silent-bypass contamination of the learning signal. The abstract and experimental description provide no implementation details, pseudocode, false-negative rate analysis, or empirical validation of the audit, leaving open the possibility that reported gains arise from indirect task contamination rather than the meta-learning loop.

Authors: We agree that the abstract and current experimental section are concise and omit full implementation details for the overfit audit. The manuscript describes the audit's purpose in detecting leakage and silent-bypass modes via fresh-agent instances, but lacks pseudocode and validation. In revision we will add a dedicated subsection with pseudocode for the audit procedure, a false-negative rate analysis on controlled contamination tests, and an ablation showing performance when the audit is removed. These additions will directly address concerns about indirect task contamination and confirm that gains derive from the meta-learning loop. revision: yes
Referee: [Experimental Evaluation] No information is given on experimental protocols, statistical significance tests, error bars, variance across runs, or exact baseline implementations (e.g., how human-curated skills were selected and deployed). This absence directly undermines verifiability of the central claim that SkillEvolver outperforms static baselines.

Authors: We acknowledge that the current manuscript reports aggregate results without full protocol details, statistical tests, or variance measures. We will revise to include a detailed experimental protocol section specifying task selection, run counts, and deployment procedures; results with error bars and run-to-run variance; statistical significance tests (e.g., paired t-tests) against baselines; and precise descriptions of how human-curated skills were sourced, selected, and deployed. These changes will enable full verification and reproduction of the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical refinement loop is externally grounded

full rationale

The paper's core claim is an empirical performance gain on SkillsBench and KernelBench from a meta-skill that authors, deploys, and refines prose/code artifacts solely from post-deployment failure signals produced by a separate agent. The fresh-agent overfit audit is described as an external guardrail against leakage and silent-bypass, not as a fitted parameter or self-referential equation. No equations, fitted predictions, self-citations, or ansatzes are presented that would reduce the reported accuracy or speedup numbers to the method's own inputs by construction; the learning target remains outside the meta-skill's generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that failure signals from deployed skills are sufficient and unbiased for iterative refinement, and that the overfit audit functions as described without further evidence in the abstract.

axioms (1)

domain assumption Failure signals from agents using the deployed skill provide a reliable learning target for the meta-skill.
This is invoked as the core mechanism distinguishing the method from trace-distillation.

pith-pipeline@v0.9.0 · 5573 in / 1210 out tokens · 54331 ms · 2026-05-12T04:51:17.140699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Equipping agents for the real world with agent skills.https://www.anthropic

Anthropic. Equipping agents for the real world with agent skills.https://www.anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent -skills, October 2025a. Anthropic. Claude code: An agentic coding tool.https://github.com/anthropics/cl aude-code, 2025b. Anthropic. Skill-Creator: Official Anthropic agent skill for authoring skills.https:...

work page 2026
[2]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J ¨urgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Automated Design of Agentic Systems

arXiv:2408.08435. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations (ICLR),

work page internal anchor Pith review arXiv
[4]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv preprint arXiv:2603.02176, 2026a. Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong ...

work page arXiv
[5]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Er- chao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Distill trajectory-local lessons into transferable agent skills. arXiv preprint arXiv:2603.25158,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

URLhttps://ar xiv.org/abs/2502.10517. 11 Preprint. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

work page arXiv
[7]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URLhttps://arxiv.org/abs/2305.18290. Oscar Sainz, Jon Ander Campos, Iker Garc ´ıa-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark.Findings of the Association for Computational Linguistics: EMNLP 2023,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Trial and error: Exploration-based trajectory optimization for llm agents,

arXiv:2403.02502. Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 9229–9248. PMLR,

work page arXiv
[9]

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , journal =

URLhttps://proceedings.mlr.pr ess/v119/sun20b.html. Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheat- sheet: Test-time learning with adaptive memory.arXiv preprint arXiv:2504.07952,

work page arXiv
[10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

URLhttps://openreview.net/forum?id=uXl3bZLkr3c. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Executable code actions elicit better LLM agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Exe- cutable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030,

work page arXiv
[12]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

12 Preprint. Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234,

work page internal anchor Pith review arXiv
[14]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks.arXiv ...

work page internal anchor Pith review arXiv
[15]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review arXiv
[16]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kama- nuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. arXiv preprint arXiv:2510.04618,

work page internal anchor Pith review arXiv
[17]

typically<2.5

Checks 1–6 cover stan- dard content-level leakage patterns; Checks 7–9 are specific to the deployed-skill regime this paper introduces (parametric-axis under-abstraction, primary-action hoisting, silent-bypass) and are de- tectable only because the refinement signal comes from a deployed skill’s handoff traces rather than from the authoring agent’s reflec...

work page 2023