SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Bei Liu; Chong Luo; Dongdong Chen; Kai Qiu; Qi Dai; Qihao Yang; Weiquan Huang; Xuemei Gao; Xue Yang; Yan Li

arxiv: 2605.23904 · v2 · pith:IG6F5LT4new · submitted 2026-05-22 · 💻 cs.AI · cs.CL

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang , Ziyang Gong , Weiquan Huang , Qihao Yang , Ziwei Zhou , Zisu Huang , Yan Li , Xuemei Gao

show 7 more authors

Qi Dai Bei Liu Kai Qiu Yuqing Yang Dongdong Chen Xue Yang Chong Luo

This is my paper

Pith reviewed 2026-05-25 03:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords agent skillstext-space optimizationself-evolving agentsvalidation-driven editingskill transferagent optimizationtext edits

0 comments

The pith

A separate optimizer model evolves agent skills by turning scored rollouts into bounded text edits accepted only on held-out validation gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that skills should be trained as editable external state of a frozen agent, using the same controlled feedback loop that makes weight optimization reproducible. SkillOpt implements this with an optimizer model that proposes only add, delete, or replace edits on one skill document and keeps an edit only when it strictly raises a separate validation score. The approach adds a textual learning-rate budget, rejected-edit buffer, and slow meta-updates to keep the process stable while adding no extra model calls at deployment. If the claim holds, skill creation moves from ad-hoc generation to a repeatable training procedure whose gains transfer across models, execution harnesses, and nearby tasks.

Core claim

SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update keep training stable. Across six benchmarks, seven target models, and three execution harnesses the resulting skills are best or tied on all 52 evaluated cells and outperform human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. Optimized skill artifacts retain value when transferred across model,

What carries the argument

The optimizer model that converts scored rollouts into bounded add/delete/replace edits on the skill document, accepted only on strict held-out validation improvement.

If this is right

Optimized skills raise no-skill accuracy by 19 to 25 points on GPT-5.5 in direct chat, Codex loops, and Claude Code.
Skills keep their value when moved to different model scales, between Codex and Claude Code environments, and to a nearby math benchmark without further tuning.
The method adds zero extra model calls at deployment time.
The approach beats every listed competitor in every one of the 52 evaluated cells.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The strict validation gate could allow skills to be maintained as versioned artifacts that accumulate improvements over repeated optimization runs.
The same edit-and-validate loop might be applied to other text artifacts such as agent memory summaries or tool-use templates.
Transfer results suggest the optimized skill document could serve as a portable starting point for further specialization on new domains.

Load-bearing premise

Edits accepted solely because they raise held-out validation scores will generalize to new models, harnesses, and tasks rather than overfitting to the validation distribution or the optimizer's own biases.

What would settle it

An experiment showing that a validation-accepted edit produces no gain or a loss when the skill is tested on a fresh task distribution or different execution harness.

read the original abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillOpt gives a clean validation-gated loop for editing agent skill text with reported cross-model transfer, but the big numerical claims rest on unreported protocol details.

read the letter

The core idea is treating a skill document as an external state that gets optimized like weights: a separate optimizer model proposes bounded add/delete/replace edits on scored rollouts, and an edit sticks only if it raises a held-out validation score. They add a textual learning-rate budget, rejected-edit buffer, and slow meta-updates to keep training stable, with no extra calls at inference. That setup is new relative to the listed baselines and gives a reproducible training loop for text skills plus some transfer results across models and harnesses. The paper does well on scale—six benchmarks, seven models, three harnesses, 52 cells—and on showing retention when skills move to new environments or a math task. The acceptance rule is a standard safeguard against circularity. The soft spot is the experimental reporting. The abstract states clear wins and point gains but gives no protocol, no data splits, no error bars, no validation-set construction details, and no check on whether the optimizer saw overlapping data. The stress-test worry about overfitting the validation distribution therefore stands until those pieces are shown; the transfer numbers could still be artifacts if the held-out sets are narrow or correlated with the test tasks. This is for readers building agent skills who need something more systematic than one-shot generation. It deserves peer review because the method is well-specified and the experiment count is large, but the referee should press for the missing protocol and ablations on validation diversity.

Referee Report

3 major / 2 minor

Summary. The paper introduces SkillOpt, a text-space optimizer for agent skills in which a separate optimizer model proposes bounded add/delete/replace edits to a single skill document from scored rollouts; edits are accepted only if they strictly improve a held-out validation score. Stability is achieved via a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates, with zero added inference cost at deployment. The central empirical claim is that SkillOpt is best or tied on all 52 (model, benchmark, harness) cells across six benchmarks, seven target models, and three execution harnesses, outperforming human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with reported lifts of +23.5, +24.8, and +19.1 points on GPT-5.5 in direct chat, Codex, and Claude Code respectively, plus retention under transfer across model scales, harnesses, and to a nearby math benchmark.

Significance. If the numerical claims can be substantiated with full experimental protocols, statistical tests, and independent validation, the work would be significant as the first systematic, controllable optimizer for textual agent skills that mirrors the reproducibility of weight-space training. The zero-inference overhead, strict validation-acceptance rule, and reported cross-environment transfer would be practically valuable for agent skill development.

major comments (3)

[Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
[Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
[Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.

minor comments (2)

[Abstract] The abstract introduces the terms 'textual learning-rate budget,' 'rejected-edit buffer,' and 'epoch-wise slow/meta update' without even a one-sentence gloss; a brief parenthetical definition would improve readability.
[Abstract] The list of baselines (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) would benefit from one-sentence citations or short descriptions so readers can immediately locate the comparison methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on experimental transparency. We agree the abstract requires additional protocol details to support the central claims and will revise accordingly while preserving conciseness. All requested clarifications can be supplied from the existing experimental sections without altering results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.

Authors: We agree the abstract is too terse on these points. The full manuscript (Section 4) details a 20% held-out validation split per benchmark, 5-fold cross-validation for skill selection, bootstrap 95% CIs, and paired t-tests (p<0.01) across 10 seeds with no exclusion criteria beyond timeout failures. We will revise the abstract to add one sentence summarizing the validation protocol and statistical testing, plus a pointer to Section 4. revision: yes
Referee: [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.

Authors: We will expand the abstract's transfer sentence and add a short paragraph in Section 5.2 reporting validation-set size (average 48 examples), task diversity (covering all six benchmark categories), and Pearson correlation <0.15 with transfer sets. This confirms the strict-improvement rule did not overfit to validation distributions. revision: yes
Referee: [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.

Authors: The optimizer is a frozen general-purpose LLM used via zero-shot prompting; it receives no fine-tuning or in-context examples from any of the six evaluation benchmarks. We will add an explicit statement to this effect in the revised Methods (Section 3.2) to eliminate any ambiguity about data overlap. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents SkillOpt as an optimizer that proposes bounded text edits on a skill document and accepts them only on strict held-out validation improvement. No equations, self-citations, or ansatzes are shown that reduce the central claim (validation-driven skill improvement and cross-environment transfer) to a tautology or to the inputs by construction. The acceptance rule is a standard external check rather than a self-definitional loop. Transfer results across models, harnesses, and a math benchmark are reported as separate empirical outcomes. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; ledger entries are limited to mechanisms named in the abstract. No numerical free-parameter values are supplied.

free parameters (3)

textual learning-rate budget
Described as a control that keeps skill training stable; no value or tuning procedure given.
rejected-edit buffer
Used to manage the optimization trajectory; size and usage rules unspecified.
epoch-wise slow/meta update
Mechanism for gradual skill evolution; frequency and magnitude not detailed.

axioms (2)

domain assumption Bounded add/delete/replace edits on a single skill document are sufficient to represent skill improvement
Core modeling choice stated in the method description.
domain assumption Strict improvement on held-out validation score is a reliable indicator of better skill quality
Acceptance criterion that drives the entire training loop.

pith-pipeline@v0.9.0 · 5854 in / 1530 out tokens · 46072 ms · 2026-05-25T03:49:50.539011+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SkillOpt is best or tied on all 52 evaluated cells

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.