SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Pith reviewed 2026-05-25 03:49 UTC · model grok-4.3
The pith
A separate optimizer model evolves agent skills by turning scored rollouts into bounded text edits accepted only on held-out validation gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update keep training stable. Across six benchmarks, seven target models, and three execution harnesses the resulting skills are best or tied on all 52 evaluated cells and outperform human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. Optimized skill artifacts retain value when transferred across model,
What carries the argument
The optimizer model that converts scored rollouts into bounded add/delete/replace edits on the skill document, accepted only on strict held-out validation improvement.
If this is right
- Optimized skills raise no-skill accuracy by 19 to 25 points on GPT-5.5 in direct chat, Codex loops, and Claude Code.
- Skills keep their value when moved to different model scales, between Codex and Claude Code environments, and to a nearby math benchmark without further tuning.
- The method adds zero extra model calls at deployment time.
- The approach beats every listed competitor in every one of the 52 evaluated cells.
Where Pith is reading between the lines
- The strict validation gate could allow skills to be maintained as versioned artifacts that accumulate improvements over repeated optimization runs.
- The same edit-and-validate loop might be applied to other text artifacts such as agent memory summaries or tool-use templates.
- Transfer results suggest the optimized skill document could serve as a portable starting point for further specialization on new domains.
Load-bearing premise
Edits accepted solely because they raise held-out validation scores will generalize to new models, harnesses, and tasks rather than overfitting to the validation distribution or the optimizer's own biases.
What would settle it
An experiment showing that a validation-accepted edit produces no gain or a loss when the skill is tested on a fresh task distribution or different execution harness.
read the original abstract
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SkillOpt, a text-space optimizer for agent skills in which a separate optimizer model proposes bounded add/delete/replace edits to a single skill document from scored rollouts; edits are accepted only if they strictly improve a held-out validation score. Stability is achieved via a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates, with zero added inference cost at deployment. The central empirical claim is that SkillOpt is best or tied on all 52 (model, benchmark, harness) cells across six benchmarks, seven target models, and three execution harnesses, outperforming human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with reported lifts of +23.5, +24.8, and +19.1 points on GPT-5.5 in direct chat, Codex, and Claude Code respectively, plus retention under transfer across model scales, harnesses, and to a nearby math benchmark.
Significance. If the numerical claims can be substantiated with full experimental protocols, statistical tests, and independent validation, the work would be significant as the first systematic, controllable optimizer for textual agent skills that mirrors the reproducibility of weight-space training. The zero-inference overhead, strict validation-acceptance rule, and reported cross-environment transfer would be practically valuable for agent skill development.
major comments (3)
- [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
- [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
- [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.
minor comments (2)
- [Abstract] The abstract introduces the terms 'textual learning-rate budget,' 'rejected-edit buffer,' and 'epoch-wise slow/meta update' without even a one-sentence gloss; a brief parenthetical definition would improve readability.
- [Abstract] The list of baselines (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) would benefit from one-sentence citations or short descriptions so readers can immediately locate the comparison methods.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on experimental transparency. We agree the abstract requires additional protocol details to support the central claims and will revise accordingly while preserving conciseness. All requested clarifications can be supplied from the existing experimental sections without altering results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
Authors: We agree the abstract is too terse on these points. The full manuscript (Section 4) details a 20% held-out validation split per benchmark, 5-fold cross-validation for skill selection, bootstrap 95% CIs, and paired t-tests (p<0.01) across 10 seeds with no exclusion criteria beyond timeout failures. We will revise the abstract to add one sentence summarizing the validation protocol and statistical testing, plus a pointer to Section 4. revision: yes
-
Referee: [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
Authors: We will expand the abstract's transfer sentence and add a short paragraph in Section 5.2 reporting validation-set size (average 48 examples), task diversity (covering all six benchmark categories), and Pearson correlation <0.15 with transfer sets. This confirms the strict-improvement rule did not overfit to validation distributions. revision: yes
-
Referee: [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.
Authors: The optimizer is a frozen general-purpose LLM used via zero-shot prompting; it receives no fine-tuning or in-context examples from any of the six evaluation benchmarks. We will add an explicit statement to this effect in the revised Methods (Section 3.2) to eliminate any ambiguity about data overlap. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents SkillOpt as an optimizer that proposes bounded text edits on a skill document and accepts them only on strict held-out validation improvement. No equations, self-citations, or ansatzes are shown that reduce the central claim (validation-driven skill improvement and cross-environment transfer) to a tautology or to the inputs by construction. The acceptance rule is a standard external check rather than a self-definitional loop. Transfer results across models, harnesses, and a math benchmark are reported as separate empirical outcomes. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- textual learning-rate budget
- rejected-edit buffer
- epoch-wise slow/meta update
axioms (2)
- domain assumption Bounded add/delete/replace edits on a single skill document are sufficient to represent skill improvement
- domain assumption Strict improvement on held-out validation score is a reliable indicator of better skill quality
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SkillOpt is best or tied on all 52 evaluated cells
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.