pith. machine review for the scientific record. sign in

arxiv: 2603.02766 · v1 · submitted 2026-03-03 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:18 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords automated skill discoveryagent skillsfailure analysismulti-agent systemsPareto frontierzero-shot transferself-evolving agentsdomain expertise
0
0 comments X

The pith

EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoSkill as a framework that lets coding agents build domain expertise on their own. It examines execution failures, proposes new skills or edits to existing ones, and materializes them as structured folders. A Pareto frontier then keeps only those skills that raise accuracy on held-out validation data, all while the base model stays frozen. On OfficeQA the method lifts exact-match accuracy from 60.6 percent to 67.9 percent; on SealQA it rises from 26.6 percent to 38.7 percent. Skills evolved on SealQA also transfer zero-shot to BrowseComp and add another 5.3 percent.

Core claim

EvoSkill analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and uses a Pareto frontier of agent programs to retain only those that improve held-out validation performance. The underlying model remains frozen throughout. This yields a 7.3 percent gain on OfficeQA and a 12.1 percent gain on SealQA, with skills evolved on SealQA transferring zero-shot to BrowseComp for a 5.3 percent improvement.

What carries the argument

Iterative failure analysis that proposes skills, followed by Pareto frontier selection that keeps only those improving validation performance.

If this is right

  • Skills evolved on one benchmark transfer zero-shot to another and raise accuracy by 5.3 percent without modification.
  • Performance gains occur while the base model stays completely frozen.
  • Only skills that measurably improve held-out validation accuracy are retained through Pareto selection.
  • The same process works for both grounded reasoning over structured data and search-augmented QA with noisy retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support building cumulative libraries of reusable skills across repeated tasks over time.
  • Failure-driven evolution might extend to other agent domains such as code generation or planning.
  • Validation-based selection may reduce the need for hand-crafted prompts when specializing agents.

Load-bearing premise

That skills generated from failure analysis and selected on held-out validation data will deliver genuine generalization rather than benchmark-specific artifacts or details of the proposal step.

What would settle it

Measure whether the evolved skills raise accuracy on a fresh benchmark never seen during skill proposal or selection, or whether removing the skills returns performance to the original baseline.

read the original abstract

Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EvoSkill, a self-evolving framework for automated skill discovery in multi-agent systems. It analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and selects via a Pareto frontier of agent programs that retain only those improving held-out validation performance while keeping the base model frozen. Evaluations report exact-match accuracy gains of 7.3% on OfficeQA (60.6% to 67.9%) and 12.1% on SealQA (26.6% to 38.7%), plus a 5.3% zero-shot transfer improvement when SealQA-evolved skills are applied to BrowseComp.

Significance. If the gains are shown to arise from the evolutionary mechanism rather than unablated implementation choices, the work would be significant for reducing reliance on hand-crafted skills and enabling transferable agent capabilities across tasks. The zero-shot transfer result, in particular, could influence research on skill-level optimization if supported by clearer mechanistic evidence and controls.

major comments (3)
  1. Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.
  2. Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.
  3. Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.
minor comments (2)
  1. The abstract and method would benefit from a concrete example or pseudocode snippet showing how a failure trace is converted into a proposed skill edit.
  2. Figure or table clarity: if a diagram of the Pareto frontier selection process exists, ensure axis labels and legend explicitly distinguish validation performance from training performance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive comments, which highlight opportunities to strengthen the empirical support for EvoSkill's claims. We address each major comment below with clarifications and commit to specific revisions that enhance the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [—] Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.

    Authors: We agree that targeted ablations would more rigorously isolate the evolutionary mechanism. The Pareto frontier is central because it enforces retention of only those skills that improve held-out validation performance with the base model frozen, but we acknowledge that the proposal step's design choices warrant explicit testing. In the revised manuscript we will add an ablation study that (i) fixes the Pareto selection while varying the proposer prompt template and failure summarization format, and (ii) varies the number of candidates per iteration while holding other components constant. These results will be reported alongside the main experiments to demonstrate that gains are not attributable solely to prompt engineering. revision: yes

  2. Referee: [—] Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.

    Authors: We accept that the current reporting lacks the statistical controls needed to substantiate the gains. In the revision we will rerun all experiments across at least five independent random seeds, report mean accuracy with standard deviation and 95% confidence intervals, and apply paired statistical tests (e.g., Wilcoxon signed-rank) to the improvements on OfficeQA and SealQA. We will also add controls that vary iteration count while measuring validation performance and examine sensitivity to validation-set composition by reporting results on multiple random splits. revision: yes

  3. Referee: [—] Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.

    Authors: We agree that the transfer section requires additional mechanistic detail. In the revised manuscript we will explicitly describe the skill representation as structured folders containing executable code, prompt templates, and metadata; explain the agent program's dynamic loading mechanism that registers transferred folders into its runtime skill library; and provide a short analysis of the evolved skills (e.g., robust retrieval-query reformulation patterns) that explains their applicability to BrowseComp's noisy-retrieval setting. These additions will clarify how skill-level optimization produces transferable capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: performance gains measured on held-out validation with frozen model

full rationale

The paper describes an iterative failure-analysis loop that proposes and materializes skills, followed by Pareto-frontier selection that retains only those improving held-out validation accuracy while the base model stays frozen. The reported lifts (7.3 % on OfficeQA, 12.1 % on SealQA, 5.3 % zero-shot transfer) are therefore external measurements on separate test distributions rather than quantities defined by or fitted to the proposal step itself. No equations, self-citations, or ansatzes are shown that would reduce the claimed generalization to the input failure traces or prompt templates by construction. The derivation chain is self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on standard concepts of evolutionary search and validation-based selection without new postulated objects.

pith-pipeline@v0.9.0 · 5594 in / 1264 out tokens · 54583 ms · 2026-05-17T02:18:59.569796+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  2. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  3. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.

  4. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  5. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  6. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

  7. SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

  8. SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    cs.AI 2026-04 unverdicted novelty 7.0

    SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.

  9. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  10. Evidence Over Plans: Online Trajectory Verification for Skill Distillation

    cs.AI 2026-05 unverdicted novelty 6.0

    PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.

  11. SkillGen: Verified Inference-Time Agent Skill Synthesis

    cs.LG 2026-05 unverdicted novelty 6.0

    SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.

  12. SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...

  13. ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...

  14. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  15. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

  16. SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.

  17. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  18. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  19. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Agent skills specification, 2025

    Agent Skills. Agent skills specification, 2025. URL https://agentskills.io/ specification

  2. [2]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https: //a...

  3. [3]

    Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026

    Salaheddin Alzu’bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, and Sewoong Oh. Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026. URLhttps://arxiv.org/abs/2602.01848

  4. [4]

    Anthropic skills documentation, 2025

    Anthropic. Anthropic skills documentation, 2025. URL https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview

  5. [5]

    Claude code overview, 2026

    Anthropic. Claude code overview, 2026. URL https://code.claude.com/docs/en/ overview

  6. [6]

    Feedback descent: Open-ended text optimization via pairwise comparison, 2025

    Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison, 2025. URLhttps://arxiv.org/abs/2511.07919

  7. [7]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

  8. [8]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  9. [9]

    Codex, 2026

    OpenAI. Codex, 2026. URLhttps://openai.com/codex/

  10. [10]

    SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

    Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models, 2025. URL https://arxiv.org/abs/2506.01062

  11. [11]

    Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025

    The Mosaic Research Team. Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025. URL https://www.databricks.com/blog/ introducing-officeqa-benchmark-end-to-end-grounded-reasoning . Accessed: 2026-02-20

  12. [12]

    Foundational autoraters: Taming large language models for better automatic evaluation, 2024

    Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung. Foundational autoraters: Taming large language models for better automatic evaluation, 2024. URLhttps://arxiv.org/abs/2407.10817

  13. [13]

    V oyager: An open-ended embodied agent with large language models,

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models,

  14. [14]

    URLhttps://arxiv.org/abs/2305.16291

  15. [15]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  16. [16]

    Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516. 10 A EvoSkill Generated Skills The following section shows some of the skills generated usi...

  17. [17]

    **Gather & Structure Data** - Collect nominal values and CPI data with consistent period formatting

  18. [18]

    **Adjust for Inflation** - Apply CPI formula: ‘Real = Nominal x (CPI_base / CPI_current)‘

  19. [19]

    **Perform Analysis** - Run linear regression on inflation-adjusted values

  20. [20]

    1970-03") Structure data with consistent period format (YYYY-MM): ‘‘‘json {

    **Format Output** - Present results as ‘[slope, intercept]‘ rounded to 2 decimal places ## Step 1: Data Collection Gather these inputs: - **Nominal values**: Dollar amounts by period (e.g., monthly Treasury data) - **CPI values**: Consumer Price Index for each period (BLS CPI-U series) - **Base period**: Reference period for inflation adjustment (e.g., "1...

  21. [21]

    Check at least THREE independent sources

  22. [22]

    If only 1-2 sources found, search with different query formulations

  23. [23]

    Have I checked 3+ sources ?

    For enumeration questions, cross-reference list against 2+ additional sources **Trigger:** Before stating any factual claim, verify: "Have I checked 3+ sources ?" ## Rule 3: "Unable to Find" Protocol Before reporting inability to find data:

  24. [24]

    **Try 3+ query formulations** - Rephrase, use synonyms, try different term orders

  25. [25]

    **Try related searches** - Information may exist in adjacent contexts

  26. [26]

    unable to find

    **Attempt derivation** - Can the answer be calculated from related data? 16 - If 96% have X, then ~4% don’t have X - If list shows 11 items but "top 12" is mentioned, one is missing **Rule 3a: Data Source Follow-Through** - If you identify a specific data source that should contain the answer (database URL, API endpoint, indicator code, dataset ID), you M...

  27. [27]

    [category] complete list

    Search "[category] complete list" and "[category] all time"

  28. [28]

    Cross-reference count with 2+ ranking/list sources

  29. [29]

    Could I have missed any?

    Before finalizing: "Could I have missed any?"

  30. [30]

    If sources disagree on count, investigate the discrepancy ## Execution For each factual question: ‘‘‘

  31. [31]

    EXPAND: List all interpretations of ambiguous terms

  32. [32]

    SEARCH: Query each interpretation separately

  33. [33]

    VERIFY: Confirm 3+ independent sources checked

  34. [34]

    COMPLETE: For enumerations, cross-check for missing items

  35. [35]

    DERIVE: If direct answer unavailable, attempt calculation from related data

  36. [36]

    create") if no existing skill covers the capability gap - An **edit to an existing skill** (action=

    CONCLUDE: Only after steps 1-5 are satisfied ‘‘‘ B Agent Prompts This appendix provides the prompts used for each agent role (placeholders shown). B.1 Proposer Proposer Agent Prompt """ You are an expert agent performance analyst specializing in identifying opportunities to enhance agent capabilities through skill additions or modifications. Your role is ...

  37. [37]

    **Use Brainstorming skill** (MANDATORY): - Read and follow the process in ‘.claude/skills/brainstorming/SKILL.md‘ - Propose 2-3 different approaches to address the failures - For each approach: describe the core idea, trade-offs, and complexity - Explore alternatives before settling on your final proposal - Apply YAGNI - choose the simplest solution that ...

  38. [38]

    **Inventory existing skills**: Review the list of existing skills provided in the query - Understand what capabilities are already available - Check if any existing skill covers similar ground

  39. [39]

    **Analyze feedback history** for: - DISCARDED proposals similar to what you’re considering - Patterns in what works vs what regresses scores - Skills that were active when failures occurred

  40. [40]

    **Determine action type**: - If an existing skill SHOULD have prevented this failure but didn’t -> propose EDIT - If no existing skill covers this capability -> propose CREATE - If a DISCARDED proposal was on the right track -> explain how yours differs ## Analysis Process Before proposing a solution, work through these steps: <analysis>

  41. [41]

    missing?

    **Trace Review**: Examine the agent’s execution trace step-by-step - What actions did the agent take? - Where did it succeed or struggle? - What information was available vs. missing?

  42. [42]

    **Gap Analysis**: Compare the agent’s answer to the ground truth - What specific information is incorrect or missing? - What reasoning errors occurred? - What capabilities would have prevented these issues?

  43. [43]

    **Existing Skill Check**: Review the listed existing skills - Does any existing skill cover this capability? - If yes, why did it fail to prevent the error? - Should that skill be EDITED instead of creating a new one?

  44. [44]

    **Skill Identification**: Determine what skill would address the failure - What new capability, tool, or workflow would help? - What inputs should it accept? - What outputs should it produce? - How would it integrate with existing capabilities? </analysis> ## Anti-Patterns to Avoid - DON’T propose a new skill if an existing one covers similar ground -> pr...

  45. [45]

    create" for a new skill or

    **action**: Either "create" for a new skill or "edit" for modifying an existing skill

  46. [46]

    **target_skill**: (Required if action="edit") The name of the existing skill to modify

  47. [47]

    **proposed_skill**: A detailed description of: - For CREATE: The new skill to be built (capability, inputs, outputs, problem it solves) - For EDIT: The modifications needed to the existing skill

  48. [48]

    **justification**: Explain your reasoning - Reference specific moments in the trace that informed your decision - Reference specific existing skills and why they were/weren’t suitable - Reference any related past iterations (especially DISCARDED ones) - Explain how your proposal addresses the identified gap

  49. [49]

    iter-4",

    **related_iterations**: List of relevant past iterations (e.g., ["iter-4", " iter-9"]) ## Example Analyses <example type="edit_existing_skill"> **Situation**: Agent failed to calculate Expected Shortfall correctly. The financial-methodology-guide skill exists but didn’t cover multi-period ES calculations. **Proposal**: - action: "edit" - target_skill: "fi...

  50. [50]

    Follows the skill-creator’s structure and conventions

  51. [51]

    Integrates properly with the Claude Code SDK

  52. [52]

    Is well-documented and maintainable

  53. [53]

    Handles edge cases gracefully ## Implementation Process Work through these steps for each skill implementation: <implementation_steps> 20

  54. [54]

    **Read the Skill-Creator Skill**: Load and follow ‘.claude/skills/skill-creator /SKILL.md‘

  55. [55]

    Does Claude really need this?

    **Implement and Validate**: Build, test, and package the skill following skill- creator guidelines </implementation_steps> ## Quality Reminder The context window is a shared resource. Every token in your skill competes with conversation history, other skills, and user requests. Challenge each piece of content: "Does Claude really need this?" Keep skills c...