Recognition: 2 theorem links
· Lean TheoremEvoSkill: Automated Skill Discovery for Multi-Agent Systems
Pith reviewed 2026-05-17 02:18 UTC · model grok-4.3
The pith
EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoSkill analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and uses a Pareto frontier of agent programs to retain only those that improve held-out validation performance. The underlying model remains frozen throughout. This yields a 7.3 percent gain on OfficeQA and a 12.1 percent gain on SealQA, with skills evolved on SealQA transferring zero-shot to BrowseComp for a 5.3 percent improvement.
What carries the argument
Iterative failure analysis that proposes skills, followed by Pareto frontier selection that keeps only those improving validation performance.
If this is right
- Skills evolved on one benchmark transfer zero-shot to another and raise accuracy by 5.3 percent without modification.
- Performance gains occur while the base model stays completely frozen.
- Only skills that measurably improve held-out validation accuracy are retained through Pareto selection.
- The same process works for both grounded reasoning over structured data and search-augmented QA with noisy retrieval.
Where Pith is reading between the lines
- The method could support building cumulative libraries of reusable skills across repeated tasks over time.
- Failure-driven evolution might extend to other agent domains such as code generation or planning.
- Validation-based selection may reduce the need for hand-crafted prompts when specializing agents.
Load-bearing premise
That skills generated from failure analysis and selected on held-out validation data will deliver genuine generalization rather than benchmark-specific artifacts or details of the proposal step.
What would settle it
Measure whether the evolved skills raise accuracy on a fresh benchmark never seen during skill proposal or selection, or whether removing the skills returns performance to the original baseline.
read the original abstract
Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EvoSkill, a self-evolving framework for automated skill discovery in multi-agent systems. It analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and selects via a Pareto frontier of agent programs that retain only those improving held-out validation performance while keeping the base model frozen. Evaluations report exact-match accuracy gains of 7.3% on OfficeQA (60.6% to 67.9%) and 12.1% on SealQA (26.6% to 38.7%), plus a 5.3% zero-shot transfer improvement when SealQA-evolved skills are applied to BrowseComp.
Significance. If the gains are shown to arise from the evolutionary mechanism rather than unablated implementation choices, the work would be significant for reducing reliance on hand-crafted skills and enabling transferable agent capabilities across tasks. The zero-shot transfer result, in particular, could influence research on skill-level optimization if supported by clearer mechanistic evidence and controls.
major comments (3)
- Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.
- Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.
- Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.
minor comments (2)
- The abstract and method would benefit from a concrete example or pseudocode snippet showing how a failure trace is converted into a proposed skill edit.
- Figure or table clarity: if a diagram of the Pareto frontier selection process exists, ensure axis labels and legend explicitly distinguish validation performance from training performance.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive comments, which highlight opportunities to strengthen the empirical support for EvoSkill's claims. We address each major comment below with clarifications and commit to specific revisions that enhance the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [—] Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.
Authors: We agree that targeted ablations would more rigorously isolate the evolutionary mechanism. The Pareto frontier is central because it enforces retention of only those skills that improve held-out validation performance with the base model frozen, but we acknowledge that the proposal step's design choices warrant explicit testing. In the revised manuscript we will add an ablation study that (i) fixes the Pareto selection while varying the proposer prompt template and failure summarization format, and (ii) varies the number of candidates per iteration while holding other components constant. These results will be reported alongside the main experiments to demonstrate that gains are not attributable solely to prompt engineering. revision: yes
-
Referee: [—] Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.
Authors: We accept that the current reporting lacks the statistical controls needed to substantiate the gains. In the revision we will rerun all experiments across at least five independent random seeds, report mean accuracy with standard deviation and 95% confidence intervals, and apply paired statistical tests (e.g., Wilcoxon signed-rank) to the improvements on OfficeQA and SealQA. We will also add controls that vary iteration count while measuring validation performance and examine sensitivity to validation-set composition by reporting results on multiple random splits. revision: yes
-
Referee: [—] Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.
Authors: We agree that the transfer section requires additional mechanistic detail. In the revised manuscript we will explicitly describe the skill representation as structured folders containing executable code, prompt templates, and metadata; explain the agent program's dynamic loading mechanism that registers transferred folders into its runtime skill library; and provide a short analysis of the evolved skills (e.g., robust retrieval-query reformulation patterns) that explains their applicability to BrowseComp's noisy-retrieval setting. These additions will clarify how skill-level optimization produces transferable capabilities. revision: yes
Circularity Check
No circularity: performance gains measured on held-out validation with frozen model
full rationale
The paper describes an iterative failure-analysis loop that proposes and materializes skills, followed by Pareto-frontier selection that retains only those improving held-out validation accuracy while the base model stays frozen. The reported lifts (7.3 % on OfficeQA, 12.1 % on SealQA, 5.3 % zero-shot transfer) are therefore external measurements on separate test distributions rather than quantities defined by or fitted to the proposal step itself. No equations, self-citations, or ansatzes are shown that would reduce the claimed generalization to the input failure traces or prompt templates by construction. The derivation chain is self-contained against the external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by 5.3%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
-
SkillGen: Verified Inference-Time Agent Skill Synthesis
SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
-
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering
SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.
Reference graph
Works this paper leans on
-
[1]
Agent skills specification, 2025
Agent Skills. Agent skills specification, 2025. URL https://agentskills.io/ specification
work page 2025
-
[2]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https: //a...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026
Salaheddin Alzu’bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, and Sewoong Oh. Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026. URLhttps://arxiv.org/abs/2602.01848
-
[4]
Anthropic skills documentation, 2025
Anthropic. Anthropic skills documentation, 2025. URL https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview
work page 2025
-
[5]
Anthropic. Claude code overview, 2026. URL https://code.claude.com/docs/en/ overview
work page 2026
-
[6]
Feedback descent: Open-ended text optimization via pairwise comparison, 2025
Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison, 2025. URLhttps://arxiv.org/abs/2511.07919
-
[7]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models, 2025. URL https://arxiv.org/abs/2506.01062
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025
The Mosaic Research Team. Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025. URL https://www.databricks.com/blog/ introducing-officeqa-benchmark-end-to-end-grounded-reasoning . Accessed: 2026-02-20
work page 2025
-
[12]
Foundational autoraters: Taming large language models for better automatic evaluation, 2024
Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung. Foundational autoraters: Taming large language models for better automatic evaluation, 2024. URLhttps://arxiv.org/abs/2407.10817
-
[13]
V oyager: An open-ended embodied agent with large language models,
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models,
-
[14]
URLhttps://arxiv.org/abs/2305.16291
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Browsecomp: A simple yet challenging benchmark for browsing agents, 2025
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516. 10 A EvoSkill Generated Skills The following section shows some of the skills generated usi...
work page 2025
-
[17]
**Gather & Structure Data** - Collect nominal values and CPI data with consistent period formatting
-
[18]
**Adjust for Inflation** - Apply CPI formula: ‘Real = Nominal x (CPI_base / CPI_current)‘
-
[19]
**Perform Analysis** - Run linear regression on inflation-adjusted values
-
[20]
1970-03") Structure data with consistent period format (YYYY-MM): ‘‘‘json {
**Format Output** - Present results as ‘[slope, intercept]‘ rounded to 2 decimal places ## Step 1: Data Collection Gather these inputs: - **Nominal values**: Dollar amounts by period (e.g., monthly Treasury data) - **CPI values**: Consumer Price Index for each period (BLS CPI-U series) - **Base period**: Reference period for inflation adjustment (e.g., "1...
work page 1970
-
[21]
Check at least THREE independent sources
-
[22]
If only 1-2 sources found, search with different query formulations
-
[23]
For enumeration questions, cross-reference list against 2+ additional sources **Trigger:** Before stating any factual claim, verify: "Have I checked 3+ sources ?" ## Rule 3: "Unable to Find" Protocol Before reporting inability to find data:
-
[24]
**Try 3+ query formulations** - Rephrase, use synonyms, try different term orders
-
[25]
**Try related searches** - Information may exist in adjacent contexts
-
[26]
**Attempt derivation** - Can the answer be calculated from related data? 16 - If 96% have X, then ~4% don’t have X - If list shows 11 items but "top 12" is mentioned, one is missing **Rule 3a: Data Source Follow-Through** - If you identify a specific data source that should contain the answer (database URL, API endpoint, indicator code, dataset ID), you M...
- [27]
-
[28]
Cross-reference count with 2+ ranking/list sources
- [29]
-
[30]
If sources disagree on count, investigate the discrepancy ## Execution For each factual question: ‘‘‘
-
[31]
EXPAND: List all interpretations of ambiguous terms
-
[32]
SEARCH: Query each interpretation separately
-
[33]
VERIFY: Confirm 3+ independent sources checked
-
[34]
COMPLETE: For enumerations, cross-check for missing items
-
[35]
DERIVE: If direct answer unavailable, attempt calculation from related data
-
[36]
create") if no existing skill covers the capability gap - An **edit to an existing skill** (action=
CONCLUDE: Only after steps 1-5 are satisfied ‘‘‘ B Agent Prompts This appendix provides the prompts used for each agent role (placeholders shown). B.1 Proposer Proposer Agent Prompt """ You are an expert agent performance analyst specializing in identifying opportunities to enhance agent capabilities through skill additions or modifications. Your role is ...
-
[37]
**Use Brainstorming skill** (MANDATORY): - Read and follow the process in ‘.claude/skills/brainstorming/SKILL.md‘ - Propose 2-3 different approaches to address the failures - For each approach: describe the core idea, trade-offs, and complexity - Explore alternatives before settling on your final proposal - Apply YAGNI - choose the simplest solution that ...
-
[38]
**Inventory existing skills**: Review the list of existing skills provided in the query - Understand what capabilities are already available - Check if any existing skill covers similar ground
-
[39]
**Analyze feedback history** for: - DISCARDED proposals similar to what you’re considering - Patterns in what works vs what regresses scores - Skills that were active when failures occurred
-
[40]
**Determine action type**: - If an existing skill SHOULD have prevented this failure but didn’t -> propose EDIT - If no existing skill covers this capability -> propose CREATE - If a DISCARDED proposal was on the right track -> explain how yours differs ## Analysis Process Before proposing a solution, work through these steps: <analysis>
- [41]
-
[42]
**Gap Analysis**: Compare the agent’s answer to the ground truth - What specific information is incorrect or missing? - What reasoning errors occurred? - What capabilities would have prevented these issues?
-
[43]
**Existing Skill Check**: Review the listed existing skills - Does any existing skill cover this capability? - If yes, why did it fail to prevent the error? - Should that skill be EDITED instead of creating a new one?
-
[44]
**Skill Identification**: Determine what skill would address the failure - What new capability, tool, or workflow would help? - What inputs should it accept? - What outputs should it produce? - How would it integrate with existing capabilities? </analysis> ## Anti-Patterns to Avoid - DON’T propose a new skill if an existing one covers similar ground -> pr...
-
[45]
**action**: Either "create" for a new skill or "edit" for modifying an existing skill
-
[46]
**target_skill**: (Required if action="edit") The name of the existing skill to modify
-
[47]
**proposed_skill**: A detailed description of: - For CREATE: The new skill to be built (capability, inputs, outputs, problem it solves) - For EDIT: The modifications needed to the existing skill
-
[48]
**justification**: Explain your reasoning - Reference specific moments in the trace that informed your decision - Reference specific existing skills and why they were/weren’t suitable - Reference any related past iterations (especially DISCARDED ones) - Explain how your proposal addresses the identified gap
-
[49]
**related_iterations**: List of relevant past iterations (e.g., ["iter-4", " iter-9"]) ## Example Analyses <example type="edit_existing_skill"> **Situation**: Agent failed to calculate Expected Shortfall correctly. The financial-methodology-guide skill exists but didn’t cover multi-period ES calculations. **Proposal**: - action: "edit" - target_skill: "fi...
-
[50]
Follows the skill-creator’s structure and conventions
-
[51]
Integrates properly with the Claude Code SDK
-
[52]
Is well-documented and maintainable
-
[53]
Handles edge cases gracefully ## Implementation Process Work through these steps for each skill implementation: <implementation_steps> 20
-
[54]
**Read the Skill-Creator Skill**: Load and follow ‘.claude/skills/skill-creator /SKILL.md‘
-
[55]
**Implement and Validate**: Build, test, and package the skill following skill- creator guidelines </implementation_steps> ## Quality Reminder The context window is a shared resource. Every token in your skill competes with conversation history, other skills, and user requests. Challenge each piece of content: "Does Claude really need this?" Keep skills c...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.