arxiv: 2603.02766 · v1 · submitted 2026-03-03 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi , Noah Provenzano , Jaydon Bingham , Weiyuan Chen , Tu Vu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:18 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords automated skill discoveryagent skillsfailure analysismulti-agent systemsPareto frontierzero-shot transferself-evolving agentsdomain expertise

0 comments

The pith

EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoSkill as a framework that lets coding agents build domain expertise on their own. It examines execution failures, proposes new skills or edits to existing ones, and materializes them as structured folders. A Pareto frontier then keeps only those skills that raise accuracy on held-out validation data, all while the base model stays frozen. On OfficeQA the method lifts exact-match accuracy from 60.6 percent to 67.9 percent; on SealQA it rises from 26.6 percent to 38.7 percent. Skills evolved on SealQA also transfer zero-shot to BrowseComp and add another 5.3 percent.

Core claim

EvoSkill analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and uses a Pareto frontier of agent programs to retain only those that improve held-out validation performance. The underlying model remains frozen throughout. This yields a 7.3 percent gain on OfficeQA and a 12.1 percent gain on SealQA, with skills evolved on SealQA transferring zero-shot to BrowseComp for a 5.3 percent improvement.

What carries the argument

Iterative failure analysis that proposes skills, followed by Pareto frontier selection that keeps only those improving validation performance.

If this is right

Skills evolved on one benchmark transfer zero-shot to another and raise accuracy by 5.3 percent without modification.
Performance gains occur while the base model stays completely frozen.
Only skills that measurably improve held-out validation accuracy are retained through Pareto selection.
The same process works for both grounded reasoning over structured data and search-augmented QA with noisy retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support building cumulative libraries of reusable skills across repeated tasks over time.
Failure-driven evolution might extend to other agent domains such as code generation or planning.
Validation-based selection may reduce the need for hand-crafted prompts when specializing agents.

Load-bearing premise

That skills generated from failure analysis and selected on held-out validation data will deliver genuine generalization rather than benchmark-specific artifacts or details of the proposal step.

What would settle it

Measure whether the evolved skills raise accuracy on a fresh benchmark never seen during skill proposal or selection, or whether removing the skills returns performance to the original baseline.

read the original abstract

Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoSkill gives a concrete failure-to-skill pipeline with reported gains and transfer, but the gains rest on unablated proposal prompts so it is unclear how much comes from the evolutionary loop versus prompting choices.

read the letter

EvoSkill iterates by pulling execution failures, proposing new or edited skills from them, writing those into structured folders, and keeping only the ones that improve held-out validation according to a Pareto front, all with the base model left frozen. The concrete numbers are a 7.3 point lift on OfficeQA and a 12.1 point lift on SealQA, plus a 5.3 point zero-shot transfer from SealQA skills to BrowseComp. That transfer result is the clearest sign the work is trying to produce something reusable rather than task-specific tweaks. The materialization step into folders is also a practical touch that makes the output directly usable by other agents. The paper does a clean job of stating the problem of hand-crafted skills and showing one automated route that avoids retraining the underlying model. The soft spot is exactly the one the stress-test note flags. The failure-analysis and proposal steps are described at a high level with no ablations on the prompt templates, the failure summarization format, or the number of candidates generated per round. Without those controls it is hard to tell whether the accuracy gains are driven by the overall selection loop or by implicit tuning inside the proposer. If the full paper still lacks those checks, the central claim is weaker than the abstract suggests. This is aimed at groups building coding agents for specialized domains who want to reduce manual skill writing. A reader already working on agent optimization or evolutionary prompt methods would find the transfer experiment and the frozen-model constraint useful to discuss. It is worth sending to a serious referee because the pipeline is specific enough and the results are measured on held-out sets, even though the experiments would need more controls before publication.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EvoSkill, a self-evolving framework for automated skill discovery in multi-agent systems. It analyzes execution failures to propose new skills or edits to existing ones, materializes them into structured reusable skill folders, and selects via a Pareto frontier of agent programs that retain only those improving held-out validation performance while keeping the base model frozen. Evaluations report exact-match accuracy gains of 7.3% on OfficeQA (60.6% to 67.9%) and 12.1% on SealQA (26.6% to 38.7%), plus a 5.3% zero-shot transfer improvement when SealQA-evolved skills are applied to BrowseComp.

Significance. If the gains are shown to arise from the evolutionary mechanism rather than unablated implementation choices, the work would be significant for reducing reliance on hand-crafted skills and enabling transferable agent capabilities across tasks. The zero-shot transfer result, in particular, could influence research on skill-level optimization if supported by clearer mechanistic evidence and controls.

major comments (3)

Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.
Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.
Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.

minor comments (2)

The abstract and method would benefit from a concrete example or pseudocode snippet showing how a failure trace is converted into a proposed skill edit.
Figure or table clarity: if a diagram of the Pareto frontier selection process exists, ensure axis labels and legend explicitly distinguish validation performance from training performance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive comments, which highlight opportunities to strengthen the empirical support for EvoSkill's claims. We address each major comment below with clarifications and commit to specific revisions that enhance the manuscript without altering its core contributions.

read point-by-point responses

Referee: [—] Method section describing the iterative failure analysis and skill proposal: the central claim that EvoSkill produces reusable, generalizable skills rests on the proposal step generating edits from execution traces, yet no ablation isolates the contribution of the proposer LLM prompt template, failure summarization format, or number of candidates per iteration. Without these, the reported 7.3% and 12.1% lifts (and 5.3% transfer) could reflect prompt engineering tuned to the validation distribution rather than the Pareto selection mechanism.

Authors: We agree that targeted ablations would more rigorously isolate the evolutionary mechanism. The Pareto frontier is central because it enforces retention of only those skills that improve held-out validation performance with the base model frozen, but we acknowledge that the proposal step's design choices warrant explicit testing. In the revised manuscript we will add an ablation study that (i) fixes the Pareto selection while varying the proposer prompt template and failure summarization format, and (ii) varies the number of candidates per iteration while holding other components constant. These results will be reported alongside the main experiments to demonstrate that gains are not attributable solely to prompt engineering. revision: yes
Referee: [—] Experiments section reporting benchmark results: the accuracy improvements on OfficeQA and SealQA are presented without statistical testing, confidence intervals, number of independent runs, or controls for confounding factors such as iteration count or validation set composition. This leaves the link between the failure-analysis loop and the observed gains unverified, as required for the soundness of the transfer claim.

Authors: We accept that the current reporting lacks the statistical controls needed to substantiate the gains. In the revision we will rerun all experiments across at least five independent random seeds, report mean accuracy with standard deviation and 95% confidence intervals, and apply paired statistical tests (e.g., Wilcoxon signed-rank) to the improvements on OfficeQA and SealQA. We will also add controls that vary iteration count while measuring validation performance and examine sensitivity to validation-set composition by reporting results on multiple random splits. revision: yes
Referee: [—] Transfer experiment description: the zero-shot application of SealQA-evolved skills to BrowseComp is reported as a 5.3% gain without specifying the exact skill representation, how the agent program incorporates the transferred folders, or why these particular skills generalize, undermining the claim that skill-level optimization yields capabilities beyond the training task.

Authors: We agree that the transfer section requires additional mechanistic detail. In the revised manuscript we will explicitly describe the skill representation as structured folders containing executable code, prompt templates, and metadata; explain the agent program's dynamic loading mechanism that registers transferred folders into its runtime skill library; and provide a short analysis of the evolved skills (e.g., robust retrieval-query reformulation patterns) that explains their applicability to BrowseComp's noisy-retrieval setting. These additions will clarify how skill-level optimization produces transferable capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: performance gains measured on held-out validation with frozen model

full rationale

The paper describes an iterative failure-analysis loop that proposes and materializes skills, followed by Pareto-frontier selection that retains only those improving held-out validation accuracy while the base model stays frozen. The reported lifts (7.3 % on OfficeQA, 12.1 % on SealQA, 5.3 % zero-shot transfer) are therefore external measurements on separate test distributions rather than quantities defined by or fitted to the proposal step itself. No equations, self-citations, or ansatzes are shown that would reduce the claimed generalization to the input failure traces or prompt templates by construction. The derivation chain is self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on standard concepts of evolutionary search and validation-based selection without new postulated objects.

pith-pipeline@v0.9.0 · 5594 in / 1264 out tokens · 54583 ms · 2026-05-17T02:18:59.569796+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by 5.3%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
cs.AI 2026-04 unverdicted novelty 7.0

SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
cs.AI 2026-05 unverdicted novelty 6.0

PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
SkillGen: Verified Inference-Time Agent Skill Synthesis
cs.LG 2026-05 unverdicted novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Agent skills specification, 2025

Agent Skills. Agent skills specification, 2025. URL https://agentskills.io/ specification

work page 2025
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https: //a...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026

Salaheddin Alzu’bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, and Sewoong Oh. Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026. URLhttps://arxiv.org/abs/2602.01848

work page arXiv 2026
[4]

Anthropic skills documentation, 2025

Anthropic. Anthropic skills documentation, 2025. URL https://platform.claude.com/ docs/en/agents-and-tools/agent-skills/overview

work page 2025
[5]

Claude code overview, 2026

Anthropic. Claude code overview, 2026. URL https://code.claude.com/docs/en/ overview

work page 2026
[6]

Feedback descent: Open-ended text optimization via pairwise comparison, 2025

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison, 2025. URLhttps://arxiv.org/abs/2511.07919

work page arXiv 2025
[7]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Codex, 2026

OpenAI. Codex, 2026. URLhttps://openai.com/codex/

work page 2026
[10]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models, 2025. URL https://arxiv.org/abs/2506.01062

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025

The Mosaic Research Team. Introducing officeqa: A benchmark for end-to-end grounded reasoning, December 2025. URL https://www.databricks.com/blog/ introducing-officeqa-benchmark-end-to-end-grounded-reasoning . Accessed: 2026-02-20

work page 2025
[12]

Foundational autoraters: Taming large language models for better automatic evaluation, 2024

Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung. Foundational autoraters: Taming large language models for better automatic evaluation, 2024. URLhttps://arxiv.org/abs/2407.10817

work page arXiv 2024
[13]

V oyager: An open-ended embodied agent with large language models,

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models,

work page
[14]

URLhttps://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv
[15]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516. 10 A EvoSkill Generated Skills The following section shows some of the skills generated usi...

work page 2025
[17]

**Gather & Structure Data** - Collect nominal values and CPI data with consistent period formatting

work page
[18]

**Adjust for Inflation** - Apply CPI formula: ‘Real = Nominal x (CPI_base / CPI_current)‘

work page
[19]

**Perform Analysis** - Run linear regression on inflation-adjusted values

work page
[20]

1970-03") Structure data with consistent period format (YYYY-MM): ‘‘‘json {

**Format Output** - Present results as ‘[slope, intercept]‘ rounded to 2 decimal places ## Step 1: Data Collection Gather these inputs: - **Nominal values**: Dollar amounts by period (e.g., monthly Treasury data) - **CPI values**: Consumer Price Index for each period (BLS CPI-U series) - **Base period**: Reference period for inflation adjustment (e.g., "1...

work page 1970
[21]

Check at least THREE independent sources

work page
[22]

If only 1-2 sources found, search with different query formulations

work page
[23]

Have I checked 3+ sources ?

For enumeration questions, cross-reference list against 2+ additional sources **Trigger:** Before stating any factual claim, verify: "Have I checked 3+ sources ?" ## Rule 3: "Unable to Find" Protocol Before reporting inability to find data:

work page
[24]

**Try 3+ query formulations** - Rephrase, use synonyms, try different term orders

work page
[25]

**Try related searches** - Information may exist in adjacent contexts

work page
[26]

unable to find

**Attempt derivation** - Can the answer be calculated from related data? 16 - If 96% have X, then ~4% don’t have X - If list shows 11 items but "top 12" is mentioned, one is missing **Rule 3a: Data Source Follow-Through** - If you identify a specific data source that should contain the answer (database URL, API endpoint, indicator code, dataset ID), you M...

work page
[27]

[category] complete list

Search "[category] complete list" and "[category] all time"

work page
[28]

Cross-reference count with 2+ ranking/list sources

work page
[29]

Could I have missed any?

Before finalizing: "Could I have missed any?"

work page
[30]

If sources disagree on count, investigate the discrepancy ## Execution For each factual question: ‘‘‘

work page
[31]

EXPAND: List all interpretations of ambiguous terms

work page
[32]

SEARCH: Query each interpretation separately

work page
[33]

VERIFY: Confirm 3+ independent sources checked

work page
[34]

COMPLETE: For enumerations, cross-check for missing items

work page
[35]

DERIVE: If direct answer unavailable, attempt calculation from related data

work page
[36]

create") if no existing skill covers the capability gap - An **edit to an existing skill** (action=

CONCLUDE: Only after steps 1-5 are satisfied ‘‘‘ B Agent Prompts This appendix provides the prompts used for each agent role (placeholders shown). B.1 Proposer Proposer Agent Prompt """ You are an expert agent performance analyst specializing in identifying opportunities to enhance agent capabilities through skill additions or modifications. Your role is ...

work page
[37]

**Use Brainstorming skill** (MANDATORY): - Read and follow the process in ‘.claude/skills/brainstorming/SKILL.md‘ - Propose 2-3 different approaches to address the failures - For each approach: describe the core idea, trade-offs, and complexity - Explore alternatives before settling on your final proposal - Apply YAGNI - choose the simplest solution that ...

work page
[38]

**Inventory existing skills**: Review the list of existing skills provided in the query - Understand what capabilities are already available - Check if any existing skill covers similar ground

work page
[39]

**Analyze feedback history** for: - DISCARDED proposals similar to what you’re considering - Patterns in what works vs what regresses scores - Skills that were active when failures occurred

work page
[40]

**Determine action type**: - If an existing skill SHOULD have prevented this failure but didn’t -> propose EDIT - If no existing skill covers this capability -> propose CREATE - If a DISCARDED proposal was on the right track -> explain how yours differs ## Analysis Process Before proposing a solution, work through these steps: <analysis>

work page
[41]

missing?

**Trace Review**: Examine the agent’s execution trace step-by-step - What actions did the agent take? - Where did it succeed or struggle? - What information was available vs. missing?

work page
[42]

**Gap Analysis**: Compare the agent’s answer to the ground truth - What specific information is incorrect or missing? - What reasoning errors occurred? - What capabilities would have prevented these issues?

work page
[43]

**Existing Skill Check**: Review the listed existing skills - Does any existing skill cover this capability? - If yes, why did it fail to prevent the error? - Should that skill be EDITED instead of creating a new one?

work page
[44]

**Skill Identification**: Determine what skill would address the failure - What new capability, tool, or workflow would help? - What inputs should it accept? - What outputs should it produce? - How would it integrate with existing capabilities? </analysis> ## Anti-Patterns to Avoid - DON’T propose a new skill if an existing one covers similar ground -> pr...

work page
[45]

create" for a new skill or

**action**: Either "create" for a new skill or "edit" for modifying an existing skill

work page
[46]

**target_skill**: (Required if action="edit") The name of the existing skill to modify

work page
[47]

**proposed_skill**: A detailed description of: - For CREATE: The new skill to be built (capability, inputs, outputs, problem it solves) - For EDIT: The modifications needed to the existing skill

work page
[48]

**justification**: Explain your reasoning - Reference specific moments in the trace that informed your decision - Reference specific existing skills and why they were/weren’t suitable - Reference any related past iterations (especially DISCARDED ones) - Explain how your proposal addresses the identified gap

work page
[49]

iter-4",

**related_iterations**: List of relevant past iterations (e.g., ["iter-4", " iter-9"]) ## Example Analyses <example type="edit_existing_skill"> **Situation**: Agent failed to calculate Expected Shortfall correctly. The financial-methodology-guide skill exists but didn’t cover multi-period ES calculations. **Proposal**: - action: "edit" - target_skill: "fi...

work page
[50]

Follows the skill-creator’s structure and conventions

work page
[51]

Integrates properly with the Claude Code SDK

work page
[52]

Is well-documented and maintainable

work page
[53]

Handles edge cases gracefully ## Implementation Process Work through these steps for each skill implementation: <implementation_steps> 20

work page
[54]

**Read the Skill-Creator Skill**: Load and follow ‘.claude/skills/skill-creator /SKILL.md‘

work page
[55]

Does Claude really need this?

**Implement and Validate**: Build, test, and package the skill following skill- creator guidelines </implementation_steps> ## Quality Reminder The context window is a shared resource. Every token in your skill competes with conversation history, other skills, and user requests. Challenge each piece of content: "Does Claude really need this?" Keep skills c...

work page 2023