{"paper":{"title":"EvoSkill: Automated Skill Discovery for Multi-Agent Systems","license":"http://creativecommons.org/licenses/by/4.0/","headline":"EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model.","cross_cats":["cs.MA"],"primary_cat":"cs.AI","authors_text":"Jaydon Bingham, Noah Provenzano, Salaheddin Alzubi, Tu Vu, Weiyuan Chen","submitted_at":"2026-03-03T09:07:22Z","abstract_excerpt":"Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \\textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \\& code) that are tightly coupled to specific models and tasks. We introduce \\textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent s"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA... improves exact-match accuracy by 7.3% (60.6% → 67.9%); and SealQA... yields a 12.1% gain (26.6% → 38.7%). ... skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by 5.3%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That iterative failure analysis can reliably generate skills whose benefits on held-out validation reflect genuine generalization rather than benchmark-specific artifacts or unstated implementation choices in the skill proposal step.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"EvoSkill evolves agent skills via failure analysis and Pareto frontier selection, raising exact-match accuracy 7.3% on OfficeQA and 12.1% on SealQA with 5.3% zero-shot transfer to BrowseComp.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b5830b8b98826cb57710aab3eaed2c4f6e69c83294287853f53afdba0fb7884e"},"source":{"id":"2603.02766","kind":"arxiv","version":1},"verdict":{"id":"78deba2c-4398-4263-b6ca-bcc698571435","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T02:18:59.569796Z","strongest_claim":"EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA... improves exact-match accuracy by 7.3% (60.6% → 67.9%); and SealQA... yields a 12.1% gain (26.6% → 38.7%). ... skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by 5.3%.","one_line_summary":"EvoSkill evolves agent skills via failure analysis and Pareto frontier selection, raising exact-match accuracy 7.3% on OfficeQA and 12.1% on SealQA with 5.3% zero-shot transfer to BrowseComp.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That iterative failure analysis can reliably generate skills whose benefits on held-out validation reflect genuine generalization rather than benchmark-specific artifacts or unstated implementation choices in the skill proposal step.","pith_extraction_headline":"EvoSkill automatically discovers reusable agent skills through failure analysis to improve performance without changing the model."},"references":{"count":55,"sample":[{"doi":"","year":2025,"title":"Agent skills specification, 2025","work_id":"cc370620-94eb-4f6b-8c2b-a40f797b4ead","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning","work_id":"40b60d06-dc1c-4799-b75d-ff1eca653049","ref_index":2,"cited_arxiv_id":"2507.19457","is_internal_anchor":true},{"doi":"","year":2026,"title":"Roma: Recursive open meta-agent framework for long-horizon multi-agent systems, 2026","work_id":"f9efd630-2181-4456-941f-bc772bac6627","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Anthropic skills documentation, 2025","work_id":"04711a91-a683-427a-a581-6222c2c3b80d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Claude code overview, 2026","work_id":"56e50904-fb0b-4d6f-a71f-44d583dec220","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":55,"snapshot_sha256":"ed781fa73596c9e64980e686db68b19e93c8cece7c614defac0549434068392a","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"fff1fb79c4f29ec0c1e34e4ec4f045feabae2ad949bdc77b2c2f4e7d1168c483"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}