arxiv: 2602.12670 · v3 · submitted 2026-02-13 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li , Wenbo Chen , Yimin Liu , Shenghan Zheng , Xiaokun Chen , Yifeng He , Yubo Li , Bingran You

show 33 more authors

Haotian Shen Jiankai Sun Shuyi Wang Binxu Li Qunhong Zeng Di Wang Xuandong Zhao Yuanli Wang Roey Ben Chaim Zonglin Di Yipeng Gao Junwei He Yizhuo He Liqiang Jing Luyang Kong Xin Lan Jiachen Li Songlin Li Yijiang Li Yueqian Lin Xinyi Liu Xuanqing Liu Haoran Lyu Ze Ma Bowei Wang Runhui Wang Tianyu Wang Wengao Ye Yue Zhang Hanwen Xing Yiqi Xue Steven Dillmann Han-chung Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords Agent SkillsLLM AgentsBenchmarkingProcedural KnowledgeTask PerformanceSelf-Generated SkillsDomain Variation

0 comments

The pith

Curated Skills improve LLM agent success rates by 16.2 percentage points on average while self-generated Skills provide no benefit

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether structured packages of procedural knowledge called Skills actually help LLM agents complete tasks. It builds SkillsBench with 86 tasks across 11 domains, each paired with human-curated Skills and automatic success verifiers. Agents run in three conditions: no Skills, curated Skills, and Skills generated by the model itself. Curated Skills deliver an average 16.2-point lift in pass rates, though gains range from 4.5 points in software engineering to 51.9 points in healthcare and turn negative on 16 tasks. Models fail to create useful Skills for their own use on average, while concise Skills with only 2-3 modules outperform longer documentation and let smaller models match larger ones without Skills.

Core claim

The paper establishes that agent Skills, structured packages of procedural knowledge, raise average task pass rates by 16.2 percentage points when curated by humans across 86 tasks and 7,308 trajectories. Self-generated Skills yield no average improvement, showing models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills limited to 2-3 modules outperform comprehensive documentation, and smaller models equipped with Skills match the performance of larger models without them. Effects vary sharply by domain but are measured consistently with deterministic verifiers.

What carries the argument

Skills, defined as structured packages of procedural knowledge that augment LLM agents at inference time. The SkillsBench benchmark carries the argument by measuring their effect on agent trajectories under controlled conditions with deterministic verifiers.

If this is right

In high-gain domains like healthcare, adding curated Skills could substantially increase agent reliability on practical tasks.
Agent systems should not rely on models generating their own Skills since that approach shows no average benefit.
Skill design should prioritize concise versions with 2-3 modules over full documentation for stronger results.
Smaller models can substitute for larger ones when given Skills, lowering compute needs in some settings.
Skills sometimes reduce performance on certain tasks, so validation remains necessary before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

External skill libraries could become a practical way to boost agents without further model scaling or retraining.
The wide domain variation suggests studying task features that predict where Skills will help most to guide curation efforts.
Extending evaluation to open-ended or multi-turn real-world scenarios would test whether benefits persist beyond the benchmark's deterministic checks.
Pairing Skills with other inference-time techniques might produce additive gains beyond what the isolated tests show.

Load-bearing premise

That the 86 tasks and their deterministic verifiers represent typical real-world agent use cases without bias and that the curated Skills contain effective procedural knowledge without introducing errors.

What would settle it

Re-running the 7 agent-model configurations on a new set of tasks outside the original 11 domains or with human-judged outcomes instead of deterministic verifiers and checking whether the 16.2-point average gain and self-generation failure still hold.

read the original abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Curated skills lift agent pass rates by 16 points on average while self-generated skills add nothing, with large domain-to-domain swings and some tasks getting worse.

read the letter

Curated skills lift agent pass rates by 16 points on average while self-generated skills add nothing, with large domain-to-domain swings and some tasks getting worse. The benchmark itself is the main new piece. It runs the same 86 tasks under three clear conditions—no skills, human-curated skills, and skills the model writes on its own—using deterministic verifiers so the pass/fail calls are objective. They cover 11 domains and several agent-model pairs, which lets them show that short, focused skills beat long documentation and that smaller models with skills can reach the level of larger models without them. The domain split is useful too: healthcare jumps more than 50 points while software engineering barely moves, and 16 tasks actually drop with the added skills. That variation is honest and worth knowing for anyone deciding where to put effort into skill libraries. The soft spots sit in the methods. The abstract says 86 tasks but then reports negative results on 16 of 84, which hints that some tasks may have been dropped after looking at the data. No error bars or significance tests appear, so the 16-point average could be pulled by a couple of strong domains. The curation process for the skills and the task selection criteria are not described, which leaves room for the possibility that the curated skills were tuned to the verifiers in ways the self-generated versions could not match. This makes the head-to-head comparison less clean than it first looks. People building or evaluating agent systems will find the numbers and the domain breakdown worth reading. The benchmark gives a concrete baseline for testing whether external procedural knowledge is worth the cost. It deserves a serious referee because the empirical setup is straightforward and the comparative results are new, even though the write-up needs tighter reporting on how the tasks and skills were built.

Referee Report

4 major / 2 minor

Summary. The paper introduces SkillsBench, a benchmark of 86 tasks across 11 domains, each with curated Skills and deterministic verifiers. It evaluates 7 agent-model configurations over 7,308 trajectories under three conditions (no Skills, curated Skills, self-generated Skills). The central claims are that curated Skills raise average pass rates by 16.2 percentage points (with domain variation from +4.5pp in Software Engineering to +51.9pp in Healthcare), self-generated Skills provide no average benefit, focused Skills (2-3 modules) outperform comprehensive documentation, and smaller models augmented with Skills can match larger models without them. Negative effects are noted on 16 tasks.

Significance. If the results hold after addressing methodological gaps, SkillsBench offers a reproducible empirical tool for quantifying the value of procedural knowledge in LLM agents, with the large trajectory count and deterministic verifiers as clear strengths enabling direct measurement. The finding that models cannot reliably author beneficial Skills they can consume has implications for agent architectures. The work is a solid step toward standardizing skill evaluation but its impact depends on demonstrating that the 16.2pp aggregate is robust rather than sensitive to task or skill construction choices.

major comments (4)

[Abstract] Abstract: The 16.2pp average improvement and domain-specific deltas are reported without error bars, confidence intervals, or any statistical significance tests across the 7,308 trajectories. This omission is load-bearing for the central claim, as the large domain variance (+4.5pp to +51.9pp) and negative effects on 16 tasks make it impossible to determine whether the aggregate result is reliable or driven by a subset of tasks.
[Abstract] Abstract: The benchmark is described as containing 86 tasks, yet negative deltas are reported for '16 of 84 tasks.' This numerical inconsistency directly affects the interpretation of the pass-rate statistics and the claim that curated Skills are net beneficial; the curation and exclusion criteria must be clarified to ensure the reported averages are not artifacts of unstated filtering.
[Methodology] Methodology (task and Skill construction): No details are provided on how the 86 tasks were selected, how the deterministic verifiers were implemented, or the process used to create the curated Skills. Because the central result—that curated Skills improve performance while self-generated ones do not—rests on the assumption that these tasks and Skills are representative and unbiased, the absence of this information prevents assessment of generalizability or construction bias.
[Results] Results (domain-level analysis): The paper notes negative effects on 16 tasks and wide domain variance but provides no breakdown or analysis of which domains or task types exhibit harm. This is load-bearing for the claim of overall benefit, as it leaves open the possibility that the 16.2pp average is sensitive to the particular mix of domains chosen.

minor comments (2)

[Abstract] Abstract: The abbreviation 'pp' for percentage points is used without an initial definition, which reduces clarity for readers outside the immediate subfield.
[Abstract] The description of 'focused Skills with 2--3 modules' would benefit from an explicit definition or example of what constitutes a 'module' versus 'comprehensive documentation' to make the comparison reproducible.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, with revisions planned where they strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: The 16.2pp average improvement and domain-specific deltas are reported without error bars, confidence intervals, or any statistical significance tests across the 7,308 trajectories. This omission is load-bearing for the central claim, as the large domain variance (+4.5pp to +51.9pp) and negative effects on 16 tasks make it impossible to determine whether the aggregate result is reliable or driven by a subset of tasks.

Authors: We agree that statistical measures would improve interpretability given the domain variance and negative cases. In the revision we will add standard errors (or bootstrap confidence intervals) to the 16.2pp aggregate and all domain-level deltas in both the abstract and results. We will also report paired statistical tests (e.g., McNemar or bootstrap) across the 7,308 trajectories to assess whether the observed improvements are significant. revision: yes
Referee: [Abstract] Abstract: The benchmark is described as containing 86 tasks, yet negative deltas are reported for '16 of 84 tasks.' This numerical inconsistency directly affects the interpretation of the pass-rate statistics and the claim that curated Skills are net beneficial; the curation and exclusion criteria must be clarified to ensure the reported averages are not artifacts of unstated filtering.

Authors: This is a typographical error in the abstract. The benchmark contains 86 tasks and negative effects appear on exactly 16 of them. We will correct the abstract to '16 of 86 tasks' and add a short clarification in the methodology that no post-hoc filtering occurred; all 86 tasks are included in the reported averages. revision: yes
Referee: [Methodology] Methodology (task and Skill construction): No details are provided on how the 86 tasks were selected, how the deterministic verifiers were implemented, or the process used to create the curated Skills. Because the central result—that curated Skills improve performance while self-generated ones do not—rests on the assumption that these tasks and Skills are representative and unbiased, the absence of this information prevents assessment of generalizability or construction bias.

Authors: We acknowledge that additional construction details are needed for reproducibility and bias assessment. The revised manuscript will include an expanded Methodology section with: (i) explicit task-selection criteria and domain sourcing, (ii) the exact implementation of the deterministic verifiers (rule-based success predicates), and (iii) the expert curation protocol for the Skills (modular procedural knowledge authored per domain). These additions will allow readers to evaluate representativeness directly. revision: yes
Referee: [Results] Results (domain-level analysis): The paper notes negative effects on 16 tasks and wide domain variance but provides no breakdown or analysis of which domains or task types exhibit harm. This is load-bearing for the claim of overall benefit, as it leaves open the possibility that the 16.2pp average is sensitive to the particular mix of domains chosen.

Authors: We agree that characterizing the negative cases is essential. We will add a dedicated subsection (or appendix) that tabulates the 16 tasks showing negative deltas, grouped by domain, and discusses observable patterns (e.g., task complexity or Skill-task mismatch). This analysis will directly address whether the aggregate gain is robust to domain composition. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper reports results from running 7,308 trajectories across 86 tasks under three explicit conditions (no Skills, curated Skills, self-generated Skills) and tabulates observed pass rates, domain variances, and per-task deltas. All central claims (16.2pp average lift, zero average benefit from self-generated Skills, focused vs. comprehensive module comparison) are direct aggregates of these measurements. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked to derive or predict any quantity; the reported numbers are the measurements themselves. The derivation chain is therefore self-contained and contains no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about task representativeness and verifier accuracy; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The 86 tasks across 11 domains are representative of real-world LLM agent applications
Benchmark validity depends on the chosen tasks being fair and comprehensive proxies for agent performance.
domain assumption Deterministic verifiers accurately measure task success without systematic bias or error
All reported pass rates and deltas rely on these verifiers being reliable ground truth.

pith-pipeline@v0.9.0 · 5620 in / 1417 out tokens · 88413 ms · 2026-05-12T00:12:22.972654+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence law_of_existence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Counterfactual Trace Auditing of LLM Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
SkillCom: Decomposing LLM-based Semantic Communication into Task and Channel Aware Skills
eess.SY 2026-05 unverdicted novelty 7.0

SkillCom decomposes LLM semantic communication into four skills connected by structured semantic-unit interfaces and outperforms monolithic LLM baselines in robustness on multi-hop QA and dialogue state tracking tasks.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses
cs.SE 2026-04 unverdicted novelty 7.0

SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
cs.CL 2026-05 unverdicted novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Evidence Over Plans: Online Trajectory Verification for Skill Distillation
cs.AI 2026-05 unverdicted novelty 6.0

PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
cs.CL 2026-05 unverdicted novelty 6.0

GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors
cs.AI 2026-05 unverdicted novelty 6.0

PrefixGuard induces typed step adapters from agent traces offline then trains prefix-risk scorers on terminal outcomes, reaching 0.900/0.710/0.533/0.557 AUPRC on four benchmarks and beating raw-text baselines by 0.137...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
cs.AI 2026-04 unverdicted novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
When Agent Markets Arrive
cs.CE 2026-04 unverdicted novelty 6.0

DIAGON simulation shows agent markets produce 3.2 times more wealth than isolated agents, but institutional choices like transparency and competitive selection can reduce rather than increase performance.
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
cs.SE 2026-03 unverdicted novelty 6.0

ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 5.0

EvoAgent is an evolvable LLM agent framework using structured skill learning, user-feedback loops, and hierarchical delegation that boosts GPT5.2 performance by about 28% in real-world trade scenarios under LLM-as-Jud...
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
cs.AI 2026-04 unverdicted novelty 5.0

Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Gradual Cognitive Externalization: From Modeling Cognition to Constituting It
cs.AI 2026-04 unverdicted novelty 5.0

Ambient AI systems transition from modeling cognition to constituting part of users' cognitive architectures through sustained causal coupling, under a functionalist view and the no behaviorally invisible residual hypothesis.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs
cs.AI 2026-04 unverdicted novelty 4.0

MESA-S framework translates human metacognitive control into LLMs via delayed procedural probes and Metacognitive Skill Cards to separate parametric certainty from source trust and reduce overthinking.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
cs.CL 2026-03 unverdicted novelty 4.0

Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 39 Pith papers

[1]

Automated CI: Structural validation (harbor tasks check ), oracle execution (harbor run -a oracle , must pass 100%), and AI-detection screening (GPTZero) oninstruction.md

work page
[2]

Reviewers run benchmark experiments with and without Skills across multiple agents

Maintainer review: Evaluates data validity, task realism, oracle quality, Skill quality, and anti-cheating robustness. Reviewers run benchmark experiments with and without Skills across multiple agents. 15 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks 0 5 10 15 20 25 30 Number of Files (excluding metadata.json) 0 2,000 4,000 6,...

work page
[3]

Of 322 candidate submissions from 105 contributors, 86 tasks passed all review stages and were included in the final benchmark (26.7% acceptance rate)

Benchmark report: For each task, reviewers produce a structured report documenting oracle results, agent pass rates with and without Skills, failure analysis, and a final verdict (approve, major changes needed, or reject). Of 322 candidate submissions from 105 contributors, 86 tasks passed all review stages and were included in the final benchmark (26.7% ...

work page
[4]

PRs with intentional grammar errors designed to circumvent AI detectors are closed

AI detection: Verify instruction.md and task.toml are manually written using GPTZero and human review. PRs with intentional grammar errors designed to circumvent AI detectors are closed. 2.Data quality: Data must be real-world and appropriately complex. AI-generated or toy data is rejected. 3.Task validity: Tasks must be grounded in realistic professional...

work page
[5]

5.Author history: Authors flagged multiple times across PRs are closed automatically

Oracle quality: Simple solutions (e.g., an Excel formula or short script) are preferred over over-engineered oracle implementations. 5.Author history: Authors flagged multiple times across PRs are closed automatically

work page
[6]

Test parsimony: Fewer than 10 test cases unless justified; tests should cover distinct criteria rather than repeat similar checks. 16 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks 0 20,000 40,000 60,000 80,000 Number of Files .png .mp3 .meta (no extension) .ttf .cs .vue .jsonl .yaml .html .yml .txt .tsx .xsd .js .json .sh .ts ....

work page
[7]

Multimodal verification: For multimodal tasks (audio, PPTX, video, PDF), maintainers personally inspect agent output to verify correctness beyond programmatic assertions. B.7. Automated CI Pipeline The CI pipeline performs the following checks on each PR: • Structural validation( harbor tasks check ): Verifies required files exist, TOML schema is valid, D...

work page
[8]

expected output, root cause, and evidence from trajectories

Failure analysis: Per-test breakdown of failures including actual vs. expected output, root cause, and evidence from trajectories. 6.Recommendation: One of:APPROVE,APPROVE WITH CAVEATS,MAJOR CHANGES NEEDED, orREJECT. B.9. Review Lifecycle PRs progress through a defined label-based lifecycle: 1.WIP→Need review: Author signals readiness for initial review. ...

work page
[9]

""obs: terminal output -> action

Reviewing → Change requested / Major change needed / Critical change needed: Issues identified; author must address. 4.Change requested→Take another look: Author responds after changes. 5.Ready to merge→Good task: All reviews passed; task included in benchmark. Critical changes include unrealistic task scenarios, AI-generated instructions, or synthetic da...

work page 2025
[10]

Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed

work page
[11]

Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks

work page
[12]

Save each skill as a markdown file in theenvironment/skills/directory with a descriptive name

work page
[13]

Unaware of Termination Conditions

Then solve the task using the skills you created as reference. 19 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks The environment/skills/ directory is empty at the start—agents must populate it before solving the task. No curated Skills are provided. The self-generated condition is evaluated on Claude Code (all four models) and C...

work page 2025