pith. sign in

arxiv: 2605.19604 · v1 · pith:JQ25MPVUnew · submitted 2026-05-19 · 💻 cs.AI

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords Formal SkillLLM agentsruntime skillstoken efficiencyexecutable state machineshook policiesagent workflowsHarness-Bench
0
0 comments X

The pith

Formal Skill encodes reusable LLM agent procedures as executable state machines and hook policies instead of repeated prompt text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agent skills mostly take the form of long informal Markdown documents or instruction packs that consume many tokens and leave workflow control outside the skill definition. The paper introduces Formal Skill as a runtime-native structure that uses JSON metadata, action schemas, Python executors, and hook-governed logic to hold reusable procedures. Shifting the logic into executable state machines and skill-local runtime state gives agents a control surface that is both token-efficient and enforceable at runtime. Implementation in the FairyClaw event-driven runtime delivers competitive Harness-Bench scores while using substantially fewer tokens, especially on tasks that highlight the structured skill role. A sympathetic reader would care because the change could make reliable agent behavior in real workspaces cheaper and more predictable without expanding the main prompt.

Core claim

Formal Skill is a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. The FairyClaw implementation obtains highly competitive average scores on Harness-Bench while using substantially fewer tokens.

What carries the argument

Formal Skill, the runtime abstraction that combines JSON metadata, Python executors, hook-governed control logic, and skill-local runtime state to turn reusable procedures into executable, observable, and composable units.

If this is right

  • Reusable procedures no longer need to be re-described in every prompt, directly lowering token counts for repeated workflows.
  • Hook policies and state-machine logic supply enforceable completion discipline and error handling inside the skill itself.
  • Skill-local runtime state keeps workflow context outside the agent's main context window.
  • Skills become modular and observable, enabling easier composition and debugging through the event-driven runtime.
  • Tasks that depend on structured procedures show improved efficiency while retaining competitive accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure could reduce the need for elaborate model-context-protocol servers by handling more workflow inside executable skills.
  • Long-running agent sessions might become more robust because skill-local state survives across multiple model calls without prompt inflation.
  • Adoption could shift agent engineering effort from prompt tuning toward writing and verifying hook policies and executors.
  • Integration with existing function-calling interfaces might become simpler once skills expose standardized action schemas at runtime.

Load-bearing premise

The token and accuracy advantages observed on Harness-Bench tasks that expose the role of Formal Skill will translate to broader real-world agent workflows without introducing execution overhead or compatibility problems.

What would settle it

A side-by-side run on a wider set of real-world agent tasks that shows Formal Skill versions using more tokens or achieving lower success rates than equivalent informal prompt-based skills would falsify the efficiency and reliability claims.

Figures

Figures reproduced from arXiv: 2605.19604 by Dingsiyi, Feiyu Wang, Meijun Gao, Tong Yang, Xinyu Tan, Xi Zhang, Yanshu Wang, Yilun Yao, Yuntian Zhao.

Figure 1
Figure 1. Figure 1: Formal Skill hook pipeline. Hooks convert compact skill-local runtime state into tool [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CODEREPAIROPS state vocabulary and main transitions. In the current executor, evidence collection from reproduce may advance directly to patch; diagnose remains a routed phase for extra evidence or contextual patching [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Formal Skill, a runtime-native abstraction for LLM agent skills that uses JSON metadata, action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. Implemented in the open-source FairyClaw event-driven runtime, the approach moves reusable procedures from repeated prompt text into executable state machines and hook policies to achieve token-efficient and enforceable control. On Harness-Bench, FairyClaw reports highly competitive average scores while using substantially fewer tokens, with stronger results on tasks that highlight Formal Skill's role.

Significance. If the performance claims hold under more rigorous scrutiny, this work could advance LLM agent design by supplying a programmable, observable control surface that reduces reliance on verbose natural-language instructions. The open-source FairyClaw implementation and emphasis on composable, executable skills constitute concrete strengths that could support reproducibility and extension by the community.

major comments (2)
  1. [§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.
  2. [§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.
minor comments (2)
  1. [§3 Formal Skill Definition] The definition of Formal Skill in §3 could more explicitly separate the roles of hook policies versus skill-local runtime state to avoid potential reader confusion.
  2. [Figure 2] Figure captions in the FairyClaw architecture diagram would benefit from additional detail on how state machines interact with hook policies during execution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our submission. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.

    Authors: We agree that the experimental section would benefit from greater statistical detail. The current manuscript reports Harness-Bench average scores for FairyClaw but does not present explicit baseline comparisons, the number of evaluation runs, variance, or error analysis. In the revision we will add a dedicated baselines subsection comparing against standard LLM-agent and skill-framework approaches, state that all reported scores are means over five independent runs, include standard deviations, and provide a brief error analysis focused on token-consumption variance. revision: yes

  2. Referee: [§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.

    Authors: We acknowledge the limitation. While FairyClaw’s design supports integration with function-calling and MCP interfaces through its action-schema layer, the manuscript does not quantify executor overhead or systematically test for state-corruption or policy-conflict failures. In the revised Discussion we will add a qualitative analysis of these potential failure modes drawn from our implementation experience, describe compatibility mechanisms, and report preliminary overhead measurements obtained during FairyClaw development. Comprehensive real-world workflow benchmarks remain future work and will be noted as such. revision: partial

Circularity Check

0 steps flagged

No significant circularity; self-contained implementation and benchmark paper

full rationale

The paper introduces Formal Skill as a new runtime abstraction for LLM agents, describes its JSON metadata, Python executors, hook policies, and implementation in FairyClaw, then reports empirical results on Harness-Bench showing competitive scores with fewer tokens. No mathematical derivations, equations, parameter fitting presented as predictions, or self-referential definitions appear in the provided text. The central claim rests on the implementation description and benchmark outcomes rather than reducing to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the result. This matches the default expectation for a systems paper evaluated against external benchmarks and warrants a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper's contribution centers on the newly introduced Formal Skill concept and its runtime implementation. No numerical free parameters are described. The key domain assumption is that current skills lack workflow state and policy enforcement.

axioms (1)
  • domain assumption Existing skills for LLM agents remain largely informal and leave workflow state, policy enforcement, and completion discipline outside the skill itself.
    Explicitly stated in the abstract as the motivation for introducing Formal Skill.
invented entities (2)
  • Formal Skill no independent evidence
    purpose: A runtime-native abstraction representing reusable capability via JSON metadata, action schemas, Python executors, hook-governed control logic, routing, and skill-local state.
    Newly defined in the paper as the central contribution.
  • FairyClaw no independent evidence
    purpose: Open-source event-driven runtime supporting executable, observable, and composable Formal Skills.
    Introduced as the concrete implementation vehicle for the abstraction.

pith-pipeline@v0.9.0 · 5738 in / 1377 out tokens · 61116 ms · 2026-05-20T05:49:00.855343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 7 internal anchors

  1. [1]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., et al. ReAct: Synergizing reasoning and acting in language models. ICLR, 2023.https://arxiv.org/abs/2210.03629

  2. [2]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Karpas, E., Abend, O., Belinkov, Y ., et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445, 2022.https://arxiv.org/abs/2205.00445. 9

  3. [3]

    Toolformer: Language models can teach themselves to use tools

    Schick, T., Dwivedi-Yu, J., Dessi, R., et al. Toolformer: Language models can teach themselves to use tools. NeurIPS, 2023. https: //proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html

  4. [4]

    Gorilla: Large Language Model Connected with Massive APIs

    Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. NeurIPS, 2024.https://arxiv.org/abs/2305.15334

  5. [5]

    API- Bank: A comprehensive benchmark for tool-augmented LLMs

    Li, M., Zhao, Y ., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y . API- Bank: A comprehensive benchmark for tool-augmented LLMs. EMNLP, 2023. https: //aclanthology.org/2023.emnlp-main.187/

  6. [6]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Qin, Y ., Liang, S., Ye, Y ., et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR, 2024.https://arxiv.org/abs/2307.16789

  7. [7]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Shinn, N., Cassano, F., Berman, E., et al. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023.https://arxiv.org/abs/2303.11366

  8. [8]

    Self-Refine: Iterative Refinement with Self-Feedback

    Madaan, A., Tandon, N., Gupta, P., et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.https://arxiv.org/abs/2303.17651

  9. [9]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Wang, G., Xie, Y ., Jiang, Y ., et al. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023.https://arxiv.org/abs/2305.16291

  10. [10]

    E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K

    Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. NeurIPS, 2024. https://openreview.net/forum?id=mXpq6ut8J3

  11. [11]

    AutoCodeRover: Autonomous program improvement

    Zhang, Y ., Ruan, H., Fan, Z., and Roychoudhury, A. AutoCodeRover: Autonomous program improvement. ISSTA, 2024.https://arxiv.org/abs/2404.05427

  12. [12]

    LangChain: Building applications with LLMs through composability

    Chase, H. LangChain: Building applications with LLMs through composability. GitHub reposi- tory, 2022.https://github.com/langchain-ai/langchain

  13. [13]

    Agent Skills

    Anthropic. Agent Skills. Claude documentation. https://docs.anthropic.com/en/ docs/agents-and-tools/agent-skills/overview

  14. [14]

    Agent Skills specification.https://agentskills.io/specification

    Anthropic. Agent Skills specification.https://agentskills.io/specification

  15. [15]

    Equipping agents for the real world with Agent Skills

    Anthropic. Equipping agents for the real world with Agent Skills. Anthropic Engineering Blog, 2025. https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills

  16. [16]

    Extend Claude with skills

    Anthropic. Extend Claude with skills. Claude Code documentation. https://docs. anthropic.com/en/docs/claude-code/skills

  17. [17]

    Create plugins

    Anthropic. Create plugins. Claude Code documentation. https://code.claude.com/ docs/en/plugins

  18. [18]

    Give Claude custom tools

    Anthropic. Give Claude custom tools. Claude Code Agent SDK documentation. https: //code.claude.com/docs/en/agent-sdk/custom-tools

  19. [19]

    Overview

    Model Context Protocol. Overview. https://modelcontextprotocol.io/ specification/latest/basic

  20. [20]

    Model Context Protocol. Tools. https://modelcontextprotocol.io/ specification/draft/server/tools

  21. [21]

    Function calling

    OpenAI. Function calling. OpenAI API documentation. https://platform.openai. com/docs/guides/function-calling

  22. [22]

    OpenAI. Tools. OpenAI Agents SDK documentation. https://openai.github.io/ openai-agents-python/tools/

  23. [23]

    Codex CLI

    OpenAI. Codex CLI. OpenAI Developers documentation. https://developers. openai.com/codex/cli/

  24. [24]

    OpenAI. Sandbox. Codex documentation. https://developers.openai.com/ codex/concepts/sandboxing

  25. [25]

    Agent runtime

    OpenClaw. Agent runtime. OpenClaw documentation. https://docs.openclaw.ai/ concepts/agent.md. 10

  26. [26]

    OpenClaw. Tools. OpenClaw documentation. https://github.com/openclaw/ openclaw/blob/bf6ec64f/docs/tools/index.md

  27. [27]

    A secure persistent personal agent server in Rust

    Moltis. A secure persistent personal agent server in Rust. GitHub repository. https:// github.com/moltis-org/moltis

  28. [28]

    Autonomous AI assistant infrastructure

    NullClaw. Autonomous AI assistant infrastructure. GitHub repository. https://github. com/nullclaw/nullclaw

  29. [29]

    Autonomous AI agent runtime

    ZeroClaw. Autonomous AI agent runtime. GitHub repository. https://github.com/ zeroclaw-labs/zeroclaw

  30. [30]

    Hermes Agent documentation

    Nous Research. Hermes Agent documentation. https://hermes-agent. nousresearch.com/docs/

  31. [31]

    Hermes Agent architecture

    Nous Research. Hermes Agent architecture. https://hermes-agent.nousresearch. com/docs/developer-guide/architecture/

  32. [32]

    Environments, benchmarks and data generation

    Nous Research. Environments, benchmarks and data generation. Hermes Agent documenta- tion. https://hermes-agent.nousresearch.com/docs/developer-guide/ environments

  33. [33]

    Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions

    Qihoo360. Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions. GitHub repository. https://github.com/ Qihoo360/harness-bench. 11