Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3
The pith
Formal Skill encodes reusable LLM agent procedures as executable state machines and hook policies instead of repeated prompt text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formal Skill is a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. The FairyClaw implementation obtains highly competitive average scores on Harness-Bench while using substantially fewer tokens.
What carries the argument
Formal Skill, the runtime abstraction that combines JSON metadata, Python executors, hook-governed control logic, and skill-local runtime state to turn reusable procedures into executable, observable, and composable units.
If this is right
- Reusable procedures no longer need to be re-described in every prompt, directly lowering token counts for repeated workflows.
- Hook policies and state-machine logic supply enforceable completion discipline and error handling inside the skill itself.
- Skill-local runtime state keeps workflow context outside the agent's main context window.
- Skills become modular and observable, enabling easier composition and debugging through the event-driven runtime.
- Tasks that depend on structured procedures show improved efficiency while retaining competitive accuracy.
Where Pith is reading between the lines
- The same structure could reduce the need for elaborate model-context-protocol servers by handling more workflow inside executable skills.
- Long-running agent sessions might become more robust because skill-local state survives across multiple model calls without prompt inflation.
- Adoption could shift agent engineering effort from prompt tuning toward writing and verifying hook policies and executors.
- Integration with existing function-calling interfaces might become simpler once skills expose standardized action schemas at runtime.
Load-bearing premise
The token and accuracy advantages observed on Harness-Bench tasks that expose the role of Formal Skill will translate to broader real-world agent workflows without introducing execution overhead or compatibility problems.
What would settle it
A side-by-side run on a wider set of real-world agent tasks that shows Formal Skill versions using more tokens or achieving lower success rates than equivalent informal prompt-based skills would falsify the efficiency and reliability claims.
Figures
read the original abstract
Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Formal Skill, a runtime-native abstraction for LLM agent skills that uses JSON metadata, action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. Implemented in the open-source FairyClaw event-driven runtime, the approach moves reusable procedures from repeated prompt text into executable state machines and hook policies to achieve token-efficient and enforceable control. On Harness-Bench, FairyClaw reports highly competitive average scores while using substantially fewer tokens, with stronger results on tasks that highlight Formal Skill's role.
Significance. If the performance claims hold under more rigorous scrutiny, this work could advance LLM agent design by supplying a programmable, observable control surface that reduces reliance on verbose natural-language instructions. The open-source FairyClaw implementation and emphasis on composable, executable skills constitute concrete strengths that could support reproducibility and extension by the community.
major comments (2)
- [§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.
- [§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.
minor comments (2)
- [§3 Formal Skill Definition] The definition of Formal Skill in §3 could more explicitly separate the roles of hook policies versus skill-local runtime state to avoid potential reader confusion.
- [Figure 2] Figure captions in the FairyClaw architecture diagram would benefit from additional detail on how state machines interact with hook policies during execution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our submission. We address each major comment below and describe the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.
Authors: We agree that the experimental section would benefit from greater statistical detail. The current manuscript reports Harness-Bench average scores for FairyClaw but does not present explicit baseline comparisons, the number of evaluation runs, variance, or error analysis. In the revision we will add a dedicated baselines subsection comparing against standard LLM-agent and skill-framework approaches, state that all reported scores are means over five independent runs, include standard deviations, and provide a brief error analysis focused on token-consumption variance. revision: yes
-
Referee: [§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.
Authors: We acknowledge the limitation. While FairyClaw’s design supports integration with function-calling and MCP interfaces through its action-schema layer, the manuscript does not quantify executor overhead or systematically test for state-corruption or policy-conflict failures. In the revised Discussion we will add a qualitative analysis of these potential failure modes drawn from our implementation experience, describe compatibility mechanisms, and report preliminary overhead measurements obtained during FairyClaw development. Comprehensive real-world workflow benchmarks remain future work and will be noted as such. revision: partial
Circularity Check
No significant circularity; self-contained implementation and benchmark paper
full rationale
The paper introduces Formal Skill as a new runtime abstraction for LLM agents, describes its JSON metadata, Python executors, hook policies, and implementation in FairyClaw, then reports empirical results on Harness-Bench showing competitive scores with fewer tokens. No mathematical derivations, equations, parameter fitting presented as predictions, or self-referential definitions appear in the provided text. The central claim rests on the implementation description and benchmark outcomes rather than reducing to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the result. This matches the default expectation for a systems paper evaluated against external benchmarks and warrants a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing skills for LLM agents remain largely informal and leave workflow state, policy enforcement, and completion discipline outside the skill itself.
invented entities (2)
-
Formal Skill
no independent evidence
-
FairyClaw
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Formal Skill converts a reusable agent capability from a purely natural-language artifact into a structured executable object with five components: (1) JSON metadata and action schemas... (3) lifecycle hooks... (4) skill-local runtime state... (5) routing metadata
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By moving reusable procedure from repeated prompt text into executable state machines and hook policies
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., et al. ReAct: Synergizing reasoning and acting in language models. ICLR, 2023.https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Karpas, E., Abend, O., Belinkov, Y ., et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445, 2022.https://arxiv.org/abs/2205.00445. 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Toolformer: Language models can teach themselves to use tools
Schick, T., Dwivedi-Yu, J., Dessi, R., et al. Toolformer: Language models can teach themselves to use tools. NeurIPS, 2023. https: //proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html
work page 2023
-
[4]
Gorilla: Large Language Model Connected with Massive APIs
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. NeurIPS, 2024.https://arxiv.org/abs/2305.15334
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
API- Bank: A comprehensive benchmark for tool-augmented LLMs
Li, M., Zhao, Y ., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y . API- Bank: A comprehensive benchmark for tool-augmented LLMs. EMNLP, 2023. https: //aclanthology.org/2023.emnlp-main.187/
work page 2023
-
[6]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Qin, Y ., Liang, S., Ye, Y ., et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR, 2024.https://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Reflexion: Language Agents with Verbal Reinforcement Learning
Shinn, N., Cassano, F., Berman, E., et al. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023.https://arxiv.org/abs/2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Self-Refine: Iterative Refinement with Self-Feedback
Madaan, A., Tandon, N., Gupta, P., et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.https://arxiv.org/abs/2303.17651
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Wang, G., Xie, Y ., Jiang, Y ., et al. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023.https://arxiv.org/abs/2305.16291
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. NeurIPS, 2024. https://openreview.net/forum?id=mXpq6ut8J3
work page 2024
-
[11]
AutoCodeRover: Autonomous program improvement
Zhang, Y ., Ruan, H., Fan, Z., and Roychoudhury, A. AutoCodeRover: Autonomous program improvement. ISSTA, 2024.https://arxiv.org/abs/2404.05427
-
[12]
LangChain: Building applications with LLMs through composability
Chase, H. LangChain: Building applications with LLMs through composability. GitHub reposi- tory, 2022.https://github.com/langchain-ai/langchain
work page 2022
-
[13]
Anthropic. Agent Skills. Claude documentation. https://docs.anthropic.com/en/ docs/agents-and-tools/agent-skills/overview
-
[14]
Agent Skills specification.https://agentskills.io/specification
Anthropic. Agent Skills specification.https://agentskills.io/specification
-
[15]
Equipping agents for the real world with Agent Skills
Anthropic. Equipping agents for the real world with Agent Skills. Anthropic Engineering Blog, 2025. https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills
work page 2025
-
[16]
Anthropic. Extend Claude with skills. Claude Code documentation. https://docs. anthropic.com/en/docs/claude-code/skills
-
[17]
Anthropic. Create plugins. Claude Code documentation. https://code.claude.com/ docs/en/plugins
-
[18]
Anthropic. Give Claude custom tools. Claude Code Agent SDK documentation. https: //code.claude.com/docs/en/agent-sdk/custom-tools
- [19]
-
[20]
Model Context Protocol. Tools. https://modelcontextprotocol.io/ specification/draft/server/tools
-
[21]
OpenAI. Function calling. OpenAI API documentation. https://platform.openai. com/docs/guides/function-calling
-
[22]
OpenAI. Tools. OpenAI Agents SDK documentation. https://openai.github.io/ openai-agents-python/tools/
- [23]
-
[24]
OpenAI. Sandbox. Codex documentation. https://developers.openai.com/ codex/concepts/sandboxing
-
[25]
OpenClaw. Agent runtime. OpenClaw documentation. https://docs.openclaw.ai/ concepts/agent.md. 10
-
[26]
OpenClaw. Tools. OpenClaw documentation. https://github.com/openclaw/ openclaw/blob/bf6ec64f/docs/tools/index.md
-
[27]
A secure persistent personal agent server in Rust
Moltis. A secure persistent personal agent server in Rust. GitHub repository. https:// github.com/moltis-org/moltis
-
[28]
Autonomous AI assistant infrastructure
NullClaw. Autonomous AI assistant infrastructure. GitHub repository. https://github. com/nullclaw/nullclaw
-
[29]
ZeroClaw. Autonomous AI agent runtime. GitHub repository. https://github.com/ zeroclaw-labs/zeroclaw
-
[30]
Nous Research. Hermes Agent documentation. https://hermes-agent. nousresearch.com/docs/
-
[31]
Nous Research. Hermes Agent architecture. https://hermes-agent.nousresearch. com/docs/developer-guide/architecture/
-
[32]
Environments, benchmarks and data generation
Nous Research. Environments, benchmarks and data generation. Hermes Agent documenta- tion. https://hermes-agent.nousresearch.com/docs/developer-guide/ environments
-
[33]
Qihoo360. Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions. GitHub repository. https://github.com/ Qihoo360/harness-bench. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.