Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Dingsiyi; Feiyu Wang; Meijun Gao; Tong Yang; Xinyu Tan; Xi Zhang; Yanshu Wang; Yilun Yao; Yuntian Zhao

arxiv: 2605.19604 · v1 · pith:JQ25MPVUnew · submitted 2026-05-19 · 💻 cs.AI

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Xi Zhang , Meijun Gao , Yuntian Zhao , Xinyu Tan , Yilun Yao , Feiyu Wang , Yanshu Wang , Dingsiyi

show 1 more author

Tong Yang

This is my paper

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.AI

keywords Formal SkillLLM agentsruntime skillstoken efficiencyexecutable state machineshook policiesagent workflowsHarness-Bench

0 comments

The pith

Formal Skill encodes reusable LLM agent procedures as executable state machines and hook policies instead of repeated prompt text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agent skills mostly take the form of long informal Markdown documents or instruction packs that consume many tokens and leave workflow control outside the skill definition. The paper introduces Formal Skill as a runtime-native structure that uses JSON metadata, action schemas, Python executors, and hook-governed logic to hold reusable procedures. Shifting the logic into executable state machines and skill-local runtime state gives agents a control surface that is both token-efficient and enforceable at runtime. Implementation in the FairyClaw event-driven runtime delivers competitive Harness-Bench scores while using substantially fewer tokens, especially on tasks that highlight the structured skill role. A sympathetic reader would care because the change could make reliable agent behavior in real workspaces cheaper and more predictable without expanding the main prompt.

Core claim

Formal Skill is a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. The FairyClaw implementation obtains highly competitive average scores on Harness-Bench while using substantially fewer tokens.

What carries the argument

Formal Skill, the runtime abstraction that combines JSON metadata, Python executors, hook-governed control logic, and skill-local runtime state to turn reusable procedures into executable, observable, and composable units.

If this is right

Reusable procedures no longer need to be re-described in every prompt, directly lowering token counts for repeated workflows.
Hook policies and state-machine logic supply enforceable completion discipline and error handling inside the skill itself.
Skill-local runtime state keeps workflow context outside the agent's main context window.
Skills become modular and observable, enabling easier composition and debugging through the event-driven runtime.
Tasks that depend on structured procedures show improved efficiency while retaining competitive accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure could reduce the need for elaborate model-context-protocol servers by handling more workflow inside executable skills.
Long-running agent sessions might become more robust because skill-local state survives across multiple model calls without prompt inflation.
Adoption could shift agent engineering effort from prompt tuning toward writing and verifying hook policies and executors.
Integration with existing function-calling interfaces might become simpler once skills expose standardized action schemas at runtime.

Load-bearing premise

The token and accuracy advantages observed on Harness-Bench tasks that expose the role of Formal Skill will translate to broader real-world agent workflows without introducing execution overhead or compatibility problems.

What would settle it

A side-by-side run on a wider set of real-world agent tasks that shows Formal Skill versions using more tokens or achieving lower success rates than equivalent informal prompt-based skills would falsify the efficiency and reliability claims.

Figures

Figures reproduced from arXiv: 2605.19604 by Dingsiyi, Feiyu Wang, Meijun Gao, Tong Yang, Xinyu Tan, Xi Zhang, Yanshu Wang, Yilun Yao, Yuntian Zhao.

**Figure 2.** Figure 2: CODEREPAIROPS state vocabulary and main transitions. In the current executor, evidence collection from reproduce may advance directly to patch; diagnose remains a routed phase for extra evidence or contextual patching [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Formal Skill, a runtime-native abstraction for LLM agent skills that uses JSON metadata, action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. Implemented in the open-source FairyClaw event-driven runtime, the approach moves reusable procedures from repeated prompt text into executable state machines and hook policies to achieve token-efficient and enforceable control. On Harness-Bench, FairyClaw reports highly competitive average scores while using substantially fewer tokens, with stronger results on tasks that highlight Formal Skill's role.

Significance. If the performance claims hold under more rigorous scrutiny, this work could advance LLM agent design by supplying a programmable, observable control surface that reduces reliance on verbose natural-language instructions. The open-source FairyClaw implementation and emphasis on composable, executable skills constitute concrete strengths that could support reproducibility and extension by the community.

major comments (2)

[§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.
[§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.

minor comments (2)

[§3 Formal Skill Definition] The definition of Formal Skill in §3 could more explicitly separate the roles of hook policies versus skill-local runtime state to avoid potential reader confusion.
[Figure 2] Figure captions in the FairyClaw architecture diagram would benefit from additional detail on how state machines interact with hook policies during execution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our submission. We address each major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§5 Experiments] §5 Experiments (Harness-Bench evaluation): The abstract and results claim competitive scores with substantially fewer tokens, yet the section supplies no baselines, number of runs, statistical tests, variance measures, or error analysis. Without these, the data cannot be verified to support the efficiency and accuracy claims.

Authors: We agree that the experimental section would benefit from greater statistical detail. The current manuscript reports Harness-Bench average scores for FairyClaw but does not present explicit baseline comparisons, the number of evaluation runs, variance, or error analysis. In the revision we will add a dedicated baselines subsection comparing against standard LLM-agent and skill-framework approaches, state that all reported scores are means over five independent runs, include standard deviations, and provide a brief error analysis focused on token-consumption variance. revision: yes
Referee: [§6 Discussion] §6 Discussion or Conclusion: The central claim that advantages generalize to real-world workflows requires evidence on runtime overhead from Python executors, compatibility with frameworks such as function calling or MCP, and avoidance of new failure modes (state corruption, policy conflicts). No such measurements or tests are reported, leaving the token-efficiency and enforceability advantages potentially benchmark-specific.

Authors: We acknowledge the limitation. While FairyClaw’s design supports integration with function-calling and MCP interfaces through its action-schema layer, the manuscript does not quantify executor overhead or systematically test for state-corruption or policy-conflict failures. In the revised Discussion we will add a qualitative analysis of these potential failure modes drawn from our implementation experience, describe compatibility mechanisms, and report preliminary overhead measurements obtained during FairyClaw development. Comprehensive real-world workflow benchmarks remain future work and will be noted as such. revision: partial

Circularity Check

0 steps flagged

No significant circularity; self-contained implementation and benchmark paper

full rationale

The paper introduces Formal Skill as a new runtime abstraction for LLM agents, describes its JSON metadata, Python executors, hook policies, and implementation in FairyClaw, then reports empirical results on Harness-Bench showing competitive scores with fewer tokens. No mathematical derivations, equations, parameter fitting presented as predictions, or self-referential definitions appear in the provided text. The central claim rests on the implementation description and benchmark outcomes rather than reducing to its own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the result. This matches the default expectation for a systems paper evaluated against external benchmarks and warrants a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper's contribution centers on the newly introduced Formal Skill concept and its runtime implementation. No numerical free parameters are described. The key domain assumption is that current skills lack workflow state and policy enforcement.

axioms (1)

domain assumption Existing skills for LLM agents remain largely informal and leave workflow state, policy enforcement, and completion discipline outside the skill itself.
Explicitly stated in the abstract as the motivation for introducing Formal Skill.

invented entities (2)

Formal Skill no independent evidence
purpose: A runtime-native abstraction representing reusable capability via JSON metadata, action schemas, Python executors, hook-governed control logic, routing, and skill-local state.
Newly defined in the paper as the central contribution.
FairyClaw no independent evidence
purpose: Open-source event-driven runtime supporting executable, observable, and composable Formal Skills.
Introduced as the concrete implementation vehicle for the abstraction.

pith-pipeline@v0.9.0 · 5738 in / 1377 out tokens · 61116 ms · 2026-05-20T05:49:00.855343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Formal Skill converts a reusable agent capability from a purely natural-language artifact into a structured executable object with five components: (1) JSON metadata and action schemas... (3) lifecycle hooks... (4) skill-local runtime state... (5) routing metadata
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By moving reusable procedure from repeated prompt text into executable state machines and hook policies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 7 internal anchors

[1]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., et al. ReAct: Synergizing reasoning and acting in language models. ICLR, 2023.https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Karpas, E., Abend, O., Belinkov, Y ., et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445, 2022.https://arxiv.org/abs/2205.00445. 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Yu, J., Dessi, R., et al. Toolformer: Language models can teach themselves to use tools. NeurIPS, 2023. https: //proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html

work page 2023
[4]

Gorilla: Large Language Model Connected with Massive APIs

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. NeurIPS, 2024.https://arxiv.org/abs/2305.15334

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

API- Bank: A comprehensive benchmark for tool-augmented LLMs

Li, M., Zhao, Y ., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y . API- Bank: A comprehensive benchmark for tool-augmented LLMs. EMNLP, 2023. https: //aclanthology.org/2023.emnlp-main.187/

work page 2023
[6]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Qin, Y ., Liang, S., Ye, Y ., et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR, 2024.https://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Berman, E., et al. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023.https://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y ., Jiang, Y ., et al. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023.https://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. NeurIPS, 2024. https://openreview.net/forum?id=mXpq6ut8J3

work page 2024
[11]

AutoCodeRover: Autonomous program improvement

Zhang, Y ., Ruan, H., Fan, Z., and Roychoudhury, A. AutoCodeRover: Autonomous program improvement. ISSTA, 2024.https://arxiv.org/abs/2404.05427

work page arXiv 2024
[12]

LangChain: Building applications with LLMs through composability

Chase, H. LangChain: Building applications with LLMs through composability. GitHub reposi- tory, 2022.https://github.com/langchain-ai/langchain

work page 2022
[13]

Agent Skills

Anthropic. Agent Skills. Claude documentation. https://docs.anthropic.com/en/ docs/agents-and-tools/agent-skills/overview

work page
[14]

Agent Skills specification.https://agentskills.io/specification

Anthropic. Agent Skills specification.https://agentskills.io/specification

work page
[15]

Equipping agents for the real world with Agent Skills

Anthropic. Equipping agents for the real world with Agent Skills. Anthropic Engineering Blog, 2025. https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills

work page 2025
[16]

Extend Claude with skills

Anthropic. Extend Claude with skills. Claude Code documentation. https://docs. anthropic.com/en/docs/claude-code/skills

work page
[17]

Create plugins

Anthropic. Create plugins. Claude Code documentation. https://code.claude.com/ docs/en/plugins

work page
[18]

Give Claude custom tools

Anthropic. Give Claude custom tools. Claude Code Agent SDK documentation. https: //code.claude.com/docs/en/agent-sdk/custom-tools

work page
[19]

Overview

Model Context Protocol. Overview. https://modelcontextprotocol.io/ specification/latest/basic

work page
[20]

Model Context Protocol. Tools. https://modelcontextprotocol.io/ specification/draft/server/tools

work page
[21]

Function calling

OpenAI. Function calling. OpenAI API documentation. https://platform.openai. com/docs/guides/function-calling

work page
[22]

OpenAI. Tools. OpenAI Agents SDK documentation. https://openai.github.io/ openai-agents-python/tools/

work page
[23]

Codex CLI

OpenAI. Codex CLI. OpenAI Developers documentation. https://developers. openai.com/codex/cli/

work page
[24]

OpenAI. Sandbox. Codex documentation. https://developers.openai.com/ codex/concepts/sandboxing

work page
[25]

Agent runtime

OpenClaw. Agent runtime. OpenClaw documentation. https://docs.openclaw.ai/ concepts/agent.md. 10

work page
[26]

OpenClaw. Tools. OpenClaw documentation. https://github.com/openclaw/ openclaw/blob/bf6ec64f/docs/tools/index.md

work page
[27]

A secure persistent personal agent server in Rust

Moltis. A secure persistent personal agent server in Rust. GitHub repository. https:// github.com/moltis-org/moltis

work page
[28]

Autonomous AI assistant infrastructure

NullClaw. Autonomous AI assistant infrastructure. GitHub repository. https://github. com/nullclaw/nullclaw

work page
[29]

Autonomous AI agent runtime

ZeroClaw. Autonomous AI agent runtime. GitHub repository. https://github.com/ zeroclaw-labs/zeroclaw

work page
[30]

Hermes Agent documentation

Nous Research. Hermes Agent documentation. https://hermes-agent. nousresearch.com/docs/

work page
[31]

Hermes Agent architecture

Nous Research. Hermes Agent architecture. https://hermes-agent.nousresearch. com/docs/developer-guide/architecture/

work page
[32]

Environments, benchmarks and data generation

Nous Research. Environments, benchmarks and data generation. Hermes Agent documenta- tion. https://hermes-agent.nousresearch.com/docs/developer-guide/ environments

work page
[33]

Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions

Qihoo360. Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions. GitHub repository. https://github.com/ Qihoo360/harness-bench. 11

work page

[1] [1]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., et al. ReAct: Synergizing reasoning and acting in language models. ICLR, 2023.https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Karpas, E., Abend, O., Belinkov, Y ., et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv:2205.00445, 2022.https://arxiv.org/abs/2205.00445. 9

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Toolformer: Language models can teach themselves to use tools

Schick, T., Dwivedi-Yu, J., Dessi, R., et al. Toolformer: Language models can teach themselves to use tools. NeurIPS, 2023. https: //proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html

work page 2023

[4] [4]

Gorilla: Large Language Model Connected with Massive APIs

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive APIs. NeurIPS, 2024.https://arxiv.org/abs/2305.15334

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

API- Bank: A comprehensive benchmark for tool-augmented LLMs

Li, M., Zhao, Y ., Yu, B., Song, F., Li, H., Yu, H., Li, Z., Huang, F., and Li, Y . API- Bank: A comprehensive benchmark for tool-augmented LLMs. EMNLP, 2023. https: //aclanthology.org/2023.emnlp-main.187/

work page 2023

[6] [6]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Qin, Y ., Liang, S., Ye, Y ., et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR, 2024.https://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn, N., Cassano, F., Berman, E., et al. Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 2023.https://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y ., Jiang, Y ., et al. V oyager: An open-ended embodied agent with large language models. arXiv:2305.16291, 2023.https://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. NeurIPS, 2024. https://openreview.net/forum?id=mXpq6ut8J3

work page 2024

[11] [11]

AutoCodeRover: Autonomous program improvement

Zhang, Y ., Ruan, H., Fan, Z., and Roychoudhury, A. AutoCodeRover: Autonomous program improvement. ISSTA, 2024.https://arxiv.org/abs/2404.05427

work page arXiv 2024

[12] [12]

LangChain: Building applications with LLMs through composability

Chase, H. LangChain: Building applications with LLMs through composability. GitHub reposi- tory, 2022.https://github.com/langchain-ai/langchain

work page 2022

[13] [13]

Agent Skills

Anthropic. Agent Skills. Claude documentation. https://docs.anthropic.com/en/ docs/agents-and-tools/agent-skills/overview

work page

[14] [14]

Agent Skills specification.https://agentskills.io/specification

Anthropic. Agent Skills specification.https://agentskills.io/specification

work page

[15] [15]

Equipping agents for the real world with Agent Skills

Anthropic. Equipping agents for the real world with Agent Skills. Anthropic Engineering Blog, 2025. https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills

work page 2025

[16] [16]

Extend Claude with skills

Anthropic. Extend Claude with skills. Claude Code documentation. https://docs. anthropic.com/en/docs/claude-code/skills

work page

[17] [17]

Create plugins

Anthropic. Create plugins. Claude Code documentation. https://code.claude.com/ docs/en/plugins

work page

[18] [18]

Give Claude custom tools

Anthropic. Give Claude custom tools. Claude Code Agent SDK documentation. https: //code.claude.com/docs/en/agent-sdk/custom-tools

work page

[19] [19]

Overview

Model Context Protocol. Overview. https://modelcontextprotocol.io/ specification/latest/basic

work page

[20] [20]

Model Context Protocol. Tools. https://modelcontextprotocol.io/ specification/draft/server/tools

work page

[21] [21]

Function calling

OpenAI. Function calling. OpenAI API documentation. https://platform.openai. com/docs/guides/function-calling

work page

[22] [22]

OpenAI. Tools. OpenAI Agents SDK documentation. https://openai.github.io/ openai-agents-python/tools/

work page

[23] [23]

Codex CLI

OpenAI. Codex CLI. OpenAI Developers documentation. https://developers. openai.com/codex/cli/

work page

[24] [24]

OpenAI. Sandbox. Codex documentation. https://developers.openai.com/ codex/concepts/sandboxing

work page

[25] [25]

Agent runtime

OpenClaw. Agent runtime. OpenClaw documentation. https://docs.openclaw.ai/ concepts/agent.md. 10

work page

[26] [26]

OpenClaw. Tools. OpenClaw documentation. https://github.com/openclaw/ openclaw/blob/bf6ec64f/docs/tools/index.md

work page

[27] [27]

A secure persistent personal agent server in Rust

Moltis. A secure persistent personal agent server in Rust. GitHub repository. https:// github.com/moltis-org/moltis

work page

[28] [28]

Autonomous AI assistant infrastructure

NullClaw. Autonomous AI assistant infrastructure. GitHub repository. https://github. com/nullclaw/nullclaw

work page

[29] [29]

Autonomous AI agent runtime

ZeroClaw. Autonomous AI agent runtime. GitHub repository. https://github.com/ zeroclaw-labs/zeroclaw

work page

[30] [30]

Hermes Agent documentation

Nous Research. Hermes Agent documentation. https://hermes-agent. nousresearch.com/docs/

work page

[31] [31]

Hermes Agent architecture

Nous Research. Hermes Agent architecture. https://hermes-agent.nousresearch. com/docs/developer-guide/architecture/

work page

[32] [32]

Environments, benchmarks and data generation

Nous Research. Environments, benchmarks and data generation. Hermes Agent documenta- tion. https://hermes-agent.nousresearch.com/docs/developer-guide/ environments

work page

[33] [33]

Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions

Qihoo360. Harness-Bench: A real-workspace benchmark for evaluating agent and claw-style frameworks under executable task conditions. GitHub repository. https://github.com/ Qihoo360/harness-bench. 11

work page