Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
Pith reviewed 2026-05-19 18:20 UTC · model grok-4.3
The pith
A skill is untrusted code until verified, so the runtime enforces verification before granting trust instead of relying on signatures or origins.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A skill is untrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. The paper defines a trust schema carrying an explicit verification level on every skill manifest, a capability gate whose human-in-the-loop policy is a function of that level, a biconditional correctness criterion that any candidate verification procedure must satisfy on an adversarial-ensemble exercise, and a portable runtime profile distilled into ten normative guidelines from an open-source reference implementation.
What carries the argument
Trust schema with explicit verification level on each skill manifest, which controls when the capability gate requires human-in-the-loop review for unverified skills.
If this is right
- Human-in-the-loop gates activate only for unverified skills rather than on every irreversible action.
- Agent runtimes avoid operational collapse into rubber-stamping at non-trivial scale.
- Verification occurs as a distinct gated process before the main runtime loads the skill.
- Trust decisions no longer depend on signatures, clearances, or origin registries.
Where Pith is reading between the lines
- The same verification requirement could apply to other forms of dynamic instruction loading in LLM-based systems.
- Standardizing the biconditional criterion might allow skills to move between different agent runtimes with retained verification status.
- Adversarial testing focused on LLM-specific behaviors could surface failure modes unique to agent skills.
Load-bearing premise
A verification procedure exists that can satisfy the biconditional correctness criterion when tested on an adversarial-ensemble exercise.
What would settle it
A concrete adversarial-ensemble test in which every candidate verification procedure either fails the biconditional criterion or allows an unsafe skill to pass as verified.
read the original abstract
Agent skills - structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself - have moved from convenience to first-class deployment artifact. The runtime that loads them inherits the same problem package managers and operating systems have always faced: a piece of content claims a behavior; the runtime must decide whether to believe it. We argue this paper's central thesis up front: a skill is untrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. Without skill verification, a human-in-the-loop (HITL) gate must fire on every irreversible call - which is operationally untenable and degrades into rubber-stamping at any non-trivial scale. With skill verification treated as a separate, gated process, HITL fires only for what is unverified, and the system becomes sustainable. We give a trust schema that includes an explicit verification level on every skill manifest; a capability gate whose HITL policy is a function of that verification level; a biconditional correctness criterion that any candidate verification procedure must satisfy on an adversarial-ensemble exercise; and a portable runtime profile with ten normative guidelines abstracted from a working open-source reference implementation. The contribution is harness- and model-agnostic; nothing here requires retraining, fine-tuning, or proprietary infrastructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that agent skills—structured packages of instructions and scripts for LLMs—must be treated as untrusted code until verified by the runtime. It introduces a trust schema with explicit verification levels on skill manifests, a capability gate whose human-in-the-loop (HITL) policy depends on those levels, a biconditional correctness criterion that any verification procedure must satisfy on adversarial-ensemble exercises, and ten normative guidelines for a portable runtime profile derived from an open-source reference implementation. The central claim is that this approach makes HITL sustainable by restricting interventions to unverified skills rather than requiring them on every irreversible action.
Significance. If a practical verification procedure satisfying the biconditional criterion can be shown to exist, the framework would provide a principled separation between verification and other trust signals such as signatures or origin registries. This could meaningfully reduce operational overhead in agent deployments while improving security posture. The grounding in a working reference implementation and the model-agnostic stance are positive features that distinguish the proposal from purely theoretical treatments.
major comments (2)
- [Abstract / Biconditional Correctness Criterion] Abstract and the section defining the biconditional correctness criterion: the central thesis requires that verification procedures exist which satisfy the biconditional criterion on adversarial-ensemble exercises, thereby allowing the capability gate to restrict HITL to unverified skills only. No concrete procedure, proof of satisfiability, or evaluation against the adversarial ensemble is supplied; the reference implementation is referenced solely for the ten runtime guidelines. This leaves the claim that verification enables sustainable HITL as an untested assumption and is load-bearing for the paper's contribution.
- [Trust Schema and Capability Gate] Section on the trust schema and capability gate: the HITL policy is defined as a function of the verification level, yet no formal semantics, decision procedure, or example policy table is provided to show how the gate would behave under the biconditional criterion. Without this, it is unclear whether the schema can be implemented without reintroducing the rubber-stamping problem it seeks to avoid.
minor comments (2)
- The abstract introduces several new terms (verification level, capability gate, biconditional correctness criterion) in rapid succession; a summary table or diagram relating these components would improve readability.
- [Portable Runtime Profile] The ten normative guidelines are presented as abstracted from the reference implementation; including a brief mapping or excerpt from the implementation for at least two guidelines would help readers assess their practicality.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive critique of the manuscript. We respond to each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / Biconditional Correctness Criterion] Abstract and the section defining the biconditional correctness criterion: the central thesis requires that verification procedures exist which satisfy the biconditional criterion on adversarial-ensemble exercises, thereby allowing the capability gate to restrict HITL to unverified skills only. No concrete procedure, proof of satisfiability, or evaluation against the adversarial ensemble is supplied; the reference implementation is referenced solely for the ten runtime guidelines. This leaves the claim that verification enables sustainable HITL as an untested assumption and is load-bearing for the paper's contribution.
Authors: We agree that the manuscript does not present a concrete verification procedure, a proof that the biconditional criterion is satisfiable, or an empirical evaluation on an adversarial ensemble. The paper's contribution is the definition of the criterion that any verification procedure must meet, the trust schema that records verification levels, and the ten normative guidelines for a portable runtime profile derived from the reference implementation. We do not claim to have constructed or validated a specific procedure; rather, we argue that separating verification from other trust signals is a prerequisite for sustainable HITL. To address the concern, we will revise the abstract and the relevant section to state explicitly that the framework assumes the existence of procedures satisfying the criterion and will add a short discussion of candidate directions for constructing such procedures as future work. revision: partial
-
Referee: [Trust Schema and Capability Gate] Section on the trust schema and capability gate: the HITL policy is defined as a function of the verification level, yet no formal semantics, decision procedure, or example policy table is provided to show how the gate would behave under the biconditional criterion. Without this, it is unclear whether the schema can be implemented without reintroducing the rubber-stamping problem it seeks to avoid.
Authors: The referee is correct that the manuscript defines the capability gate's policy as depending on verification level but does not supply a concrete policy table or formal semantics. We will add an illustrative policy table in the revised manuscript that shows, for each verification level, which capability requests trigger mandatory HITL review versus automated allowance. This table will be accompanied by a brief explanation of how the biconditional criterion informs the policy, making clear that lower verification levels require human intervention on irreversible actions and thereby avoiding indiscriminate approval. revision: yes
Circularity Check
No significant circularity; definitions and criterion are independent requirements
full rationale
The paper presents a trust schema, explicit verification levels, a capability gate, and a biconditional correctness criterion as new constructs that any verification procedure must satisfy on an adversarial-ensemble exercise. These are framed as requirements and policy functions rather than results derived from or equivalent to fitted parameters, self-citations, or prior definitions within the paper. The reference to an open-source implementation applies only to the ten normative runtime guidelines and is not load-bearing for the central thesis that skills remain untrusted until verified. No equations, reductions by construction, or self-referential derivations appear in the provided abstract or described content. The framework is a self-contained conceptual proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A skill is untrusted code until it is verified
invented entities (3)
-
Verification level on skill manifest
no independent evidence
-
Capability gate with HITL policy
no independent evidence
-
Biconditional correctness criterion
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Conference on Learning Representations (ICLR)
work page 2025
-
[2]
D. Elliott Bell and Leonard J. LaPadula. 1976.Secure Computer System: Unified Exposition and Multics Interpretation. Technical Report MTR-2997. The MITRE Corporation
work page 1976
-
[3]
Alex Birsan. 2021. Dependency Confusion: How I Hacked Into Apple, Mi- crosoft and Dozens of Other Companies. https://medium.com/@alex.birsan/ 15 dependency-confusion-4a5d60fec610. Disclosure of a supply-chain class of attacks against npm, PyPI, and RubyGems
work page 2021
-
[4]
Justin Cappos, Justin Samuel, Scott Baker, and John H. Hartman. 2008. A Look in the Mirror: Attacks on Package Managers. InProceedings of the 15th ACM Conference on Computer and Communications Security (CCS). 565–574
work page 2008
-
[5]
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. AgentPoison: Red- Teaming LLM Agents via Poisoning Memory or Knowledge Bases. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[6]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track
work page 2023
-
[7]
Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner. 2026. Towards Verifiably Safe Tool Use for LLM Agents. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE NIER). Preprint: arXiv:2601.08012
-
[8]
Petros Efstathopoulos, Maxwell Krohn, Steve VanDeBogart, Cliff Frey, David Ziegler, Eddie Kohler, David Mazières, Frans Kaashoek, and Robert Morris. 2005. Labels and Event Processes in the Asbestos Operating System. InProceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 17–30
work page 2005
-
[9]
Emilio Ferrara. 2024. GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence and Large Language Models.Journal of Computational Social Science7 (2024), 549–569
work page 2024
-
[10]
Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang
-
[11]
Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration
SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration. arXiv:2603.21019 [cs.CR]
-
[12]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-Based Input–Output Safeguard for Human–AI Conversations. arXiv:2312.06674
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Lakera AI. 2024. Lakera Guard: Real-Time LLM Security.https://www.lakera.ai/
work page 2024
-
[14]
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[15]
Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, and Tianyue Luo. 2026. Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis. arXiv:2604.02837 [cs.CR]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Hailin Liu, Eugene Ilyushin, Jie Ni, and Min Zhu. 2026. SafeAgent: A Runtime Protection Architecture for Agentic Systems. arXiv:2604.17562 [cs.CR] Stateful runtime controller mediating actions around the agent loop with a context-aware decision core over persistent session state. 16
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Peter Loscocco and Stephen Smalley. 2001. Integrating Flexible Support for Security Policies into the Linux Operating System. InProceedings of the FREENIX Track of the USENIX Annual Technical Conference. 29–42
work page 2001
-
[18]
Alfredo Metere. 2026. enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways. arXiv:2604.16838 [cs.CR] Open-source reference imple- mentation:https://github.com/metereconsulting/enclawed
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Mark S. Miller. 2006.Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control. Ph.D. Dissertation. Johns Hopkins University
work page 2006
-
[20]
Miller, Ka-Ping Yee, and Jonathan S
Mark S. Miller, Ka-Ping Yee, and Jonathan S. Shapiro. 2003. The Object-Capability Model: A Survey. Technical report, Hewlett-Packard Laboratories
work page 2003
-
[21]
NVIDIA. 2024. NeMo Guardrails: A Toolkit for Programmable LLM Safety.https://github. com/NVIDIA/NeMo-Guardrails
work page 2024
-
[22]
OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications.https: //genai.owasp.org/llm-top-10/. Accessed 2026
work page 2025
-
[23]
Justin Samuel, Nick Mathewson, Justin Cappos, and Roger Dingledine. 2010. Survivable Key Compromise in Software Update Systems. InProceedings of the 17th ACM Conference on Computer and Communications Security (CCS). 61–72
work page 2010
-
[24]
Bruce Schneier and John Kelsey. 1999. Secure Audit Logs to Support Computer Forensics. In ACM Transactions on Information and System Security, Vol. 2. 159–176
work page 1999
-
[25]
The MITRE Corporation. 2025. MITRE ATLAS: Adversarial Threat Landscape for AI Systems. https://atlas.mitre.org/. Accessed 2026
work page 2025
-
[26]
Ken Thompson. 1984. Reflections on Trusting Trust.Commun. ACM27, 8 (1984), 761–763
work page 1984
-
[27]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.CR]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.AI] Proposes a four-tier (T1–T4) gate-based (G1–G4) skill trust and lifecycle governance framework over theSKILL.md ecosystem
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong
-
[30]
InProceedings of the 33rd USENIX Security Symposium
AnLLM-AssistedEasy-to-TriggerBackdoorAttackonCodeCompletionModels: Injecting Disguised Vulnerabilities Against Strong Detection. InProceedings of the 33rd USENIX Security Symposium
-
[31]
Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. 2006. Making Information Flow Explicit in HiStar. InProceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 263–278
work page 2006
- [32]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.