Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

arxiv: 2605.00424 · v2 · pith:FZ7AO632new · submitted 2026-05-01 · 💻 cs.CR · cs.AI· cs.MA· cs.SE

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere This is my paper

Pith reviewed 2026-05-19 18:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MAcs.SE

keywords agent skillstrust schemahuman-in-the-loopverification criterioncapability gateruntime securityadversarial testingLLM agents

0 comments

The pith

A skill is untrusted code until verified, so the runtime enforces verification before granting trust instead of relying on signatures or origins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agent skills, as packages extending large language models, must be treated as untrusted until a verification process confirms their behavior. This default prevents the runtime from assuming trust based on external markers and makes human-in-the-loop oversight sustainable by triggering it only for unverified skills. The authors supply a trust schema that records verification level on each skill manifest, a capability gate that ties intervention policy to that level, and a biconditional correctness criterion any verification procedure must meet when tested against adversarial ensembles. They also abstract ten normative guidelines from a working implementation into a portable runtime profile. The result separates verification as its own gated activity, avoiding the operational collapse that occurs when every irreversible action requires human approval.

Core claim

A skill is untrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. The paper defines a trust schema carrying an explicit verification level on every skill manifest, a capability gate whose human-in-the-loop policy is a function of that level, a biconditional correctness criterion that any candidate verification procedure must satisfy on an adversarial-ensemble exercise, and a portable runtime profile distilled into ten normative guidelines from an open-source reference implementation.

What carries the argument

Trust schema with explicit verification level on each skill manifest, which controls when the capability gate requires human-in-the-loop review for unverified skills.

If this is right

Human-in-the-loop gates activate only for unverified skills rather than on every irreversible action.
Agent runtimes avoid operational collapse into rubber-stamping at non-trivial scale.
Verification occurs as a distinct gated process before the main runtime loads the skill.
Trust decisions no longer depend on signatures, clearances, or origin registries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification requirement could apply to other forms of dynamic instruction loading in LLM-based systems.
Standardizing the biconditional criterion might allow skills to move between different agent runtimes with retained verification status.
Adversarial testing focused on LLM-specific behaviors could surface failure modes unique to agent skills.

Load-bearing premise

A verification procedure exists that can satisfy the biconditional correctness criterion when tested on an adversarial-ensemble exercise.

What would settle it

A concrete adversarial-ensemble test in which every candidate verification procedure either fails the biconditional criterion or allows an unsafe skill to pass as verified.

read the original abstract

Agent skills - structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself - have moved from convenience to first-class deployment artifact. The runtime that loads them inherits the same problem package managers and operating systems have always faced: a piece of content claims a behavior; the runtime must decide whether to believe it. We argue this paper's central thesis up front: a skill is untrusted code until it is verified, and the runtime that loads it must enforce that default rather than infer trust from a signature, a clearance, or a registry of origin. Without skill verification, a human-in-the-loop (HITL) gate must fire on every irreversible call - which is operationally untenable and degrades into rubber-stamping at any non-trivial scale. With skill verification treated as a separate, gated process, HITL fires only for what is unverified, and the system becomes sustainable. We give a trust schema that includes an explicit verification level on every skill manifest; a capability gate whose HITL policy is a function of that verification level; a biconditional correctness criterion that any candidate verification procedure must satisfy on an adversarial-ensemble exercise; and a portable runtime profile with ten normative guidelines abstracted from a working open-source reference implementation. The contribution is harness- and model-agnostic; nothing here requires retraining, fine-tuning, or proprietary infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames agent skills as untrusted code by default and offers a trust schema plus biconditional criterion to limit HITL to unverified cases, but supplies no working verification method that meets the criterion.

read the letter

The main thing here is a proposal to treat skills in LLM agent runtimes as untrusted code until they pass explicit verification, using a trust schema with verification levels on the manifest and a biconditional correctness criterion that any verification procedure must satisfy on adversarial-ensemble exercises. This is meant to let the capability gate skip human oversight for verified skills and keep HITL only for the rest, avoiding rubber-stamping at scale. The ten normative guidelines for the runtime profile come from their open-source reference implementation, which gives the ideas some grounding in practice. The paper does a clear job spelling out why signatures, clearances, or origin registries are not enough for these skills and why constant HITL gates become operationally impossible. That diagnosis of the scaling problem is useful on its own. What is new is the specific combination of the verification level, the HITL policy tied to it, and the biconditional criterion that demands correct performance on both sides of an adversarial test. The contribution is presented as harness- and model-agnostic, which keeps the focus on the runtime layer. The soft spot is that no concrete verification procedure is shown that actually satisfies the biconditional criterion, and there is no derivation, proof, or test result demonstrating that such a procedure can be built or that it would hold up. The central claim therefore rests on the assumption that the criterion is achievable rather than on evidence that it is. The citation pattern is light, but that is secondary to the missing demonstration. This paper is for engineers and researchers who build or secure runtime environments for LLM agents, especially in settings where skills are deployed as first-class artifacts. A reader looking for a structured way to think about trust and oversight in agent systems would get value from the schema and the guidelines. It deserves a serious referee because the operational problem it identifies is real and the framework is coherent enough to be worth testing and refining. I would send it to peer review rather than desk reject, with the expectation that reviewers will press for a concrete verification example or satisfiability argument.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that agent skills—structured packages of instructions and scripts for LLMs—must be treated as untrusted code until verified by the runtime. It introduces a trust schema with explicit verification levels on skill manifests, a capability gate whose human-in-the-loop (HITL) policy depends on those levels, a biconditional correctness criterion that any verification procedure must satisfy on adversarial-ensemble exercises, and ten normative guidelines for a portable runtime profile derived from an open-source reference implementation. The central claim is that this approach makes HITL sustainable by restricting interventions to unverified skills rather than requiring them on every irreversible action.

Significance. If a practical verification procedure satisfying the biconditional criterion can be shown to exist, the framework would provide a principled separation between verification and other trust signals such as signatures or origin registries. This could meaningfully reduce operational overhead in agent deployments while improving security posture. The grounding in a working reference implementation and the model-agnostic stance are positive features that distinguish the proposal from purely theoretical treatments.

major comments (2)

[Abstract / Biconditional Correctness Criterion] Abstract and the section defining the biconditional correctness criterion: the central thesis requires that verification procedures exist which satisfy the biconditional criterion on adversarial-ensemble exercises, thereby allowing the capability gate to restrict HITL to unverified skills only. No concrete procedure, proof of satisfiability, or evaluation against the adversarial ensemble is supplied; the reference implementation is referenced solely for the ten runtime guidelines. This leaves the claim that verification enables sustainable HITL as an untested assumption and is load-bearing for the paper's contribution.
[Trust Schema and Capability Gate] Section on the trust schema and capability gate: the HITL policy is defined as a function of the verification level, yet no formal semantics, decision procedure, or example policy table is provided to show how the gate would behave under the biconditional criterion. Without this, it is unclear whether the schema can be implemented without reintroducing the rubber-stamping problem it seeks to avoid.

minor comments (2)

The abstract introduces several new terms (verification level, capability gate, biconditional correctness criterion) in rapid succession; a summary table or diagram relating these components would improve readability.
[Portable Runtime Profile] The ten normative guidelines are presented as abstracted from the reference implementation; including a brief mapping or excerpt from the implementation for at least two guidelines would help readers assess their practicality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive critique of the manuscript. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Biconditional Correctness Criterion] Abstract and the section defining the biconditional correctness criterion: the central thesis requires that verification procedures exist which satisfy the biconditional criterion on adversarial-ensemble exercises, thereby allowing the capability gate to restrict HITL to unverified skills only. No concrete procedure, proof of satisfiability, or evaluation against the adversarial ensemble is supplied; the reference implementation is referenced solely for the ten runtime guidelines. This leaves the claim that verification enables sustainable HITL as an untested assumption and is load-bearing for the paper's contribution.

Authors: We agree that the manuscript does not present a concrete verification procedure, a proof that the biconditional criterion is satisfiable, or an empirical evaluation on an adversarial ensemble. The paper's contribution is the definition of the criterion that any verification procedure must meet, the trust schema that records verification levels, and the ten normative guidelines for a portable runtime profile derived from the reference implementation. We do not claim to have constructed or validated a specific procedure; rather, we argue that separating verification from other trust signals is a prerequisite for sustainable HITL. To address the concern, we will revise the abstract and the relevant section to state explicitly that the framework assumes the existence of procedures satisfying the criterion and will add a short discussion of candidate directions for constructing such procedures as future work. revision: partial
Referee: [Trust Schema and Capability Gate] Section on the trust schema and capability gate: the HITL policy is defined as a function of the verification level, yet no formal semantics, decision procedure, or example policy table is provided to show how the gate would behave under the biconditional criterion. Without this, it is unclear whether the schema can be implemented without reintroducing the rubber-stamping problem it seeks to avoid.

Authors: The referee is correct that the manuscript defines the capability gate's policy as depending on verification level but does not supply a concrete policy table or formal semantics. We will add an illustrative policy table in the revised manuscript that shows, for each verification level, which capability requests trigger mandatory HITL review versus automated allowance. This table will be accompanied by a brief explanation of how the biconditional criterion informs the policy, making clear that lower verification levels require human intervention on irreversible actions and thereby avoiding indiscriminate approval. revision: yes

Circularity Check

0 steps flagged

No significant circularity; definitions and criterion are independent requirements

full rationale

The paper presents a trust schema, explicit verification levels, a capability gate, and a biconditional correctness criterion as new constructs that any verification procedure must satisfy on an adversarial-ensemble exercise. These are framed as requirements and policy functions rather than results derived from or equivalent to fitted parameters, self-citations, or prior definitions within the paper. The reference to an open-source implementation applies only to the ten normative runtime guidelines and is not load-bearing for the central thesis that skills remain untrusted until verified. No equations, reductions by construction, or self-referential derivations appear in the provided abstract or described content. The framework is a self-contained conceptual proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper introduces several new conceptual entities and assumptions to address the trust problem in agent skill loading, without referencing specific fitted parameters or external proofs.

axioms (1)

domain assumption A skill is untrusted code until it is verified
This is the central premise stated upfront in the abstract.

invented entities (3)

Verification level on skill manifest no independent evidence
purpose: To indicate the trust status of a skill for gating decisions
New element in the trust schema to control HITL policy.
Capability gate with HITL policy no independent evidence
purpose: To enforce human oversight based on verification level
Component of the proposed runtime.
Biconditional correctness criterion no independent evidence
purpose: To provide a standard that verification procedures must meet against adversarial tests
Core new criterion for the framework.

pith-pipeline@v0.9.0 · 5797 in / 1418 out tokens · 71744 ms · 2026-05-19T18:20:20.231776+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 6 internal anchors

[1]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Conference on Learning Representations (ICLR)

work page 2025
[2]

Elliott Bell and Leonard J

D. Elliott Bell and Leonard J. LaPadula. 1976.Secure Computer System: Unified Exposition and Multics Interpretation. Technical Report MTR-2997. The MITRE Corporation

work page 1976
[3]

Alex Birsan. 2021. Dependency Confusion: How I Hacked Into Apple, Mi- crosoft and Dozens of Other Companies. https://medium.com/@alex.birsan/ 15 dependency-confusion-4a5d60fec610. Disclosure of a supply-chain class of attacks against npm, PyPI, and RubyGems

work page 2021
[4]

Justin Cappos, Justin Samuel, Scott Baker, and John H. Hartman. 2008. A Look in the Mirror: Attacks on Package Managers. InProceedings of the 15th ACM Conference on Computer and Communications Security (CCS). 565–574

work page 2008
[5]

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. AgentPoison: Red- Teaming LLM Agents via Poisoning Memory or Knowledge Bases. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[6]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

work page 2023
[7]

Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner. 2026. Towards Verifiably Safe Tool Use for LLM Agents. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE NIER). Preprint: arXiv:2601.08012

work page arXiv 2026
[8]

Petros Efstathopoulos, Maxwell Krohn, Steve VanDeBogart, Cliff Frey, David Ziegler, Eddie Kohler, David Mazières, Frans Kaashoek, and Robert Morris. 2005. Labels and Event Processes in the Asbestos Operating System. InProceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 17–30

work page 2005
[9]

Emilio Ferrara. 2024. GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence and Large Language Models.Journal of Computational Social Science7 (2024), 549–569

work page 2024
[10]

Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang

work page
[11]

Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration

SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration. arXiv:2603.21019 [cs.CR]

work page arXiv
[12]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-Based Input–Output Safeguard for Human–AI Conversations. arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Lakera AI. 2024. Lakera Guard: Real-Time LLM Security.https://www.lakera.ai/

work page 2024
[14]

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[15]

Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, and Tianyue Luo. 2026. Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis. arXiv:2604.02837 [cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Hailin Liu, Eugene Ilyushin, Jie Ni, and Min Zhu. 2026. SafeAgent: A Runtime Protection Architecture for Agentic Systems. arXiv:2604.17562 [cs.CR] Stateful runtime controller mediating actions around the agent loop with a context-aware decision core over persistent session state. 16

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Peter Loscocco and Stephen Smalley. 2001. Integrating Flexible Support for Security Policies into the Linux Operating System. InProceedings of the FREENIX Track of the USENIX Annual Technical Conference. 29–42

work page 2001
[18]

Alfredo Metere. 2026. enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways. arXiv:2604.16838 [cs.CR] Open-source reference imple- mentation:https://github.com/metereconsulting/enclawed

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Mark S. Miller. 2006.Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control. Ph.D. Dissertation. Johns Hopkins University

work page 2006
[20]

Miller, Ka-Ping Yee, and Jonathan S

Mark S. Miller, Ka-Ping Yee, and Jonathan S. Shapiro. 2003. The Object-Capability Model: A Survey. Technical report, Hewlett-Packard Laboratories

work page 2003
[21]

NVIDIA. 2024. NeMo Guardrails: A Toolkit for Programmable LLM Safety.https://github. com/NVIDIA/NeMo-Guardrails

work page 2024
[22]

OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications.https: //genai.owasp.org/llm-top-10/. Accessed 2026

work page 2025
[23]

Justin Samuel, Nick Mathewson, Justin Cappos, and Roger Dingledine. 2010. Survivable Key Compromise in Software Update Systems. InProceedings of the 17th ACM Conference on Computer and Communications Security (CCS). 61–72

work page 2010
[24]

Bruce Schneier and John Kelsey. 1999. Secure Audit Logs to Support Computer Forensics. In ACM Transactions on Information and System Security, Vol. 2. 159–176

work page 1999
[25]

The MITRE Corporation. 2025. MITRE ATLAS: Adversarial Threat Landscape for AI Systems. https://atlas.mitre.org/. Accessed 2026

work page 2025
[26]

Ken Thompson. 1984. Reflections on Trusting Trust.Commun. ACM27, 8 (1984), 761–763

work page 1984
[27]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.AI] Proposes a four-tier (T1–T4) gate-based (G1–G4) skill trust and lifecycle governance framework over theSKILL.md ecosystem

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong

work page
[30]

InProceedings of the 33rd USENIX Security Symposium

AnLLM-AssistedEasy-to-TriggerBackdoorAttackonCodeCompletionModels: Injecting Disguised Vulnerabilities Against Strong Detection. InProceedings of the 33rd USENIX Security Symposium

work page
[31]

Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. 2006. Making Information Flow Explicit in HiStar. InProceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 263–278

work page 2006
[32]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2024. Weak-to-Strong Jailbreaking on Large Language Models. arXiv:2401.17256 17

work page arXiv 2024

[1] [1]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. 2025. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Conference on Learning Representations (ICLR)

work page 2025

[2] [2]

Elliott Bell and Leonard J

D. Elliott Bell and Leonard J. LaPadula. 1976.Secure Computer System: Unified Exposition and Multics Interpretation. Technical Report MTR-2997. The MITRE Corporation

work page 1976

[3] [3]

Alex Birsan. 2021. Dependency Confusion: How I Hacked Into Apple, Mi- crosoft and Dozens of Other Companies. https://medium.com/@alex.birsan/ 15 dependency-confusion-4a5d60fec610. Disclosure of a supply-chain class of attacks against npm, PyPI, and RubyGems

work page 2021

[4] [4]

Justin Cappos, Justin Samuel, Scott Baker, and John H. Hartman. 2008. A Look in the Mirror: Attacks on Package Managers. InProceedings of the 15th ACM Conference on Computer and Communications Security (CCS). 565–574

work page 2008

[5] [5]

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. AgentPoison: Red- Teaming LLM Agents via Poisoning Memory or Knowledge Bases. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024

[6] [6]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

work page 2023

[7] [7]

Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner. 2026. Towards Verifiably Safe Tool Use for LLM Agents. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE NIER). Preprint: arXiv:2601.08012

work page arXiv 2026

[8] [8]

Petros Efstathopoulos, Maxwell Krohn, Steve VanDeBogart, Cliff Frey, David Ziegler, Eddie Kohler, David Mazières, Frans Kaashoek, and Robert Morris. 2005. Labels and Event Processes in the Asbestos Operating System. InProceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 17–30

work page 2005

[9] [9]

Emilio Ferrara. 2024. GenAI Against Humanity: Nefarious Applications of Generative Artificial Intelligence and Large Language Models.Journal of Computational Social Science7 (2024), 549–569

work page 2024

[10] [10]

Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang

work page

[11] [11]

Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration

SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration. arXiv:2603.21019 [cs.CR]

work page arXiv

[12] [12]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-Based Input–Output Safeguard for Human–AI Conversations. arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Lakera AI. 2024. Lakera Guard: Real-Time LLM Security.https://www.lakera.ai/

work page 2024

[14] [14]

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024

[15] [15]

Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, and Tianyue Luo. 2026. Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis. arXiv:2604.02837 [cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Hailin Liu, Eugene Ilyushin, Jie Ni, and Min Zhu. 2026. SafeAgent: A Runtime Protection Architecture for Agentic Systems. arXiv:2604.17562 [cs.CR] Stateful runtime controller mediating actions around the agent loop with a context-aware decision core over persistent session state. 16

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Peter Loscocco and Stephen Smalley. 2001. Integrating Flexible Support for Security Policies into the Linux Operating System. InProceedings of the FREENIX Track of the USENIX Annual Technical Conference. 29–42

work page 2001

[18] [18]

Alfredo Metere. 2026. enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways. arXiv:2604.16838 [cs.CR] Open-source reference imple- mentation:https://github.com/metereconsulting/enclawed

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Mark S. Miller. 2006.Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control. Ph.D. Dissertation. Johns Hopkins University

work page 2006

[20] [20]

Miller, Ka-Ping Yee, and Jonathan S

Mark S. Miller, Ka-Ping Yee, and Jonathan S. Shapiro. 2003. The Object-Capability Model: A Survey. Technical report, Hewlett-Packard Laboratories

work page 2003

[21] [21]

NVIDIA. 2024. NeMo Guardrails: A Toolkit for Programmable LLM Safety.https://github. com/NVIDIA/NeMo-Guardrails

work page 2024

[22] [22]

OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications.https: //genai.owasp.org/llm-top-10/. Accessed 2026

work page 2025

[23] [23]

Justin Samuel, Nick Mathewson, Justin Cappos, and Roger Dingledine. 2010. Survivable Key Compromise in Software Update Systems. InProceedings of the 17th ACM Conference on Computer and Communications Security (CCS). 61–72

work page 2010

[24] [24]

Bruce Schneier and John Kelsey. 1999. Secure Audit Logs to Support Computer Forensics. In ACM Transactions on Information and System Security, Vol. 2. 159–176

work page 1999

[25] [25]

The MITRE Corporation. 2025. MITRE ATLAS: Adversarial Threat Landscape for AI Systems. https://atlas.mitre.org/. Accessed 2026

work page 2025

[26] [26]

Ken Thompson. 1984. Reflections on Trusting Trust.Commun. ACM27, 8 (1984), 761–763

work page 1984

[27] [27]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.AI] Proposes a four-tier (T1–T4) gate-based (G1–G4) skill trust and lifecycle governance framework over theSKILL.md ecosystem

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong

work page

[30] [30]

InProceedings of the 33rd USENIX Security Symposium

AnLLM-AssistedEasy-to-TriggerBackdoorAttackonCodeCompletionModels: Injecting Disguised Vulnerabilities Against Strong Detection. InProceedings of the 33rd USENIX Security Symposium

work page

[31] [31]

Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. 2006. Making Information Flow Explicit in HiStar. InProceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 263–278

work page 2006

[32] [32]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2024. Weak-to-Strong Jailbreaking on Large Language Models. arXiv:2401.17256 17

work page arXiv 2024