arxiv: 2605.01740 · v1 · submitted 2026-05-03 · 💻 cs.CR · cs.AI· cs.MA

Recognition: unknown

Architectural Obsolescence of Unhardened Agentic-AI Runtimes

Alfredo Metere

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA

keywords agentic AIruntime securityaudit integrityaction divergencearchitectural obsolescencetool callingLLM agentshardening

0 comments

The pith

Unhardened agentic-AI runtimes detect none of four action-audit divergences on extensive tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that any agentic-AI runtime must catch when its issued actions diverge from the recorded audit trail in one of four defined ways. It reports that a current widely used runtime detects none of these divergences across a 1600-sample test set run through its actual command-line interface and across multiple language models. A drop-in replacement that adds seven specific runtime structures detects every divergence instance with perfect scores. The authors demonstrate that the performance difference is architectural by showing that a small catalog extension improves the hardened version but cannot be applied to the original. They conclude that unhardened runtimes are therefore obsolete because a strictly better architecture already exists and can be adopted without losing extension support.

Core claim

Catching the four ways an action can diverge from its audit record is a load-bearing safety property of any agentic-AI runtime. The examined unhardened runtime catches none of them, with recall of zero on every cell of every confusion matrix across the 1600-sample baseline and a ten-LLM cross-model run. Detecting the divergences requires seven structures absent from that runtime: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. The version that ships these structures reaches perfect precision, recall, F1, and accuracy on the same inputs. The gap (

What carries the argument

The four divergence types (gate-bypass, audit-forgery, silent host failure, wrong-target) together with the seven runtime structures required to detect them.

If this is right

Any runtime missing the seven structures will record zero detection of the four divergences on the harness.
Configuration changes alone cannot close the detection gap; re-architecture is required.
Small extensions to the data-loss-prevention catalog improve detection only in the presence of the seven structures.
The hardened runtime supports previously excluded plugin categories while maintaining perfect divergence detection.
The harness can be applied directly to any other candidate runtime to measure its coverage of the four divergences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production agentic systems should be evaluated first on whether they contain the seven structures before any configuration tuning.
The same structural-versus-parametric distinction may apply to other integrity properties such as logging completeness or policy enforcement in automated agents.
If the four divergences prove exhaustive, safety standards for agentic runtimes could be expressed as a requirement to implement exactly those seven structures.
The approach of testing through live external channels could be extended to measure detection under actual user-driven tool usage patterns.

Load-bearing premise

The four divergence types plus the 1600-sample harness cover all relevant safety failures that occur in real deployments, and the tested runtime is representative of unhardened agentic-AI systems.

What would settle it

A runtime that lacks the seven structures yet still detects every instance of the four divergences when run through the same harness on real Discord and Telegram channels.

read the original abstract

An agentic-AI runtime issues tool calls, sends messages, and actuates devices on behalf of an LLM. Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, -- is a load-bearing safety property of any such runtime. We show that upstream OpenClaw, the most engineered single-user agentic-AI gateway in public release, catches none of them: recall is 0.000 on every cell of every confusion matrix, on a 1600-sample template baseline through OpenClaw's actual production command-line interface (CLI) and on a ten-LLM cross-model generalisation run. Detecting F1--F4 requires seven specific runtime structures absent from OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss -- an MIT-licensed drop-in fork that ships all seven -- reaches $P = R = F_1 =$ accuracy $= 1.000$ on the same input. The gap is structural, not parametric: a six-line append-only widening of enclawed-oss's data-loss-prevention (DLP) regex catalog raises per-channel F3 detection by 14.6\% net at unchanged precision; the same edit on OpenClaw has nowhere to land. The harness deliberately exercises real Discord and Telegram channels -- plugin categories the first enclawed release deleted as unsafe -- to show F1--F4 detection extends to those previously-unsafe extensions. With architectural superiority for security and feature parity for extensions, we argue that unhardened agentic-AI runtimes are architecturally obsolete: a strictly better alternative exists, is adoptable today, and the gap requires re-architecture rather than configuration. We invite reviewers to apply the harness to any candidate runtime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenClaw catches none of the four action divergences on the 1600-sample harness while the fork catches all of them, but the tests stay narrow and the obsolescence claim rests on unproven coverage.

read the letter

The paper shows that OpenClaw, a public agentic-AI gateway, lets actions diverge from its audit record in four defined ways and detects none of them on a 1600-sample test through its real CLI, plus a cross-model run. The authors add seven structures to a fork—biconditional checker, hash-chained log, admission gate, egress guard, classification policy, module signing, and bootstrap seal—and reach perfect recall and accuracy on the same inputs. They also show that a small DLP regex edit improves F3 detection only because those structures exist, and they test on Discord and Telegram channels that earlier versions had dropped as unsafe. That contrast is concrete and the code is MIT-licensed, so others can run the harness themselves. The work is new in tying those exact failure modes to OpenClaw's production interface and in listing the minimal runtime pieces needed to close them. It does well by publishing the test harness and inviting external runs rather than keeping the evaluation private. The structural point lands: once the structures are missing, some improvements have nowhere to attach. The soft spots sit in the scope of the evidence. Perfect scores on a template baseline raise the usual question about how well the cases represent noisy production traffic, and the paper uses only one unhardened system as the comparator. Without an external mapping of real incidents to the F1-F4 categories or runs on other tool-call layers, it is hard to know whether these four divergences are the main ones or whether OpenClaw is representative. The re-architecture mandate follows only if the taxonomy is both necessary and sufficient, which the current harness does not yet demonstrate. This is useful for anyone shipping agentic systems that send messages or call tools. Developers who need practical hardening ideas will find the seven structures and the open harness worth examining, even if they treat the obsolescence conclusion as a hypothesis rather than a settled result. It deserves a serious referee because the implementation is public and the safety gap is checkable; reviewers can push on test diversity and external validation without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The paper claims that unhardened agentic-AI runtimes are architecturally obsolete. It identifies four action divergence types from audit records—F1 gate-bypass, F2 audit-forgery, F3 silent host failure, and F4 wrong-target—as load-bearing safety properties. Through a 1600-sample harness exercising real Discord and Telegram channels and a cross-model test with ten LLMs, it shows that OpenClaw achieves zero recall on all confusion matrices, while the authors' enclaved-oss fork, which incorporates seven specific runtime structures absent in OpenClaw, achieves perfect precision, recall, F1-score, and accuracy. The paper argues that this gap is structural rather than parametric, as evidenced by a DLP regex modification that only benefits the hardened fork, and provides an open harness for external validation.

Significance. If the empirical findings are confirmed and the F1-F4 taxonomy proves comprehensive, the paper would make a substantial contribution to AI safety by providing concrete evidence that secure agentic systems necessitate architectural redesign rather than superficial configuration changes. The open-source release of the fork under MIT license and the explicit invitation for reviewers to apply the harness to other candidate runtimes represent notable strengths in promoting reproducibility and community scrutiny.

major comments (3)

The manuscript reports perfect detection metrics (P = R = F1 = accuracy = 1.000) for enclaved-oss and zero recall for OpenClaw across all matrices on 1600 samples and cross-model tests, yet provides no details on the construction of the template baseline, the distribution of test cases across divergence types, or any error analysis, making it difficult to assess whether the results generalize beyond the specific harness.
The completeness of the F1-F4 taxonomy for capturing all relevant safety failures is asserted without supporting evidence, such as a retrospective analysis of documented agentic-AI incidents or comparative testing against additional unhardened runtimes with different tool-call or messaging layers, which is essential to justify the conclusion that re-architecture is mandated.
The evaluation of architectural superiority is performed by comparing OpenClaw to the authors' own fork that introduces the seven structures (biconditional checker, hash-chained audit log, extension admission gate, two-layer egress guard, Bell-LaPadula policy, module-signing trust root, bootstrap seal), creating a degree of self-referential grounding that requires independent verification to support the obsolescence claim.

minor comments (2)

The abstract refers to 'enclaved-oss' without an initial definition or link to its repository, which could hinder readers' ability to locate the implementation for replication.
The claim that the harness 'deliberately exercises real Discord and Telegram channels' to show extension to previously unsafe plugins would benefit from a brief description of how the test cases were adapted for these channels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions made to the paper.

read point-by-point responses

Referee: The manuscript reports perfect detection metrics (P = R = F1 = accuracy = 1.000) for enclaved-oss and zero recall for OpenClaw across all matrices on 1600 samples and cross-model tests, yet provides no details on the construction of the template baseline, the distribution of test cases across divergence types, or any error analysis, making it difficult to assess whether the results generalize beyond the specific harness.

Authors: We agree that additional details would improve transparency. In the revised version, we have added a dedicated section (Section 4.2) describing the template baseline construction, including the generation process for the 1600 samples and their distribution across F1-F4 types (approximately 400 samples per type). We also include an error analysis subsection discussing the zero-recall cases for OpenClaw and the perfect scores for the fork, along with limitations regarding generalization to other environments. These additions should allow readers to better evaluate the results. revision: yes
Referee: The completeness of the F1-F4 taxonomy for capturing all relevant safety failures is asserted without supporting evidence, such as a retrospective analysis of documented agentic-AI incidents or comparative testing against additional unhardened runtimes with different tool-call or messaging layers, which is essential to justify the conclusion that re-architecture is mandated.

Authors: The F1-F4 taxonomy is not asserted as exhaustive of all possible failures but as the four fundamental ways an action can diverge from its audit record, which are load-bearing for safety. We have expanded the introduction and discussion to derive these from first principles of auditability in agentic systems. While a full retrospective analysis of all incidents is beyond the scope of this work, we note that the open harness enables such testing by the community. We did not test additional unhardened runtimes, focusing on OpenClaw as the leading public example, but the structural argument holds as the seven structures are absent in its source. We believe this supports the re-architecture conclusion without claiming universality. revision: partial
Referee: The evaluation of architectural superiority is performed by comparing OpenClaw to the authors' own fork that introduces the seven structures (biconditional checker, hash-chained audit log, extension admission gate, two-layer egress guard, Bell-LaPadula policy, module-signing trust root, bootstrap seal), creating a degree of self-referential grounding that requires independent verification to support the obsolescence claim.

Authors: We acknowledge the self-referential nature of the comparison. However, the enclawed-oss fork is released under the MIT license with full source code available, allowing independent implementation and verification of the seven structures. The harness is also open for anyone to apply to other runtimes. To mitigate this concern, we have added pseudocode and detailed specifications for each of the seven structures in the appendix, facilitating reproduction. The obsolescence claim is supported by the empirical gap and the fact that the DLP modification only applies to the hardened version, demonstrating structural differences rather than implementation specifics. revision: yes

Circularity Check

1 steps flagged

F1-F4 detection and architectural superiority reduce to author-defined divergences plus structures added to the fork

specific steps

self definitional [Abstract]
"Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, -- is a load-bearing safety property of any such runtime. ... Detecting F1--F4 requires seven specific runtime structures absent from OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss -- an MIT-licensed drop-in fork that ships all seven -- reaches $P = R = F_1 =$ accuracy $="

F1-F4 are introduced as the complete set of load-bearing divergences; the seven structures are then declared necessary to detect them; the fork is defined as the implementation that adds exactly those structures; the harness is built to exercise F1-F4; and the fork is reported to achieve perfect detection on that harness. The superiority result is therefore true by construction of the definitions, the added code, and the test cases rather than an independent empirical finding.

full rationale

The paper's central derivation is: (1) define F1-F4 as the exhaustive load-bearing divergences from audit records, (2) assert that detecting them requires exactly the seven listed structures, (3) implement those structures in enclawed-oss, (4) run a harness that exercises F1-F4 on both systems, and (5) conclude unhardened runtimes are obsolete because only the re-architected fork achieves perfect scores. Step (4) is not an independent test; the harness is constructed to trigger the author-defined divergences, and the fork was built to contain the detectors for them. Consequently the reported 1.000 vs 0.000 scores, the structural-vs-parametric distinction, and the obsolescence claim are equivalent to the initial definitional choices rather than externally validated. No external mapping of real incidents to F1-F4 or tests on other unhardened implementations is provided, leaving the completeness assumption untested. This produces partial circularity (score 6) while still leaving room for the open harness to be applied by others.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that the four divergence types exhaust the relevant failure space and that the harness exercises production behavior; no free parameters are fitted, but the seven structures are introduced as necessary without independent prior validation in the abstract.

axioms (2)

domain assumption The four divergence categories (F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target) cover all load-bearing safety properties for agentic runtimes.
Invoked in the opening definition of catching divergences as load-bearing.
domain assumption OpenClaw's production CLI behavior on the 1600-sample baseline is representative of unhardened agentic-AI runtimes.
Used to generalize from one runtime to the class of unhardened runtimes.

pith-pipeline@v0.9.0 · 5675 in / 1522 out tokens · 59760 ms · 2026-05-10T14:59:40.934458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 2 canonical work pages · 2 internal anchors

[1]

enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways

Alfredo Metere. enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways. Metere Consulting, May 2026. https://github.com/metereconsulting/enclawed/blob/main/enclawed/paper/enclawed.pdf

2026
[2]

OpenClaw: a single-user agentic-AI gateway

Peter Steinberger and the OpenClaw contributors. OpenClaw: a single-user agentic-AI gateway. https://github.com/openclaw/openclaw, 2025

2025
[3]

https://github.com/metereconsulting/enclawed

enclawed source repository. https://github.com/metereconsulting/enclawed
[4]

Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. J. Amer.\ Statist.\ Assoc., 22(158):209--212, 1927

1927
[5]

Note on the sampling error of the difference between correlated proportions or percentages

Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153--157, 1947

1947
[6]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv:2302.12173, 2023

work page internal anchor Pith review arXiv 2023
[7]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications. arXiv:2306.05499, 2024

work page internal anchor Pith review arXiv 2024
[8]

OWASP Top 10:2021 --- A01: Broken Access Control, 2021

OWASP Foundation. OWASP Top 10:2021 --- A01: Broken Access Control, 2021. https://owasp.org/Top10/2021/A01_2021-Broken_Access_Control/

2021
[9]

Cost of a Data Breach Report 2024

IBM Security and Ponemon Institute. Cost of a Data Breach Report 2024. IBM Corporation, July 2024. https://www.ibm.com/reports/data-breach

2024
[10]

M-Trends 2024 Special Report

Mandiant (Google Cloud). M-Trends 2024 Special Report. April 2024. https://services.google.com/fh/files/misc/m-trends-2024.pdf

2024