Recognition: unknown
Architectural Obsolescence of Unhardened Agentic-AI Runtimes
Pith reviewed 2026-05-10 14:59 UTC · model grok-4.3
The pith
Unhardened agentic-AI runtimes detect none of four action-audit divergences on extensive tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Catching the four ways an action can diverge from its audit record is a load-bearing safety property of any agentic-AI runtime. The examined unhardened runtime catches none of them, with recall of zero on every cell of every confusion matrix across the 1600-sample baseline and a ten-LLM cross-model run. Detecting the divergences requires seven structures absent from that runtime: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. The version that ships these structures reaches perfect precision, recall, F1, and accuracy on the same inputs. The gap (
What carries the argument
The four divergence types (gate-bypass, audit-forgery, silent host failure, wrong-target) together with the seven runtime structures required to detect them.
If this is right
- Any runtime missing the seven structures will record zero detection of the four divergences on the harness.
- Configuration changes alone cannot close the detection gap; re-architecture is required.
- Small extensions to the data-loss-prevention catalog improve detection only in the presence of the seven structures.
- The hardened runtime supports previously excluded plugin categories while maintaining perfect divergence detection.
- The harness can be applied directly to any other candidate runtime to measure its coverage of the four divergences.
Where Pith is reading between the lines
- Production agentic systems should be evaluated first on whether they contain the seven structures before any configuration tuning.
- The same structural-versus-parametric distinction may apply to other integrity properties such as logging completeness or policy enforcement in automated agents.
- If the four divergences prove exhaustive, safety standards for agentic runtimes could be expressed as a requirement to implement exactly those seven structures.
- The approach of testing through live external channels could be extended to measure detection under actual user-driven tool usage patterns.
Load-bearing premise
The four divergence types plus the 1600-sample harness cover all relevant safety failures that occur in real deployments, and the tested runtime is representative of unhardened agentic-AI systems.
What would settle it
A runtime that lacks the seven structures yet still detects every instance of the four divergences when run through the same harness on real Discord and Telegram channels.
read the original abstract
An agentic-AI runtime issues tool calls, sends messages, and actuates devices on behalf of an LLM. Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, -- is a load-bearing safety property of any such runtime. We show that upstream OpenClaw, the most engineered single-user agentic-AI gateway in public release, catches none of them: recall is 0.000 on every cell of every confusion matrix, on a 1600-sample template baseline through OpenClaw's actual production command-line interface (CLI) and on a ten-LLM cross-model generalisation run. Detecting F1--F4 requires seven specific runtime structures absent from OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss -- an MIT-licensed drop-in fork that ships all seven -- reaches $P = R = F_1 =$ accuracy $= 1.000$ on the same input. The gap is structural, not parametric: a six-line append-only widening of enclawed-oss's data-loss-prevention (DLP) regex catalog raises per-channel F3 detection by 14.6\% net at unchanged precision; the same edit on OpenClaw has nowhere to land. The harness deliberately exercises real Discord and Telegram channels -- plugin categories the first enclawed release deleted as unsafe -- to show F1--F4 detection extends to those previously-unsafe extensions. With architectural superiority for security and feature parity for extensions, we argue that unhardened agentic-AI runtimes are architecturally obsolete: a strictly better alternative exists, is adoptable today, and the gap requires re-architecture rather than configuration. We invite reviewers to apply the harness to any candidate runtime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that unhardened agentic-AI runtimes are architecturally obsolete. It identifies four action divergence types from audit records—F1 gate-bypass, F2 audit-forgery, F3 silent host failure, and F4 wrong-target—as load-bearing safety properties. Through a 1600-sample harness exercising real Discord and Telegram channels and a cross-model test with ten LLMs, it shows that OpenClaw achieves zero recall on all confusion matrices, while the authors' enclaved-oss fork, which incorporates seven specific runtime structures absent in OpenClaw, achieves perfect precision, recall, F1-score, and accuracy. The paper argues that this gap is structural rather than parametric, as evidenced by a DLP regex modification that only benefits the hardened fork, and provides an open harness for external validation.
Significance. If the empirical findings are confirmed and the F1-F4 taxonomy proves comprehensive, the paper would make a substantial contribution to AI safety by providing concrete evidence that secure agentic systems necessitate architectural redesign rather than superficial configuration changes. The open-source release of the fork under MIT license and the explicit invitation for reviewers to apply the harness to other candidate runtimes represent notable strengths in promoting reproducibility and community scrutiny.
major comments (3)
- The manuscript reports perfect detection metrics (P = R = F1 = accuracy = 1.000) for enclaved-oss and zero recall for OpenClaw across all matrices on 1600 samples and cross-model tests, yet provides no details on the construction of the template baseline, the distribution of test cases across divergence types, or any error analysis, making it difficult to assess whether the results generalize beyond the specific harness.
- The completeness of the F1-F4 taxonomy for capturing all relevant safety failures is asserted without supporting evidence, such as a retrospective analysis of documented agentic-AI incidents or comparative testing against additional unhardened runtimes with different tool-call or messaging layers, which is essential to justify the conclusion that re-architecture is mandated.
- The evaluation of architectural superiority is performed by comparing OpenClaw to the authors' own fork that introduces the seven structures (biconditional checker, hash-chained audit log, extension admission gate, two-layer egress guard, Bell-LaPadula policy, module-signing trust root, bootstrap seal), creating a degree of self-referential grounding that requires independent verification to support the obsolescence claim.
minor comments (2)
- The abstract refers to 'enclaved-oss' without an initial definition or link to its repository, which could hinder readers' ability to locate the implementation for replication.
- The claim that the harness 'deliberately exercises real Discord and Telegram channels' to show extension to previously unsafe plugins would benefit from a brief description of how the test cases were adapted for these channels.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating revisions made to the paper.
read point-by-point responses
-
Referee: The manuscript reports perfect detection metrics (P = R = F1 = accuracy = 1.000) for enclaved-oss and zero recall for OpenClaw across all matrices on 1600 samples and cross-model tests, yet provides no details on the construction of the template baseline, the distribution of test cases across divergence types, or any error analysis, making it difficult to assess whether the results generalize beyond the specific harness.
Authors: We agree that additional details would improve transparency. In the revised version, we have added a dedicated section (Section 4.2) describing the template baseline construction, including the generation process for the 1600 samples and their distribution across F1-F4 types (approximately 400 samples per type). We also include an error analysis subsection discussing the zero-recall cases for OpenClaw and the perfect scores for the fork, along with limitations regarding generalization to other environments. These additions should allow readers to better evaluate the results. revision: yes
-
Referee: The completeness of the F1-F4 taxonomy for capturing all relevant safety failures is asserted without supporting evidence, such as a retrospective analysis of documented agentic-AI incidents or comparative testing against additional unhardened runtimes with different tool-call or messaging layers, which is essential to justify the conclusion that re-architecture is mandated.
Authors: The F1-F4 taxonomy is not asserted as exhaustive of all possible failures but as the four fundamental ways an action can diverge from its audit record, which are load-bearing for safety. We have expanded the introduction and discussion to derive these from first principles of auditability in agentic systems. While a full retrospective analysis of all incidents is beyond the scope of this work, we note that the open harness enables such testing by the community. We did not test additional unhardened runtimes, focusing on OpenClaw as the leading public example, but the structural argument holds as the seven structures are absent in its source. We believe this supports the re-architecture conclusion without claiming universality. revision: partial
-
Referee: The evaluation of architectural superiority is performed by comparing OpenClaw to the authors' own fork that introduces the seven structures (biconditional checker, hash-chained audit log, extension admission gate, two-layer egress guard, Bell-LaPadula policy, module-signing trust root, bootstrap seal), creating a degree of self-referential grounding that requires independent verification to support the obsolescence claim.
Authors: We acknowledge the self-referential nature of the comparison. However, the enclawed-oss fork is released under the MIT license with full source code available, allowing independent implementation and verification of the seven structures. The harness is also open for anyone to apply to other runtimes. To mitigate this concern, we have added pseudocode and detailed specifications for each of the seven structures in the appendix, facilitating reproduction. The obsolescence claim is supported by the empirical gap and the fact that the DLP modification only applies to the hardened version, demonstrating structural differences rather than implementation specifics. revision: yes
Circularity Check
F1-F4 detection and architectural superiority reduce to author-defined divergences plus structures added to the fork
specific steps
-
self definitional
[Abstract]
"Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, -- is a load-bearing safety property of any such runtime. ... Detecting F1--F4 requires seven specific runtime structures absent from OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss -- an MIT-licensed drop-in fork that ships all seven -- reaches $P = R = F_1 =$ accuracy $="
F1-F4 are introduced as the complete set of load-bearing divergences; the seven structures are then declared necessary to detect them; the fork is defined as the implementation that adds exactly those structures; the harness is built to exercise F1-F4; and the fork is reported to achieve perfect detection on that harness. The superiority result is therefore true by construction of the definitions, the added code, and the test cases rather than an independent empirical finding.
full rationale
The paper's central derivation is: (1) define F1-F4 as the exhaustive load-bearing divergences from audit records, (2) assert that detecting them requires exactly the seven listed structures, (3) implement those structures in enclawed-oss, (4) run a harness that exercises F1-F4 on both systems, and (5) conclude unhardened runtimes are obsolete because only the re-architected fork achieves perfect scores. Step (4) is not an independent test; the harness is constructed to trigger the author-defined divergences, and the fork was built to contain the detectors for them. Consequently the reported 1.000 vs 0.000 scores, the structural-vs-parametric distinction, and the obsolescence claim are equivalent to the initial definitional choices rather than externally validated. No external mapping of real incidents to F1-F4 or tests on other unhardened implementations is provided, leaving the completeness assumption untested. This produces partial circularity (score 6) while still leaving room for the open harness to be applied by others.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The four divergence categories (F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target) cover all load-bearing safety properties for agentic runtimes.
- domain assumption OpenClaw's production CLI behavior on the 1600-sample baseline is representative of unhardened agentic-AI runtimes.
Reference graph
Works this paper leans on
-
[1]
enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways
Alfredo Metere. enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways. Metere Consulting, May 2026. https://github.com/metereconsulting/enclawed/blob/main/enclawed/paper/enclawed.pdf
2026
-
[2]
OpenClaw: a single-user agentic-AI gateway
Peter Steinberger and the OpenClaw contributors. OpenClaw: a single-user agentic-AI gateway. https://github.com/openclaw/openclaw, 2025
2025
-
[3]
https://github.com/metereconsulting/enclawed
enclawed source repository. https://github.com/metereconsulting/enclawed
-
[4]
Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. J. Amer.\ Statist.\ Assoc., 22(158):209--212, 1927
1927
-
[5]
Note on the sampling error of the difference between correlated proportions or percentages
Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153--157, 1947
1947
-
[6]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv:2302.12173, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against LLM-integrated applications. arXiv:2306.05499, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
OWASP Top 10:2021 --- A01: Broken Access Control, 2021
OWASP Foundation. OWASP Top 10:2021 --- A01: Broken Access Control, 2021. https://owasp.org/Top10/2021/A01_2021-Broken_Access_Control/
2021
-
[9]
Cost of a Data Breach Report 2024
IBM Security and Ponemon Institute. Cost of a Data Breach Report 2024. IBM Corporation, July 2024. https://www.ibm.com/reports/data-breach
2024
-
[10]
M-Trends 2024 Special Report
Mandiant (Google Cloud). M-Trends 2024 Special Report. April 2024. https://services.google.com/fh/files/misc/m-trends-2024.pdf
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.