pith. sign in

arxiv: 2605.16265 · v1 · pith:3534NDQInew · submitted 2026-03-24 · 💻 cs.AI · cs.CR

AgentWall: A Runtime Safety Layer for Local AI Agents

Pith reviewed 2026-05-21 10:01 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords runtime safetyAI agentspolicy enforcementaction interceptionhuman-in-the-looplocal environmentsobservabilitydeclarative policies
0
0 comments X

The pith

AgentWall intercepts every AI agent action to enforce declarative safety policies with human approval for sensitive tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As AI agents gain the ability to execute real actions like running commands and editing files on local machines, the risk of unsafe behavior grows. AgentWall inserts a runtime check that evaluates each proposed action against an explicit policy before it executes. For sensitive operations it seeks human approval and maintains a full log of what happened. This setup works with existing agent platforms through a simple proxy installation. The result is higher safety without major slowdowns, shown in tests with high accuracy.

Core claim

AgentWall demonstrates that intercepting agent actions at the runtime layer and evaluating them against a declarative policy, with mandatory human approval for sensitive operations and complete logging, provides effective safety for local AI agents, as evidenced by 92.9% policy enforcement accuracy and sub-millisecond overhead in 14 benchmark tests.

What carries the argument

A policy-enforcing MCP proxy implemented as a native OpenClaw plugin that evaluates proposed actions against explicit declarative policies before allowing execution.

If this is right

  • Local AI agents can be used more confidently in developer environments without fear of unintended system changes.
  • Execution trails allow for post-incident analysis and replay of agent sessions.
  • Safety features become available across multiple agent interfaces like Claude Desktop and Cursor with one installation.
  • Policy updates can be made declaratively without modifying the underlying AI models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interception layers could be developed for agents operating in cloud or enterprise settings.
  • Combining this with model-based safety checks might create more comprehensive protection against both aligned and adversarial agents.
  • Expanding the benchmark tests to include a wider variety of real-world agent scenarios would help validate broader applicability.

Load-bearing premise

The 14 benchmark tests and declarative policy model capture the majority of unsafe or adversarial behaviors that local AI agents could produce in practice.

What would settle it

A demonstration of an AI agent performing an unsafe action, such as deleting critical files or exfiltrating data, that AgentWall fails to intercept, block, or require approval for would challenge the effectiveness of its enforcement.

Figures

Figures reproduced from arXiv: 2605.16265 by Ashwin Aravind.

Figure 1
Figure 1. Figure 1: High-Level Architecture of AgentWall. The architecture consists of the following conceptual components: 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Runtime Execution Flow in AgentWall. 6. If approved or allowed, the execution adapter performs the action. 7. The outcome is logged with relevant metadata. 8. The user can inspect the trace during or after the run. The value of this flow is that it introduces a structured decision point into what would otherwise be a direct path from model output to machine action. 6 Implementation Approach This paper pres… view at source ↗
read the original abstract

The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent's intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests. AgentWall is open-source at https://github.com/agentwall/Agentwall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment using a policy-enforcing MCP proxy and native OpenClaw plugin. It evaluates actions against an explicit declarative policy, requires human approval for sensitive operations, records a complete execution trail, and is compatible with tools like Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw. The authors report 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests and provide an open-source implementation.

Significance. Should the interception mechanism prove complete and the benchmark results generalizable, this work contributes a practical runtime safety solution for local AI agents that complements existing alignment research by focusing on execution-time controls. The single-install compatibility across multiple agent environments and the open-source release are notable strengths that enhance the potential impact and reproducibility of the approach.

major comments (2)
  1. [Evaluation] The manuscript reports 92.9% policy enforcement accuracy on 14 benchmark tests but provides insufficient details on how these tests were constructed, what specific threat models or unsafe behaviors they cover, baseline comparisons, or error analysis. This limits the ability to assess the robustness of the performance claim.
  2. [Threat Model] The central claim of intercepting every proposed agent action relies on the MCP proxy capturing all execution paths. However, the threat model does not discuss or rule out potential bypass mechanisms, such as direct invocations or alternative paths in environments like Claude Desktop and Cursor, which would render the policy evaluation and approval steps ineffective.
minor comments (2)
  1. [Abstract] The abstract mentions '14 benchmark tests' without specifying their nature or source, which could be clarified for readers.
  2. [Implementation] The description of the single install command is promising but lacks specifics on dependencies or setup requirements that might be useful in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for major revision. We appreciate the focus on strengthening the evaluation details and threat model. We respond to each major comment below and will revise the manuscript to address the points raised.

read point-by-point responses
  1. Referee: [Evaluation] The manuscript reports 92.9% policy enforcement accuracy on 14 benchmark tests but provides insufficient details on how these tests were constructed, what specific threat models or unsafe behaviors they cover, baseline comparisons, or error analysis. This limits the ability to assess the robustness of the performance claim.

    Authors: We agree that the evaluation section would benefit from greater detail to support the reported accuracy. The 14 benchmark tests were constructed to cover representative unsafe behaviors including unauthorized file system modifications, execution of privileged shell commands, and external API calls that violate declarative policies. In the revised manuscript we will add an expanded evaluation subsection describing the test construction process, the specific threat models and unsafe behaviors addressed, baseline comparisons against no-enforcement and simple heuristic approaches, and an error analysis of the failure cases. These additions will allow readers to more fully assess the 92.9% enforcement accuracy result. revision: yes

  2. Referee: [Threat Model] The central claim of intercepting every proposed agent action relies on the MCP proxy capturing all execution paths. However, the threat model does not discuss or rule out potential bypass mechanisms, such as direct invocations or alternative paths in environments like Claude Desktop and Cursor, which would render the policy evaluation and approval steps ineffective.

    Authors: The threat model in the manuscript is predicated on AgentWall being installed as the interception layer through the MCP proxy and native OpenClaw plugin, which are the standard integration points for the supported agent environments. We acknowledge that explicit discussion of potential bypass mechanisms such as direct invocations was not included. In the revision we will expand the threat model section to address these concerns, clarifying that bypasses outside the proxy would require compromising the host environment itself and are therefore considered out of scope for the local-agent runtime safety setting targeted by this work. We will also note the single-install compatibility across Claude Desktop, Cursor, and the other listed tools as the mechanism that ensures the proxy is the sole execution path in practice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with benchmark measurements

full rationale

The paper introduces AgentWall as an implemented runtime safety layer, detailing its architecture as an MCP proxy and OpenClaw plugin, threat model, policy model, and reports direct empirical measurements of 92.9% policy enforcement accuracy and sub-millisecond overhead on 14 benchmark tests. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on explicit implementation details and external benchmark execution rather than any reduction of results to inputs by construction. The work is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a declarative policy plus human-in-the-loop approval can meaningfully constrain agent behavior in practice; no new mathematical axioms or fitted parameters are introduced beyond standard security engineering practices.

axioms (1)
  • domain assumption Local AI agents can execute shell commands, modify files, call APIs, and browse the web on the host machine.
    Stated directly in the abstract as the motivation for needing runtime controls.
invented entities (1)
  • AgentWall no independent evidence
    purpose: Policy-enforcing runtime proxy and observability layer for local AI agents
    New system introduced and implemented in the paper; no independent evidence outside this work is provided.

pith-pipeline@v0.9.0 · 5769 in / 1284 out tokens · 50657 ms · 2026-05-21T10:01:17.731571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Chennabasappa, S., Nikolaidis, C., Song, D., Molnar, D., Ding, S., Wan, S., et al. (2025). LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents. arXiv:2505.03574

  2. [2]

    Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., & Cohen, J. (2023). NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. arXiv:2310.10501

  3. [3]

    Xiang, Z., Zheng, L., Li, Y ., Hong, J., Li, Q., Xie, H., Zhang, J., Xiong, Z., Xie, C., Yang, C., Song, D., & Li, B. (2024). GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning. arXiv:2406.09187

  4. [4]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y . (2023). ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of ICLR 2023. arXiv:2210.03629

  5. [5]

    Schick, T., Dwivedi-Yu, J., Dess`ı, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems, 36

  6. [6]

    H., White, R

    Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2024). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. InProceedings of COLM 2024

  7. [7]

    Liu, Y ., Deng, G., Li, Y ., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y ., Wang, H., Zheng, Y ., & Liu, Y . (2023). Prompt Injection Attack against LLM-Integrated Applications. arXiv:2306.05499. 15

  8. [8]

    Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173

  9. [9]

    Open Policy Agent. (n.d.). OPA Documentation.https://www.openpolicyagent. org/

  10. [10]

    gVisor. (n.d.). gVisor Documentation.https://gvisor.dev/

  11. [11]

    Firecracker. (n.d.). Firecracker Documentation.https://firecracker-microvm. github.io/

  12. [12]

    OpenTelemetry. (n.d.). OpenTelemetry Documentation.https://opentelemetry. io/

  13. [13]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. 16