pith. machine review for the scientific record. sign in

arxiv: 2604.03131 · v1 · submitted 2026-04-03 · 💻 cs.CR · cs.AI

Recognition: no theorem link

A Systematic Security Evaluation of OpenClaw and Its Variants

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:29 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords AI agentssecurity evaluationLLM agentsvulnerability assessmenttool-augmented modelsattack benchmarkprivilege escalationlifecycle security
0
0 comments X

The pith

Agent systems built on language models expose far more security risks than the models do when used alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six OpenClaw-based agent frameworks with 205 cases that track attack paths from the first step through to final actions. It finds every framework leaks credentials, allows lateral movement, or escalates privileges in ways the base models avoid. These failures arise because agents combine model output with tool calls, multi-step plans, and ongoing runtime memory. The work shows that prompt safety checks miss how early reconnaissance steps turn into concrete system breaches once execution rights are granted. The authors argue that security must address the full agent lifecycle rather than isolated model responses.

Core claim

Agentized systems are significantly riskier than their underlying models used in isolation because the coupling of model capability, tool use, multi-step planning, and runtime orchestration amplifies weaknesses; reconnaissance and discovery behaviors prove the most common entry points, while individual frameworks display distinct profiles such as credential leakage, lateral movement, privilege escalation, and resource development.

What carries the argument

A benchmark of 205 test cases that evaluate representative attack behaviors across the full agent execution lifecycle, applied uniformly to six OpenClaw variants under multiple backbone models.

If this is right

  • Reconnaissance steps become the dominant weakness that later stages amplify into system-level failures.
  • Each framework carries its own high-risk profile, such as credential leakage in one and privilege escalation in another.
  • Granting execution capability plus persistent runtime context turns early-stage gaps into concrete breaches.
  • Security depends on the joint behavior of model, tools, planning, and orchestration rather than model safety properties alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of new agent frameworks should simulate full execution traces rather than isolated prompt tests.
  • Risk levels may increase further as agents receive more tools or longer-running persistent contexts.
  • The same lifecycle-wide gaps are likely to appear in agent systems built on other frameworks beyond the OpenClaw series.

Load-bearing premise

The 205 test cases accurately capture representative real-world attack behaviors across the full agent execution lifecycle without selection bias or incomplete coverage.

What would settle it

An independent evaluation that runs the same six frameworks against a substantially larger or differently constructed set of attack cases and records no substantial vulnerabilities would disprove the central finding.

Figures

Figures reproduced from arXiv: 2604.03131 by Haichang Gao, Shiguo Lian, Wenjing Zhang, Xiang Wang, Yuhang Wang, Zhaoxiang Liu, Zhenxing Niu.

Figure 1
Figure 1. Figure 1: OpenClaw system architecture and workflow [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: KimiClaw System Architecture and Workflow [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ArkClaw System Architecture and Workflow [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: System Architecture of QClaw The encapsulation layer sits before the core layer, assuming the critical roles of security isolation and capability integration. This layer leverages Tencent’s Security Shield environment and the AI security sandbox technology from PC Manager 18.0 to achieve strict isolation and protection of the local runtime environment, effectively blocking external risks from penetrating t… view at source ↗
Figure 5
Figure 5. Figure 5: Workflow of QClaw The gateway distributes commands to three collaborative modules—Pi Agent, Skills, and Browser—where the inference engine parses semantics, the skills module executes business logic, and the browser module completes automated operations, thereby achieving a complete closed-loop command execution. These modules can further invoke underlying resources such as the local file system, terminal … view at source ↗
Figure 6
Figure 6. Figure 6: AutoClaw core system architecture AutoClaw extends large-model capabilities into practical automation through task orchestration, tool execution, and multi-model access. It follows a pipeline mechanism of input reception → task 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Core Architecture of the MaxClaw System The workflow of MaxClaw follows a cyclic process consisting of input reception, task parsing, model decision-making, tool execution, state updating, and result return. The system first accepts the user’s request, extracts objectives and constraints, and decomposes complex tasks. It then selects an appropriate model for reasoning and planning according to the task typ… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Attack Success Rates Across OpenClaw Base Models [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Success and Failure Counts by Attack Type [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: KimiClaw Security Test: Risk Level Analysis [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ArkClaw Attack-Type Success Rate Distribution [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Composition of successful attacks by attack type in AutoClaw [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attack success rate by attack type in AutoClaw [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Attack Success Rates by Category The core issues currently exposed by MaxClaw are mainly reflected in its insufficient sensitivity to exploratory, preparatory, and environment-aware requests at the early stage of an attack. Among them, Reconnaissance has the highest attack success rate at 50.00%, followed by Discovery at 48.28%. Defense Evasion and Execution reached 17.86% and 16.67%, respectively. These … view at source ↗
read the original abstract

Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a systematic empirical security evaluation of six OpenClaw-series agent frameworks (OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, ArkClaw) under multiple backbone models. It constructs a benchmark of 205 test cases spanning reconnaissance, credential leakage, lateral movement, privilege escalation, and resource development across the full agent execution lifecycle. The central claims are that all evaluated agents exhibit substantial vulnerabilities, agentized systems are significantly riskier than their underlying models used in isolation, reconnaissance behaviors are the most common weakness, and different frameworks show distinct high-risk profiles; the work concludes that security depends on coupling among model capability, tool use, planning, and orchestration, calling for lifecycle-wide governance.

Significance. If the benchmark is representative, the results provide concrete evidence that agent frameworks amplify risks beyond model-only safety properties, particularly through multi-step execution and persistent context. This is a timely contribution to AI agent security literature, offering a unified evaluation framework that could guide development of more robust orchestration mechanisms and inform deployment policies.

major comments (1)
  1. [Benchmark construction (Section 3/4)] Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.
minor comments (2)
  1. [Results] Clarify the exact definition and measurement of 'substantial security vulnerabilities' and 'significantly riskier' (e.g., specific failure rate thresholds or statistical tests used for the agent vs. model-only comparison).
  2. [Methodology] Provide more detail on how the six frameworks were selected as 'representative' and whether any exclusion criteria were applied during case construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comment on benchmark construction is well-taken and highlights an area where greater transparency will strengthen the work. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.

    Authors: We agree that the current manuscript does not supply the requested methodological details. The 205 cases were developed by mapping MITRE ATLAS tactics to agent-specific execution stages (tool invocation, planning, persistent context, and output handling), but this process is only summarized at a high level in Section 3. In the revised version we will add an explicit subsection detailing: (1) the sampling methodology, which enumerated one or more concrete scenarios for each relevant ATLAS tactic and filtered for feasibility within the evaluated agent frameworks; (2) quantitative coverage metrics, including a table showing the number of cases per tactic (e.g., reconnaissance, credential leakage, lateral movement, privilege escalation, resource development); (3) the internal review process and inter-annotator agreement computed on a 20 % sample of cases; and (4) references to publicly documented AI-agent incidents and security reports used to ground the scenarios. These additions will make clear that the benchmark is systematically derived from an established framework rather than constructed ad hoc, thereby supporting the direct comparison between agent and model-only failure rates. We view this as a necessary and feasible improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper conducts a direct empirical security assessment by constructing a benchmark of 205 test cases and measuring failure rates across six agent frameworks and multiple backbone models. No mathematical derivations, parameter fitting, predictions derived from fitted inputs, or self-referential equations appear in the manuscript. All reported results (vulnerability rates, risk profiles by lifecycle stage) are presented as observations from executing the test cases, without any step that reduces by construction to the inputs or prior self-citations. The representativeness of the 205 cases is an external validity concern rather than a circularity issue in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation rests on the assumption that the constructed test cases are representative of real attack surfaces; no free parameters or invented entities are introduced.

pith-pipeline@v0.9.0 · 5572 in / 979 out tokens · 36870 ms · 2026-05-13T19:29:59.238372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

    cs.CR 2026-05 unverdicted novelty 6.0

    DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...

  2. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

    cs.CR 2026-05 unverdicted novelty 5.0

    A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · cited by 2 Pith papers

  1. [1]

    DiraBook

    Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

  2. [2]

    Os-harm: A benchmark for measuring safety of computer use agents

    Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, J Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents. InICML 2025 Workshop on Computer Use Agents. Ruiqi Li, Zhiqiang Wang, Yunhao Yao, and Xiang-Yang Li. Mcp-itp: An automated framework for implicit tool poisoning in mcp.arXiv ...

  3. [3]

    Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

    Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qi- wei Ye, Yiming Hei, Xi Zhang, et al. Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

  4. [4]

    Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

    Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXiv preprint arXiv:2603.10387,

  5. [5]

    Memory poisoning attack and defense on memory based llm-agents,

    Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504,

  6. [6]

    https://doi.org/10

    Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wagner. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207, 2025a. Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, and Zhaoxiang Liu. From assistant to double agent: Formalizing and benchmarking attacks o...

  7. [7]

    Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b

    38 Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b. Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglon...