arxiv: 2604.03131 · v1 · submitted 2026-04-03 · 💻 cs.CR · cs.AI

Recognition: no theorem link

A Systematic Security Evaluation of OpenClaw and Its Variants

Yuhang Wang , Haichang Gao , Zhenxing Niu , Zhaoxiang Liu , Wenjing Zhang , Xiang Wang , Shiguo Lian

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:29 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agentssecurity evaluationLLM agentsvulnerability assessmenttool-augmented modelsattack benchmarkprivilege escalationlifecycle security

0 comments

The pith

Agent systems built on language models expose far more security risks than the models do when used alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six OpenClaw-based agent frameworks with 205 cases that track attack paths from the first step through to final actions. It finds every framework leaks credentials, allows lateral movement, or escalates privileges in ways the base models avoid. These failures arise because agents combine model output with tool calls, multi-step plans, and ongoing runtime memory. The work shows that prompt safety checks miss how early reconnaissance steps turn into concrete system breaches once execution rights are granted. The authors argue that security must address the full agent lifecycle rather than isolated model responses.

Core claim

Agentized systems are significantly riskier than their underlying models used in isolation because the coupling of model capability, tool use, multi-step planning, and runtime orchestration amplifies weaknesses; reconnaissance and discovery behaviors prove the most common entry points, while individual frameworks display distinct profiles such as credential leakage, lateral movement, privilege escalation, and resource development.

What carries the argument

A benchmark of 205 test cases that evaluate representative attack behaviors across the full agent execution lifecycle, applied uniformly to six OpenClaw variants under multiple backbone models.

If this is right

Reconnaissance steps become the dominant weakness that later stages amplify into system-level failures.
Each framework carries its own high-risk profile, such as credential leakage in one and privilege escalation in another.
Granting execution capability plus persistent runtime context turns early-stage gaps into concrete breaches.
Security depends on the joint behavior of model, tools, planning, and orchestration rather than model safety properties alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of new agent frameworks should simulate full execution traces rather than isolated prompt tests.
Risk levels may increase further as agents receive more tools or longer-running persistent contexts.
The same lifecycle-wide gaps are likely to appear in agent systems built on other frameworks beyond the OpenClaw series.

Load-bearing premise

The 205 test cases accurately capture representative real-world attack behaviors across the full agent execution lifecycle without selection bias or incomplete coverage.

What would settle it

An independent evaluation that runs the same six frameworks against a substantially larger or differently constructed set of attack cases and records no substantial vulnerabilities would disprove the central finding.

Figures

Figures reproduced from arXiv: 2604.03131 by Haichang Gao, Shiguo Lian, Wenjing Zhang, Xiang Wang, Yuhang Wang, Zhaoxiang Liu, Zhenxing Niu.

**Figure 2.** Figure 2: KimiClaw System Architecture and Workflow [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ArkClaw System Architecture and Workflow [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: System Architecture of QClaw The encapsulation layer sits before the core layer, assuming the critical roles of security isolation and capability integration. This layer leverages Tencent’s Security Shield environment and the AI security sandbox technology from PC Manager 18.0 to achieve strict isolation and protection of the local runtime environment, effectively blocking external risks from penetrating t… view at source ↗

**Figure 5.** Figure 5: Workflow of QClaw The gateway distributes commands to three collaborative modules—Pi Agent, Skills, and Browser—where the inference engine parses semantics, the skills module executes business logic, and the browser module completes automated operations, thereby achieving a complete closed-loop command execution. These modules can further invoke underlying resources such as the local file system, terminal … view at source ↗

**Figure 6.** Figure 6: AutoClaw core system architecture AutoClaw extends large-model capabilities into practical automation through task orchestration, tool execution, and multi-model access. It follows a pipeline mechanism of input reception → task 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Core Architecture of the MaxClaw System The workflow of MaxClaw follows a cyclic process consisting of input reception, task parsing, model decision-making, tool execution, state updating, and result return. The system first accepts the user’s request, extracts objectives and constraints, and decomposes complex tasks. It then selects an appropriate model for reasoning and planning according to the task typ… view at source ↗

**Figure 8.** Figure 8: Comparison of Attack Success Rates Across OpenClaw Base Models [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Success and Failure Counts by Attack Type [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: KimiClaw Security Test: Risk Level Analysis [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: ArkClaw Attack-Type Success Rate Distribution [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Composition of successful attacks by attack type in AutoClaw [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Attack success rate by attack type in AutoClaw [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Attack Success Rates by Category The core issues currently exposed by MaxClaw are mainly reflected in its insufficient sensitivity to exploratory, preparatory, and environment-aware requests at the early stage of an attack. Among them, Reconnaissance has the highest attack success rate at 50.00%, followed by Discovery at 48.28%. Defense Evasion and Execution reached 17.86% and 16.67%, respectively. These … view at source ↗

read the original abstract

Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper benchmarks six OpenClaw variants and shows agents carry extra security risks beyond their base models, but the 205-case suite lacks clear validation against real attacks.

read the letter

This work runs a security benchmark on OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw across several backbones. It finds every variant fails on reconnaissance and discovery most often, with other frameworks showing distinct problems like credential leaks or privilege escalation. The central observation is that giving an agent execution rights and persistent context turns early weaknesses into bigger failures, which model-only tests miss. That part is useful because it separates framework-level exposure from raw model safety. The 205 cases span the full lifecycle and produce consistent patterns across models, which gives the comparison some weight. The main soft spot is the test construction itself. The paper lists the attack categories but gives no sampling method, no coverage numbers against MITRE ATLAS or incident logs, and no external check on whether the cases are representative. Without that, the high failure rates could partly reflect how the cases were written rather than how the agents behave in practice. The model-only baseline uses the same cases, so any bias there affects both sides. This is the kind of paper that belongs in a reading group for people who deploy tool-using agents. It is not yet tight enough for a strong citation, but the empirical angle is clear enough that a serious editor should send it out for review so referees can examine the case list and any replication details.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a systematic empirical security evaluation of six OpenClaw-series agent frameworks (OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, ArkClaw) under multiple backbone models. It constructs a benchmark of 205 test cases spanning reconnaissance, credential leakage, lateral movement, privilege escalation, and resource development across the full agent execution lifecycle. The central claims are that all evaluated agents exhibit substantial vulnerabilities, agentized systems are significantly riskier than their underlying models used in isolation, reconnaissance behaviors are the most common weakness, and different frameworks show distinct high-risk profiles; the work concludes that security depends on coupling among model capability, tool use, planning, and orchestration, calling for lifecycle-wide governance.

Significance. If the benchmark is representative, the results provide concrete evidence that agent frameworks amplify risks beyond model-only safety properties, particularly through multi-step execution and persistent context. This is a timely contribution to AI agent security literature, offering a unified evaluation framework that could guide development of more robust orchestration mechanisms and inform deployment policies.

major comments (1)

[Benchmark construction (Section 3/4)] Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.

minor comments (2)

[Results] Clarify the exact definition and measurement of 'substantial security vulnerabilities' and 'significantly riskier' (e.g., specific failure rate thresholds or statistical tests used for the agent vs. model-only comparison).
[Methodology] Provide more detail on how the six frameworks were selected as 'representative' and whether any exclusion criteria were applied during case construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. The comment on benchmark construction is well-taken and highlights an area where greater transparency will strengthen the work. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.

Authors: We agree that the current manuscript does not supply the requested methodological details. The 205 cases were developed by mapping MITRE ATLAS tactics to agent-specific execution stages (tool invocation, planning, persistent context, and output handling), but this process is only summarized at a high level in Section 3. In the revised version we will add an explicit subsection detailing: (1) the sampling methodology, which enumerated one or more concrete scenarios for each relevant ATLAS tactic and filtered for feasibility within the evaluated agent frameworks; (2) quantitative coverage metrics, including a table showing the number of cases per tactic (e.g., reconnaissance, credential leakage, lateral movement, privilege escalation, resource development); (3) the internal review process and inter-annotator agreement computed on a 20 % sample of cases; and (4) references to publicly documented AI-agent incidents and security reports used to ground the scenarios. These additions will make clear that the benchmark is systematically derived from an established framework rather than constructed ad hoc, thereby supporting the direct comparison between agent and model-only failure rates. We view this as a necessary and feasible improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper conducts a direct empirical security assessment by constructing a benchmark of 205 test cases and measuring failure rates across six agent frameworks and multiple backbone models. No mathematical derivations, parameter fitting, predictions derived from fitted inputs, or self-referential equations appear in the manuscript. All reported results (vulnerability rates, risk profiles by lifecycle stage) are presented as observations from executing the test cases, without any step that reduces by construction to the inputs or prior self-citations. The representativeness of the 205 cases is an external validity concern rather than a circularity issue in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation rests on the assumption that the constructed test cases are representative of real attack surfaces; no free parameters or invented entities are introduced.

pith-pipeline@v0.9.0 · 5572 in / 979 out tokens · 36870 ms · 2026-05-13T19:29:59.238372+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
cs.CR 2026-05 unverdicted novelty 6.0

DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · cited by 2 Pith papers

[1]

DiraBook

Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats.arXiv preprint arXiv:2603.11619,

work page arXiv
[2]

Os-harm: A benchmark for measuring safety of computer use agents

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, J Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents. InICML 2025 Workshop on Computer Use Agents. Ruiqi Li, Zhiqiang Wang, Yunhao Yao, and Xiang-Yang Li. Mcp-itp: An automated framework for implicit tool poisoning in mcp.arXiv ...

work page arXiv 2025
[3]

Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qi- wei Ye, Yiming Hei, Xi Zhang, et al. Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,

work page arXiv
[4]

Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXivpreprintarXiv:2603.10387, 2026

Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXiv preprint arXiv:2603.10387,

work page arXiv
[5]

Memory poisoning attack and defense on memory based llm-agents,

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504,

work page arXiv
[6]

https://doi.org/10

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wagner. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207, 2025a. Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, and Zhaoxiang Liu. From assistant to double agent: Formalizing and benchmarking attacks o...

work page arXiv
[7]

Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b

38 Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b. Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglon...

work page arXiv