Recognition: no theorem link
A Systematic Security Evaluation of OpenClaw and Its Variants
Pith reviewed 2026-05-13 19:29 UTC · model grok-4.3
The pith
Agent systems built on language models expose far more security risks than the models do when used alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentized systems are significantly riskier than their underlying models used in isolation because the coupling of model capability, tool use, multi-step planning, and runtime orchestration amplifies weaknesses; reconnaissance and discovery behaviors prove the most common entry points, while individual frameworks display distinct profiles such as credential leakage, lateral movement, privilege escalation, and resource development.
What carries the argument
A benchmark of 205 test cases that evaluate representative attack behaviors across the full agent execution lifecycle, applied uniformly to six OpenClaw variants under multiple backbone models.
If this is right
- Reconnaissance steps become the dominant weakness that later stages amplify into system-level failures.
- Each framework carries its own high-risk profile, such as credential leakage in one and privilege escalation in another.
- Granting execution capability plus persistent runtime context turns early-stage gaps into concrete breaches.
- Security depends on the joint behavior of model, tools, planning, and orchestration rather than model safety properties alone.
Where Pith is reading between the lines
- Developers of new agent frameworks should simulate full execution traces rather than isolated prompt tests.
- Risk levels may increase further as agents receive more tools or longer-running persistent contexts.
- The same lifecycle-wide gaps are likely to appear in agent systems built on other frameworks beyond the OpenClaw series.
Load-bearing premise
The 205 test cases accurately capture representative real-world attack behaviors across the full agent execution lifecycle without selection bias or incomplete coverage.
What would settle it
An independent evaluation that runs the same six frameworks against a substantially larger or differently constructed set of attack cases and records no substantial vulnerabilities would disprove the central finding.
Figures
read the original abstract
Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic empirical security evaluation of six OpenClaw-series agent frameworks (OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, ArkClaw) under multiple backbone models. It constructs a benchmark of 205 test cases spanning reconnaissance, credential leakage, lateral movement, privilege escalation, and resource development across the full agent execution lifecycle. The central claims are that all evaluated agents exhibit substantial vulnerabilities, agentized systems are significantly riskier than their underlying models used in isolation, reconnaissance behaviors are the most common weakness, and different frameworks show distinct high-risk profiles; the work concludes that security depends on coupling among model capability, tool use, planning, and orchestration, calling for lifecycle-wide governance.
Significance. If the benchmark is representative, the results provide concrete evidence that agent frameworks amplify risks beyond model-only safety properties, particularly through multi-step execution and persistent context. This is a timely contribution to AI agent security literature, offering a unified evaluation framework that could guide development of more robust orchestration mechanisms and inform deployment policies.
major comments (1)
- [Benchmark construction (Section 3/4)] Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.
minor comments (2)
- [Results] Clarify the exact definition and measurement of 'substantial security vulnerabilities' and 'significantly riskier' (e.g., specific failure rate thresholds or statistical tests used for the agent vs. model-only comparison).
- [Methodology] Provide more detail on how the six frameworks were selected as 'representative' and whether any exclusion criteria were applied during case construction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. The comment on benchmark construction is well-taken and highlights an area where greater transparency will strengthen the work. We address it point-by-point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Benchmark construction (Section 3/4): The 205 test cases are described as covering representative attack behaviors across the agent lifecycle, but the manuscript supplies no sampling methodology, quantitative coverage metrics (e.g., stratified by MITRE ATLAS tactics), inter-annotator agreement, or external validation against public incident logs. This is load-bearing for the central claim, as the comparison to model-only evaluation uses the identical cases; without these details, observed failure rates may reflect benchmark construction choices rather than inherent agent risk.
Authors: We agree that the current manuscript does not supply the requested methodological details. The 205 cases were developed by mapping MITRE ATLAS tactics to agent-specific execution stages (tool invocation, planning, persistent context, and output handling), but this process is only summarized at a high level in Section 3. In the revised version we will add an explicit subsection detailing: (1) the sampling methodology, which enumerated one or more concrete scenarios for each relevant ATLAS tactic and filtered for feasibility within the evaluated agent frameworks; (2) quantitative coverage metrics, including a table showing the number of cases per tactic (e.g., reconnaissance, credential leakage, lateral movement, privilege escalation, resource development); (3) the internal review process and inter-annotator agreement computed on a 20 % sample of cases; and (4) references to publicly documented AI-agent incidents and security reports used to ground the scenarios. These additions will make clear that the benchmark is systematically derived from an established framework rather than constructed ad hoc, thereby supporting the direct comparison between agent and model-only failure rates. We view this as a necessary and feasible improvement. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper conducts a direct empirical security assessment by constructing a benchmark of 205 test cases and measuring failure rates across six agent frameworks and multiple backbone models. No mathematical derivations, parameter fitting, predictions derived from fitted inputs, or self-referential equations appear in the manuscript. All reported results (vulnerability rates, risk profiles by lifecycle stage) are presented as observations from executing the test cases, without any step that reduces by construction to the inputs or prior self-citations. The representativeness of the 205 cases is an external validity concern rather than a circularity issue in the derivation chain.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Reference graph
Works this paper leans on
- [1]
-
[2]
Os-harm: A benchmark for measuring safety of computer use agents
Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, J Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents. InICML 2025 Workshop on Computer Use Agents. Ruiqi Li, Zhiqiang Wang, Yunhao Yao, and Xiang-Yang Li. Mcp-itp: An automated framework for implicit tool poisoning in mcp.arXiv ...
-
[3]
Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qi- wei Ye, Yiming Hei, Xi Zhang, et al. Clawkeeper: Comprehensive safety protection for openclaw agents through skills, plugins, and watchers.arXiv preprint arXiv:2603.24414,
-
[4]
Zhengyang Shan, Jiayun Xin, Yue Zhang, and Minghui Xu. Don’t let the claw grip your hand: A security analysis and defense framework for openclaw.arXiv preprint arXiv:2603.10387,
-
[5]
Memory poisoning attack and defense on memory based llm-agents,
Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504,
-
[6]
Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wagner. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207, 2025a. Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, and Zhaoxiang Liu. From assistant to double agent: Formalizing and benchmarking attacks o...
-
[7]
38 Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b. Zonghao Ying, Xiao Yang, Siyang Wu, Yumeng Song, Yang Qu, Hainan Li, Tianlin Li, Jiakai Wang, Aishan Liu, and Xianglon...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.