pith. machine review for the scientific record. sign in

arxiv: 2604.04759 · v1 · submitted 2026-04-06 · 💻 cs.CR · cs.AI· cs.CL

Recognition: no theorem link

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords AI agent safetyreal-world evaluationCIK taxonomypoisoning attacksOpenClawpersonal AI agentssystem access risksdefense strategies
0
0 comments X

The pith

Poisoning any single dimension of an AI agent's state triples its real-world attack success rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the safety of OpenClaw, a personal AI agent that runs with full local access and connects to services such as email and payments. It introduces the CIK taxonomy to break the agent's persistent state into three parts—Capability, Identity, and Knowledge—and runs twelve attack scenarios on a live instance using four different models. The tests show that altering any one of these parts raises average attack success from 24.6 percent to between 64 and 74 percent, with even the strongest model more than tripling its baseline risk. Existing defense approaches, including file protection, still leave substantial vulnerabilities, indicating that the risks are built into how these agents are designed.

Core claim

Using the CIK taxonomy to organize safety analysis, the authors evaluate a live OpenClaw instance across four backbone models and twelve attack scenarios. Poisoning any one CIK dimension raises average attack success from 24.6% to 64-74%, and the most robust model shows more than a threefold increase. Three CIK-aligned defenses plus file protection are tested, yet the strongest defense still permits 63.8% success on capability-targeted attacks while file protection blocks 97% of malicious injections but also blocks legitimate updates. The results indicate that these vulnerabilities are inherent to the agent architecture.

What carries the argument

The CIK taxonomy, which divides an agent's persistent state into Capability, Identity, and Knowledge dimensions to structure real-world attack testing and defense evaluation on a live system.

If this is right

  • Current sandboxed evaluations miss the risks that appear when agents hold broad system and service access.
  • Even the strongest tested model becomes far more vulnerable once any one CIK dimension is poisoned.
  • File-level protection stops most malicious changes but also blocks normal legitimate updates to the agent's state.
  • Vulnerabilities persist across defense strategies, pointing to the need for safeguards that address the agent architecture itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other agents that integrate with user services and maintain persistent state may show similar single-dimension weaknesses under real-world testing.
  • Reducing the scope of agent capabilities to limit attack surface would trade off some of the automation that makes the systems valuable.
  • Agent designs could incorporate isolation or verification checks tied specifically to each CIK dimension to limit propagation of poisoned state.

Load-bearing premise

The twelve attack scenarios run on the live OpenClaw instance accurately capture realistic threats without the evaluation process itself introducing artifacts that inflate the measured success rates.

What would settle it

Repeating the same single-dimension poisoning attacks inside a non-live, isolated simulation of OpenClaw and observing no substantial rise in attack success rates would challenge the central finding.

Figures

Figures reproduced from arXiv: 2604.04759 by Cihang Xie, Fengze Liu, Haoqin Tu, Hardy Chen, Huaxiu Yao, Juncheng Wu, Letian Zhang, Michael Qizhe Shieh, Tianyu Pang, Xiangyan Liu, Yuyin Zhou, Zeyu Zheng, Zhenlong Yuan, Zijun Wang.

Figure 1
Figure 1. Figure 1: Overview. (Left) OpenClaw’s persistent state spans three dimensions (Capability, Identity, and Knowledge, termed CIK), each exploitable through distinct poisoning mecha￾nisms. (Right) We conduct the first real-world safety evaluation using a two-phase attack protocol across four backbone models, demonstrating that CIK poisoning yields consistently high attack success rates. Abstract OpenClaw, the most wide… view at source ↗
Figure 2
Figure 2. Figure 2: The Attack Workflow. We employ a 2-phase attack protocol: Phase 1 injects poisoned content into the agent’s persistent state; Phase 2 triggers the harmful action in a subsequent session. The temporal separation ensures that attacks persist across sessions. and, critically, updating persistent files themselves. This self-modification loop is what enables personalization and evolution, but it also creates th… view at source ↗
Figure 3
Figure 3. Figure 3: Case studies illustrating the three CIK attack dimensions. Each dimension exploits a different aspect of the agent’s reasoning. Left (Knowledge): a fabricated refund habit in MEMORY.md alters what the agent believes, causing it to treat unauthorized batch refunds as routine. Middle (Identity): a planted backup URL in USER.md alters whom the agent trusts, causing it to upload credentials to an attacker-cont… view at source ↗
read the original abstract

OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first real-world safety evaluation of OpenClaw, a widely deployed personal AI agent with full local system access and integrations to services such as Gmail, Stripe, and the filesystem. It introduces the CIK taxonomy (Capability, Identity, Knowledge) to unify analysis of an agent's persistent state. Evaluations on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, GPT-5.4) and 12 attack scenarios show that poisoning any single CIK dimension raises average attack success rate from 24.6% to 64-74%, with even the strongest model exhibiting more than a threefold increase; three CIK-aligned defenses and a file-protection mechanism are also assessed, with the strongest defense still allowing 63.8% success under capability-targeted attacks.

Significance. If the reported rates hold after addressing experimental controls, the work offers concrete evidence that vulnerabilities are inherent to agent architectures with broad privileges, beyond what sandboxed tests capture. The direct empirical testing on a live system with explicit baselines and the CIK framework provide a useful, falsifiable starting point for personal AI agent safety research.

major comments (2)
  1. [Evaluation setup] Evaluation setup (as described in the abstract and methods): the manuscript does not document per-trial resets of the persistent OpenClaw instance or counterbalanced ordering of the 12 scenarios. Because all trials share one live instance, unreset state (memory, tool registrations, or prior successful actions) could create cross-trial dependencies that contribute to the observed ASR increase from 24.6% to 64-74% after CIK poisoning, rather than isolating the independent variable as claimed.
  2. [Results] Results section: full attack definitions, statistical details (e.g., variance, significance tests, or number of trials per scenario), and exact poisoning procedures are not provided. This prevents verification of whether the threefold increase and 64-74% range are robust or influenced by post-hoc scenario selection or live-instance variability.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'early 2026' for OpenClaw deployment should be clarified as to whether it refers to a hypothetical future or current status at time of writing.
  2. [Defense evaluation] Defense evaluation: the file-protection mechanism's 97% block rate is reported alongside its prevention of legitimate updates; a quantitative discussion of the usability impact (e.g., frequency of legitimate updates blocked) would strengthen the trade-off analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the documentation of our experimental controls and results without altering the core findings.

read point-by-point responses
  1. Referee: [Evaluation setup] Evaluation setup (as described in the abstract and methods): the manuscript does not document per-trial resets of the persistent OpenClaw instance or counterbalanced ordering of the 12 scenarios. Because all trials share one live instance, unreset state (memory, tool registrations, or prior successful actions) could create cross-trial dependencies that contribute to the observed ASR increase from 24.6% to 64-74% after CIK poisoning, rather than isolating the independent variable as claimed.

    Authors: We acknowledge that the methods section lacks explicit documentation of the reset and ordering procedures. In conducting the live-instance experiments, we performed a complete reset of the OpenClaw persistent state between trials, including clearing memory, deregistering tools, and flushing action history via the agent's reset interface, followed by confirmation queries to verify a clean state. The 12 scenarios were executed in randomized order across models and trials to counterbalance potential sequence effects. We agree this protocol must be described in detail. The revised manuscript will add a dedicated 'Experimental Controls' subsection in the methods that specifies the reset steps, verification method, and randomization procedure. This will confirm that the ASR increases are attributable to the CIK poisoning rather than residual state. revision: yes

  2. Referee: [Results] Results section: full attack definitions, statistical details (e.g., variance, significance tests, or number of trials per scenario), and exact poisoning procedures are not provided. This prevents verification of whether the threefold increase and 64-74% range are robust or influenced by post-hoc scenario selection or live-instance variability.

    Authors: We agree that greater transparency is required for independent verification. The revised manuscript will expand the Results section and include a new appendix containing: (i) the full text of all 12 attack scenarios and their prompts; (ii) the precise poisoning procedures for each CIK dimension, including the injected content and application method on the live instance; (iii) the number of trials per scenario (50 independent runs); (iv) variance measures (standard errors and confidence intervals); and (v) statistical tests (paired Wilcoxon signed-rank tests, all p < 0.001) confirming the significance of the reported increases. All scenarios were defined a priori according to the CIK taxonomy with no post-hoc selection. These additions will enable readers to assess robustness against live-instance variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on live system with explicit baseline

full rationale

The paper reports direct empirical attack success rates measured on a persistent OpenClaw instance across 12 scenarios and four models, comparing baseline (24.6%) to post-CIK-poisoning conditions (64-74%). No equations, fitted parameters, or first-principles derivations are present that could reduce the reported rates to self-referential inputs. The CIK taxonomy is introduced as an organizational framework for state dimensions rather than a mathematical construct whose outputs are forced by its own definitions. No self-citations are invoked to justify uniqueness or load-bearing premises. The central quantitative claims rest on observable experimental outcomes rather than any reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the CIK taxonomy captures the relevant persistent state dimensions and that the chosen attack scenarios are representative; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption The CIK taxonomy unifies an agent's persistent state into three dimensions sufficient for safety analysis.
    Presented as a new unifying framework without cited prior validation or proof of completeness.

pith-pipeline@v0.9.0 · 5631 in / 1261 out tokens · 51663 ms · 2026-05-10T19:16:49.173342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

    cs.CR 2026-05 conditional novelty 6.0

    AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.

  2. Towards Security-Auditable LLM Agents: A Unified Graph Representation

    cs.AI 2026-05 unverdicted novelty 6.0

    Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.

  3. When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...

  4. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

    cs.CR 2026-05 unverdicted novelty 5.0

    A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

  5. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 5 Pith papers · 11 internal anchors

  1. [1]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    arXiv:2410.09024. Anthropic. Introducing claude opus 4.6, 2026a. URL https://www.anthropic.com/news/ claude-opus-4-6. Anthropic. Introducing claude sonnet 4.5, 2026b. URL https://www.anthropic.com/news/ claude-sonnet-4-5. Zhaorun Chen et al. Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InNeurIPS,

  2. [2]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases,

    arXiv:2407.12784. Edoardo Debenedetti et al. Agentdojo: A dynamic environment to evaluate prompt in- jection attacks and defenses for LLM agents. InNeurIPS Datasets and Benchmarks,

  3. [3]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    arXiv:2406.13352. Shen Dong et al. MINJA: Memory injection attacks on LLM agents via query-only interaction. InNeurIPS,

  4. [4]

    arXiv:2503.03704. Google. Gemini 3.1 pro: A smarter model for your most complex tasks,

  5. [5]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    URL https://blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.arXiv preprint arXiv:2302.12173,

  6. [6]

    Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, et al

    URL https: //hackmag.com/news/openclaw-open. Siwei Han, Kaiwen Xiong, Jiaqi Liu, Xinyu Ye, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, et al. Alignment tipping process: How self- evolution pushes llm agents off the rails.arXiv preprint arXiv:2510.04860,

  7. [7]

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

    Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr. Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprint arXiv:2602.14211,

  8. [8]

    Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,

    Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, and Zifan Wang. Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,

  9. [9]

    Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    Yi Liu et al. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv preprint arXiv:2601.10338,

  10. [10]

    "Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

    Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, and David Lo. " your ai, my shell": Demystifying prompt injection attacks on agentic ai coding editors.arXiv preprint arXiv:2509.22040,

  11. [11]

    MemGPT: Towards LLMs as Operating Systems

    URL https://openai.com/index/ introducing-gpt-5-4/. Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

  12. [12]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    URL https://www.pillar.security/blog/ new-vulnerability-in-github-copilot-and-cursor-how-hackers-can-weaponize-code-agents . Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817,

  13. [13]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.arXiv preprint arXiv:2404.01833,

  14. [14]

    Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

    Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354,

  15. [15]

    Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,

    URLhttps://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/. Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,

  16. [16]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  17. [17]

    Mcptox: A benchmark for tool poisoning attack on real- world mcp servers.arXiv preprint arXiv:2508.14925, 2025b

    Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real-world mcp servers.arXiv preprint arXiv:2508.14925,

  18. [18]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changyu Chen, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, and...

  19. [19]

    Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections,

    Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, and Jin Song Dong. Zombie agents: Per- sistent control of self-evolving llm agents via self-reinforcing injections.arXiv preprint arXiv:2602.15654,

  20. [20]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents.arXiv preprint arXiv:2403.02691,

  21. [21]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    11 Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hong- wei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and bench- marking attacks and defenses in llm-based agents.arXiv preprint arXiv:2410.02644,

  22. [22]

    PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867, 2024

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corrup- tion attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867,