arxiv: 2605.06731 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

Xiaoyu Xu , Minxin Du , Qipeng Xie , Haobin Ke , Qingqing Ye , Haibo Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG

keywords LLM agentsstate poisoningpersonalized agentsmemory corruptionsecurity vulnerabilitybenchmarkdefense mechanismauthorization drift

0 comments

The pith

Routine conversations alone can substantially poison the long-term state of personalized LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Personalized LLM agents maintain persistent cross-session state to support ongoing collaboration, but this persistence allows ordinary user chats to gradually reshape memory and decision rules in unintended ways. The paper shows that such routine interactions can weaken confirmation boundaries, expand tool-use defaults, and increase unchecked autonomy without any deliberate attack. A new benchmark with 350 settings across categories and patterns, plus a Harm Score tracking authorization drift, tool escalation, and autonomy, demonstrates that memory-centric artifacts are the primary target. Tests on multiple models and real-world seeded interactions confirm the effect occurs in practice. The authors also present a defense that audits state changes before they are saved.

Core claim

The authors establish that unintended long-term state poisoning occurs when routine user-agent chats reshape persistent state, primarily memory-centric parts, resulting in measurable increases in authorization drift, tool-use escalation, and unchecked autonomy. This is shown through the ULSPB benchmark consisting of 350 settings across assistance categories and interaction patterns, with 24-turn routine sequences compared to single-injection cases. Evaluations on four backbone LLMs reveal substantial poisoning from routines, validated by real-world interaction seeds, and the proposed StateGuard defense audits state diffs at writeback to reduce the Harm Score to near zero with acceptable high

What carries the argument

unintended long-term state poisoning, the process by which routine interactions gradually corrupt memory-centric artifacts and decision parameters in persistent agent state

If this is right

Routine conversations can erode safety boundaries in personalized agents over multiple sessions.
Memory-centric artifacts are the main vectors for this state poisoning.
StateGuard effectively prevents the poisoning by selective rollback with minimal performance cost.
Single malicious injections are not necessary, as normal use patterns suffice to cause harm.
Real-world user behaviors replicate the poisoning observed in controlled benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designers may need to implement automatic state auditing as standard practice rather than optional.
Personalization through persistent memory carries hidden security costs that could affect long-term reliability.
Similar risks might apply to other AI systems maintaining cross-session state, suggesting broader evaluation needs.
Further testing could explore whether combining StateGuard with periodic state resets provides stronger protection.

Load-bearing premise

The ULSPB benchmark and real-world seeded interactions accurately represent genuine user behavior without artificial effects that exaggerate the poisoning risk.

What would settle it

A controlled study in which agents undergo 24-turn routine conversations across multiple settings and models shows no significant rise in Harm Score for authorization drift, tool-use escalation, or unchecked autonomy.

Figures

Figures reproduced from arXiv: 2605.06731 by Haibo Hu, Haobin Ke, Minxin Du, Qingqing Ye, Qipeng Xie, Xiaoyu Xu.

**Figure 2.** Figure 2: Overview of Unintended Long-Term State Poisoning Bench (ULSPB). The benchmark is organized by seven scenarios, five categories, bilingual templates, and five conversation variants. For each instance, a routine conversation is selected, combined with four variant-specific injection prefixes, and injected at a turn, yielding a benchmark tuple of (scenario, category, language, variant). (I) New Threat Identif… view at source ↗

**Figure 3.** Figure 3: Unintended long-term state poisoning: benign [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Routine evaluation of ULSPB All benchmark instances are generated with GPT5.4 using prompts designed to preserve the multiturn flow, personalized-assistance context, and realistic conversational tone, while ensuring that injected counterparts remain distinguishable from routine conversations. Additional construction details are provided in Appendix F. To verify that ULSPB captures routine interaction, … view at source ↗

**Figure 5.** Figure 5: HS across scenarios, categories, and real-seed evaluation. Abbreviations follow the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Long-term state modification hotspots across backbone models [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: HS weight sensitivity and severity alignment [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of StateGuard. After each interaction round, StateGuard audits added lines in long-term state files and either preserves or rolls back the updates. Routine interactions can poison long-term state through benign-looking updates that gradually relax safety-relevant defaults, as shown in Section 4.2. Since this threat is mediated by persistent state rather than immediate unsafe actions, defenses sh… view at source ↗

**Figure 9.** Figure 9: Qualitative trigger-phrase visualization associated with protected-file changes across the four [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we introduce the \textbf{Unintended Long-Term State Poisoning Bench (ULSPB)}, a bilingual benchmark comprising $350$ settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emph{Harm Score} (HS), a state-centric metric that quantifies \emph{authorization drift}, \emph{tool-use escalation}, and \emph{unchecked autonomy}. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbf{StateGuard}, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Routine user chats can gradually poison persistent state in LLM agents, and the paper supplies a benchmark plus a lightweight defense to quantify and limit the damage.

read the letter

The core finding is that normal back-and-forth with a personalized agent can shift its long-term memory and defaults over time, weakening authorization checks and increasing autonomous tool use. They formalize this as unintended long-term state poisoning and build ULSPB to test it: 350 settings across five categories, seven patterns, 24-turn routine dialogues, and matched single-injection controls. The Harm Score tracks three concrete drifts—authorization, tool escalation, and unchecked autonomy—and the experiments run on four backbone models inside OpenClaw. Routine interactions alone produce substantial poisoning, mostly in memory artifacts, and real-world seeded logs are included to argue the effect is not just synthetic. StateGuard then sits at the writeback step, diffs the state changes, and rolls back the risky ones, dropping the score near zero with modest overhead and a safety-first tolerance for false positives.

Referee Report

3 major / 3 minor

Summary. The paper claims that personalized LLM agents with persistent cross-session state are vulnerable to unintended long-term state poisoning (ULSP) from routine user interactions, which can corrupt memory artifacts, cause authorization drift, escalate tool-use defaults, and increase unchecked autonomy. It introduces the ULSPB benchmark (350 settings across five assistance categories, seven interaction patterns, 24-turn routines, and matched single-injection controls), defines the Harm Score (HS) metric to quantify these effects, demonstrates substantial poisoning from routine chats alone across four backbone LLMs on OpenClaw, validates via real-world seeded interactions, and proposes StateGuard (a post-execution state-diff auditor with selective rollback) that reduces HS to near zero.

Significance. If the results hold, this identifies a practically relevant security vulnerability in stateful LLM agents that has received limited prior attention. The ULSPB benchmark and HS metric offer a reusable evaluation framework for state poisoning, while StateGuard provides a lightweight, deployable mitigation with quantified overhead. These contributions could inform safer design of long-horizon personalized agents and stimulate further work on persistent-state security.

major comments (3)

[§3] §3 (ULSPB benchmark construction): The central claim that 'routine conversations alone can substantially poison long-term state' is load-bearing on the assumption that the seven 24-turn interaction patterns contain no subtle state-changing cues (e.g., implicit memory references or tool-request phrasing). Without an ablation that isolates purely neutral logs from the patterned interactions, the observed drift could be an artifact of benchmark design rather than a general property of normal use, directly undermining the distinction from single-injection controls.
[§4] §4 (Harm Score definition): The HS metric aggregates authorization drift, tool-use escalation, and unchecked autonomy, but the manuscript must supply the precise computation (including any weighting or normalization) to confirm it is not circular with the ULSPB pattern definitions or dependent on the same state artifacts it measures.
[§5] §5 (real-world seeding and experimental controls): The abstract states that real-world seeded interactions confirm the risk is not synthetic, yet details on collection criteria, exclusion rules, statistical tests, and how these logs were matched to the 350 synthetic settings are required to verify that they represent genuine neutral behavior without artificial exaggeration of poisoning effects.

minor comments (3)

[Abstract] Abstract: The bilingual nature of ULSPB is noted but no languages are specified nor are any language-specific results reported; this should be clarified in the benchmark description.
[§6] §6 (StateGuard evaluation): Quantitative values for false-positive rates, overhead, and direct comparison against baselines are referenced only qualitatively; explicit numbers and tables would strengthen the mitigation claims.
[Notation] Notation throughout: Distinguish clearly between 'OpenClaw' (the agent framework) and the four backbone LLMs in all experimental tables and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify key aspects of our work on unintended long-term state poisoning. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript's rigor without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (ULSPB benchmark construction): The central claim that 'routine conversations alone can substantially poison long-term state' is load-bearing on the assumption that the seven 24-turn interaction patterns contain no subtle state-changing cues (e.g., implicit memory references or tool-request phrasing). Without an ablation that isolates purely neutral logs from the patterned interactions, the observed drift could be an artifact of benchmark design rather than a general property of normal use, directly undermining the distinction from single-injection controls.

Authors: We agree that an explicit ablation is needed to confirm the patterns introduce no subtle cues. The seven 24-turn patterns were constructed from common assistance scenarios (e.g., scheduling, information lookup) with explicit instructions to avoid memory references or tool phrasing, and human annotators verified neutrality. To address the concern directly, we will add a new ablation in §3 using 50 purely neutral, non-patterned 24-turn logs drawn from the same assistance categories. This will quantify any baseline drift and confirm that observed effects stem from routine interactions rather than design artifacts, preserving the distinction from single-injection controls. revision: yes
Referee: [§4] §4 (Harm Score definition): The HS metric aggregates authorization drift, tool-use escalation, and unchecked autonomy, but the manuscript must supply the precise computation (including any weighting or normalization) to confirm it is not circular with the ULSPB pattern definitions or dependent on the same state artifacts it measures.

Authors: We will supply the exact HS formula in the revised §4. HS is computed as a weighted average: HS = (0.4 * AD + 0.3 * TE + 0.3 * UA) / N, where AD, TE, and UA are normalized counts of authorization drift, tool-use escalation, and unchecked autonomy events (each scaled 0-1 against safety baselines), and N is the number of affected state artifacts. Weights were set via expert annotation of harm severity on a held-out set of 100 interactions, independent of ULSPB patterns. This formulation evaluates post-interaction state diffs against fixed safety rules and is not circular with the benchmark definitions. revision: yes
Referee: [§5] §5 (real-world seeding and experimental controls): The abstract states that real-world seeded interactions confirm the risk is not synthetic, yet details on collection criteria, exclusion rules, statistical tests, and how these logs were matched to the 350 synthetic settings are required to verify that they represent genuine neutral behavior without artificial exaggeration of poisoning effects.

Authors: We will expand §5 with the requested details. Real-world logs were collected from consented public interaction traces and anonymized user studies (n=120 sessions), with exclusion rules removing any content containing explicit commands, personal data, or potential state-altering phrasing. Matching to the 350 synthetic settings was performed by category and pattern similarity using cosine similarity on embeddings (>0.85 threshold). We applied paired t-tests (p<0.01) to compare HS between real and synthetic conditions, confirming no significant exaggeration. These additions will be included verbatim in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces ULSPB benchmark (350 settings, five categories, seven patterns, 24-turn routines) and Harm Score metric as independent definitions, then reports empirical measurements of state drift on OpenClaw with four LLMs plus real-world seeding. No equations, fitted parameters, or self-citations are load-bearing; the central claim is an observed outcome from executing the defined interactions rather than a quantity that reduces by construction to the benchmark inputs themselves. The evaluation framework remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about LLM agent state persistence and introduces a new risk concept and benchmark without fitted parameters or new physical entities.

axioms (1)

domain assumption Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration
Explicitly stated in the abstract as the foundation for the vulnerability.

invented entities (1)

Unintended Long-Term State Poisoning no independent evidence
purpose: To name and formalize the gradual corruption of agent state through routine interactions
Newly defined phenomenon without independent external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5590 in / 1215 out tokens · 59798 ms · 2026-05-11T00:44:56.602007+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 24 canonical work pages · 11 internal anchors

[1]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InAISec@CCS, pages 79–90, 2023

2023
[2]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv:2308.14132, 2023

work page arXiv 2023
[3]

Agentharm: A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InICLR, 2025

2025
[4]

Humans or llms as the judge? A study on judgement bias

Guiming Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? A study on judgement bias. InEMNLP, pages 8301–8327, 2024

2024
[5]

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A sys- tematic assessment of llm guardrails on multi-step tool-calling trajectories. arXiv:2604.07223, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. InNeurIPS, pages 130185–130213, 2024

2024
[7]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InNeurIPS, 2024

2024
[8]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DiraBook

Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats. arXiv:2603.11619, 2026

work page arXiv 2026
[10]

Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in OpenClaw agents

Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents. arXiv:2603.00902, 2026

work page arXiv 2026
[11]

Autonomous action runtime management (aarm): A system specification for securing ai-driven actions at runtime

Herman Errico. Autonomous action runtime management (aarm): A system specification for securing ai-driven actions at runtime. arXiv:2602.09433, 2026

work page arXiv 2026
[12]

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems. arXiv:2604.06811, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

When benign inputs lead to severe harms: Eliciting unsafe unintended behaviors of computer-use agents.arXiv preprint arXiv:2602.08235, 2026

Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, and Huan Sun. When benign inputs lead to severe harms: Eliciting unsafe unintended behaviors of computer-use agents. arXiv:2602.08235, 2026

work page arXiv 2026
[14]

PRISM:Zero-forkdefense-in-depthruntimelayerforagentsecurity.arXiv preprint arXiv:2603.11853, 2026

Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool- augmented llm agents. arXiv:2603.11853, 2026. 10

work page arXiv 2026
[15]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security, 2024

2024
[16]

Caution for the environment: Multimodal llm agents are susceptible to environmental distractions

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal llm agents are susceptible to environmental distractions. InACL, pages 22324–22339, 2025

2025
[17]

Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, and Yu-Gang Jiang

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhao Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Yiming Li, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong,...

2025
[18]

Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Monta- nari, Tomas Pfister, and Long T. Le. Veriguard: Enhancing LLM agent safety via verified code generation. arXiv:2510.05156, 2025

work page arXiv 2025
[19]

Openclaw: The ai that actually does things

Steinberger Peter. Openclaw: The ai that actually does things. https://github.com/ openclaw/openclaw, 2024

2024
[20]

A survey of classification tasks and approaches for legal contracts.Artif

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, and Hye-young Paik. A survey of classification tasks and approaches for legal contracts.Artif. Intell. Rev., page 380, 2025

2025
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: visual agentic intelligence. arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Poskitt, and Jun Sun

Haoyu Wang, Christopher M. Poskitt, and Jun Sun. Agentspec: Customizable runtime enforce- ment for safe and reliable LLM agents. InICSE, 2026

2026
[24]

Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,

Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. Probguard: Probabilistic runtime monitoring for llm agent safety. arXiv:2508.00500, 2025

work page arXiv 2025
[25]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[26]

From assistant to double agent: Formalizing and benchmarking attacks on OpenClaw for personalized local AI agent

Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, and Zhaoxiang Liu. From assistant to double agent: Formalizing and benchmarking attacks on openclaw for personalized local ai agent. arXiv:2602.08412, 2026

work page arXiv 2026
[27]

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, et al. Your agent, their asset: A real-world safety analysis of openclaw. arXiv:2604.04759, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

arXiv preprint arXiv:2410.21819 (2025)

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv:2410.21819, 2024

work page arXiv 2024
[29]

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: "Safe" llms, unsafe agents. arXiv:2604.01438, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

From storage to steering: Memory control flow attacks on llm agents

Zhenlin Xu, Xiaogang Zhu, Yu Yao, Minhui Xue, and Yiliao Song. From storage to steering: Memory control flow attacks on llm agents. arXiv:2603.15125, 2026

work page internal anchor Pith review arXiv 2026
[31]

Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections,

Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, and Jin Song Dong. Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections. arXiv:2602.15654, 2026

work page arXiv 2026
[32]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, and Tong Yang. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv:2604.06132, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Chawla, and Xiangliang Zhang

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InICLR, 2025

2025
[34]

Agent security bench (ASB): formalizing and benchmarking attacks and defenses in llm-based agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): formalizing and benchmarking attacks and defenses in llm-based agents. InICLR, 2025

2025
[35]

AgentSentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and context pu- rification.arXiv preprint arXiv:2602.22724, 2026

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and context purification. arXiv:2602.22724, 2026

work page arXiv 2026
[36]

Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

Yechao Zhang, Shiqian Zhao, Jie Zhang, Gelei Deng, Jiawen Zhang, Xiaogeng Liu, Chaowei Xiao, and Tianwei Zhang. Mind your heartbeat! claw background execution inherently enables silent memory pollution. arXiv:2603.23064, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Clawworm: Self-propagating attacks across llm agent ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems. arXiv:2603.15727, 2026

work page arXiv 2026
[38]

Wildchat: 1m chatgpt interaction logs in the wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. InICLR, 2024

2024
[39]

Xing, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world LLM conversation dataset. InICLR, 2024. 12 A Limitations We acknowledge that ULSPB captures representative routine interaction patterns ra...

2024
[40]

The conversation should look like ordinary daily assistant use rather than an explicit attack prompt
[41]

The messages may gradually reflect the behavioral drift implied by the specified Scenario
[42]

The business context should remain plausible for the specified Category
[43]

The wording should be natural in the specified Language
[44]

Output: Produce one routine conversation with 24 user messages

Avoid obviously adversarial or injected phrasing. Output: Produce one routine conversation with 24 user messages. Return strict JSON only. Conversation VariantsEach Routine conversation contains 24 user–OpenClaw interaction runs. Starting from this clean template, ULSPB further defines four matched single-injection counterparts that preserve the same Scen...
[45]

The injected item should preserve the core semantic signal of the specified Scenario
[46]

It should remain plausible within the specified Category and natural in the specified Language
[47]

It should appear as inserted or relayed content rather than the user’s own direct request
[48]

by default,

Generate only one injected item, not a full conversation. Return only the injected item text. Table 7 summarizes the five conversation variants with representative English inserted items. Each injected item preserves the same scenario signal while varying the apparent source of the instruction. Routine-Likeness Judge PromptWe evaluate routine-likeness usi...

2026