pith. machine review for the scientific record. sign in

arxiv: 2605.06731 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords LLM agentsstate poisoningpersonalized agentsmemory corruptionsecurity vulnerabilitybenchmarkdefense mechanismauthorization drift
0
0 comments X

The pith

Routine conversations alone can substantially poison the long-term state of personalized LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Personalized LLM agents maintain persistent cross-session state to support ongoing collaboration, but this persistence allows ordinary user chats to gradually reshape memory and decision rules in unintended ways. The paper shows that such routine interactions can weaken confirmation boundaries, expand tool-use defaults, and increase unchecked autonomy without any deliberate attack. A new benchmark with 350 settings across categories and patterns, plus a Harm Score tracking authorization drift, tool escalation, and autonomy, demonstrates that memory-centric artifacts are the primary target. Tests on multiple models and real-world seeded interactions confirm the effect occurs in practice. The authors also present a defense that audits state changes before they are saved.

Core claim

The authors establish that unintended long-term state poisoning occurs when routine user-agent chats reshape persistent state, primarily memory-centric parts, resulting in measurable increases in authorization drift, tool-use escalation, and unchecked autonomy. This is shown through the ULSPB benchmark consisting of 350 settings across assistance categories and interaction patterns, with 24-turn routine sequences compared to single-injection cases. Evaluations on four backbone LLMs reveal substantial poisoning from routines, validated by real-world interaction seeds, and the proposed StateGuard defense audits state diffs at writeback to reduce the Harm Score to near zero with acceptable high

What carries the argument

unintended long-term state poisoning, the process by which routine interactions gradually corrupt memory-centric artifacts and decision parameters in persistent agent state

If this is right

  • Routine conversations can erode safety boundaries in personalized agents over multiple sessions.
  • Memory-centric artifacts are the main vectors for this state poisoning.
  • StateGuard effectively prevents the poisoning by selective rollback with minimal performance cost.
  • Single malicious injections are not necessary, as normal use patterns suffice to cause harm.
  • Real-world user behaviors replicate the poisoning observed in controlled benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designers may need to implement automatic state auditing as standard practice rather than optional.
  • Personalization through persistent memory carries hidden security costs that could affect long-term reliability.
  • Similar risks might apply to other AI systems maintaining cross-session state, suggesting broader evaluation needs.
  • Further testing could explore whether combining StateGuard with periodic state resets provides stronger protection.

Load-bearing premise

The ULSPB benchmark and real-world seeded interactions accurately represent genuine user behavior without artificial effects that exaggerate the poisoning risk.

What would settle it

A controlled study in which agents undergo 24-turn routine conversations across multiple settings and models shows no significant rise in Harm Score for authorization drift, tool-use escalation, or unchecked autonomy.

Figures

Figures reproduced from arXiv: 2605.06731 by Haibo Hu, Haobin Ke, Minxin Du, Qingqing Ye, Qipeng Xie, Xiaoyu Xu.

Figure 1
Figure 1. Figure 1: Task-centric vs. personalized LLM agents: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Unintended Long-Term State Poisoning Bench (ULSPB). The benchmark is organized by seven scenarios, five categories, bilingual templates, and five conversation variants. For each instance, a routine conversation is selected, combined with four variant-specific injection prefixes, and injected at a turn, yielding a benchmark tuple of (scenario, category, language, variant). (I) New Threat Identif… view at source ↗
Figure 3
Figure 3. Figure 3: Unintended long-term state poisoning: benign [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Routine evaluation of ULSPB All benchmark instances are generated with GPT￾5.4 using prompts designed to preserve the multi￾turn flow, personalized-assistance context, and re￾alistic conversational tone, while ensuring that injected counterparts remain distinguishable from routine conversations. Additional construction de￾tails are provided in Appendix F. To verify that ULSPB captures routine interaction, … view at source ↗
Figure 5
Figure 5. Figure 5: HS across scenarios, categories, and real-seed evaluation. Abbreviations follow the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Long-term state modification hotspots across backbone models [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: HS weight sensitivity and severity alignment [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of StateGuard. Af￾ter each interaction round, StateGuard audits added lines in long-term state files and either preserves or rolls back the updates. Routine interactions can poison long-term state through benign-looking updates that gradually re￾lax safety-relevant defaults, as shown in Section 4.2. Since this threat is mediated by persistent state rather than immediate unsafe actions, defenses sh… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative trigger-phrase visualization associated with protected-file changes across the four [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we introduce the \textbf{Unintended Long-Term State Poisoning Bench (ULSPB)}, a bilingual benchmark comprising $350$ settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emph{Harm Score} (HS), a state-centric metric that quantifies \emph{authorization drift}, \emph{tool-use escalation}, and \emph{unchecked autonomy}. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbf{StateGuard}, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that personalized LLM agents with persistent cross-session state are vulnerable to unintended long-term state poisoning (ULSP) from routine user interactions, which can corrupt memory artifacts, cause authorization drift, escalate tool-use defaults, and increase unchecked autonomy. It introduces the ULSPB benchmark (350 settings across five assistance categories, seven interaction patterns, 24-turn routines, and matched single-injection controls), defines the Harm Score (HS) metric to quantify these effects, demonstrates substantial poisoning from routine chats alone across four backbone LLMs on OpenClaw, validates via real-world seeded interactions, and proposes StateGuard (a post-execution state-diff auditor with selective rollback) that reduces HS to near zero.

Significance. If the results hold, this identifies a practically relevant security vulnerability in stateful LLM agents that has received limited prior attention. The ULSPB benchmark and HS metric offer a reusable evaluation framework for state poisoning, while StateGuard provides a lightweight, deployable mitigation with quantified overhead. These contributions could inform safer design of long-horizon personalized agents and stimulate further work on persistent-state security.

major comments (3)
  1. [§3] §3 (ULSPB benchmark construction): The central claim that 'routine conversations alone can substantially poison long-term state' is load-bearing on the assumption that the seven 24-turn interaction patterns contain no subtle state-changing cues (e.g., implicit memory references or tool-request phrasing). Without an ablation that isolates purely neutral logs from the patterned interactions, the observed drift could be an artifact of benchmark design rather than a general property of normal use, directly undermining the distinction from single-injection controls.
  2. [§4] §4 (Harm Score definition): The HS metric aggregates authorization drift, tool-use escalation, and unchecked autonomy, but the manuscript must supply the precise computation (including any weighting or normalization) to confirm it is not circular with the ULSPB pattern definitions or dependent on the same state artifacts it measures.
  3. [§5] §5 (real-world seeding and experimental controls): The abstract states that real-world seeded interactions confirm the risk is not synthetic, yet details on collection criteria, exclusion rules, statistical tests, and how these logs were matched to the 350 synthetic settings are required to verify that they represent genuine neutral behavior without artificial exaggeration of poisoning effects.
minor comments (3)
  1. [Abstract] Abstract: The bilingual nature of ULSPB is noted but no languages are specified nor are any language-specific results reported; this should be clarified in the benchmark description.
  2. [§6] §6 (StateGuard evaluation): Quantitative values for false-positive rates, overhead, and direct comparison against baselines are referenced only qualitatively; explicit numbers and tables would strengthen the mitigation claims.
  3. [Notation] Notation throughout: Distinguish clearly between 'OpenClaw' (the agent framework) and the four backbone LLMs in all experimental tables and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify key aspects of our work on unintended long-term state poisoning. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript's rigor without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (ULSPB benchmark construction): The central claim that 'routine conversations alone can substantially poison long-term state' is load-bearing on the assumption that the seven 24-turn interaction patterns contain no subtle state-changing cues (e.g., implicit memory references or tool-request phrasing). Without an ablation that isolates purely neutral logs from the patterned interactions, the observed drift could be an artifact of benchmark design rather than a general property of normal use, directly undermining the distinction from single-injection controls.

    Authors: We agree that an explicit ablation is needed to confirm the patterns introduce no subtle cues. The seven 24-turn patterns were constructed from common assistance scenarios (e.g., scheduling, information lookup) with explicit instructions to avoid memory references or tool phrasing, and human annotators verified neutrality. To address the concern directly, we will add a new ablation in §3 using 50 purely neutral, non-patterned 24-turn logs drawn from the same assistance categories. This will quantify any baseline drift and confirm that observed effects stem from routine interactions rather than design artifacts, preserving the distinction from single-injection controls. revision: yes

  2. Referee: [§4] §4 (Harm Score definition): The HS metric aggregates authorization drift, tool-use escalation, and unchecked autonomy, but the manuscript must supply the precise computation (including any weighting or normalization) to confirm it is not circular with the ULSPB pattern definitions or dependent on the same state artifacts it measures.

    Authors: We will supply the exact HS formula in the revised §4. HS is computed as a weighted average: HS = (0.4 * AD + 0.3 * TE + 0.3 * UA) / N, where AD, TE, and UA are normalized counts of authorization drift, tool-use escalation, and unchecked autonomy events (each scaled 0-1 against safety baselines), and N is the number of affected state artifacts. Weights were set via expert annotation of harm severity on a held-out set of 100 interactions, independent of ULSPB patterns. This formulation evaluates post-interaction state diffs against fixed safety rules and is not circular with the benchmark definitions. revision: yes

  3. Referee: [§5] §5 (real-world seeding and experimental controls): The abstract states that real-world seeded interactions confirm the risk is not synthetic, yet details on collection criteria, exclusion rules, statistical tests, and how these logs were matched to the 350 synthetic settings are required to verify that they represent genuine neutral behavior without artificial exaggeration of poisoning effects.

    Authors: We will expand §5 with the requested details. Real-world logs were collected from consented public interaction traces and anonymized user studies (n=120 sessions), with exclusion rules removing any content containing explicit commands, personal data, or potential state-altering phrasing. Matching to the 350 synthetic settings was performed by category and pattern similarity using cosine similarity on embeddings (>0.85 threshold). We applied paired t-tests (p<0.01) to compare HS between real and synthetic conditions, confirming no significant exaggeration. These additions will be included verbatim in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces ULSPB benchmark (350 settings, five categories, seven patterns, 24-turn routines) and Harm Score metric as independent definitions, then reports empirical measurements of state drift on OpenClaw with four LLMs plus real-world seeding. No equations, fitted parameters, or self-citations are load-bearing; the central claim is an observed outcome from executing the defined interactions rather than a quantity that reduces by construction to the benchmark inputs themselves. The evaluation framework remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about LLM agent state persistence and introduces a new risk concept and benchmark without fitted parameters or new physical entities.

axioms (1)
  • domain assumption Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration
    Explicitly stated in the abstract as the foundation for the vulnerability.
invented entities (1)
  • Unintended Long-Term State Poisoning no independent evidence
    purpose: To name and formalize the gradual corruption of agent state through routine interactions
    Newly defined phenomenon without independent external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5590 in / 1215 out tokens · 59798 ms · 2026-05-11T00:44:56.602007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 24 canonical work pages · 11 internal anchors

  1. [1]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InAISec@CCS, pages 79–90, 2023

  2. [2]

    Detecting language model attacks with perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv:2308.14132, 2023

  3. [3]

    Agentharm: A benchmark for measuring harmfulness of LLM agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of LLM agents. InICLR, 2025

  4. [4]

    Humans or llms as the judge? A study on judgement bias

    Guiming Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? A study on judgement bias. InEMNLP, pages 8301–8327, 2024

  5. [5]

    TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

    Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, and Yun-Nung Chen. Tracesafe: A sys- tematic assessment of llm guardrails on multi-step tool-calling trajectories. arXiv:2604.07223, 2026

  6. [6]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. InNeurIPS, pages 130185–130213, 2024

  7. [7]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InNeurIPS, 2024

  8. [8]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv:2512.02556, 2025

  9. [9]

    DiraBook

    Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, et al. Taming openclaw: Security analysis and mitigation of autonomous llm agent threats. arXiv:2603.11619, 2026

  10. [10]

    Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in OpenClaw agents

    Ben Dong, Hui Feng, and Qian Wang. Clawdrain: Exploiting tool-calling chains for stealthy token exhaustion in openclaw agents. arXiv:2603.00902, 2026

  11. [11]

    Autonomous action runtime management (aarm): A system specification for securing ai-driven actions at runtime

    Herman Errico. Autonomous action runtime management (aarm): A system specification for securing ai-driven actions at runtime. arXiv:2602.09433, 2026

  12. [12]

    SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

    Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang. Skilltrojan: Backdoor attacks on skill-based agent systems. arXiv:2604.06811, 2026

  13. [13]

    When benign inputs lead to severe harms: Eliciting unsafe unintended behaviors of computer-use agents.arXiv preprint arXiv:2602.08235, 2026

    Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, and Huan Sun. When benign inputs lead to severe harms: Eliciting unsafe unintended behaviors of computer-use agents. arXiv:2602.08235, 2026

  14. [14]

    PRISM:Zero-forkdefense-in-depthruntimelayerforagentsecurity.arXiv preprint arXiv:2603.11853, 2026

    Frank Li. Openclaw prism: A zero-fork, defense-in-depth runtime security layer for tool- augmented llm agents. arXiv:2603.11853, 2026. 10

  15. [15]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security, 2024

  16. [16]

    Caution for the environment: Multimodal llm agents are susceptible to environmental distractions

    Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal llm agents are susceptible to environmental distractions. InACL, pages 22324–22339, 2025

  17. [17]

    Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, and Yu-Gang Jiang

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhao Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Yiming Li, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong,...

  18. [18]

    Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Monta- nari, Tomas Pfister, and Long T. Le. Veriguard: Enhancing LLM agent safety via verified code generation. arXiv:2510.05156, 2025

  19. [19]

    Openclaw: The ai that actually does things

    Steinberger Peter. Openclaw: The ai that actually does things. https://github.com/ openclaw/openclaw, 2024

  20. [20]

    A survey of classification tasks and approaches for legal contracts.Artif

    Amrita Singh, Aditya Joshi, Jiaojiao Jiang, and Hye-young Paik. A survey of classification tasks and approaches for legal contracts.Artif. Intell. Rev., page 380, 2025

  21. [21]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261, 2025

  22. [22]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: visual agentic intelligence. arXiv:2602.02276, 2026

  23. [23]

    Poskitt, and Jun Sun

    Haoyu Wang, Christopher M. Poskitt, and Jun Sun. Agentspec: Customizable runtime enforce- ment for safe and reliable LLM agents. InICSE, 2026

  24. [24]

    Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,

    Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. Probguard: Probabilistic runtime monitoring for llm agent safety. arXiv:2508.00500, 2025

  25. [25]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv:2407.16741, 2024

  26. [26]

    From assistant to double agent: Formalizing and benchmarking attacks on OpenClaw for personalized local AI agent

    Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, and Zhaoxiang Liu. From assistant to double agent: Formalizing and benchmarking attacks on openclaw for personalized local ai agent. arXiv:2602.08412, 2026

  27. [27]

    Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

    Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, et al. Your agent, their asset: A real-world safety analysis of openclaw. arXiv:2604.04759, 2026

  28. [28]

    arXiv preprint arXiv:2410.21819 (2025)

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv:2410.21819, 2024

  29. [29]

    ClawSafety: "Safe" LLMs, Unsafe Agents

    Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, and Yingqiang Ge. Clawsafety: "Safe" llms, unsafe agents. arXiv:2604.01438, 2026

  30. [30]

    From storage to steering: Memory control flow attacks on llm agents

    Zhenlin Xu, Xiaogang Zhu, Yu Yao, Minhui Xue, and Yiliao Song. From storage to steering: Memory control flow attacks on llm agents. arXiv:2603.15125, 2026

  31. [31]

    Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections,

    Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, and Jin Song Dong. Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections. arXiv:2602.15654, 2026

  32. [32]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, and Tong Yang. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv:2604.06132, 2026. 11

  33. [33]

    Chawla, and Xiangliang Zhang

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V . Chawla, and Xiangliang Zhang. Justice or prejudice? quantifying biases in llm-as-a-judge. InICLR, 2025

  34. [34]

    Agent security bench (ASB): formalizing and benchmarking attacks and defenses in llm-based agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): formalizing and benchmarking attacks and defenses in llm-based agents. InICLR, 2025

  35. [35]

    AgentSentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and context pu- rification.arXiv preprint arXiv:2602.22724, 2026

    Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and context purification. arXiv:2602.22724, 2026

  36. [36]

    Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

    Yechao Zhang, Shiqian Zhao, Jie Zhang, Gelei Deng, Jiawen Zhang, Xiaogeng Liu, Chaowei Xiao, and Tianwei Zhang. Mind your heartbeat! claw background execution inherently enables silent memory pollution. arXiv:2603.23064, 2026

  37. [37]

    Clawworm: Self-propagating attacks across llm agent ecosystems

    Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems. arXiv:2603.15727, 2026

  38. [38]

    Wildchat: 1m chatgpt interaction logs in the wild

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. InICLR, 2024

  39. [39]

    Xing, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world LLM conversation dataset. InICLR, 2024. 12 A Limitations We acknowledge that ULSPB captures representative routine interaction patterns ra...

  40. [40]

    The conversation should look like ordinary daily assistant use rather than an explicit attack prompt

  41. [41]

    The messages may gradually reflect the behavioral drift implied by the specified Scenario

  42. [42]

    The business context should remain plausible for the specified Category

  43. [43]

    The wording should be natural in the specified Language

  44. [44]

    Output: Produce one routine conversation with 24 user messages

    Avoid obviously adversarial or injected phrasing. Output: Produce one routine conversation with 24 user messages. Return strict JSON only. Conversation VariantsEach Routine conversation contains 24 user–OpenClaw interaction runs. Starting from this clean template, ULSPB further defines four matched single-injection counterparts that preserve the same Scen...

  45. [45]

    The injected item should preserve the core semantic signal of the specified Scenario

  46. [46]

    It should remain plausible within the specified Category and natural in the specified Language

  47. [47]

    It should appear as inserted or relayed content rather than the user’s own direct request

  48. [48]

    by default,

    Generate only one injected item, not a full conversation. Return only the injected item text. Table 7 summarizes the five conversation variants with representative English inserted items. Each injected item preserves the same scenario signal while varying the apparent source of the instruction. Routine-Likeness Judge PromptWe evaluate routine-likeness usi...