pith. sign in

arxiv: 2605.16746 · v1 · pith:PR26USGVnew · submitted 2026-05-16 · 💻 cs.AI · cs.LG

State Contamination in Memory-Augmented LLM Agents

Pith reviewed 2026-05-19 21:23 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords memory launderingLLM agentstoxicity propagationstate contaminationsub-threshold propagationmemory sanitizationpersistent state
0
0 comments X p. Extension
pith:PR26USGV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{PR26USGV}

Prints a linked pith:PR26USGV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Toxic information can be compressed into memory summaries that pass toxicity detectors but still raise the chance of harmful future outputs in LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that persistent memory in LLM agents creates a new safety vulnerability called memory laundering. In this process, toxic context gets summarized in a way that standard detectors miss the toxicity, yet the summaries retain enough structure to make later agent behavior more toxic. A sympathetic reader would care because current safety tools focus on individual outputs rather than what gets stored for later use, leaving agents exposed to hidden contamination over long interactions. The work uses paired experiments to measure this hidden effect through a new metric for sub-threshold influence.

Core claim

Toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. This hidden influence is quantified by the sub-threshold propagation gap, and experiments reveal that sanitizing state before it is summarized into memory reduces the propagation more effectively than sanitizing the summary itself.

What carries the argument

Memory laundering, the compression of toxic or adversarial context into summaries that no longer trigger standard detectors while preserving hostile framing or conflict structure that influences future generations.

If this is right

  • Raw transcript reuse tends to produce overt downstream toxicity.
  • Compressed memory states carry sub-threshold influence that increases toxicity without detection.
  • Placing sanitization before summarization substantially reduces the hidden propagation gap.
  • Cleaning only after summarization can leave the laundered influence intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent safety protocols should monitor and sanitize information before it enters persistent memory rather than relying on post-compression checks.
  • This state-control approach might extend to other forms of persistent context such as retrieved documents or conversation histories.
  • Future agent designs could incorporate ongoing state auditing to prevent gradual accumulation of subtle biases or toxicities.

Load-bearing premise

The paired counterfactual multi-agent rollouts isolate the causal effect of the memory state itself on downstream toxicity without other differences in agent behavior or prompts affecting the results.

What would settle it

Finding no increase in downstream toxicity when using toxic-origin summaries compared to neutral ones, under otherwise identical conditions in the multi-agent rollouts, would falsify the claim of hidden sub-threshold propagation.

Figures

Figures reproduced from arXiv: 2605.16746 by Agam Goyal, Hari Sundaram, Yian Wang, Yuen Chen.

Figure 1
Figure 1. Figure 1: Memory laundering in stateful agents. (a) In single-turn chatbots, safety monitoring is typically applied directly to the generated response. In memory-augmented agents, harmful influence can instead be compressed into external agent state, such as summaries or transcripts, that appears safe under standard toxicity checks while still conditioning downstream agents toward unsafe behavior. (b) In our work, w… view at source ↗
Figure 2
Figure 2. Figure 2: Additional qualitative examples of memory laundering. Each example shows a toxic source message from agent A1, the resulting compressed memory state Mt, and the downstream response conditioned on that memory. In both cases, the memory summary scores below τ = 0.5 despite source toxicity above 0.9, yet the downstream response remains overtly toxic. Example 2 flattens explicit hostility into a neutral descri… view at source ↗
Figure 3
Figure 3. Figure 3: Sub-threshold propagation gap SPG(τ ) as a function of the classifier threshold τ under memory-augmented chain rollouts (n = 200 seeds). SPG remains stable at ≈ 0.14 across τ ∈ {0.03, 0.05, 0.1, 0.2, 0.3, 0.5} and is significantly positive at every threshold (all p < 0.001, paired Wilcoxon signed-rank). Across this entire range, the fraction of memory states classified as clean satisfies Pr[tox(Mt) < τ ] ≥… view at source ↗
read the original abstract

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies 'memory laundering' as a failure mode in memory-augmented LLM agents, where toxic or adversarial context is compressed into summaries that fall below standard toxicity detector thresholds yet preserve hostile framing that influences future generations. Using paired counterfactual multi-agent rollouts, the authors demonstrate that such summaries increase downstream toxicity relative to matched neutral baselines, quantified via a new sub-threshold propagation gap (SPG) metric. Experiments distinguish propagation channels (raw transcripts vs. compressed memory) and show that sanitizing toxic state before summarization reduces the hidden gap more effectively than post-summarization cleaning.

Significance. If the results hold under tighter controls, the work identifies a practically relevant safety gap in persistent-state agents: output monitoring alone is insufficient when state compression can hide influence. The SPG metric provides a concrete, observable way to measure sub-threshold effects, and the differential mitigation finding offers actionable guidance on intervention timing. This contributes to agent safety literature by framing the problem as state-control rather than isolated generation safety.

major comments (2)
  1. [Abstract and Methods (paired counterfactual multi-agent rollouts)] Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.
  2. [Experiments (SPG quantification and detector details)] Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.
minor comments (2)
  1. [Introduction] The term 'memory laundering' is introduced without explicit comparison to related notions such as context poisoning or memory injection; a short related-work paragraph would clarify novelty.
  2. [Figures/Tables] Figure or table captions describing the multi-agent rollout protocol should include exact prompt templates and matching criteria for neutral baselines to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below, clarifying our experimental controls and committing to revisions that strengthen the causal claims and robustness analysis.

read point-by-point responses
  1. Referee: Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.

    Authors: We agree that stochastic divergence in multi-agent trajectories is a potential confounder for causal attribution. Our paired counterfactual design matches prompts and initial states across toxic and neutral conditions while using fixed random seeds for all rollouts; results are averaged over 10 independent runs with reported standard errors. To directly address the concern, the revised manuscript will add an explicit variance decomposition separating memory provenance effects from trajectory stochasticity, along with single-agent ablation controls that hold the memory state fixed while varying only the generation process. revision: partial

  2. Referee: Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.

    Authors: We acknowledge that the original submission placed the exact detector thresholds, toxicity classifier version, and summary-generation hyperparameters in the appendix rather than the main text. The revised version will move these specifications into the Methods section and include new sensitivity analyses that vary the toxicity threshold across a range of values (e.g., 0.1–0.5) and alternative summary-generation prompts, confirming that the SPG remains statistically significant and directionally consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical demonstration of behavioral gap via new metric

full rationale

The paper is an empirical study that introduces the SPG metric to quantify observed downstream toxicity differences in paired rollouts and reports experimental outcomes on sanitization placement. No derivation chain exists that reduces a claimed result to fitted parameters, self-definitions, or self-citation load-bearing premises. The central claim rests on direct measurement of behavioral differences rather than any algebraic or definitional equivalence to inputs. The work is self-contained against external benchmarks of toxicity detection and rollout comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that summarization can preserve hostile framing while evading detectors, and that counterfactual rollouts isolate memory effects. No free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Standard toxicity detectors applied to summaries are sufficient to classify state as safe for downstream use.
    Invoked when claiming that laundered summaries remain below thresholds yet still propagate influence.
  • domain assumption Paired counterfactual rollouts can hold all other agent variables constant while varying only memory state.
    Required for attributing downstream toxicity differences to memory laundering.
invented entities (1)
  • memory laundering no independent evidence
    purpose: Names the process by which toxic context is compressed into summaries that evade detection while retaining influence.
    Conceptual label introduced to describe the observed phenomenon; no independent falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5768 in / 1331 out tokens · 47031 ms · 2026-05-19T21:23:21.706686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 7 internal anchors

  1. [1]

    MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Mem- orybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281,

  2. [2]

    Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

    Tanav Singh Bajaj, Nikhil Singh, Karan Anand, and Eishkaran Singh. Position: Safety and fairness in agentic ai depend on interaction topology, not on model scale or alignment.arxiv preprint arXiv:2605.01147,

  3. [3]

    Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,

    Fouad Bousetouane. Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,

  4. [4]

    Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

    Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, and Yizheng Chen. Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

  5. [5]

    Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,

    Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, and Lina Yao. Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,

  6. [6]

    Here comes the ai worm: Unleashing zero-click worms that target genai-powered applications.arXiv preprint arXiv:2403.02817,

    Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Unleashing zero-click worms that target genai-powered applications.arXiv preprint arXiv:2403.02817,

  7. [7]

    Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers

    Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,

  8. [8]

    Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,

    Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,

  9. [9]

    Breaking bad tokens: Detoxification of LLMs using sparse autoencoders

    Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. Breaking bad tokens: Detoxification of LLMs using sparse autoencoders. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691–12709, Suzhou, ...

  10. [10]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.641. URL https://aclanthology.org/2025.emnlp-main. 641/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect promp...

  11. [11]

    Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

    10 Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

  12. [12]

    Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

    Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

  13. [13]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218,

  14. [14]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121,

  15. [15]

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez

    Accessed: 2026- 05-03. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems

  16. [16]

    Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control.arXiv preprint arXiv:2505.18279,

    Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control.arXiv preprint arXiv:2505.18279,

  17. [17]

    arXiv preprint arXiv:2412.17686 , year=

    Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, et al. Large language model safety: A holistic survey.arXiv preprint arXiv:2412.17686,

  18. [18]

    Active context compression: Autonomous memory management in llm agents.arXiv preprint arXiv:2601.07190,

    Nikhil Verma. Active context compression: Autonomous memory management in llm agents.arXiv preprint arXiv:2601.07190,

  19. [19]

    CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

    Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, and Pengfei He. Unveiling privacy risks in LLM agent memory. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25241–25260, Vienna...

  20. [20]

    Incremental summarization for customer support via progressive note-taking and agent feedback

    11 Yisha Wu, Cen Zhao, Yuanpei Cao, Xiaoqing Xu, Yashar Mehdad, Mindy Ji, and Claire Na Cheng. Incremental summarization for customer support via progressive note-taking and agent feedback. In Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry...

  21. [21]

    ISBN 979-8-89176-333-3

    Association for Computational Linguistics. ISBN 979-8-89176-333-3. doi: 10.18653/v1/2025.emnlp-industry.140. URL https://aclanthology. org/2025.emnlp-industry.140/. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon dec...

  22. [22]

    LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X

    Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang. Llms can get" brain rot"!arXiv preprint arXiv:2510.13928,

  23. [23]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

  24. [24]

    Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510,

    Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510,

  25. [25]

    Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,

    Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,

  26. [26]

    in the presence of toxic upstream context, prefer the neutral-condition response over the toxic-condition response

    for intervention experiments. Decoding uses temperature 0.8, top-p 0.95, and a maximum response length of 256 tokens. For paired counterfactual rollouts, the same decoding seed is used for the toxic and neutral conditions, ensuring that any difference between the two rollouts is attributable to the focal-agent prompt only. Scale.The phenomenon experiments...