State Contamination in Memory-Augmented LLM Agents
Pith reviewed 2026-05-19 21:23 UTC · model grok-4.3
pith:PR26USGV Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{PR26USGV}
Prints a linked pith:PR26USGV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Toxic information can be compressed into memory summaries that pass toxicity detectors but still raise the chance of harmful future outputs in LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. This hidden influence is quantified by the sub-threshold propagation gap, and experiments reveal that sanitizing state before it is summarized into memory reduces the propagation more effectively than sanitizing the summary itself.
What carries the argument
Memory laundering, the compression of toxic or adversarial context into summaries that no longer trigger standard detectors while preserving hostile framing or conflict structure that influences future generations.
If this is right
- Raw transcript reuse tends to produce overt downstream toxicity.
- Compressed memory states carry sub-threshold influence that increases toxicity without detection.
- Placing sanitization before summarization substantially reduces the hidden propagation gap.
- Cleaning only after summarization can leave the laundered influence intact.
Where Pith is reading between the lines
- Agent safety protocols should monitor and sanitize information before it enters persistent memory rather than relying on post-compression checks.
- This state-control approach might extend to other forms of persistent context such as retrieved documents or conversation histories.
- Future agent designs could incorporate ongoing state auditing to prevent gradual accumulation of subtle biases or toxicities.
Load-bearing premise
The paired counterfactual multi-agent rollouts isolate the causal effect of the memory state itself on downstream toxicity without other differences in agent behavior or prompts affecting the results.
What would settle it
Finding no increase in downstream toxicity when using toxic-origin summaries compared to neutral ones, under otherwise identical conditions in the multi-agent rollouts, would falsify the claim of hidden sub-threshold propagation.
Figures
read the original abstract
LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies 'memory laundering' as a failure mode in memory-augmented LLM agents, where toxic or adversarial context is compressed into summaries that fall below standard toxicity detector thresholds yet preserve hostile framing that influences future generations. Using paired counterfactual multi-agent rollouts, the authors demonstrate that such summaries increase downstream toxicity relative to matched neutral baselines, quantified via a new sub-threshold propagation gap (SPG) metric. Experiments distinguish propagation channels (raw transcripts vs. compressed memory) and show that sanitizing toxic state before summarization reduces the hidden gap more effectively than post-summarization cleaning.
Significance. If the results hold under tighter controls, the work identifies a practically relevant safety gap in persistent-state agents: output monitoring alone is insufficient when state compression can hide influence. The SPG metric provides a concrete, observable way to measure sub-threshold effects, and the differential mitigation finding offers actionable guidance on intervention timing. This contributes to agent safety literature by framing the problem as state-control rather than isolated generation safety.
major comments (2)
- [Abstract and Methods (paired counterfactual multi-agent rollouts)] Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.
- [Experiments (SPG quantification and detector details)] Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.
minor comments (2)
- [Introduction] The term 'memory laundering' is introduced without explicit comparison to related notions such as context poisoning or memory injection; a short related-work paragraph would clarify novelty.
- [Figures/Tables] Figure or table captions describing the multi-agent rollout protocol should include exact prompt templates and matching criteria for neutral baselines to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below, clarifying our experimental controls and committing to revisions that strengthen the causal claims and robustness analysis.
read point-by-point responses
-
Referee: Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.
Authors: We agree that stochastic divergence in multi-agent trajectories is a potential confounder for causal attribution. Our paired counterfactual design matches prompts and initial states across toxic and neutral conditions while using fixed random seeds for all rollouts; results are averaged over 10 independent runs with reported standard errors. To directly address the concern, the revised manuscript will add an explicit variance decomposition separating memory provenance effects from trajectory stochasticity, along with single-agent ablation controls that hold the memory state fixed while varying only the generation process. revision: partial
-
Referee: Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.
Authors: We acknowledge that the original submission placed the exact detector thresholds, toxicity classifier version, and summary-generation hyperparameters in the appendix rather than the main text. The revised version will move these specifications into the Methods section and include new sensitivity analyses that vary the toxicity threshold across a range of values (e.g., 0.1–0.5) and alternative summary-generation prompts, confirming that the SPG remains statistically significant and directionally consistent. revision: yes
Circularity Check
No circularity: empirical demonstration of behavioral gap via new metric
full rationale
The paper is an empirical study that introduces the SPG metric to quantify observed downstream toxicity differences in paired rollouts and reports experimental outcomes on sanitization placement. No derivation chain exists that reduces a claimed result to fitted parameters, self-definitions, or self-citation load-bearing premises. The central claim rests on direct measurement of behavioral differences rather than any algebraic or definitional equivalence to inputs. The work is self-contained against external benchmarks of toxicity detection and rollout comparison.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard toxicity detectors applied to summaries are sufficient to classify state as safe for downstream use.
- domain assumption Paired counterfactual rollouts can hold all other agent variables constant while varying only memory state.
invented entities (1)
-
memory laundering
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Mem- orybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Tanav Singh Bajaj, Nikhil Singh, Karan Anand, and Eishkaran Singh. Position: Safety and fairness in agentic ai depend on interaction topology, not on model scale or alignment.arxiv preprint arXiv:2605.01147,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,
Fouad Bousetouane. Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,
-
[4]
Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, and Yizheng Chen. Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,
-
[5]
Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, and Lina Yao. Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,
-
[6]
Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Unleashing zero-click worms that target genai-powered applications.arXiv preprint arXiv:2403.02817,
-
[7]
Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers
Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,
-
[8]
Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,
Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,
-
[9]
Breaking bad tokens: Detoxification of LLMs using sparse autoencoders
Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. Breaking bad tokens: Detoxification of LLMs using sparse autoencoders. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691–12709, Suzhou, ...
work page 2025
-
[10]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.641. URL https://aclanthology.org/2025.emnlp-main. 641/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect promp...
-
[11]
10 Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,
-
[12]
Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,
Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,
-
[13]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez
Accessed: 2026- 05-03. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems
work page 2026
-
[16]
Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control.arXiv preprint arXiv:2505.18279,
-
[17]
arXiv preprint arXiv:2412.17686 , year=
Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, et al. Large language model safety: A holistic survey.arXiv preprint arXiv:2412.17686,
-
[18]
Nikhil Verma. Active context compression: Autonomous memory management in llm agents.arXiv preprint arXiv:2601.07190,
-
[19]
CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, and Pengfei He. Unveiling privacy risks in LLM agent memory. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25241–25260, Vienna...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Incremental summarization for customer support via progressive note-taking and agent feedback
11 Yisha Wu, Cen Zhao, Yuanpei Cao, Xiaoqing Xu, Yashar Mehdad, Mindy Ji, and Claire Na Cheng. Incremental summarization for customer support via progressive note-taking and agent feedback. In Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry...
work page 2025
-
[21]
Association for Computational Linguistics. ISBN 979-8-89176-333-3. doi: 10.18653/v1/2025.emnlp-industry.140. URL https://aclanthology. org/2025.emnlp-industry.140/. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon dec...
-
[22]
LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X
Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang. Llms can get" brain rot"!arXiv preprint arXiv:2510.13928,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510,
-
[25]
Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,
Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,
-
[26]
for intervention experiments. Decoding uses temperature 0.8, top-p 0.95, and a maximum response length of 256 tokens. For paired counterfactual rollouts, the same decoding seed is used for the toxic and neutral conditions, ensuring that any difference between the two rollouts is attributable to the focal-agent prompt only. Scale.The phenomenon experiments...
work page 2067
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.