State Contamination in Memory-Augmented LLM Agents

Agam Goyal; Hari Sundaram; Yian Wang; Yuen Chen

REVIEW 2 major objections 2 minor 1 cited by

Toxic information can be compressed into memory summaries that pass toxicity detectors but still raise the chance of harmful future outputs in LLM agents.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-19 21:23 UTC pith:PR26USGV

load-bearing objection The paper flags how toxic context can hide in compressed memory summaries below detector thresholds while still raising downstream toxicity, with better results from sanitizing before compression than after. the 2 major comments →

arxiv 2605.16746 v1 pith:PR26USGV submitted 2026-05-16 cs.AI cs.LG

State Contamination in Memory-Augmented LLM Agents

Yian Wang , Agam Goyal , Yuen Chen , Hari Sundaram This is my paper

classification cs.AI cs.LG

keywords memory launderingLLM agentstoxicity propagationstate contaminationsub-threshold propagationmemory sanitizationpersistent state

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that persistent memory in LLM agents creates a new safety vulnerability called memory laundering. In this process, toxic context gets summarized in a way that standard detectors miss the toxicity, yet the summaries retain enough structure to make later agent behavior more toxic. A sympathetic reader would care because current safety tools focus on individual outputs rather than what gets stored for later use, leaving agents exposed to hidden contamination over long interactions. The work uses paired experiments to measure this hidden effect through a new metric for sub-threshold influence.

Core claim

Toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. This hidden influence is quantified by the sub-threshold propagation gap, and experiments reveal that sanitizing state before it is summarized into memory reduces the propagation more effectively than sanitizing the summary itself.

What carries the argument

Memory laundering, the compression of toxic or adversarial context into summaries that no longer trigger standard detectors while preserving hostile framing or conflict structure that influences future generations.

Load-bearing premise

The paired counterfactual multi-agent rollouts isolate the causal effect of the memory state itself on downstream toxicity without other differences in agent behavior or prompts affecting the results.

What would settle it

Finding no increase in downstream toxicity when using toxic-origin summaries compared to neutral ones, under otherwise identical conditions in the multi-agent rollouts, would falsify the claim of hidden sub-threshold propagation.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Raw transcript reuse tends to produce overt downstream toxicity.
Compressed memory states carry sub-threshold influence that increases toxicity without detection.
Placing sanitization before summarization substantially reduces the hidden propagation gap.
Cleaning only after summarization can leave the laundered influence intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent safety protocols should monitor and sanitize information before it enters persistent memory rather than relying on post-compression checks.
This state-control approach might extend to other forms of persistent context such as retrieved documents or conversation histories.
Future agent designs could incorporate ongoing state auditing to prevent gradual accumulation of subtle biases or toxicities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper flags how toxic context can hide in compressed memory summaries below detector thresholds while still raising downstream toxicity, with better results from sanitizing before compression than after.

read the letter

The main thing to know is that toxic or adversarial context can get compressed into memory summaries that stay under standard toxicity thresholds but still push future agent outputs toward more toxicity. They call this memory laundering and quantify the hidden effect with a sub-threshold propagation gap measured through paired counterfactual multi-agent rollouts. The work also shows that cleaning state before summarization cuts the gap more effectively than cleaning the finished summary alone. That distinction between raw transcripts and compressed memory is a useful framing for thinking about where safety checks belong in persistent-state systems. The experiments give a concrete empirical demonstration that matches what you would expect in long-running agents already showing up in practice. One moderate soft spot is the isolation of the memory effect. The rollouts are prompt-matched, but agent turns are stochastic, so early differences in response style or topic can create their own toxicity increases separate from the summary content. The paper would be stronger with explicit fixed-seed controls or variance decomposition to rule that channel out more cleanly. Without those details the causal claim is a bit less tight, though the overall pattern still looks plausible. The work stays empirical and avoids self-referential definitions, which keeps the circularity burden low. This is for researchers and engineers working on memory-augmented agents or safety pipelines for them. Readers who care about state management in deployed systems will find the SPG metric and the before-versus-after intervention results directly usable. It deserves a serious referee because the failure mode is timely and the setup provides a starting point for tighter follow-up. I would send it to peer review with a request for more on rollout controls and statistical robustness.

Referee Report

2 major / 2 minor

Summary. The manuscript studies 'memory laundering' as a failure mode in memory-augmented LLM agents, where toxic or adversarial context is compressed into summaries that fall below standard toxicity detector thresholds yet preserve hostile framing that influences future generations. Using paired counterfactual multi-agent rollouts, the authors demonstrate that such summaries increase downstream toxicity relative to matched neutral baselines, quantified via a new sub-threshold propagation gap (SPG) metric. Experiments distinguish propagation channels (raw transcripts vs. compressed memory) and show that sanitizing toxic state before summarization reduces the hidden gap more effectively than post-summarization cleaning.

Significance. If the results hold under tighter controls, the work identifies a practically relevant safety gap in persistent-state agents: output monitoring alone is insufficient when state compression can hide influence. The SPG metric provides a concrete, observable way to measure sub-threshold effects, and the differential mitigation finding offers actionable guidance on intervention timing. This contributes to agent safety literature by framing the problem as state-control rather than isolated generation safety.

major comments (2)

[Abstract and Methods (paired counterfactual multi-agent rollouts)] Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.
[Experiments (SPG quantification and detector details)] Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.

minor comments (2)

[Introduction] The term 'memory laundering' is introduced without explicit comparison to related notions such as context poisoning or memory injection; a short related-work paragraph would clarify novelty.
[Figures/Tables] Figure or table captions describing the multi-agent rollout protocol should include exact prompt templates and matching criteria for neutral baselines to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below, clarifying our experimental controls and committing to revisions that strengthen the causal claims and robustness analysis.

read point-by-point responses

Referee: Abstract and Methods (paired counterfactual multi-agent rollouts): The central claim that toxic-origin summaries causally increase downstream toxicity via SPG rests on the rollouts isolating memory provenance. Stochastic divergence in agent trajectories (even under prompt matching) can produce independent toxicity amplification; without reported variance decomposition, fixed-seed controls, or single-agent ablations, this channel remains unruled-out and directly undermines causal attribution.

Authors: We agree that stochastic divergence in multi-agent trajectories is a potential confounder for causal attribution. Our paired counterfactual design matches prompts and initial states across toxic and neutral conditions while using fixed random seeds for all rollouts; results are averaged over 10 independent runs with reported standard errors. To directly address the concern, the revised manuscript will add an explicit variance decomposition separating memory provenance effects from trajectory stochasticity, along with single-agent ablation controls that hold the memory state fixed while varying only the generation process. revision: partial
Referee: Experiments (SPG quantification and detector details): The reported propagation gap depends on specific detector thresholds and summary-generation procedures, yet these are not specified. Post-hoc selection of thresholds or generation methods could artifactually enlarge the gap; explicit reporting of these choices and sensitivity checks is required to establish that the sub-threshold effect is robust rather than method-dependent.

Authors: We acknowledge that the original submission placed the exact detector thresholds, toxicity classifier version, and summary-generation hyperparameters in the appendix rather than the main text. The revised version will move these specifications into the Methods section and include new sensitivity analyses that vary the toxicity threshold across a range of values (e.g., 0.1–0.5) and alternative summary-generation prompts, confirming that the SPG remains statistically significant and directionally consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical demonstration of behavioral gap via new metric

full rationale

The paper is an empirical study that introduces the SPG metric to quantify observed downstream toxicity differences in paired rollouts and reports experimental outcomes on sanitization placement. No derivation chain exists that reduces a claimed result to fitted parameters, self-definitions, or self-citation load-bearing premises. The central claim rests on direct measurement of behavioral differences rather than any algebraic or definitional equivalence to inputs. The work is self-contained against external benchmarks of toxicity detection and rollout comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that summarization can preserve hostile framing while evading detectors, and that counterfactual rollouts isolate memory effects. No free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Standard toxicity detectors applied to summaries are sufficient to classify state as safe for downstream use.
Invoked when claiming that laundered summaries remain below thresholds yet still propagate influence.
domain assumption Paired counterfactual rollouts can hold all other agent variables constant while varying only memory state.
Required for attributing downstream toxicity differences to memory laundering.

invented entities (1)

memory laundering no independent evidence
purpose: Names the process by which toxic context is compressed into summaries that evade detection while retaining influence.
Conceptual label introduced to describe the observed phenomenon; no independent falsifiable prediction is provided.

pith-pipeline@v0.9.0 · 5768 in / 1331 out tokens · 47031 ms · 2026-05-19T21:23:21.706686+00:00 · methodology

0 comments

read the original abstract

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

Figures

Figures reproduced from arXiv: 2605.16746 by Agam Goyal, Hari Sundaram, Yian Wang, Yuen Chen.

**Figure 1.** Figure 1: Memory laundering in stateful agents. (a) In single-turn chatbots, safety monitoring is typically applied directly to the generated response. In memory-augmented agents, harmful influence can instead be compressed into external agent state, such as summaries or transcripts, that appears safe under standard toxicity checks while still conditioning downstream agents toward unsafe behavior. (b) In our work, w… view at source ↗

**Figure 2.** Figure 2: Additional qualitative examples of memory laundering. Each example shows a toxic source message from agent A1, the resulting compressed memory state Mt, and the downstream response conditioned on that memory. In both cases, the memory summary scores below τ = 0.5 despite source toxicity above 0.9, yet the downstream response remains overtly toxic. Example 2 flattens explicit hostility into a neutral descri… view at source ↗

**Figure 3.** Figure 3: Sub-threshold propagation gap SPG(τ ) as a function of the classifier threshold τ under memory-augmented chain rollouts (n = 200 seeds). SPG remains stable at ≈ 0.14 across τ ∈ {0.03, 0.05, 0.1, 0.2, 0.3, 0.5} and is significantly positive at every threshold (all p < 0.001, paired Wilcoxon signed-rank). Across this entire range, the fraction of memory states classified as clean satisfies Pr[tox(Mt) < τ ] ≥… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Git Is the Memory Solution for the Agentic Development Lifecycle
cs.SE 2026-07 conditional novelty 6.0

Git-bound, routed memory for coding agents, with retrieval and answer-assembly evaluations, achieves ~0.31 MRR on seed retrieval and up to 0.83 answer-sufficiency on rationale questions, at token costs three orders of...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Mem- orybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

Tanav Singh Bajaj, Nikhil Singh, Karan Anand, and Eishkaran Singh. Position: Safety and fairness in agentic ai depend on interaction topology, not on model scale or alignment.arxiv preprint arXiv:2605.01147,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Ai agents need memory control over more context

Fouad Bousetouane. Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,

work page arXiv
[4]

Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, and Yizheng Chen. Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

work page arXiv
[5]

Ai safety in generative ai large language models: A survey

Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, and Lina Yao. Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,

work page arXiv
[6]

Here comes the AI wo rm: Unleashing zero-click worms that target GenAI-powered applications

Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Unleashing zero-click worms that target genai-powered applications.arXiv preprint arXiv:2403.02817,

work page arXiv
[7]

Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers

Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,

work page arXiv
[8]

arXiv preprint arXiv:2404.09982 , year=

Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,

work page arXiv
[9]

Breaking bad tokens: Detoxification of LLMs using sparse autoencoders

Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. Breaking bad tokens: Detoxification of LLMs using sparse autoencoders. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691–12709, Suzhou, ...

work page 2025
[10]

Breaking Bad Tokens: Detoxification of LLM s Using Sparse Autoencoders

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.641. URL https://aclanthology.org/2025.emnlp-main. 641/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect promp...

work page doi:10.18653/v1/2025.emnlp-main.641 2025
[11]

arXiv preprint arXiv:2402.03610 , year=

10 Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

work page arXiv
[12]

Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

work page arXiv
[13]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez

Accessed: 2026- 05-03. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems

work page 2026
[16]

Collaborativememory: Multi- user memory sharing in LLM agents with dynamic ac- cess control

Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control.arXiv preprint arXiv:2505.18279,

work page arXiv
[17]

https://arxiv.org/abs/2412.17686

Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, et al. Large language model safety: A holistic survey.arXiv preprint arXiv:2412.17686,

work page arXiv
[18]

Active Context Compression for Long-Horizon LLM Sessions,

Nikhil Verma. Active context compression: Autonomous memory management in llm agents.arXiv preprint arXiv:2601.07190,

work page arXiv
[19]

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, and Pengfei He. Unveiling privacy risks in LLM agent memory. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25241–25260, Vienna...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Incremental summarization for customer support via progressive note-taking and agent feedback

11 Yisha Wu, Cen Zhao, Yuanpei Cao, Xiaoqing Xu, Yashar Mehdad, Mindy Ji, and Claire Na Cheng. Incremental summarization for customer support via progressive note-taking and agent feedback. In Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry...

work page 2025
[21]

ISBN 979-8-89176-333-3

Association for Computational Linguistics. ISBN 979-8-89176-333-3. doi: 10.18653/v1/2025.emnlp-industry.140. URL https://aclanthology. org/2025.emnlp-industry.140/. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon dec...

work page doi:10.18653/v1/2025.emnlp-industry.140 2025
[22]

LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X

Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang. Llms can get" brain rot"!arXiv preprint arXiv:2510.13928,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems,

Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510,

work page arXiv
[25]

Clawworm: Self-propagating attacks across LLM agent ecosystems,

Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,

work page internal anchor Pith review arXiv
[26]

in the presence of toxic upstream context, prefer the neutral-condition response over the toxic-condition response

for intervention experiments. Decoding uses temperature 0.8, top-p 0.95, and a maximum response length of 256 tokens. For paired counterfactual rollouts, the same decoding seed is used for the toxic and neutral conditions, ensuring that any difference between the two rollouts is attributable to the focal-agent prompt only. Scale.The phenomenon experiments...

work page 2067

[1] [1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Mem- orybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

Tanav Singh Bajaj, Nikhil Singh, Karan Anand, and Eishkaran Singh. Position: Safety and fairness in agentic ai depend on interaction topology, not on model scale or alignment.arxiv preprint arXiv:2605.01147,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Ai agents need memory control over more context

Fouad Bousetouane. Ai agents need memory control over more context.arXiv preprint arXiv:2601.11653,

work page arXiv

[4] [4]

Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, and Yizheng Chen. Why are web ai agents more vulnerable than standalone llms? a security analysis.arXiv preprint arXiv:2502.20383,

work page arXiv

[5] [5]

Ai safety in generative ai large language models: A survey

Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, and Lina Yao. Ai safety in generative ai large language models: A survey.arXiv preprint arXiv:2407.18369,

work page arXiv

[6] [6]

Here comes the AI wo rm: Unleashing zero-click worms that target GenAI-powered applications

Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Unleashing zero-click worms that target genai-powered applications.arXiv preprint arXiv:2403.02817,

work page arXiv

[7] [7]

Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers

Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,

work page arXiv

[8] [8]

arXiv preprint arXiv:2404.09982 , year=

Hang Gao and Yongfeng Zhang. Memory sharing for large language model based agents.arXiv preprint arXiv:2404.09982,

work page arXiv

[9] [9]

Breaking bad tokens: Detoxification of LLMs using sparse autoencoders

Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, and Hari Sundaram. Breaking bad tokens: Detoxification of LLMs using sparse autoencoders. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12691–12709, Suzhou, ...

work page 2025

[10] [10]

Breaking Bad Tokens: Detoxification of LLM s Using Sparse Autoencoders

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.641. URL https://aclanthology.org/2025.emnlp-main. 641/. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect promp...

work page doi:10.18653/v1/2025.emnlp-main.641 2025

[11] [11]

arXiv preprint arXiv:2402.03610 , year=

10 Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

work page arXiv

[12] [12]

Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

work page arXiv

[13] [13]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez

Accessed: 2026- 05-03. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems

work page 2026

[16] [16]

Collaborativememory: Multi- user memory sharing in LLM agents with dynamic ac- cess control

Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao. Collaborative memory: Multi-user memory sharing in llm agents with dynamic access control.arXiv preprint arXiv:2505.18279,

work page arXiv

[17] [17]

https://arxiv.org/abs/2412.17686

Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, et al. Large language model safety: A holistic survey.arXiv preprint arXiv:2412.17686,

work page arXiv

[18] [18]

Active Context Compression for Long-Horizon LLM Sessions,

Nikhil Verma. Active context compression: Autonomous memory management in llm agents.arXiv preprint arXiv:2601.07190,

work page arXiv

[19] [19]

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, and Pengfei He. Unveiling privacy risks in LLM agent memory. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25241–25260, Vienna...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Incremental summarization for customer support via progressive note-taking and agent feedback

11 Yisha Wu, Cen Zhao, Yuanpei Cao, Xiaoqing Xu, Yashar Mehdad, Mindy Ji, and Claire Na Cheng. Incremental summarization for customer support via progressive note-taking and agent feedback. In Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry...

work page 2025

[21] [21]

ISBN 979-8-89176-333-3

Association for Computational Linguistics. ISBN 979-8-89176-333-3. doi: 10.18653/v1/2025.emnlp-industry.140. URL https://aclanthology. org/2025.emnlp-industry.140/. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon dec...

work page doi:10.18653/v1/2025.emnlp-industry.140 2025

[22] [22]

LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X

Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang. Llms can get" brain rot"!arXiv preprint arXiv:2510.13928,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems,

Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510,

work page arXiv

[25] [25]

Clawworm: Self-propagating attacks across LLM agent ecosystems,

Yihao Zhang, Zeming Wei, Xiaokun Luan, Chengcan Wu, Zhixin Zhang, Jiangrong Wu, Haolin Wu, Huanran Chen, Jun Sun, and Meng Sun. Clawworm: Self-propagating attacks across llm agent ecosystems.arXiv preprint arXiv:2603.15727,

work page internal anchor Pith review arXiv

[26] [26]

in the presence of toxic upstream context, prefer the neutral-condition response over the toxic-condition response

for intervention experiments. Decoding uses temperature 0.8, top-p 0.95, and a maximum response length of 256 tokens. For paired counterfactual rollouts, the same decoding seed is used for the toxic and neutral conditions, ensuring that any difference between the two rollouts is attributable to the focal-agent prompt only. Scale.The phenomenon experiments...

work page 2067