Recognition: 1 theorem link
· Lean TheoremDefense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents
Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3
The pith
Memory-layer tool gating stops persistent memory attacks on eight of nine LLM agents but inverts success to 100 percent on one reasoning model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Persistent memory attacks achieve high success rates against open-source stateful LLM agents by storing malicious instructions retrieved through RAG for execution in later sessions. Defenses at the input and retrieval layers fail to lower attack success rates because they cannot observe RAG-injected content or are defeated by compliance-framed semantic masking. Prompt hardening produces only partial reduction. Memory Sandbox, which applies tool gating at the memory layer, removes the recall capability required by the attacks and reduces success to zero for eight of nine models. The remaining reasoning model inverts from zero to 100 percent success under the sandbox because the restriction on
What carries the argument
Tool-gating at the memory layer (Memory Sandbox), which blocks explicit recall of RAG-injected instructions while leaving other pathways intact.
If this is right
- Input and retrieval defenses cannot observe or filter RAG-injected content, leaving attack success rates at baseline levels.
- Retrieval classifiers are defeated by compliance-framed semantic masking that keeps malicious instructions below detection thresholds.
- Memory-layer tool gating eliminates the recall step required for delayed execution and drives attack success to zero on most models.
- One class of reasoning models loses its natural refusal behavior when explicit memory recall is blocked, routing attacks onto the RAG pathway where refusal does not activate.
- Memory Sandbox produces no utility cost on clean tasks as measured by BTCR across all tested conditions.
Where Pith is reading between the lines
- Defense design for stateful agents should focus on controlling memory recall rather than earlier filtering stages that the paper shows are ineffective.
- The inversion observed on the reasoning model indicates that memory restrictions can create unintended pathways that future agent designs may need to close separately.
- Extending the layer-by-layer evaluation to closed-source models or production agent logs would test whether the same failure patterns and inversion effect hold outside the open-source sample.
- New attack variants could be constructed to target RAG retrieval directly once memory recall is gated, bypassing the mechanism that Memory Sandbox exploits.
Load-bearing premise
The nine open-source models and the specific delayed-trigger attacks tested are representative of real-world stateful LLM agent deployments and threats, and that the BTCR utility metric captures all relevant performance impacts.
What would settle it
Running the same attack set on a new collection of models that includes closed-source agents and checking whether Memory Sandbox still produces zero attack success on eight of nine or whether the inversion effect appears more widely.
read the original abstract
Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level refusal independent of the defense. The architectural explanation holds: input-level defenses cannot observe RAG-injected content, and retrieval-level classifiers are defeated by compliance-framed semantic masking. One defense, tool-gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the effectiveness of six defenses across four architectural layers against persistent memory attacks on stateful LLM agents. Through 5,040 experimental runs (N=40 per condition) on nine open-source models, it demonstrates that input-level and retrieval-level defenses fail to mitigate attack success rates (ASR), remaining at approximately 88-89% similar to the undefended baseline of 88.6%. Prompt Hardening shows partial reduction to 77.8% ASR. In contrast, the memory-layer defense 'Memory Sandbox' achieves 0% ASR for eight models by gating recall, but causes an inversion to 100% ASR in one reasoning model. The defense incurs no utility cost as measured by BTCR.
Significance. This study is significant as it provides the first systematic analysis of defense strategies against persistent memory attacks in LLM agents. The large-scale empirical evaluation (5040 runs across nine models with statistical comparisons) and the discovery of the defense inversion effect offer valuable insights into why certain defenses fail mechanistically and how memory-layer interventions can be both effective and counterintuitively risky. The zero utility cost finding for the successful defense further supports its practical applicability. These elements enable informed defense investment decisions.
minor comments (2)
- In the methods section, provide additional details on the construction and injection of the delayed-trigger attacks via RAG-retrieved documents to support full reproducibility of the 5040 runs.
- In the results section, explicitly state the statistical tests (e.g., specific hypothesis test and significance threshold) used to conclude that certain ASRs are 'statistically indistinguishable' from the 88.6% baseline.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The evaluation of our large-scale experiments and the identification of the defense inversion effect are appreciated. Since no specific major comments were raised, we have no points to rebut and will proceed with any minor editorial adjustments in the revised version.
Circularity Check
No significant circularity in empirical evaluation
full rationale
This paper conducts an empirical evaluation of six defenses across architectural layers against delayed-trigger persistent memory attacks, using 5,040 experimental runs (N=40 per condition) on nine open-source models. The central claims—defense failure rates at input/retrieval layers, Memory Sandbox reducing ASR to 0% for eight models with one inversion case, and zero utility cost—are directly measured from attack success rates and BTCR metrics in the reported conditions. No mathematical derivations, equations, fitted parameters, or self-citational chains exist that reduce any result to its inputs by construction. The architectural explanations are consistent with the experimental outcomes rather than presupposing them, and the study is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). ... Memory Sandbox ... reduces ASR to 0% for eight of nine models by removing the recall capability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...
Reference graph
Works this paper leans on
-
[1]
Zombie Agents: Persistent Control of Self-Evolving
Yang, Xianglin and He, Yufei and Ji, Shuo and Hooi, Bryan and Dong, Jin Song , journal=. Zombie Agents: Persistent Control of Self-Evolving
-
[2]
Shao, Shuai and others , booktitle=
-
[3]
Srivastava, Saksham Sahai and He, Haoyu , journal=
-
[4]
Engineering Applications of Artificial Intelligence , volume=
Memory Poisoning Attacks on Retrieval-Augmented Large Language Model Agents via Deceptive Semantic Reasoning , author=. Engineering Applications of Artificial Intelligence , volume=
-
[5]
Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=
-
[6]
Debenedetti, Edoardo and others , booktitle=
-
[7]
Yang, Xianglin and others , journal=
-
[8]
33rd USENIX Security Symposium , pages=
Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. 33rd USENIX Security Symposium , pages=
-
[9]
Not What You've Signed Up For: Compromising Real-World
Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle=. Not What You've Signed Up For: Compromising Real-World
- [10]
- [11]
-
[12]
2025 , howpublished=
work page 2025
-
[13]
Zou, Wei and Geng, Runpeng and Wang, Binghui and Jia, Jinyuan , booktitle=
-
[14]
Packer, Charles and Fang, Vivian and Patil, Shishir G and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph E , booktitle=
-
[15]
Advances in Neural Information Processing Systems , volume=
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , booktitle=
-
[17]
Wei, Tianxin and Sachdeva, Noveen and Coleman, Benjamin and others , journal=. Evo-Memory: Benchmarking
-
[18]
Your Agent May Misevolve: Emergent Risks in Self-Evolving
Shao, Shuai and Ren, Qihan and Qian, Chen and others , journal=. Your Agent May Misevolve: Emergent Risks in Self-Evolving
-
[19]
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems , author=. arXiv preprint arXiv:2505.11548 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
arXiv preprint arXiv:2410.14479 , year=
Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models , author=. arXiv preprint arXiv:2410.14479 , year=
-
[21]
arXiv preprint arXiv:2502.16580 , year=
Can Indirect Prompt Injection Attacks Be Detected and Removed? , author=. arXiv preprint arXiv:2502.16580 , year=
-
[22]
Taler, Dolev , journal=. Reprompt: The Single-Click. 2026 , note=
work page 2026
-
[23]
Zhang, Yedi and Wang, Haoyu and Yang, Xianglin and Dong, Jin Song and Sun, Jun , journal=
-
[24]
Skillject: Automating Stealthy Skill-Based Prompt Injection for Coding Agents , author=. arXiv preprint arXiv:2602.14211 , year=
-
[25]
Li, Chunyang and Zhang, Junwei and Cheng, Anda and Ma, Zhuo and Li, Xinghua and Ma, Jianfeng , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.