arxiv: 2605.08442 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Jun Wen Leong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords persistent memory attacksLLM agentsdefense evaluationarchitectural layerstool gatingattack success rateRAG retrievalstateful agents

0 comments

The pith

Memory-layer tool gating stops persistent memory attacks on eight of nine LLM agents but inverts success to 100 percent on one reasoning model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six defenses placed at input, retrieval, prompt, and memory layers against delayed-trigger persistent memory attacks that store malicious instructions in agent memory via RAG documents for later execution. Input-level filters and retrieval-level classifiers leave attack success rates statistically unchanged from the 88.6 percent baseline because they cannot see or are masked from the injected content. Prompt hardening reduces success only modestly to 77.8 percent, with the drop coming from two models that refuse attacks independently of the defense. Tool gating at the memory layer via Memory Sandbox eliminates the recall step the attacks need and drives success to zero on eight models while imposing no measurable utility loss on normal tasks. The ninth model, a reasoning system that refuses attacks without any defense, reaches 100 percent success under the sandbox because blocking explicit memory recall routes the attack through the RAG path where the model's refusal does not activate.

Core claim

Persistent memory attacks achieve high success rates against open-source stateful LLM agents by storing malicious instructions retrieved through RAG for execution in later sessions. Defenses at the input and retrieval layers fail to lower attack success rates because they cannot observe RAG-injected content or are defeated by compliance-framed semantic masking. Prompt hardening produces only partial reduction. Memory Sandbox, which applies tool gating at the memory layer, removes the recall capability required by the attacks and reduces success to zero for eight of nine models. The remaining reasoning model inverts from zero to 100 percent success under the sandbox because the restriction on

What carries the argument

Tool-gating at the memory layer (Memory Sandbox), which blocks explicit recall of RAG-injected instructions while leaving other pathways intact.

If this is right

Input and retrieval defenses cannot observe or filter RAG-injected content, leaving attack success rates at baseline levels.
Retrieval classifiers are defeated by compliance-framed semantic masking that keeps malicious instructions below detection thresholds.
Memory-layer tool gating eliminates the recall step required for delayed execution and drives attack success to zero on most models.
One class of reasoning models loses its natural refusal behavior when explicit memory recall is blocked, routing attacks onto the RAG pathway where refusal does not activate.
Memory Sandbox produces no utility cost on clean tasks as measured by BTCR across all tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defense design for stateful agents should focus on controlling memory recall rather than earlier filtering stages that the paper shows are ineffective.
The inversion observed on the reasoning model indicates that memory restrictions can create unintended pathways that future agent designs may need to close separately.
Extending the layer-by-layer evaluation to closed-source models or production agent logs would test whether the same failure patterns and inversion effect hold outside the open-source sample.
New attack variants could be constructed to target RAG retrieval directly once memory recall is gated, bypassing the mechanism that Memory Sandbox exploits.

Load-bearing premise

The nine open-source models and the specific delayed-trigger attacks tested are representative of real-world stateful LLM agent deployments and threats, and that the BTCR utility metric captures all relevant performance impacts.

What would settle it

Running the same attack set on a new collection of models that includes closed-source agents and checking whether Memory Sandbox still produces zero attack success on eight of nine or whether the inversion effect appears more widely.

read the original abstract

Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level refusal independent of the defense. The architectural explanation holds: input-level defenses cannot observe RAG-injected content, and retrieval-level classifiers are defeated by compliance-framed semantic masking. One defense, tool-gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows input and retrieval defenses are useless against persistent memory attacks on LLM agents while a memory-layer sandbox mostly works, with one model where it backfires.

read the letter

The core finding is straightforward: defenses at input or retrieval layers leave attack success rates at the undefended baseline of 88-6%, while tool-gating at the memory layer drops ASR to zero on eight of nine models with no clean-run utility penalty. They ran 5040 trials across nine open-source models with N=40 per condition, which gives the layer comparison real weight. The architectural story lines up with the numbers—input filters never see RAG-injected content, retrieval classifiers get fooled by compliance framing, and the sandbox removes the recall step the attack needs. The inversion on the one reasoning model is the part worth flagging: it refuses execution without defense but hits 100% ASR once recall is blocked and it falls back to the RAG path. That kind of counterexample is useful to see in print. The work is empirical and reports statistical comparisons, so the claims rest on observable outcomes rather than fitted parameters. The main limitations are scope. All results are on open-source models and one family of delayed-trigger attacks, so it is not yet clear how the pattern holds for closed models or production agent memory systems. The BTCR metric is clean for the tested tasks but may not capture every side effect in richer agent workflows. The paper does not overclaim generality, which keeps the conclusions proportionate. Readers working on agent security or defense layering will get concrete guidance on where effort is wasted and where it is not. It is worth sending to referees because the scale, the layer breakdown, and the inversion case are substantive enough to check in detail.

Referee Report

0 major / 2 minor

Summary. The manuscript evaluates the effectiveness of six defenses across four architectural layers against persistent memory attacks on stateful LLM agents. Through 5,040 experimental runs (N=40 per condition) on nine open-source models, it demonstrates that input-level and retrieval-level defenses fail to mitigate attack success rates (ASR), remaining at approximately 88-89% similar to the undefended baseline of 88.6%. Prompt Hardening shows partial reduction to 77.8% ASR. In contrast, the memory-layer defense 'Memory Sandbox' achieves 0% ASR for eight models by gating recall, but causes an inversion to 100% ASR in one reasoning model. The defense incurs no utility cost as measured by BTCR.

Significance. This study is significant as it provides the first systematic analysis of defense strategies against persistent memory attacks in LLM agents. The large-scale empirical evaluation (5040 runs across nine models with statistical comparisons) and the discovery of the defense inversion effect offer valuable insights into why certain defenses fail mechanistically and how memory-layer interventions can be both effective and counterintuitively risky. The zero utility cost finding for the successful defense further supports its practical applicability. These elements enable informed defense investment decisions.

minor comments (2)

In the methods section, provide additional details on the construction and injection of the delayed-trigger attacks via RAG-retrieved documents to support full reproducibility of the 5040 runs.
In the results section, explicitly state the statistical tests (e.g., specific hypothesis test and significance threshold) used to conclude that certain ASRs are 'statistically indistinguishable' from the 88.6% baseline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The evaluation of our large-scale experiments and the identification of the defense inversion effect are appreciated. Since no specific major comments were raised, we have no points to rebut and will proceed with any minor editorial adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

This paper conducts an empirical evaluation of six defenses across architectural layers against delayed-trigger persistent memory attacks, using 5,040 experimental runs (N=40 per condition) on nine open-source models. The central claims—defense failure rates at input/retrieval layers, Memory Sandbox reducing ASR to 0% for eight models with one inversion case, and zero utility cost—are directly measured from attack success rates and BTCR metrics in the reported conditions. No mathematical derivations, equations, fitted parameters, or self-citational chains exist that reduce any result to its inputs by construction. The architectural explanations are consistent with the experimental outcomes rather than presupposing them, and the study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical with no new mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5614 in / 972 out tokens · 46566 ms · 2026-05-12T01:14:59.085101+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). ... Memory Sandbox ... reduces ASR to 0% for eight of nine models by removing the recall capability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
cs.CR 2026-05 conditional novelty 6.0

MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Zombie Agents: Persistent Control of Self-Evolving

Yang, Xianglin and He, Yufei and Ji, Shuo and Hooi, Bryan and Dong, Jin Song , journal=. Zombie Agents: Persistent Control of Self-Evolving

work page
[2]

Shao, Shuai and others , booktitle=

work page
[3]

Srivastava, Saksham Sahai and He, Haoyu , journal=

work page
[4]

Engineering Applications of Artificial Intelligence , volume=

Memory Poisoning Attacks on Retrieval-Augmented Large Language Model Agents via Deceptive Semantic Reasoning , author=. Engineering Applications of Artificial Intelligence , volume=

work page
[5]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=

work page
[6]

Debenedetti, Edoardo and others , booktitle=

work page
[7]

Yang, Xianglin and others , journal=

work page
[8]

33rd USENIX Security Symposium , pages=

Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. 33rd USENIX Security Symposium , pages=

work page
[9]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle=. Not What You've Signed Up For: Compromising Real-World

work page
[10]

2025 , howpublished=

The Claude Model Spec , author=. 2025 , howpublished=

work page 2025
[11]

, journal=

Conley, N. , journal=. How Safe Are. 2026 , note=

work page 2026
[12]

2025 , howpublished=

work page 2025
[13]

Zou, Wei and Geng, Runpeng and Wang, Binghui and Jia, Jinyuan , booktitle=

work page
[14]

Packer, Charles and Fang, Vivian and Patil, Shishir G and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph E , booktitle=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , booktitle=

work page
[17]

Evo-Memory: Benchmarking

Wei, Tianxin and Sachdeva, Noveen and Coleman, Benjamin and others , journal=. Evo-Memory: Benchmarking

work page
[18]

Your Agent May Misevolve: Emergent Risks in Self-Evolving

Shao, Shuai and Ren, Qihan and Qian, Chen and others , journal=. Your Agent May Misevolve: Emergent Risks in Self-Evolving

work page
[19]

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems , author=. arXiv preprint arXiv:2505.11548 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2410.14479 , year=

Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models , author=. arXiv preprint arXiv:2410.14479 , year=

work page arXiv
[21]

arXiv preprint arXiv:2502.16580 , year=

Can Indirect Prompt Injection Attacks Be Detected and Removed? , author=. arXiv preprint arXiv:2502.16580 , year=

work page arXiv
[22]

Reprompt: The Single-Click

Taler, Dolev , journal=. Reprompt: The Single-Click. 2026 , note=

work page 2026
[23]

Zhang, Yedi and Wang, Haoyu and Yang, Xianglin and Dong, Jin Song and Sun, Jun , journal=

work page
[24]

Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

Skillject: Automating Stealthy Skill-Based Prompt Injection for Coding Agents , author=. arXiv preprint arXiv:2602.14211 , year=

work page arXiv
[25]

Li, Chunyang and Zhang, Junwei and Cheng, Anda and Ma, Zhuo and Li, Xinghua and Ma, Jianfeng , journal=

work page