pith. machine review for the scientific record. sign in

arxiv: 2605.08442 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords persistent memory attacksLLM agentsdefense evaluationarchitectural layerstool gatingattack success rateRAG retrievalstateful agents
0
0 comments X

The pith

Memory-layer tool gating stops persistent memory attacks on eight of nine LLM agents but inverts success to 100 percent on one reasoning model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six defenses placed at input, retrieval, prompt, and memory layers against delayed-trigger persistent memory attacks that store malicious instructions in agent memory via RAG documents for later execution. Input-level filters and retrieval-level classifiers leave attack success rates statistically unchanged from the 88.6 percent baseline because they cannot see or are masked from the injected content. Prompt hardening reduces success only modestly to 77.8 percent, with the drop coming from two models that refuse attacks independently of the defense. Tool gating at the memory layer via Memory Sandbox eliminates the recall step the attacks need and drives success to zero on eight models while imposing no measurable utility loss on normal tasks. The ninth model, a reasoning system that refuses attacks without any defense, reaches 100 percent success under the sandbox because blocking explicit memory recall routes the attack through the RAG path where the model's refusal does not activate.

Core claim

Persistent memory attacks achieve high success rates against open-source stateful LLM agents by storing malicious instructions retrieved through RAG for execution in later sessions. Defenses at the input and retrieval layers fail to lower attack success rates because they cannot observe RAG-injected content or are defeated by compliance-framed semantic masking. Prompt hardening produces only partial reduction. Memory Sandbox, which applies tool gating at the memory layer, removes the recall capability required by the attacks and reduces success to zero for eight of nine models. The remaining reasoning model inverts from zero to 100 percent success under the sandbox because the restriction on

What carries the argument

Tool-gating at the memory layer (Memory Sandbox), which blocks explicit recall of RAG-injected instructions while leaving other pathways intact.

If this is right

  • Input and retrieval defenses cannot observe or filter RAG-injected content, leaving attack success rates at baseline levels.
  • Retrieval classifiers are defeated by compliance-framed semantic masking that keeps malicious instructions below detection thresholds.
  • Memory-layer tool gating eliminates the recall step required for delayed execution and drives attack success to zero on most models.
  • One class of reasoning models loses its natural refusal behavior when explicit memory recall is blocked, routing attacks onto the RAG pathway where refusal does not activate.
  • Memory Sandbox produces no utility cost on clean tasks as measured by BTCR across all tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defense design for stateful agents should focus on controlling memory recall rather than earlier filtering stages that the paper shows are ineffective.
  • The inversion observed on the reasoning model indicates that memory restrictions can create unintended pathways that future agent designs may need to close separately.
  • Extending the layer-by-layer evaluation to closed-source models or production agent logs would test whether the same failure patterns and inversion effect hold outside the open-source sample.
  • New attack variants could be constructed to target RAG retrieval directly once memory recall is gated, bypassing the mechanism that Memory Sandbox exploits.

Load-bearing premise

The nine open-source models and the specific delayed-trigger attacks tested are representative of real-world stateful LLM agent deployments and threats, and that the BTCR utility metric captures all relevant performance impacts.

What would settle it

Running the same attack set on a new collection of models that includes closed-source agents and checking whether Memory Sandbox still produces zero attack success on eight of nine or whether the inversion effect appears more widely.

read the original abstract

Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level refusal independent of the defense. The architectural explanation holds: input-level defenses cannot observe RAG-injected content, and retrieval-level classifiers are defeated by compliance-framed semantic masking. One defense, tool-gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript evaluates the effectiveness of six defenses across four architectural layers against persistent memory attacks on stateful LLM agents. Through 5,040 experimental runs (N=40 per condition) on nine open-source models, it demonstrates that input-level and retrieval-level defenses fail to mitigate attack success rates (ASR), remaining at approximately 88-89% similar to the undefended baseline of 88.6%. Prompt Hardening shows partial reduction to 77.8% ASR. In contrast, the memory-layer defense 'Memory Sandbox' achieves 0% ASR for eight models by gating recall, but causes an inversion to 100% ASR in one reasoning model. The defense incurs no utility cost as measured by BTCR.

Significance. This study is significant as it provides the first systematic analysis of defense strategies against persistent memory attacks in LLM agents. The large-scale empirical evaluation (5040 runs across nine models with statistical comparisons) and the discovery of the defense inversion effect offer valuable insights into why certain defenses fail mechanistically and how memory-layer interventions can be both effective and counterintuitively risky. The zero utility cost finding for the successful defense further supports its practical applicability. These elements enable informed defense investment decisions.

minor comments (2)
  1. In the methods section, provide additional details on the construction and injection of the delayed-trigger attacks via RAG-retrieved documents to support full reproducibility of the 5040 runs.
  2. In the results section, explicitly state the statistical tests (e.g., specific hypothesis test and significance threshold) used to conclude that certain ASRs are 'statistically indistinguishable' from the 88.6% baseline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The evaluation of our large-scale experiments and the identification of the defense inversion effect are appreciated. Since no specific major comments were raised, we have no points to rebut and will proceed with any minor editorial adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

This paper conducts an empirical evaluation of six defenses across architectural layers against delayed-trigger persistent memory attacks, using 5,040 experimental runs (N=40 per condition) on nine open-source models. The central claims—defense failure rates at input/retrieval layers, Memory Sandbox reducing ASR to 0% for eight models with one inversion case, and zero utility cost—are directly measured from attack success rates and BTCR metrics in the reported conditions. No mathematical derivations, equations, fitted parameters, or self-citational chains exist that reduce any result to its inputs by construction. The architectural explanations are consistent with the experimental outcomes rather than presupposing them, and the study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical with no new mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5614 in / 972 out tokens · 46566 ms · 2026-05-12T01:14:59.085101+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

    cs.CR 2026-05 conditional novelty 6.0

    MemLineage enforces untrusted-path persistence in LLM agent memory through Merkle logs, per-principal signatures, and max-of-strong-edges lineage propagation, achieving zero ASR on three poisoning workloads with sub-m...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Zombie Agents: Persistent Control of Self-Evolving

    Yang, Xianglin and He, Yufei and Ji, Shuo and Hooi, Bryan and Dong, Jin Song , journal=. Zombie Agents: Persistent Control of Self-Evolving

  2. [2]

    Shao, Shuai and others , booktitle=

  3. [3]

    Srivastava, Saksham Sahai and He, Haoyu , journal=

  4. [4]

    Engineering Applications of Artificial Intelligence , volume=

    Memory Poisoning Attacks on Retrieval-Augmented Large Language Model Agents via Deceptive Semantic Reasoning , author=. Engineering Applications of Artificial Intelligence , volume=

  5. [5]

    Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=

  6. [6]

    Debenedetti, Edoardo and others , booktitle=

  7. [7]

    Yang, Xianglin and others , journal=

  8. [8]

    33rd USENIX Security Symposium , pages=

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses , author=. 33rd USENIX Security Symposium , pages=

  9. [9]

    Not What You've Signed Up For: Compromising Real-World

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle=. Not What You've Signed Up For: Compromising Real-World

  10. [10]

    2025 , howpublished=

    The Claude Model Spec , author=. 2025 , howpublished=

  11. [11]

    , journal=

    Conley, N. , journal=. How Safe Are. 2026 , note=

  12. [12]

    2025 , howpublished=

  13. [13]

    Zou, Wei and Geng, Runpeng and Wang, Binghui and Jia, Jinyuan , booktitle=

  14. [14]

    Packer, Charles and Fang, Vivian and Patil, Shishir G and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph E , booktitle=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , booktitle=

  17. [17]

    Evo-Memory: Benchmarking

    Wei, Tianxin and Sachdeva, Noveen and Coleman, Benjamin and others , journal=. Evo-Memory: Benchmarking

  18. [18]

    Your Agent May Misevolve: Emergent Risks in Self-Evolving

    Shao, Shuai and Ren, Qihan and Qian, Chen and others , journal=. Your Agent May Misevolve: Emergent Risks in Self-Evolving

  19. [19]

    One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

    One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems , author=. arXiv preprint arXiv:2505.11548 , year=

  20. [20]

    arXiv preprint arXiv:2410.14479 , year=

    Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models , author=. arXiv preprint arXiv:2410.14479 , year=

  21. [21]

    arXiv preprint arXiv:2502.16580 , year=

    Can Indirect Prompt Injection Attacks Be Detected and Removed? , author=. arXiv preprint arXiv:2502.16580 , year=

  22. [22]

    Reprompt: The Single-Click

    Taler, Dolev , journal=. Reprompt: The Single-Click. 2026 , note=

  23. [23]

    Zhang, Yedi and Wang, Haoyu and Yang, Xianglin and Dong, Jin Song and Sun, Jun , journal=

  24. [24]

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

    Skillject: Automating Stealthy Skill-Based Prompt Injection for Coding Agents , author=. arXiv preprint arXiv:2602.14211 , year=

  25. [25]

    Li, Chunyang and Zhang, Junwei and Cheng, Anda and Ma, Zhuo and Li, Xinghua and Ma, Jianfeng , journal=