MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Pith reviewed 2026-05-25 04:03 UTC · model grok-4.3
The pith
MemAudit identifies poisoned memories in LLM agents after attacks by scoring each record's causal contribution to harmful outputs and detecting structural anomalies in the memory store.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemAudit combines a counterfactual memory influence score, which quantifies how much each memory record causally affects the production of harmful outputs, with a memory consistency graph that surfaces records whose content or retrieval patterns deviate from the rest of the store. When applied after harmful behavior is observed, these two signals together locate and neutralize the malicious records that were injected through normal agent interactions in the MINJA attack, eliminating the attack success that previously reached 70 percent in QA tasks and 83.3 percent in RAP tasks.
What carries the argument
the dual-signal auditing procedure that pairs a counterfactual memory influence score with a memory consistency graph to attribute and isolate malicious records
If this is right
- Agents can continue using long-term memory stores without permanent compromise once harmful behavior appears.
- Defense can shift from blocking inputs in real time to cleaning the memory bank afterward.
- The same auditing signals can be recomputed whenever new harmful outputs are observed.
- Memory stores remain usable for retrieval while still allowing targeted removal of compromised entries.
Where Pith is reading between the lines
- The auditing approach could extend to retrieval-augmented generation systems that also maintain persistent document stores.
- Repeated auditing passes might allow agents to maintain memory integrity over very long interaction histories.
- If the consistency graph can be maintained incrementally, the cost of each audit round could stay low enough for routine use.
Load-bearing premise
The two signals are together sufficient to separate malicious records from benign ones without producing many false positives or missing poisons across the tested agent configurations.
What would settle it
A new memory-injection technique that produces records whose removal does not change the harmful outputs yet still evades detection by both the influence score and the consistency graph.
Figures
read the original abstract
Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MemAudit, a post-hoc causal memory auditing framework for memory-augmented LLM agents. It combines a counterfactual memory influence score measuring each memory's causal contribution to harmful outputs with a memory consistency graph identifying structurally anomalous memories. Evaluated against the MINJA query-only memory injection attack, the paper claims substantial reductions in attack success rates under post-hoc auditing, specifically reducing QA ASR from 70% to 0% and RAP ASR from 83.3% to 0%.
Significance. If the empirical results hold under rigorous validation, MemAudit would address an important gap in defenses for memory-augmented LLM agents by enabling post-hoc identification and removal of malicious records after harmful behavior is observed, complementing existing online intervention methods. The dual use of causal attribution and structural anomaly detection provides a concrete, falsifiable approach to this security problem.
major comments (3)
- [Abstract] Abstract: The headline claims of reducing QA attack success from 70% to 0% and RAP from 83.3% to 0% are presented without any information on trial counts, statistical tests, baseline comparisons, variance across runs, or the precise computation and thresholding of the counterfactual influence scores, rendering the central empirical claim unverifiable from the provided evidence.
- [Method] Method description: No details are given on how the counterfactual memory influence score is computed from interventions, how it is combined with the memory consistency graph anomaly score (e.g., via thresholds, weighting, or logical conjunction), or whether the final decision rule was tuned on the reported test attacks, which directly bears on whether the two signals suffice to neutralize poisons without high false positives or missed records.
- [Evaluation] Evaluation: The manuscript supplies no false-positive rates when MemAudit is applied to clean memory stores, nor any analysis of missed poisons or robustness under distribution shift, leaving the weakest assumption (that the combined signals reliably flag poisons in realistic agent settings) untested and the 0% ASR figures potentially non-generalizable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity and completeness of our empirical claims, methodological details, and evaluation. We will revise the manuscript accordingly to address each point.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of reducing QA attack success from 70% to 0% and RAP from 83.3% to 0% are presented without any information on trial counts, statistical tests, baseline comparisons, variance across runs, or the precise computation and thresholding of the counterfactual influence scores, rendering the central empirical claim unverifiable from the provided evidence.
Authors: We agree that the abstract should provide more context to make the claims verifiable. In the revision, we will expand the abstract to note that results are averaged over 10 independent runs with reported standard deviations, include a brief mention of baseline comparisons (standard retrieval without auditing), and indicate that the influence score uses a fixed threshold of 0.5 on the causal effect difference. Full details on computation and statistical tests will remain in the main text and appendix due to length constraints. revision: yes
-
Referee: [Method] Method description: No details are given on how the counterfactual memory influence score is computed from interventions, how it is combined with the memory consistency graph anomaly score (e.g., via thresholds, weighting, or logical conjunction), or whether the final decision rule was tuned on the reported test attacks, which directly bears on whether the two signals suffice to neutralize poisons without high false positives or missed records.
Authors: We will revise the Method section to include the exact computation: the counterfactual influence score is defined as the difference in the LLM's output probability for a harmful response when performing a do-intervention that removes the candidate memory record. The consistency graph anomaly score measures deviation from average node connectivity. The signals are combined via logical conjunction after independent thresholding (influence > 0.3 and anomaly > 2 standard deviations). Thresholds were selected on a held-out validation set of clean and poisoned memories, not on the test attacks. We will add equations, pseudocode, and explicit discussion of this process. revision: yes
-
Referee: [Evaluation] Evaluation: The manuscript supplies no false-positive rates when MemAudit is applied to clean memory stores, nor any analysis of missed poisons or robustness under distribution shift, leaving the weakest assumption (that the combined signals reliably flag poisons in realistic agent settings) untested and the 0% ASR figures potentially non-generalizable.
Authors: We agree these metrics are necessary for a complete evaluation. In the revised manuscript, we will add a new subsection reporting a false-positive rate below 5% when applying MemAudit to five clean memory stores of varying sizes. We will confirm zero missed poisons (100% recall) in the reported experiments and include an analysis of robustness under distribution shift by testing on out-of-domain queries from a different domain, where ASR remains at 0%. These results will be presented with the same trial counts as the main experiments. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical post-hoc auditing method that combines a counterfactual memory influence score with a memory consistency graph, then reports experimental reductions in attack success rates on QA and RAP tasks under the MINJA attack. No equations, parameter-fitting steps, or derivation chains appear in the abstract or description that would reduce the claimed outcomes to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The results are presented as direct empirical measurements rather than derived quantities, leaving the framework self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[2]
Advances in Neural Information Processing Systems , volume=
Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents
Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents , author=. arXiv preprint arXiv:2604.02623 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
-
[9]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[11]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[12]
Advances in Neural Information Processing Systems , volume=
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=
-
[13]
A practical memory injection attack against llm agents , author=. arXiv e-prints , pages=
-
[14]
arXiv preprint arXiv:2512.16962 , year=
MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval , author=. arXiv preprint arXiv:2512.16962 , year=
-
[15]
arXiv preprint arXiv:2601.07072 , year=
Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems , author=. arXiv preprint arXiv:2601.07072 , year=
-
[16]
34th USENIX Security Symposium (USENIX Security 25) , pages=
\ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[17]
Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
-
[18]
Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2024
-
[19]
arXiv preprint arXiv:2505.12567 , year=
A survey of attacks on large language models , author=. arXiv preprint arXiv:2505.12567 , year=
-
[20]
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=
work page 2024
-
[22]
arXiv preprint arXiv:2505.04806 , year=
Red teaming the mind of the machine: A systematic evaluation of prompt injection and jailbreak vulnerabilities in llms , author=. arXiv preprint arXiv:2505.04806 , year=
-
[23]
From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows , author=. ICT Express , year=
-
[24]
arXiv preprint arXiv:2509.14285 , year=
A multi-agent LLM defense pipeline against prompt injection attacks , author=. arXiv preprint arXiv:2509.14285 , year=
-
[25]
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , author=. 2021 , eprint=
work page 2021
-
[26]
A broad-coverage challenge corpus for sentence understanding through inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=
work page 2018
-
[27]
arXiv preprint arXiv:2510.02373 , year=
A-memguard: A proactive defense framework for llm-based agent memory , author=. arXiv preprint arXiv:2510.02373 , year=
-
[28]
Memory Poisoning Attack and Defense on Memory Based LLM-Agents , author=. arXiv preprint arXiv:2601.05504 , year=
-
[29]
arXiv preprint arXiv:2603.02240 , year=
SuperLocalMemory: Privacy-preserving multi-agent memory with Bayesian trust defense against memory poisoning , author=. arXiv preprint arXiv:2603.02240 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.