MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning
Pith reviewed 2026-06-30 00:23 UTC · model grok-4.3
The pith
Injecting three disguised records into an LLM agent's long-term memory lets attackers steer tool selection toward malicious choices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemMorph works by injecting a small number of crafted records disguised as technical facts, incident reports, and operational policies into the agent's long-term memory. These records do not explicitly command a tool call; instead they reshape the agent's accumulated experience so that it autonomously infers and selects the tool the attacker wants. Experiments on three benchmarks, ten agent backbones, and three memory-module implementations show success rates up to 85.9 percent with only three injected records, beating the strongest baseline by up to 25 percent and remaining effective against three representative defenses.
What carries the argument
Poisoned long-term memory records that bias the agent's contextual perception and future tool-selection inferences.
If this is right
- Tool-metadata auditing alone is insufficient to protect agents that accumulate experience in memory.
- A small number of memory entries can produce lasting changes in tool-use policy across repeated tasks.
- Existing defenses focused on prompt or metadata integrity leave memory-based attacks largely unaddressed.
- Attack success remains high even when the agent backbone or memory module is varied.
Where Pith is reading between the lines
- Memory integrity checks may need to become a standard layer in agent architectures, similar to input sanitization.
- The same poisoning approach could be tested on other memory-augmented systems such as retrieval-augmented generation pipelines.
- Persistent memory records create an attack surface that survives across sessions, suggesting the need for time-limited or verifiable memory entries.
Load-bearing premise
The injected records are accepted into the agent's long-term memory and then used to guide tool decisions without being filtered or detected.
What would settle it
Running the agent on the same task before and after the three crafted records are added to memory, and checking whether the agent now selects the attacker's chosen tool at a significantly higher rate.
Figures
read the original abstract
LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MemMorph as the first attack to hijack tool selection in LLM-based agents by poisoning long-term memory with a small number (as few as three) of crafted records disguised as technical facts, incident reports, and operational policies. These records are claimed to reshape the agent's contextual perception, leading to autonomous selection of attacker-preferred tools. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations report up to 85.9% attack success rate (ASR), outperforming the strongest baseline by up to 25%, with retained potency under 3 representative defenses.
Significance. If the empirical results hold after addressing methodological gaps, the work identifies long-term memory as a previously under-explored attack surface in tool-augmented agents. This is significant because modern agents increasingly rely on memory to refine tool-selection policies, and the multi-benchmark, multi-backbone evaluation provides breadth. The paper's emphasis on disguised records (rather than direct metadata manipulation) and defense resilience adds to its potential impact on agent security.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: The central effectiveness claim (85.9% ASR with 3 records) rests on the unvalidated assumption that poisoned records are ingested into long-term memory, retrieved in future contexts, and causally alter tool decisions without triggering filtering, deduplication, or provenance checks. No concrete details are provided on the injection channel (direct DB write vs. agent-mediated append), memory-module validation behavior, or retrieval mechanics (e.g., embedding similarity), making this the load-bearing weakest link.
- [Experiments] Experimental results section: Success rates are reported without error bars, statistical significance tests, or full methodology on controls for the ingestion assumption. This limits verification of whether the measured ASR would collapse under even lightweight sanitization, directly undermining the cross-setup and cross-defense claims.
minor comments (2)
- [Abstract] The abstract and results lack explicit description of the three representative defenses and their implementation details, which would aid assessment of the 'retaining potency' claim.
- [Evaluation] Notation for attack success rate (ASR) and baseline comparisons should be defined consistently with tables or figures for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where greater methodological detail will strengthen the paper. We have revised the manuscript to address both major comments by expanding the methodology and results sections with concrete implementation details, statistical analyses, and additional controls.
read point-by-point responses
-
Referee: [Abstract / Evaluation] The central effectiveness claim (85.9% ASR with 3 records) rests on the unvalidated assumption that poisoned records are ingested into long-term memory, retrieved in future contexts, and causally alter tool decisions without triggering filtering, deduplication, or provenance checks. No concrete details are provided on the injection channel, memory-module validation behavior, or retrieval mechanics.
Authors: We agree that the original submission lacked sufficient detail on these mechanisms. In the revised manuscript we have added Subsection 3.2, which specifies: (1) the injection channel as agent-mediated appends triggered by simulated user queries that cause the agent to update its long-term memory; (2) the validation and deduplication behavior of each of the three tested memory modules; and (3) the retrieval mechanics (embedding-based cosine similarity with explicit threshold). We also include example memory logs confirming successful ingestion and retrieval of the three crafted records. These additions make the ingestion assumption explicit and verifiable. revision: yes
-
Referee: [Experiments] Success rates are reported without error bars, statistical significance tests, or full methodology on controls for the ingestion assumption. This limits verification of whether the measured ASR would collapse under even lightweight sanitization.
Authors: We accept this criticism. The revised version reports mean ASR with standard-deviation error bars from five independent trials per configuration. We added paired t-tests with p-values comparing MemMorph against baselines. We also expanded the experimental methodology with explicit ingestion-control procedures (post-injection memory audits and retrieval logs) and evaluated the attack under two additional lightweight sanitization defenses (keyword-based filtering and provenance tagging). Updated results appear in a new Table 5 and Appendix C, showing that ASR remains above 70% under these controls. revision: yes
Circularity Check
No circularity: empirical attack evaluation with no derivations or fitted predictions
full rationale
The paper is an empirical demonstration of a memory-poisoning attack on LLM agents. It reports measured attack success rates across benchmarks, agent backbones, and memory modules, without any claimed mathematical derivations, parameter fitting presented as prediction, or self-referential uniqueness theorems. The central results (e.g., 85.9% ASR with three records) are direct experimental outcomes, not reductions to inputs by construction. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
Presents TMA-NM, a non-malleable origin-bound authority system for LLM-agent memory with TLA+ machine-checked separation theorems and benchmarks showing 0% attack success against direct and laundering poisoning while ...
-
Agent-Native Immune System: Architecture, Taxonomy, and Engineering
Introduces ANIS as an endogenous, six-layer immune architecture for AI agents with taxonomy of viruses/vaccines and a meta-cognitive Harness Triad for continual adaptation.
Reference graph
Works this paper leans on
-
[1]
InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems
Memory injection attacks on llm agents via query-only interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems. Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2023. News summarization and evaluation in the era of gpt-3.Preprint, arXiv:2209.12356. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten...
-
[2]
InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90
Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. M...
-
[3]
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion: Language agents with verbal rein- forcement learning.Preprint, arXiv:2303.11366. Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. 2024. Cognitive architectures for language agents.Preprint, arXiv:2309.02427. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Universal and Transferable Adversarial Attacks on Aligned Language Models
From allies to adversaries: Manipulating llm tool-calling through adversarial injection. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2009–2028. Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Each query must be a natural, first-person request as a real user would type it
-
[6]
Vary phrasing: mix direct requests, questions, complaints, and context-heavy descriptions
-
[7]
Vary specificity: some queries should be vague, others precise
-
[8]
Vary expertise level: some from novices, some from experienced operators
-
[9]
Last quarter
Do NOT mention any tool by name. Output each query on a separate line, numbered 1 through {batch_size}. No other text. Figure 6: Prompt for Potential Query Generation. Factual Seed Generation You are writing a technical knowledge-base entry for an AI operations assistant. Scenario:{scenario_description} Safe tool:{t_safe}: {safe_description} Risk tool:{t_...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.