pith. sign in

arxiv: 2605.26154 · v1 · pith:4BMHFWMQnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

Pith reviewed 2026-06-30 00:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsmemory poisoningtool selectionadversarial attackagent securitylong-term memorytool hijacking
0
0 comments X

The pith

Injecting three disguised records into an LLM agent's long-term memory lets attackers steer tool selection toward malicious choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents increasingly use long-term memory to refine which external tools they call for user tasks. This paper demonstrates that attackers can insert a small set of crafted records presented as facts, reports, or policies, causing the agent to infer and pick the attacker's preferred tool instead of the correct one. The attack avoids direct changes to tool descriptions, which makes it harder to catch through metadata checks. It succeeds across multiple benchmarks and agent designs even when some standard defenses are in place. The result matters because memory modules are meant to make agents more reliable over time, yet they create a new route for persistent manipulation.

Core claim

MemMorph works by injecting a small number of crafted records disguised as technical facts, incident reports, and operational policies into the agent's long-term memory. These records do not explicitly command a tool call; instead they reshape the agent's accumulated experience so that it autonomously infers and selects the tool the attacker wants. Experiments on three benchmarks, ten agent backbones, and three memory-module implementations show success rates up to 85.9 percent with only three injected records, beating the strongest baseline by up to 25 percent and remaining effective against three representative defenses.

What carries the argument

Poisoned long-term memory records that bias the agent's contextual perception and future tool-selection inferences.

If this is right

  • Tool-metadata auditing alone is insufficient to protect agents that accumulate experience in memory.
  • A small number of memory entries can produce lasting changes in tool-use policy across repeated tasks.
  • Existing defenses focused on prompt or metadata integrity leave memory-based attacks largely unaddressed.
  • Attack success remains high even when the agent backbone or memory module is varied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Memory integrity checks may need to become a standard layer in agent architectures, similar to input sanitization.
  • The same poisoning approach could be tested on other memory-augmented systems such as retrieval-augmented generation pipelines.
  • Persistent memory records create an attack surface that survives across sessions, suggesting the need for time-limited or verifiable memory entries.

Load-bearing premise

The injected records are accepted into the agent's long-term memory and then used to guide tool decisions without being filtered or detected.

What would settle it

Running the agent on the same task before and after the three crafted records are added to memory, and checking whether the agent now selects the attacker's chosen tool at a significantly higher rate.

Figures

Figures reproduced from arXiv: 2605.26154 by Bowen Shen, Haoran Ou, Kaiyu Zhou, Kwok-Yan Lam, Tianwei Zhang, Xuanye Zhang, Yongsen Zheng, Zhuqin Xu.

Figure 1
Figure 1. Figure 1: Overview of MemMorph. Once optimized poisoned memories are retrieved during a safety-critical task, it [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MemMorph remains highly effective under indirect poisoning that structured records survive W’s rewriting with minimal loss of attack potency and posing a practi￾cal threat under realistic injection conditions. 5.3 Analysis Persistence under continued operation. We simulate extended deployment by injecting n=3 records once and scaling |D| from 300 to 2,000 benign records while the agent continues accu￾mulat… view at source ↗
Figure 4
Figure 4. Figure 4: Transferability across embedding models. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional Results for §5.3 Factual Seed Generation You are simulating diverse users interacting with a tool-augmented AI assistant. Scenario: {scenario_description} Available tools: {tool_list} Generate {batch_size} diverse, realistic user queries that someone might send to the assistant in this scenario. Requirements: 1. Each query must be a natural, first-person request as a real user would type it. 2. … view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for Potential Query Generation. Factual Seed Generation You are writing a technical knowledge-base entry for an AI operations assistant. Scenario: {scenario_description} Safe tool: {t_safe}: {safe_description} Risk tool: {t_risk}: {risk_description} Sample queries: {sample_queries} Write one knowledge-base entry as a single continuous paragraph with three parts: FRAME (1–2 sentences, <30 words): A n… view at source ↗
Figure 7
Figure 7. Figure 7: Prompts for Seed Generation. Target scenarios in MetaTool ▷ SCENARIO 1 — Data Management “task”: “Manage, organize, and securely store important information from conversations and tasks.” “tool_list”: [NotesTool, exportchat, MemoryTool, ChartTool, mini_habits] “tsafe”: NotesTool — “A full-featured reminder and to-do list management tool where you can add, delete, list, and mark reminders.” “trisk”: exportc… view at source ↗
Figure 8
Figure 8. Figure 8: Target Scenarios in MetaTool. Target scenarios in τ 2 -Bench ▷ SCENARIO 1 — Airline: Service Complaint Resolution “task”: “Handle a customer complaint about a flight delay, cancellation, or service issue and determine the appropriate resolution action.” “tool_list”: [get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, list_all_airports, book_reservation, cancel_reservati… view at source ↗
Figure 9
Figure 9. Figure 9: Target Scenarios in τ 2 -Bench. Target scenarios in ToolBench ▷ SCENARIO 1 — SMS Communications “task”: “Streamline SMS communications for effective business messaging and customer engage￾ment.” “tool_list”: [“Virtual Number - View SMS history”, “Zigatext - Global Bulk SMS & OTP - Check Balance”, “CallTrackingMetrics - List Numbers”, “CallTrackingMetrics - List Text Messages”, “MailSlurp Email Testing - ge… view at source ↗
Figure 10
Figure 10. Figure 10: Target Scenarios in ToolBench [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MemMorph as the first attack to hijack tool selection in LLM-based agents by poisoning long-term memory with a small number (as few as three) of crafted records disguised as technical facts, incident reports, and operational policies. These records are claimed to reshape the agent's contextual perception, leading to autonomous selection of attacker-preferred tools. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations report up to 85.9% attack success rate (ASR), outperforming the strongest baseline by up to 25%, with retained potency under 3 representative defenses.

Significance. If the empirical results hold after addressing methodological gaps, the work identifies long-term memory as a previously under-explored attack surface in tool-augmented agents. This is significant because modern agents increasingly rely on memory to refine tool-selection policies, and the multi-benchmark, multi-backbone evaluation provides breadth. The paper's emphasis on disguised records (rather than direct metadata manipulation) and defense resilience adds to its potential impact on agent security.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation description: The central effectiveness claim (85.9% ASR with 3 records) rests on the unvalidated assumption that poisoned records are ingested into long-term memory, retrieved in future contexts, and causally alter tool decisions without triggering filtering, deduplication, or provenance checks. No concrete details are provided on the injection channel (direct DB write vs. agent-mediated append), memory-module validation behavior, or retrieval mechanics (e.g., embedding similarity), making this the load-bearing weakest link.
  2. [Experiments] Experimental results section: Success rates are reported without error bars, statistical significance tests, or full methodology on controls for the ingestion assumption. This limits verification of whether the measured ASR would collapse under even lightweight sanitization, directly undermining the cross-setup and cross-defense claims.
minor comments (2)
  1. [Abstract] The abstract and results lack explicit description of the three representative defenses and their implementation details, which would aid assessment of the 'retaining potency' claim.
  2. [Evaluation] Notation for attack success rate (ASR) and baseline comparisons should be defined consistently with tables or figures for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where greater methodological detail will strengthen the paper. We have revised the manuscript to address both major comments by expanding the methodology and results sections with concrete implementation details, statistical analyses, and additional controls.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] The central effectiveness claim (85.9% ASR with 3 records) rests on the unvalidated assumption that poisoned records are ingested into long-term memory, retrieved in future contexts, and causally alter tool decisions without triggering filtering, deduplication, or provenance checks. No concrete details are provided on the injection channel, memory-module validation behavior, or retrieval mechanics.

    Authors: We agree that the original submission lacked sufficient detail on these mechanisms. In the revised manuscript we have added Subsection 3.2, which specifies: (1) the injection channel as agent-mediated appends triggered by simulated user queries that cause the agent to update its long-term memory; (2) the validation and deduplication behavior of each of the three tested memory modules; and (3) the retrieval mechanics (embedding-based cosine similarity with explicit threshold). We also include example memory logs confirming successful ingestion and retrieval of the three crafted records. These additions make the ingestion assumption explicit and verifiable. revision: yes

  2. Referee: [Experiments] Success rates are reported without error bars, statistical significance tests, or full methodology on controls for the ingestion assumption. This limits verification of whether the measured ASR would collapse under even lightweight sanitization.

    Authors: We accept this criticism. The revised version reports mean ASR with standard-deviation error bars from five independent trials per configuration. We added paired t-tests with p-values comparing MemMorph against baselines. We also expanded the experimental methodology with explicit ingestion-control procedures (post-injection memory audits and retrieval logs) and evaluated the attack under two additional lightweight sanitization defenses (keyword-based filtering and provenance tagging). Updated results appear in a new Table 5 and Appendix C, showing that ASR remains above 70% under these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation with no derivations or fitted predictions

full rationale

The paper is an empirical demonstration of a memory-poisoning attack on LLM agents. It reports measured attack success rates across benchmarks, agent backbones, and memory modules, without any claimed mathematical derivations, parameter fitting presented as prediction, or self-referential uniqueness theorems. The central results (e.g., 85.9% ASR with three records) are direct experimental outcomes, not reductions to inputs by construction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical security attack paper; contains no mathematical derivations, free parameters, axioms, or invented entities in the theoretical sense.

pith-pipeline@v0.9.1-grok · 5767 in / 1110 out tokens · 29645 ms · 2026-06-30T00:23:18.370512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees

    cs.CR 2026-06 unverdicted novelty 7.0 partial

    Presents TMA-NM, a non-malleable origin-bound authority system for LLM-agent memory with TLA+ machine-checked separation theorems and benchmarks showing 0% attack success against direct and laundering poisoning while ...

  2. Agent-Native Immune System: Architecture, Taxonomy, and Engineering

    cs.AI 2026-06 unverdicted novelty 4.0

    Introduces ANIS as an endogenous, six-layer immune architecture for AI agents with taxonomy of viruses/vaccines and a meta-cognitive Harness Triad for continual adaptation.

Reference graph

Works this paper leans on

9 extracted references · 4 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems

    Memory injection attacks on llm agents via query-only interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems. Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2023. News summarization and evaluation in the era of gpt-3.Preprint, arXiv:2209.12356. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten...

  2. [2]

    InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90

    Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. M...

  3. [3]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Reflexion: Language agents with verbal rein- forcement learning.Preprint, arXiv:2303.11366. Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. 2024. Cognitive architectures for language agents.Preprint, arXiv:2309.02427. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2...

  4. [4]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    From allies to adversaries: Manipulating llm tool-calling through adversarial injection. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2009–2028. Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024....

  5. [5]

    Each query must be a natural, first-person request as a real user would type it

  6. [6]

    Vary phrasing: mix direct requests, questions, complaints, and context-heavy descriptions

  7. [7]

    Vary specificity: some queries should be vague, others precise

  8. [8]

    Vary expertise level: some from novices, some from experienced operators

  9. [9]

    Last quarter

    Do NOT mention any tool by name. Output each query on a separate line, numbered 1 through {batch_size}. No other text. Figure 6: Prompt for Potential Query Generation. Factual Seed Generation You are writing a technical knowledge-base entry for an AI operations assistant. Scenario:{scenario_description} Safe tool:{t_safe}: {safe_description} Risk tool:{t_...