Think Twice Before You Act: Protecting LLM Agents Against Tool Description Poisoning via Isolated Planning
Pith reviewed 2026-06-26 16:29 UTC · model grok-4.3
The pith
Isolating suspicious tool invocations on a quarantined list breaks persistent influence from poisoned descriptions in LLM agent planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tool-Guard defends against cross-tool description poisoning by isolating influence through a quarantined influenced list: when a tool invocation is detected as misaligned or suspicious, its poisoned description is prevented from steering subsequent planning steps, yet the tool remains available to support legitimate task execution.
What carries the argument
Isolated planning, the mechanism that places detected misaligned tool invocations onto an influenced list to break continuous influence from poisoned descriptions across planning steps.
If this is right
- Attack success rates drop substantially on the AgentDojo and ASB benchmarks.
- Task utility remains high because tools stay available after isolation.
- Poisoned descriptions lose their ability to exert continuous influence over later tool choices.
- The approach addresses the specific failure mode where existing defenses leave poisoned metadata active in planning context.
Where Pith is reading between the lines
- The same isolation pattern could limit metadata-based steering in multi-step agent workflows that share tool lists across agents.
- Detection rules for the influenced list might be tuned separately from utility goals to handle new poisoning variants without redesigning the full planner.
- Isolated planning could serve as a modular layer added to existing agent runtimes rather than requiring changes to the underlying LLM.
Load-bearing premise
Tool invocations can be reliably detected as misaligned or suspicious to trigger placement on the influenced list without harming legitimate utility.
What would settle it
An experiment in which a poisoned description still steers agent choices after the tool enters the influenced list, or in which legitimate tool use after isolation causes measurable task failure.
Figures
read the original abstract
The integration of external tools has substantially expanded the capabilities of large language model (LLM) agents, but it also introduces new attack surfaces beyond prompt injection. In particular, cross-tool description poisoning can manipulate planner-visible tool metadata to steer an agent's trajectory, even if the poisoned tool itself is never chosen. To understand the effectiveness of existing defenses against this emerging threat, we first evaluate several prompt-injection defenses and find that they transfer poorly to cross-tool description poisoning. A key observation is that poisoned descriptions persist in the planning context across steps, enabling continuous influence over subsequent tool choices. Building on this insight, we propose Tool-Guard, a novel system-level defense based on a new concept called isolated planning, in which tool invocations that are detected as misaligned or suspicious cause the corresponding tool to be placed in a quarantined list (the influenced list), breaking further influence from poisoned descriptions. With this influence isolated, the tool can continue to be used to support the task, enabling a robust defense that preserves legitimate tool utility. Experiments on the AgentDojo and ASB benchmarks show that Tool-Guard substantially reduces attack success while maintaining high task utility. Our code is available at https://github.com/shishishi123/Tool-Guard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies cross-tool description poisoning as an attack on LLM agents where poisoned tool metadata influences planning even without selecting the poisoned tool. It evaluates that existing prompt-injection defenses transfer poorly, observes that poisoned descriptions persist across planning steps, and proposes Tool-Guard, which uses isolated planning: tool invocations detected as misaligned or suspicious are placed on an influenced (quarantined) list to break further influence while still allowing the tool to be used. Experiments on the AgentDojo and ASB benchmarks report that Tool-Guard substantially reduces attack success while preserving high task utility. Code is released at https://github.com/shishishi123/Tool-Guard.
Significance. If the results hold, the work is significant for addressing an emerging attack surface in LLM agent systems that extends beyond standard prompt injection. The system-level approach of isolating influence rather than blocking tools outright is a practical contribution that aims to balance security and utility. Explicit credit is due for the public code release, which supports reproducibility.
major comments (1)
- [Abstract] Abstract: the central mechanism of Tool-Guard relies on detecting 'misaligned or suspicious' tool invocations to populate the influenced list and thereby isolate poisoned descriptions. No algorithm, model, threshold, false-positive analysis, or accuracy metrics for this detection step are provided. This is load-bearing for the central claim because the reported reductions in attack success and maintained task utility on AgentDojo/ASB cannot be evaluated without knowing the reliability of the detection component; imperfect detection would either leave attacks active or unnecessarily quarantine legitimate tools.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for identifying a key point regarding the detection component in Tool-Guard. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central mechanism of Tool-Guard relies on detecting 'misaligned or suspicious' tool invocations to populate the influenced list and thereby isolate poisoned descriptions. No algorithm, model, threshold, false-positive analysis, or accuracy metrics for this detection step are provided. This is load-bearing for the central claim because the reported reductions in attack success and maintained task utility on AgentDojo/ASB cannot be evaluated without knowing the reliability of the detection component; imperfect detection would either leave attacks active or unnecessarily quarantine legitimate tools.
Authors: We agree that the detection of misaligned or suspicious tool invocations is central to Tool-Guard and that its reliability directly affects the validity of the reported results. The current manuscript does not provide the algorithm, model, threshold, false-positive analysis, or accuracy metrics for this detection step. In the revised version we will add a dedicated subsection (in Section 4) that fully specifies the detection method, including any underlying model or heuristic, the exact decision rule and threshold(s), and an empirical evaluation of its accuracy, precision, recall, and false-positive rate on the AgentDojo and ASB benchmarks. This addition will allow readers to assess how detection errors would propagate to attack success and task utility. revision: yes
Circularity Check
No circularity: system architecture with no equations or self-referential derivations
full rationale
The paper describes a proposed defense system (Tool-Guard with isolated planning) without any mathematical derivation chain, equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation load-bearing steps or ansatz smuggling appear. The central contribution is an architectural description whose validity rests on empirical benchmarks rather than tautological definitions. The detection mechanism is underspecified, but that is an assumption gap, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introducing the Model Context Protocol (MCP) , year =
-
[2]
arXiv preprint arXiv:2508.14925 , year =
Wang, Zhiqiang and Gao, Yichao and Wang, Yanting and Liu, Suyuan and Sun, Haifeng and Cheng, Haoran and Shi, Guanquan and Du, Haohua and Li, Xiangyang , title =. arXiv preprint arXiv:2508.14925 , year =
-
[3]
Model Context Protocol -- OpenAI Agents Python Documentation , year =
-
[4]
2025 , note =
Aldridge, Nick and Ward, James , title =. 2025 , note =
2025
-
[5]
2025 , note =
Microsoft , title =. 2025 , note =
2025
-
[6]
arXiv preprint arXiv:2508.10991 , year=
Mcp-guard: A defense framework for model context protocol integrity in large language model applications , author=. arXiv preprint arXiv:2508.10991 , year=
-
[7]
, author =
Built to make you extraordinarily productive, Cursor is the best way to code with AI. , author =
-
[8]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[9]
2025 , howpublished =
OpenAI Operator: Computer-Use Agent , author =. 2025 , howpublished =
2025
-
[10]
2025 , howpublished =
2025
-
[11]
Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , title =
Izzeddin Gur and Hiroki Furuta and Austin V. Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , title =. International Conference on Learning Representations (ICLR) , year =
-
[12]
2024 , howpublished =
GitHub Copilot: Your AI Pair Programmer , author =. 2024 , howpublished =
2024
-
[13]
Amazon introduces agentic AI across the seller experience, transforming how sellers manage their businesses , author =
-
[14]
VS Code AI Development Tools , author=
-
[15]
Function Calling and Other API Updates , author =
-
[16]
LangChain Tools Documentation , author =
-
[17]
arXiv preprint arXiv:2510.02554 , year=
ToolTweak: An Attack on Tool Selection in LLM-based Agents , author=. arXiv preprint arXiv:2510.02554 , year=
-
[18]
arXiv preprint arXiv:2508.02110 , year=
Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools , author=. arXiv preprint arXiv:2508.02110 , year=
-
[19]
arXiv preprint arXiv:2506.12104 , year=
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents , author=. arXiv preprint arXiv:2506.12104 , year=
-
[20]
arXiv preprint arXiv:2504.11703 , year=
Progent: Programmable privilege control for llm agents , author=. arXiv preprint arXiv:2504.11703 , year=
-
[21]
arXiv preprint arXiv:2503.18813 , year=
Defeating prompt injections by design , author=. arXiv preprint arXiv:2503.18813 , year=
-
[22]
arXiv preprint arXiv:2502.05174 , year=
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents , author=. arXiv preprint arXiv:2502.05174 , year=
-
[23]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Ipiguard: A novel tool dependency graph-based defense against indirect prompt injection in llm agents , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[24]
Advances in Neural Information Processing Systems , volume=
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
arXiv preprint arXiv:2410.02644 , year=
Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents , author=. arXiv preprint arXiv:2410.02644 , year=
-
[26]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Mcip: Protecting mcp safety via model contextual integrity protocol , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[27]
2025 , month =
Introducing MCP-Scan: Protecting MCP with Invariant , author =. 2025 , month =
2025
-
[28]
arXiv preprint arXiv:2508.20412 , year=
Mindguard: Tracking, detecting, and attributing mcp tool poisoning attack via decision dependence graph , author=. arXiv preprint arXiv:2508.20412 , year=
-
[29]
arXiv preprint arXiv:2508.12538 , year=
Systematic analysis of mcp security , author=. arXiv preprint arXiv:2508.12538 , year=
-
[30]
2023 , howpublished =
The Dual LLM pattern for building AI assistants that can resist prompt injection , author =. 2023 , howpublished =
2023
-
[31]
2024 , month = oct, day =
Sander Schulhoff , title =. 2024 , month = oct, day =
2024
-
[32]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[33]
arXiv preprint arXiv:2211.09527 , year=
Ignore previous prompt: Attack techniques for language models , author=. arXiv preprint arXiv:2211.09527 , year=
-
[34]
Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 , pages=
INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents , author=. Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 , pages=. 2024 , organization=
2024
-
[35]
33rd USENIX Security Symposium (USENIX Security 24) , pages=
Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=
-
[36]
arXiv preprint arXiv:2306.05499 , year=
Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=
-
[37]
Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=
-
[38]
34th USENIX Security Symposium (USENIX Security 25) , pages=
\ StruQ \ : Defending against prompt injection with structured queries , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[39]
Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=
Secalign: Defending against prompt injection with preference optimization , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=
2025
-
[40]
arXiv preprint arXiv:2312.06674 , year=
Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=
-
[41]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
PIGuard: Prompt injection guardrail via mitigating overdefense for free , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[42]
2024 , month = jul, day =
2024
-
[43]
2024 , howpublished =
2024
-
[44]
2024 , month = oct, day =
2024
-
[45]
2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=
Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=
2025
-
[46]
Advances in Neural Information Processing Systems , volume=
Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.