When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents
Pith reviewed 2026-06-30 16:19 UTC · model grok-4.3
The pith
Tool description poisoning achieves nearly 100% attack success on leading LLM agents by manipulating metadata manuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Tool Description Poisoning (TDP), a semantic attack that covertly injects malicious instructions into a tool's descriptive metadata, exploits the planning layer of LLM agents that rely on the Model Context Protocol. The MCP-TDP benchmark demonstrates this threat through 32 real-world test cases, revealing severe vulnerabilities including nearly 100% attack success rates for leading models like GPT-4o in high-risk scenarios, the ineffectiveness of prompt-guardrail defenses, and the potential of Reactive Self-Correction as a defense mechanism.
What carries the argument
The MCP-TDP Security Benchmark, a high-fidelity sandbox environment with 32 realistic test cases spanning 6 risk categories, which measures how tool metadata poisoning affects LLM agent decision-making and execution.
If this is right
- LLM agents that base planning on tool metadata are exposed to semantic attacks that bypass code-level security.
- Prompt-guardrail defenses can increase rather than decrease susceptibility to tool description poisoning.
- Reactive Self-Correction enables agents to autonomously detect and revert malicious actions after they occur.
- The benchmark supplies a standardized method for testing and improving the cognitive security of agentic systems.
Where Pith is reading between the lines
- Agent developers may need to add independent verification steps for tool metadata sources separate from code execution.
- The results point to a need for agent architectures that maintain skepticism toward external tool descriptions even when they appear legitimate.
- Applying the benchmark in live deployments could identify additional risk categories not covered in the initial 32 cases.
Load-bearing premise
The 32 realistic test cases in the MCP-TDP benchmark accurately capture the threat model and decision-making behavior of LLM agents that rely on tool metadata.
What would settle it
Executing the 32 MCP-TDP test cases against GPT-4o and observing whether the attack success rate in the six high-risk scenarios remains near 100% or drops substantially.
Figures
read the original abstract
The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Tool Description Poisoning (TDP) is a novel semantic attack on LLM agents that injects malicious instructions into tool metadata rather than executable code under the Model Context Protocol (MCP). It introduces the MCP-TDP benchmark with 32 realistic real-world test cases spanning 6 risk categories, reports evaluation results on 8 mainstream LLMs with nearly 100% attack success rate (ASR) for GPT-4o in six high-risk scenarios, shows that common prompt-guardrail defenses are ineffective or counterproductive (termed the 'Firewall Fallacy'), and proposes Reactive Self-Correction as a post-execution defense mechanism.
Significance. If the benchmark cases prove representative of deployed agent behavior with real tool metadata, the work would be significant for highlighting an under-explored attack surface in the planning layer of tool-using LLM agents. Strengths include the empirical evaluation across eight models and the proposal of a concrete defense; the introduction of the first specialized TDP benchmark is a clear contribution to agent security research.
major comments (2)
- [Abstract and benchmark description] The central claim that TDP poses a practical threat with severe vulnerabilities (nearly 100% ASR) rests entirely on the 32 MCP-TDP test cases being representative of real agent decision-making from production tool metadata. However, no details are provided on sourcing, validation against actual MCP registries, or controls for prompt-engineering artifacts in case construction.
- [Evaluation section] The evaluation reports high ASR values and defense ineffectiveness without accompanying methodology for dataset construction, statistical controls, error analysis, or inter-rater validation of the 32 scenarios; this directly undermines confidence that the numbers support the broader conclusion of systemic weakness rather than benchmark-specific artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify gaps in methodological transparency that weaken support for the paper's claims. We address each point below and will make corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and benchmark description] The central claim that TDP poses a practical threat with severe vulnerabilities (nearly 100% ASR) rests entirely on the 32 MCP-TDP test cases being representative of real agent decision-making from production tool metadata. However, no details are provided on sourcing, validation against actual MCP registries, or controls for prompt-engineering artifacts in case construction.
Authors: We agree that the manuscript does not supply explicit details on sourcing or validation. The 32 cases were developed by adapting real tool metadata patterns drawn from publicly documented MCP-compatible tools across the six risk categories, with poisoning elements modeled on plausible malicious instructions that could appear in production descriptions. To address the concern, the revised manuscript will add a dedicated 'Benchmark Construction' subsection that specifies the registries and documentation sources used, the validation steps taken to ensure fidelity to real metadata, and the controls applied to avoid prompt-engineering artifacts (e.g., restricting phrasing to observed natural-language variations). revision: yes
-
Referee: [Evaluation section] The evaluation reports high ASR values and defense ineffectiveness without accompanying methodology for dataset construction, statistical controls, error analysis, or inter-rater validation of the 32 scenarios; this directly undermines confidence that the numbers support the broader conclusion of systemic weakness rather than benchmark-specific artifacts.
Authors: We concur that the evaluation section lacks sufficient methodological detail. ASR figures were obtained by executing each scenario in the MCP-TDP sandbox, with multiple trials per model to reduce stochastic effects. The revision will expand this section to include: (1) the full dataset-construction methodology, (2) an error analysis of failure modes, (3) basic statistical controls such as trial counts and variability measures, and (4) a justification that inter-rater validation was not required because success/failure is determined by objective execution outcomes against predefined risk criteria rather than subjective interpretation. These additions will better substantiate the systemic conclusions. revision: yes
Circularity Check
Empirical benchmark study with no derivation chain or self-referential predictions
full rationale
The paper introduces the MCP-TDP benchmark consisting of 32 hand-specified test cases and reports direct empirical measurements (ASR values) of LLM behavior on those cases. No equations, fitted parameters, predictions derived from prior outputs, or load-bearing self-citations appear in the provided text. The central claims are observational results on the benchmark rather than quantities obtained by reducing a derivation to its own inputs. This matches the default expectation for an empirical benchmark paper and receives the lowest circularity score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Integrating open-domain knowledge via large language model for multimodal fake news detection,
A. Xie, F. Zhuet al., “Integrating open-domain knowledge via large language model for multimodal fake news detection,” in2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2024, pp. 1917–1922
2024
-
[2]
The rise and potential of large language model based agents: A survey,
Z. Xi, W. Chenet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025
2025
-
[3]
A survey on large language model based autonomous agents,
L. Wang, C. Maet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024
2024
-
[4]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yuet al., “Toolformer: Language models can teach themselves to use tools,”Advances in Neural Information Processing Systems, vol. 36, pp. 68 539–68 551, 2023
2023
-
[5]
Model Context Protocol,
Anthropic, “Model Context Protocol,” https://modelcontextprotocol.io/ overview, 2024, accessed: 2025-07-15
2024
-
[6]
A survey of llm-driven ai agent communication: Protocols,
D. Kong, S. Linet al., “A survey of llm-driven ai agent communication: Protocols,”Security Risks, and Defense Countermeasures, 2025
2025
-
[7]
Tool learning with foundation models,
Y . Qin, S. Huet al., “Tool learning with foundation models,”ACM Computing Surveys, vol. 57, no. 4, pp. 1–40, 2024
2024
-
[8]
Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection,
K. Greshake, S. Abdelnabiet al., “Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection,” inProceedings of the 16th ACM workshop on artificial intelligence and security, 2023, pp. 79–90
2023
-
[9]
Commercial llm agents are already vulnerable to simple yet dangerous attacks,
A. Li, Y . Zhouet al., “Commercial llm agents are already vulnerable to simple yet dangerous attacks,”arXiv preprint arXiv:2502.08586, 2025
-
[10]
Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents,
E. Debenedetti, J. Zhang, M. Balunovi ´c, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “Agentdojo: A dynamic environment to evaluate attacks and defenses for llm agents,”arXiv e-prints, pp. arXiv–2406, 2024
2024
-
[11]
AgentBench: Evaluating LLMs as Agents
X. Liu, H. Yuet al., “Agentbench: Evaluating llms as agents,”arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Generating valid and natural adversarial examples with large language models,
Z. Wang, W. Wanget al., “Generating valid and natural adversarial examples with large language models,” in2024 27th International Con- ference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2024, pp. 1716–1721
2024
-
[13]
Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,
Z. Wang, Y . Gaoet al., “Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,”arXiv preprint arXiv:2508.14925, 2025
-
[14]
Mcp security bench (msb): Benchmarking attacks against model context protocol in llm agents,
D. Zhang, Z. Liet al., “Mcp security bench (msb): Benchmarking attacks against model context protocol in llm agents,”arXiv preprint arXiv:2510.15994, 2025
-
[15]
Y . Fu, X. Yuanet al., “Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments,”arXiv preprint arXiv:2506.15253, 2025
-
[16]
Red-Teaming Coding Agents from a Tool-Invocation Perspective: An Empirical Security Assessment
Y . Xie, M. Luoet al., “On the security of tool-invocation prompts for llm-based agentic systems: An empirical risk assessment,”arXiv preprint arXiv:2509.05755, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Automatic red teaming llm-based agents with model context protocol tools,
P. He, C. Liet al., “Automatic red teaming llm-based agents with model context protocol tools,”arXiv preprint arXiv:2509.21011, 2025
-
[18]
Augmented Language Models: a Survey
G. Mialon, R. Dess `ıet al., “Augmented language models: a survey,” arXiv preprint arXiv:2302.07842, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Jailbroken: How does llm safety training fail?
A. Wei, N. Haghtalabet al., “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Systems, vol. 36, pp. 80 079–80 110, 2023
2023
-
[20]
Tool learning with large language models: A survey,
C. Qu, S. Daiet al., “Tool learning with large language models: A survey,”Frontiers of Computer Science, vol. 19, no. 8, p. 198343, 2025
2025
-
[21]
The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise
M. Lupinacci, F. A. Pirontiet al., “The dark side of llms: Agent-based attacks for complete computer takeover,”arXiv preprint arXiv:2507.06850, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
OW ASP Top 10 for Large Language Model Applications,
OW ASP Foundation, “OW ASP Top 10 for Large Language Model Applications,” https://owasp.org/ www-project-top-10-for-large-language-model-applications/, 2023
2023
-
[23]
AI Agent Security Vulnerabilities: Tool Name Exploitation,
Enkrypt AI, “AI Agent Security Vulnerabilities: Tool Name Exploitation,” https://www.enkryptai.com/blog/ ai-agent-security-vulnerabilities-tool-name-exploitation, 2025
2025
-
[24]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhaoet al., “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022
2022
-
[25]
Scenariofuzz-llm: Enhancing diversity in au- tonomous driving scenario fuzzing with llms,
S. Lin, F. Chenet al., “Scenariofuzz-llm: Enhancing diversity in au- tonomous driving scenario fuzzing with llms,” in2025 28th Interna- tional Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2025, pp. 1581–1586
2025
-
[26]
T. Gasmi, R. Guesmiet al., “Bridging ai and software security: A com- parative vulnerability assessment of llm agent deployment paradigms,” arXiv preprint arXiv:2507.06323, 2025
-
[27]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M. Andriushchenko, A. Soulyet al., “Agentharm: A benchmark for measuring harmfulness of llm agents,”arXiv preprint arXiv:2410.09024, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Os-harm: A benchmark for measuring safety of computer use agents,
T. Kuntz, A. Duzanet al., “Os-harm: A benchmark for measuring safety of computer use agents,”arXiv preprint arXiv:2506.14866, 2025
-
[29]
Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control,
J. Lee, D. Hahmet al., “Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control,”arXiv preprint arXiv:2410.17520, 2024
-
[30]
Chain of preference optimization: Improving chain-of-thought reasoning in llms,
X. Zhang, C. Duet al., “Chain of preference optimization: Improving chain-of-thought reasoning in llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 333–356, 2024
2024
-
[31]
Tree of thoughts: Deliberate problem solving with large language models,
S. Yao, D. Yuet al., “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023
2023
-
[32]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassanoet al., “Reflexion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023
2023
-
[33]
Unierase: Unlearning token as a universal erasure primitive for language models,
M. Yu, L. Linet al., “Unierase: Unlearning token as a universal erasure primitive for language models,”arXiv preprint arXiv:2505.15674, 2025
-
[34]
Memory enhanced context awareness for large language model based autonomous driving,
Y . Zheng, H. Zhanget al., “Memory enhanced context awareness for large language model based autonomous driving,” in2025 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 2025, pp. 417–422
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.