Semantic Attacks on Tool-Augmented LLMs: Securing the Model Context Protocol Against Descriptor-Level Manipulation
Pith reviewed 2026-05-22 12:43 UTC · model grok-4.3
The pith
Tool descriptor manipulation in the Model Context Protocol can steer LLMs toward unsafe tool selections up to 36 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Descriptor manipulation in the Model Context Protocol creates a semantic attack surface that biases LLM tool selection. The work defines three attack classes: Tool Poisoning, Shadowing, and Rug Pull. A full-stack mitigation using descriptor integrity verification, auxiliary-LLM semantic vetting before context insertion, and lightweight runtime guardrails reduces unsafe invocations from as high as 36 percent to 15 percent and raises the block rate to 74 percent across GPT-5.3, DeepSeek-V3, and LLaMA-3.5 in controlled adversarial scenarios.
What carries the argument
Model Context Protocol tool descriptors as the attack vector, defended by a three-layer stack of integrity verification, pre-context auxiliary-LLM semantic vetting, and runtime guardrails.
If this is right
- Tool selection behavior in LLMs becomes more predictable once descriptor metadata receives integrity checks and semantic review.
- Robustness varies across model families and prompting styles, so defenses must be tuned per architecture.
- Secure tool use can be added to existing LLMs without retraining or architectural changes.
- Descriptor attacks form a threat category distinct from prompt injection that requires metadata-specific controls.
Where Pith is reading between the lines
- Descriptor manipulation risks likely appear in any agent framework that supplies capability or tool metadata to the model context.
- Chaining the auxiliary vetting model itself could create new attack paths worth testing in multi-stage setups.
- Future tool-calling standards may need built-in security fields in descriptors to reduce reliance on post-hoc vetting.
Load-bearing premise
Controlled lab scenarios that manually alter tool metadata accurately capture the capabilities and goals of realistic attackers.
What would settle it
Measure unsafe tool invocation rates when the same descriptor changes are applied inside a live production tool-calling system instead of simulated tests.
Figures
read the original abstract
The Model Context Protocol (MCP) enables Large Language Models (LLMs) to interact with external tools via tool descriptors, thereby extending their capabilities for task execution, autonomous decision-making, and multi-agent coordination. Existing MCP deployments treat tool descriptors as trusted metadata, despite their direct integration into the LLM reasoning context. This introduces a previously underexplored semantic attack surface. Current defenses primarily target prompt injection, neglecting descriptor-level manipulation that can bias tool selection and downstream reasoning. To address this gap, we formalize three descriptor-driven attack classes: Tool Poisoning, Shadowing, and Rug Pull. We propose a layered defense solution that integrates descriptor integrity verification, pre-context semantic vetting with an auxiliary LLM, and lightweight runtime guardrails, without requiring model retraining. We evaluate GPT-5.3, DeepSeek-V3, and LLaMA-3.5 across eight prompting strategies in controlled, adversarial MCP scenarios in which tool metadata is manipulated to simulate realistic attacks. Results demonstrate that descriptor manipulation can substantially alter tool-selection behavior, producing unsafe tool invocations in up to 36% of trials under baseline configurations. The proposed full-stack mitigation reduces unsafe invocations to 15% while increasing the block rate to 74%, demonstrating substantial improvement in resistance to descriptor-driven attacks. Cross-model analysis further reveals significant differences in robustness, latency, and sensitivity to descriptor-level manipulation across LLM architectures and prompting strategies. This study provides a controlled cross-model evaluation of descriptor-level threats and mitigation strategies in tool-calling LLM systems, establishing an empirical foundation for deploying secure and resilient tool-augmented LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that descriptor-level manipulations in the Model Context Protocol enable semantic attacks (Tool Poisoning, Shadowing, and Rug Pull) that can induce unsafe tool invocations in LLMs at rates up to 36% under baseline conditions. It proposes a full-stack mitigation combining descriptor integrity verification, auxiliary-LLM semantic vetting, and runtime guardrails that reduces unsafe invocations to 15% and raises the block rate to 74%. The evaluation covers GPT-5.3, DeepSeek-V3, and LLaMA-3.5 across eight prompting strategies in controlled adversarial scenarios with manually edited tool metadata, plus cross-model analysis of robustness and latency.
Significance. If the quantitative results hold after fuller experimental reporting and the attack scenarios are shown to be realizable under realistic deployment constraints, the work would be significant for LLM security. It identifies a previously neglected attack surface in tool-augmented systems and offers practical, training-free defenses. The cross-model comparison supplies useful data on architectural differences in susceptibility, which could guide secure MCP design.
major comments (3)
- [Evaluation] The abstract and evaluation report concrete figures (36% baseline unsafe invocations, reduction to 15%, 74% block rate) but supply no trial counts, statistical significance tests, confidence intervals, or explicit criteria for labeling an invocation unsafe. This information is required to evaluate reproducibility and reliability of the central empirical claims.
- [Threat Model and Attack Classes] The threat model and attack implementation grant the adversary complete, direct control over every field in the tool descriptor. The manuscript does not discuss or validate how an attacker would obtain this level of access in typical MCP deployments (authenticated registration, versioned registries, or external fetches), which is load-bearing for interpreting the reported deltas as realistic attack success rates rather than simulation artifacts.
- [Proposed Defense] The semantic-vetting layer relies on an auxiliary LLM whose own robustness to descriptor manipulation or prompt injection is not evaluated. If this auxiliary model can be compromised or biased, the mitigation stack's effectiveness would be undermined; this assumption is central to the defense claims.
minor comments (1)
- [Abstract] The eight prompting strategies are referenced but not enumerated or described in the abstract or early sections; a short list or pointer to the relevant subsection would aid readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments help clarify the presentation of our empirical results, threat model assumptions, and defense assumptions. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Evaluation] The abstract and evaluation report concrete figures (36% baseline unsafe invocations, reduction to 15%, 74% block rate) but supply no trial counts, statistical significance tests, confidence intervals, or explicit criteria for labeling an invocation unsafe. This information is required to evaluate reproducibility and reliability of the central empirical claims.
Authors: We agree that greater statistical transparency is needed. The full evaluation section describes experiments across GPT-5.3, DeepSeek-V3, and LLaMA-3.5 with eight prompting strategies, but we will revise to explicitly state the trial count per configuration (100 trials), report 95% confidence intervals, apply appropriate significance tests (e.g., McNemar’s test for paired comparisons), and provide a precise definition of “unsafe invocation” based on policy violation or match to adversarial tool signatures. These additions will be placed in a new subsection on experimental methodology. revision: yes
-
Referee: [Threat Model and Attack Classes] The threat model and attack implementation grant the adversary complete, direct control over every field in the tool descriptor. The manuscript does not discuss or validate how an attacker would obtain this level of access in typical MCP deployments (authenticated registration, versioned registries, or external fetches), which is load-bearing for interpreting the reported deltas as realistic attack success rates rather than simulation artifacts.
Authors: The threat model is intentionally scoped to descriptor-level manipulation once an attacker has write access to the tool metadata presented to the LLM. We will expand the threat-model section with a dedicated paragraph on realizability, citing realistic vectors such as compromised tool registries, malicious third-party tool providers in open ecosystems, and supply-chain attacks on external descriptor fetches. This will clarify that the reported success rates apply to deployments lacking strong authentication or integrity enforcement on tool registration, while noting that stronger registry controls would raise the bar for the attacker. revision: yes
-
Referee: [Proposed Defense] The semantic-vetting layer relies on an auxiliary LLM whose own robustness to descriptor manipulation or prompt injection is not evaluated. If this auxiliary model can be compromised or biased, the mitigation stack's effectiveness would be undermined; this assumption is central to the defense claims.
Authors: We acknowledge that the auxiliary LLM’s own susceptibility was not directly tested. In revision we will add a short analysis subsection that (a) evaluates the auxiliary model on the same descriptor-manipulation corpus and (b) discusses the layered design: even partial compromise of the auxiliary layer is mitigated by the preceding descriptor-integrity check and the subsequent runtime guardrails. We will also note that the auxiliary model can itself be hardened or replaced with a smaller, fine-tuned classifier if desired. revision: partial
Circularity Check
No circularity: empirical attack/mitigation rates are direct measurements
full rationale
The paper formalizes three attack classes (Tool Poisoning, Shadowing, Rug Pull) via controlled metadata edits, implements a layered defense (integrity verification + auxiliary LLM vetting + guardrails), and reports measured outcomes (unsafe invocations 36% baseline to 15%, block rate 74%) across GPT-5.3, DeepSeek-V3, and LLaMA-3.5 under eight prompting strategies. These percentages are obtained from explicit trial runs on manipulated descriptors; they are not obtained by fitting parameters to a subset and relabeling the fit as a prediction, nor by any self-referential definition or self-citation chain that would make the result equivalent to its inputs by construction. The evaluation setup is self-contained against the stated benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tool descriptors are directly incorporated into the LLM reasoning context and treated as trusted metadata.
- domain assumption An auxiliary LLM can reliably detect semantic manipulation in tool descriptors without itself being compromised.
Forward citations
Cited by 5 Pith papers
-
Sealing the Audit-Runtime Gap for LLM Skills
SIGIL cryptographically seals the audit-runtime gap for LLM skills via an on-chain registry with four publication types, DAO vetting, and a runtime verification loader that enforces integrity and permissions.
-
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
-
A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms
MCPSHIELD offers a threat taxonomy of 23 attack vectors, a labeled transition system verification model, and a defense-in-depth architecture claiming 91% coverage for MCP-based AI agents.
-
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP
The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.
-
CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems
CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...
Reference graph
Works this paper leans on
-
[1]
Server Tools — Model Context Protocol (MCP) Specification (Draft)
2024. Server Tools — Model Context Protocol (MCP) Specification (Draft). Online documentation. https: //modelcontextprotocol.info/specification/draft/server/tools/ Accessed on 2025-09-05
work page 2024
-
[2]
Samuel Aidoo and AML Int Dip. 2025. Cryptocurrency and Financial Crime: Emerging Risks and Regulatory Responses. (2025)
work page 2025
- [3]
-
[4]
S Akheel. 2025. Guardrails for large language models: A review of techniques and challenges.J Artif Intell Mach Learn & Data Sci3, 1 (2025), 2504–2512. J. ACM, Vol. 37, No. 4, Article 111. Publication date: November 2025. Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks 111:31
work page 2025
-
[5]
Anthropic. 2025. Our Framework for Developing Safe and Trustworthy Agents. Online article. https://www.anthropic. com/news/our-framework-for-developing-safe-and-trustworthy-agents
work page 2025
-
[6]
Luca Beurer-Kellner, Beat Buesser, Ana-Maria Creţu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, et al. 2025. Design patterns for securing llm agents against prompt injections.arXiv preprint arXiv:2506.08837(2025)
- [7]
-
[8]
Gordon Owusu Boateng, Hani Sami, Ahmed Alagha, Hanae Elmekki, Ahmad Hammoud, Rabeb Mizouni, Azzam Mourad, Hadi Otrok, Jamal Bentahar, Sami Muhaidat, et al. 2025. A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions.IEEE Communications Surveys & Tutorials(2025)
work page 2025
-
[9]
Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al . 2024. When large language models meet personalization: Perspectives of challenges and opportunities.World Wide Web27, 4 (2024), 42
work page 2024
- [10]
- [11]
- [12]
-
[13]
Florencio Cano Gabarda. 2025. Model Context Protocol (MCP): Understanding Security Risks and Controls. https: //www.redhat.com/en/blog/model-context-protocol-mcp-understanding-security-risks-and-controls. Accessed: 2025-08-04
work page 2025
- [14]
-
[15]
Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E Hassan
-
[16]
Model context protocol (mcp) at first glance: Studying the security and maintainability of mcp servers.arXiv preprint arXiv:2506.13538(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Mahbub Hassan, Md Emtiaz Kabir, Muzammil Jusoh, Hong Ki An, Michael Negnevitsky, and Chengjiang Li. 2025. Large Language Models in Transportation: A Comprehensive Bibliometric Analysis of Emerging Trends, Challenges and Future Research.IEEE Access(2025)
work page 2025
-
[19]
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [20]
-
[21]
2024.Automotive security solution using hardware security module (HSM)
Arvind Kumar, Ashish Gholve, and Kedar Kotalwar. 2024.Automotive security solution using hardware security module (HSM). Technical Report. SAE Technical Paper
work page 2024
- [22]
- [23]
- [24]
-
[25]
Anne Lott and Jerome P Reiter. 2020. Wilson confidence intervals for binomial proportions with multiple imputation for missing data.The American Statistician74, 2 (2020), 109–115
work page 2020
-
[26]
Weiqin Ma, Pu Duan, Sanmin Liu, Guofei Gu, and Jyh-Charn Liu. 2012. Shadow attacks: automatically evading system-call-behavior based malware detection.Journal in Computer Virology8, 1 (2012), 1–13
work page 2012
-
[27]
Shreekant Mandvikar. 2023. Augmenting intelligent document processing (IDP) workflows with contemporary large language models (LLMs).International Journal of Computer Trends and Technology71, 10 (2023), 80–91
work page 2023
-
[28]
Jeremy McHugh, Kristina Šekrst, and Jon Cefalu. 2025. Prompt Injection 2.0: Hybrid AI Threats.arXiv preprint arXiv:2507.13169(2025). J. ACM, Vol. 37, No. 4, Article 111. Publication date: November 2025. 111:32 Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh, and Naser Ezzati-Jivan
- [29]
-
[30]
Thanh Toan Nguyen, Nguyen Quoc Viet Hung, Thanh Tam Nguyen, Thanh Trung Huynh, Thanh Thi Nguyen, Matthias Weidlich, and Hongzhi Yin. 2024. Manipulating recommender systems: A survey of poisoning attacks and countermeasures.Comput. Surveys57, 1 (2024), 1–39
work page 2024
-
[31]
Esezi Isaac Obilor and Eric Chikweru Amadi. 2018. Test for significance of Pearson’s correlation coefficient.International Journal of Innovative Mathematics, Statistics & Energy Policies6, 1 (2018), 11–23
work page 2018
-
[32]
János Pintz. 2007. Cramér vs. Cramér. On Cramér’s probabilistic model for primes.Functiones et Approximatio Commentarii Mathematici37, 2 (2007), 361–376
work page 2007
- [33]
-
[34]
Partha Pratim Ray. 2025. A survey on model context protocol: Architecture, state-of-the-art, challenges and future directions.Authorea Preprints(2025)
work page 2025
- [35]
-
[36]
Oleksii I Sheremet, Oleksandr V Sadovoi, Kateryna S Sheremet, and Yuliia V Sokhina. 2024. Effective documentation practices for enhancing user interaction through GPT-powered conversational interfaces.Applied Aspects of Information Technology7, 2 (2024), 135–150
work page 2024
-
[37]
Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. A survey of the model context protocol (mcp): Standardizing context to enhance large language models (llms). (2025)
work page 2025
-
[38]
Lars St, Svante Wold, et al. 1989. Analysis of variance (ANOVA).Chemometrics and intelligent laboratory systems6, 4 (1989), 259–272
work page 1989
-
[39]
Tal Shapira / Reco.ai. 2025. MCP Security: Key Risks, Controls & Best Practices Explained. Online article. https: //www.reco.ai/learn/mcp-security Updated August 7, 2025; accessed September 5, 2025
work page 2025
-
[40]
Zhibo Wang, Jingjing Ma, Xue Wang, Jiahui Hu, Zhan Qin, and Kui Ren. 2022. Threats to training: A survey of poisoning attacks and defenses on machine learning systems.Comput. Surveys55, 7 (2022), 1–36
work page 2022
- [41]
-
[42]
2012.Experimentation in software engineering
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in software engineering. Springer Science & Business Media
work page 2012
-
[43]
Andreas Wortmann. 2016.An extensible component & connector architecture description infrastructure for multi-platform modeling. Vol. 25. Shaker Verlag GmbH. J. ACM, Vol. 37, No. 4, Article 111. Publication date: November 2025
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.