Machine Learning-Based Detection of MCP Attacks
Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3
The pith
Machine learning models classify malicious MCP tool descriptions with up to 100 percent F1-score and outperform rule-based detectors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Supervised machine learning models, including support vector classifiers and BERT, can distinguish malicious from benign MCP tool descriptions with F1-scores reaching 100 percent in binary classification and around 90 percent in multiclass settings that also label attack type, while a developed middleware blocks unsafe tools before execution and the learned detectors surpass the performance of current rule-based solutions.
What carries the argument
Supervised classifiers trained on labeled collections of MCP tool descriptions to perform binary malicious-versus-benign detection and multiclass attack-type identification.
If this is right
- A middleware layer can now filter MCP tools in real time by classifying their descriptions before execution.
- Text-based machine learning detection provides a stronger practical alternative to the rule-based methods already deployed in the field.
- Confusion matrices reveal error patterns that help choose the right model for different deployment tolerances.
- The same modeling approach can be applied to other emerging LLM tool protocols that rely on descriptive text.
Where Pith is reading between the lines
- If the textual features remain stable, the same models could be retrained periodically on fresh attack examples without major redesign.
- The success on description text alone suggests that deeper execution monitoring may not be required for initial filtering.
- Deployment across multiple LLM platforms would require only a shared classifier service rather than per-tool rules.
Load-bearing premise
The collected set of malicious and benign MCP tool descriptions is representative of real attacks and the trained models will correctly classify new descriptions not seen in training.
What would settle it
A collection of previously unseen malicious MCP tool descriptions on which the models produce low detection rates or high false-positive rates on benign ones.
Figures
read the original abstract
The Model Context Protocol (MCP) is a new and emerging technology that extends the functionality of large language models, improving workflows but also exposing users to a new attack surface. Several studies have highlighted related security flaws, but MCP attack detection remains underexplored. To address this research gap, this study develops and evaluates a range of supervised machine learning approaches, including both traditional and deep-learning models. We evaluated the systems on the detection of malicious MCP tool descriptions in two scenarios: (1) a binary classification task distinguishing malicious from benign tools, and (2) a multiclass classification task identifying the attack type while separating benign from malicious tools. In addition to the machine learning models, we compared a rule-based approach that serves as a baseline. The results indicate that several of the developed models achieved 100\% F1-score on the binary classification task. In the multiclass scenario, the SVC and BERT models performed best, achieving F1 scores of 90.56\% and 88.33\%, respectively. Confusion matrices were also used to visualize the full distribution of predictions often missed by traditional metrics, providing additional insight for selecting the best-fitting solution in real-world scenarios. This study presents an addition to the MCP defence area, showing that machine learning models can perform exceptionally well in separating malicious and benign data points. To apply the solution in a live environment, a middleware was developed to classify which MCP tools are safe to use before execution, and block the ones that are not safe. Furthermore, the study shows that these models can outperform traditional rule-based solutions currently in use in the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops supervised ML models (including BERT and SVC) for binary classification of malicious vs. benign MCP tool descriptions and multiclass identification of attack types. It reports 100% F1 on binary classification for several models, 90.56% and 88.33% F1 for SVC and BERT on multiclass, compares against a rule-based baseline, visualizes results with confusion matrices, and implements a middleware for real-time blocking of unsafe tools.
Significance. If the high performance generalizes beyond the evaluated data, the work provides a practical addition to defenses for the emerging MCP attack surface in LLM tool use, with the middleware offering deployable value and the empirical comparison to rule-based methods highlighting potential advantages of ML approaches.
major comments (3)
- [Abstract] Abstract and evaluation methodology: the claim of 100% F1-score on binary classification provides no dataset cardinality, class balance, sourcing of malicious tool descriptions (author-generated, scraped, or from incidents), cross-validation procedure, or train/test split details. This absence directly undermines verification of the performance numbers and the generalization assertion to novel attacks.
- [Results] Results and baseline comparison: while ML F1 scores are highlighted, the specific performance of the rule-based baseline is not quantified, preventing assessment of the claim that the models 'outperform traditional rule-based solutions currently in use in the field.'
- [Evaluation] Generalization and representativeness: the multiclass results and middleware deployment rest on the assumption that the collected dataset is representative of real-world MCP attacks, but no description of attack generation or external validation set is provided, leaving the robustness to unseen descriptions unestablished.
minor comments (2)
- [Abstract] The abstract mentions confusion matrices for additional insight but does not discuss specific patterns of misclassifications that would guide model selection in practice.
- [Introduction] A brief expansion on the MCP protocol and typical tool description format would improve accessibility for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments have highlighted areas where the manuscript can be strengthened for clarity and rigor. We have revised the paper to incorporate additional details on the dataset, methodology, baseline performance, and limitations as outlined below.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation methodology: the claim of 100% F1-score on binary classification provides no dataset cardinality, class balance, sourcing of malicious tool descriptions (author-generated, scraped, or from incidents), cross-validation procedure, or train/test split details. This absence directly undermines verification of the performance numbers and the generalization assertion to novel attacks.
Authors: We agree that the original abstract and methodology sections were insufficiently detailed on these points, which limits independent verification. In the revised manuscript we have added a dedicated Dataset section that specifies the total number of samples, class balance, sourcing and generation process for the malicious tool descriptions, the cross-validation procedure, and the train/test split ratios. These additions directly support verification of the reported F1 scores and provide context for the generalization claims based on the held-out evaluation data. revision: yes
-
Referee: [Results] Results and baseline comparison: while ML F1 scores are highlighted, the specific performance of the rule-based baseline is not quantified, preventing assessment of the claim that the models 'outperform traditional rule-based solutions currently in use in the field.'
Authors: We acknowledge that the rule-based baseline results were described only qualitatively. The revised results section now includes the explicit F1-score achieved by the rule-based baseline, along with updated tables and discussion that allow direct quantitative comparison with the ML models and substantiate the outperformance claim. revision: yes
-
Referee: [Evaluation] Generalization and representativeness: the multiclass results and middleware deployment rest on the assumption that the collected dataset is representative of real-world MCP attacks, but no description of attack generation or external validation set is provided, leaving the robustness to unseen descriptions unestablished.
Authors: We agree that explicit description of attack generation and discussion of generalization are needed. The revised manuscript adds a subsection detailing the attack generation process used to construct the malicious samples. While no separate external real-world validation set was available, the evaluation used a held-out test set; we have added a limitations paragraph acknowledging that robustness to entirely novel attack descriptions outside the dataset distribution remains an open question for future work. revision: partial
Circularity Check
No circularity: purely empirical ML evaluation on held-out classification tasks
full rationale
The paper describes training and evaluating standard supervised classifiers (BERT, SVC, etc.) plus a rule-based baseline on a binary and multiclass task involving MCP tool descriptions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. Performance numbers (F1 scores) are reported as measured results on the evaluation scenarios rather than being algebraically forced by the inputs or prior self-work. The work is self-contained empirical ML research with no load-bearing steps that reduce to self-definition or construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Labeled training data for malicious and benign MCP tool descriptions is available and representative.
Reference graph
Works this paper leans on
-
[1]
Introducing the model context proto- col,
Anthropic, “Introducing the model context proto- col,” 2024. https://www.anthropic.com/news/model- context-protocol
work page 2024
-
[2]
Systematic analysis of mcp security,
Y . Guo, P. Liu, W. Ma, Z. Deng, X. Zhu, P. Di, X. Xiao, and S. Wen, “System- atic analysis of mcp security.” arXiv, 2025. https://arxiv.org/abs/2508.12538
-
[3]
G. Alkhodhairy and K. Saleem, “Machine learning algorithm for detecting suspicious email messages using natural language processing nlp,”Alexandria Engineering Journal, vol. 128, pp. 153–165, 2025
work page 2025
-
[4]
I. Androutsopoulos, J. Koutsias, K. V . Chandrinos, and C. D. Spyropoulos, “An experimental compari- son of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages,” 2000
work page 2000
-
[5]
A hybrid machine learning ap- proach for securing emails: Phishing detection and prevention,
A. S and S. P. G, “A hybrid machine learning ap- proach for securing emails: Phishing detection and prevention,” in2025 3rd International Conference on Advancements in Electrical, Electronics, Com- munication, Computing and Automation (ICAECA), pp. 1–6, 2025
work page 2025
-
[6]
Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,
Z. Wang, Y . Gao, Y . Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li, “Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,” 2025
work page 2025
-
[7]
Email spam detection using deep learning approach,
K. Debnath and N. Kar, “Email spam detection using deep learning approach,” in2022 International Con- ference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol. 1, pp. 37– 41, 2022
work page 2022
-
[8]
Outline of a design science research process,
P. Offermann, O. Levina, M. Schönherr, and U. Bub, “Outline of a design science research process,” 01 2009
work page 2009
-
[9]
C. Wohlin and P. Runeson, “Guiding the selection of research methodology in industry–academia col- laboration in software engineering,”Information and Software Technology, vol. 140, p. 106678, 2021
work page 2021
-
[10]
Creswell,Research Design: Qualitative, Quanti- 12 tative, and Mixed Methods Approaches
J. Creswell,Research Design: Qualitative, Quanti- 12 tative, and Mixed Methods Approaches. SAGE Pub- lications, 2009
work page 2009
-
[11]
M. Berndtsson, J. Hansson, B. Olsson, and B. Lun- dell,Thesis Projects: A Guide for Students in Com- puter Science and Information Systems. 01 2008
work page 2008
-
[12]
Application of a case study methodol- ogy,
W. Tellis, “Application of a case study methodol- ogy,”The Qualitative Report, 09 1997
work page 1997
-
[13]
L. Herckis, “Implementation science for software engineering: bridging the gap between research and practice (keynote),” pp. 4–4, 09 2018
work page 2018
-
[14]
Exploring experimental research: Method- ologies, designs, and applications across disci- plines,
S. Em, “Exploring experimental research: Method- ologies, designs, and applications across disci- plines,”SSRN Electronic Journal, pp. 1–9, 03 2024
work page 2024
-
[15]
Middleware for llms: Tools are instrumental for language agents in complex environments,
Y . Gu, Y . Shu, H. Yu, X. Liu, Y . Dong, J. Tang, J. Srinivasa, H. Latapie, and Y . Su, “Middleware for llms: Tools are instrumental for language agents in complex environments,” pp. 7646–7663, 01 2024
work page 2024
-
[16]
Supervised machine learning: A review of classification techniques.,
S. Kotsiantis, “Supervised machine learning: A review of classification techniques.,”Informatica (Slovenia), vol. 31, pp. 249–268, 01 2007
work page 2007
- [17]
-
[18]
Anthropic, “Fastmcp webpage.” https://gofastmcp.com/getting-started/welcome
-
[19]
Semantics at an angle: When cosine simi- larity works until it doesn’t,
K. You, “Semantics at an angle: When cosine simi- larity works until it doesn’t,” 2025
work page 2025
-
[20]
Un- derstanding back-translation at scale,
S. Edunov, M. Ott, M. Auli, and D. Grangier, “Un- derstanding back-translation at scale,” 2018
work page 2018
-
[21]
The impact of linguistic distance from english on economic growth: A cross-country analysis,
Z. Ozkok, B. Malloy, and A. Rowe, “The impact of linguistic distance from english on economic growth: A cross-country analysis,”International Journal of Economics and Financial Issues, vol. 12, p. 1–15, Mar. 2022
work page 2022
-
[22]
A. Swanda, A. Chang, A. Chen, F. Burch, P. Kas- sianik, and K. Berlin, “A framework for rapidly de- veloping and deploying protection against large lan- guage model attacks,” 2025
work page 2025
-
[23]
Bert: Pre-training of deep bidirectional transform- ers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” 2019
work page 2019
-
[24]
A fine-tuned bert-based transfer learning approach for text classification,
R. Qasim, W. Bangyal, M. Alqarni, and A. Almazroi, “A fine-tuned bert-based transfer learning approach for text classification,”Journal of Healthcare Engi- neering, vol. 2022, pp. 1–17, 01 2022
work page 2022
-
[25]
Effec- tive few-shot classification with transfer learning,
A. Gupta, K. Thadani, and N. O’Hare, “Effec- tive few-shot classification with transfer learning,” pp. 1061–1066, 01 2020
work page 2020
-
[26]
M. Awad and R. Khanna,Support Vector Machines for Classification, pp. 39–66. 04 2015
work page 2015
-
[27]
Svm kernel func- tions for classification,
A. Patle and D. S. Chouhan, “Svm kernel func- tions for classification,” in2013 International Con- ference on Advances in Technology and Engineering (ICATE), pp. 1–9, 2013
work page 2013
-
[28]
R. Kohavi and F. Provost, “Glossary of terms. spe- cial issue of applications of machine learning and the knowledge discovery process,”Mach. Learn., vol. 30, 01 1998
work page 1998
-
[29]
F. J. Valverde-Albacete and C. Peláez-Moreno, “100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox,”PloS one, vol. 9, no. 1, p. e84217, 2014
work page 2014
-
[30]
Calculating precision and recall for multiclass classification using confusion matrix,
kiwkandmd, “Calculating precision and recall for multiclass classification using confusion matrix,”
-
[31]
https://www.geeksforgeeks.org/machine- learning/calculating-precision-and-recall-for- multiclass-classification-using-confusion-matrix/
-
[32]
Metrics for multi-class classification: an overview,
M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: an overview,” 2020
work page 2020
-
[33]
Axiomatic attribution for deep networks,
M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017
work page 2017
-
[34]
Malware detection using automated gener- ation of yara rules on dynamic features,
Q. Si, H. Xu, Y . Tong, Y . Zhou, J. Liang, L. Cui, and Z. Hao, “Malware detection using automated gener- ation of yara rules on dynamic features,” inScience of Cyber Security(C. Su, K. Sakurai, and F. Liu, eds.), (Cham), pp. 315–330, Springer International Publishing, 2022
work page 2022
-
[35]
T. L. Foundation, “Example clients,” 2026. https://modelcontextprotocol.io/clients#feature- support-matrix
work page 2026
-
[36]
Claude skills: Teach claude your way of working,
Anthropic, “Claude skills: Teach claude your way of working,” 2026. https://claude.com/skills. 13 A Appendix Figure 8: Visualization of the tokens that influence the BERT model’s decisions. 14
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.