pith. sign in

arxiv: 2604.10534 · v1 · submitted 2026-04-12 · 💻 cs.CR · cs.AI· cs.SE

Machine Learning-Based Detection of MCP Attacks

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords Machine LearningMCPAttack DetectionModel Context ProtocolSupervised ClassificationSecurityTool DescriptionsBinary Classification
0
0 comments X

The pith

Machine learning models classify malicious MCP tool descriptions with up to 100 percent F1-score and outperform rule-based detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and tests a range of supervised machine learning models to identify attacks on the Model Context Protocol, which lets large language models call external tools. It evaluates the models on two tasks: separating malicious tool descriptions from benign ones, and further identifying the specific attack type. Several models reach perfect scores on the binary task while the best multiclass performers exceed a rule-based baseline. A working middleware is built to apply the classifier before any tool runs. The approach treats the text of tool descriptions as the key signal for defense.

Core claim

Supervised machine learning models, including support vector classifiers and BERT, can distinguish malicious from benign MCP tool descriptions with F1-scores reaching 100 percent in binary classification and around 90 percent in multiclass settings that also label attack type, while a developed middleware blocks unsafe tools before execution and the learned detectors surpass the performance of current rule-based solutions.

What carries the argument

Supervised classifiers trained on labeled collections of MCP tool descriptions to perform binary malicious-versus-benign detection and multiclass attack-type identification.

If this is right

  • A middleware layer can now filter MCP tools in real time by classifying their descriptions before execution.
  • Text-based machine learning detection provides a stronger practical alternative to the rule-based methods already deployed in the field.
  • Confusion matrices reveal error patterns that help choose the right model for different deployment tolerances.
  • The same modeling approach can be applied to other emerging LLM tool protocols that rely on descriptive text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the textual features remain stable, the same models could be retrained periodically on fresh attack examples without major redesign.
  • The success on description text alone suggests that deeper execution monitoring may not be required for initial filtering.
  • Deployment across multiple LLM platforms would require only a shared classifier service rather than per-tool rules.

Load-bearing premise

The collected set of malicious and benign MCP tool descriptions is representative of real attacks and the trained models will correctly classify new descriptions not seen in training.

What would settle it

A collection of previously unseen malicious MCP tool descriptions on which the models produce low detection rates or high false-positive rates on benign ones.

Figures

Figures reproduced from arXiv: 2604.10534 by Anton Borg, Ricardo Britto, Samuel Nyberg, Tobias Mattsson.

Figure 1
Figure 1. Figure 1: Design flow of project structure 3.2 Problem Identification The initial phase of a research project involves acquiring a comprehensive understanding of the relevant field, grasp￾ing existing knowledge, and identifying knowledge gaps 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Class distribution of malicious MCP-tox dataset [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Flowchart for middleware when intercepting MCP-tool call [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cosine Similarity matrix after class concatena [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity distance between classes [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrices for binary classification. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confusion matrices for multiclass classification. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the tokens that influence the BERT model’s decisions. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

The Model Context Protocol (MCP) is a new and emerging technology that extends the functionality of large language models, improving workflows but also exposing users to a new attack surface. Several studies have highlighted related security flaws, but MCP attack detection remains underexplored. To address this research gap, this study develops and evaluates a range of supervised machine learning approaches, including both traditional and deep-learning models. We evaluated the systems on the detection of malicious MCP tool descriptions in two scenarios: (1) a binary classification task distinguishing malicious from benign tools, and (2) a multiclass classification task identifying the attack type while separating benign from malicious tools. In addition to the machine learning models, we compared a rule-based approach that serves as a baseline. The results indicate that several of the developed models achieved 100\% F1-score on the binary classification task. In the multiclass scenario, the SVC and BERT models performed best, achieving F1 scores of 90.56\% and 88.33\%, respectively. Confusion matrices were also used to visualize the full distribution of predictions often missed by traditional metrics, providing additional insight for selecting the best-fitting solution in real-world scenarios. This study presents an addition to the MCP defence area, showing that machine learning models can perform exceptionally well in separating malicious and benign data points. To apply the solution in a live environment, a middleware was developed to classify which MCP tools are safe to use before execution, and block the ones that are not safe. Furthermore, the study shows that these models can outperform traditional rule-based solutions currently in use in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops supervised ML models (including BERT and SVC) for binary classification of malicious vs. benign MCP tool descriptions and multiclass identification of attack types. It reports 100% F1 on binary classification for several models, 90.56% and 88.33% F1 for SVC and BERT on multiclass, compares against a rule-based baseline, visualizes results with confusion matrices, and implements a middleware for real-time blocking of unsafe tools.

Significance. If the high performance generalizes beyond the evaluated data, the work provides a practical addition to defenses for the emerging MCP attack surface in LLM tool use, with the middleware offering deployable value and the empirical comparison to rule-based methods highlighting potential advantages of ML approaches.

major comments (3)
  1. [Abstract] Abstract and evaluation methodology: the claim of 100% F1-score on binary classification provides no dataset cardinality, class balance, sourcing of malicious tool descriptions (author-generated, scraped, or from incidents), cross-validation procedure, or train/test split details. This absence directly undermines verification of the performance numbers and the generalization assertion to novel attacks.
  2. [Results] Results and baseline comparison: while ML F1 scores are highlighted, the specific performance of the rule-based baseline is not quantified, preventing assessment of the claim that the models 'outperform traditional rule-based solutions currently in use in the field.'
  3. [Evaluation] Generalization and representativeness: the multiclass results and middleware deployment rest on the assumption that the collected dataset is representative of real-world MCP attacks, but no description of attack generation or external validation set is provided, leaving the robustness to unseen descriptions unestablished.
minor comments (2)
  1. [Abstract] The abstract mentions confusion matrices for additional insight but does not discuss specific patterns of misclassifications that would guide model selection in practice.
  2. [Introduction] A brief expansion on the MCP protocol and typical tool description format would improve accessibility for readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments have highlighted areas where the manuscript can be strengthened for clarity and rigor. We have revised the paper to incorporate additional details on the dataset, methodology, baseline performance, and limitations as outlined below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation methodology: the claim of 100% F1-score on binary classification provides no dataset cardinality, class balance, sourcing of malicious tool descriptions (author-generated, scraped, or from incidents), cross-validation procedure, or train/test split details. This absence directly undermines verification of the performance numbers and the generalization assertion to novel attacks.

    Authors: We agree that the original abstract and methodology sections were insufficiently detailed on these points, which limits independent verification. In the revised manuscript we have added a dedicated Dataset section that specifies the total number of samples, class balance, sourcing and generation process for the malicious tool descriptions, the cross-validation procedure, and the train/test split ratios. These additions directly support verification of the reported F1 scores and provide context for the generalization claims based on the held-out evaluation data. revision: yes

  2. Referee: [Results] Results and baseline comparison: while ML F1 scores are highlighted, the specific performance of the rule-based baseline is not quantified, preventing assessment of the claim that the models 'outperform traditional rule-based solutions currently in use in the field.'

    Authors: We acknowledge that the rule-based baseline results were described only qualitatively. The revised results section now includes the explicit F1-score achieved by the rule-based baseline, along with updated tables and discussion that allow direct quantitative comparison with the ML models and substantiate the outperformance claim. revision: yes

  3. Referee: [Evaluation] Generalization and representativeness: the multiclass results and middleware deployment rest on the assumption that the collected dataset is representative of real-world MCP attacks, but no description of attack generation or external validation set is provided, leaving the robustness to unseen descriptions unestablished.

    Authors: We agree that explicit description of attack generation and discussion of generalization are needed. The revised manuscript adds a subsection detailing the attack generation process used to construct the malicious samples. While no separate external real-world validation set was available, the evaluation used a held-out test set; we have added a limitations paragraph acknowledging that robustness to entirely novel attack descriptions outside the dataset distribution remains an open question for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ML evaluation on held-out classification tasks

full rationale

The paper describes training and evaluating standard supervised classifiers (BERT, SVC, etc.) plus a rule-based baseline on a binary and multiclass task involving MCP tool descriptions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. Performance numbers (F1 scores) are reported as measured results on the evaluation scenarios rather than being algebraically forced by the inputs or prior self-work. The work is self-contained empirical ML research with no load-bearing steps that reduce to self-definition or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no concrete free parameters, axioms, or invented entities beyond the generic assumption that labeled examples of malicious/benign MCP descriptions exist and can be used for supervised training.

axioms (1)
  • domain assumption Labeled training data for malicious and benign MCP tool descriptions is available and representative.
    Required for any supervised classification result reported in the abstract.

pith-pipeline@v0.9.0 · 5590 in / 1047 out tokens · 32491 ms · 2026-05-10T16:00:31.840784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Introducing the model context proto- col,

    Anthropic, “Introducing the model context proto- col,” 2024. https://www.anthropic.com/news/model- context-protocol

  2. [2]

    Systematic analysis of mcp security,

    Y . Guo, P. Liu, W. Ma, Z. Deng, X. Zhu, P. Di, X. Xiao, and S. Wen, “System- atic analysis of mcp security.” arXiv, 2025. https://arxiv.org/abs/2508.12538

  3. [3]

    Machine learning algorithm for detecting suspicious email messages using natural language processing nlp,

    G. Alkhodhairy and K. Saleem, “Machine learning algorithm for detecting suspicious email messages using natural language processing nlp,”Alexandria Engineering Journal, vol. 128, pp. 153–165, 2025

  4. [4]

    An experimental compari- son of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages,

    I. Androutsopoulos, J. Koutsias, K. V . Chandrinos, and C. D. Spyropoulos, “An experimental compari- son of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages,” 2000

  5. [5]

    A hybrid machine learning ap- proach for securing emails: Phishing detection and prevention,

    A. S and S. P. G, “A hybrid machine learning ap- proach for securing emails: Phishing detection and prevention,” in2025 3rd International Conference on Advancements in Electrical, Electronics, Com- munication, Computing and Automation (ICAECA), pp. 1–6, 2025

  6. [6]

    Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,

    Z. Wang, Y . Gao, Y . Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li, “Mcptox: A benchmark for tool poisoning attack on real-world mcp servers,” 2025

  7. [7]

    Email spam detection using deep learning approach,

    K. Debnath and N. Kar, “Email spam detection using deep learning approach,” in2022 International Con- ference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), vol. 1, pp. 37– 41, 2022

  8. [8]

    Outline of a design science research process,

    P. Offermann, O. Levina, M. Schönherr, and U. Bub, “Outline of a design science research process,” 01 2009

  9. [9]

    Guiding the selection of research methodology in industry–academia col- laboration in software engineering,

    C. Wohlin and P. Runeson, “Guiding the selection of research methodology in industry–academia col- laboration in software engineering,”Information and Software Technology, vol. 140, p. 106678, 2021

  10. [10]

    Creswell,Research Design: Qualitative, Quanti- 12 tative, and Mixed Methods Approaches

    J. Creswell,Research Design: Qualitative, Quanti- 12 tative, and Mixed Methods Approaches. SAGE Pub- lications, 2009

  11. [11]

    Berndtsson, J

    M. Berndtsson, J. Hansson, B. Olsson, and B. Lun- dell,Thesis Projects: A Guide for Students in Com- puter Science and Information Systems. 01 2008

  12. [12]

    Application of a case study methodol- ogy,

    W. Tellis, “Application of a case study methodol- ogy,”The Qualitative Report, 09 1997

  13. [13]

    Implementation science for software engineering: bridging the gap between research and practice (keynote),

    L. Herckis, “Implementation science for software engineering: bridging the gap between research and practice (keynote),” pp. 4–4, 09 2018

  14. [14]

    Exploring experimental research: Method- ologies, designs, and applications across disci- plines,

    S. Em, “Exploring experimental research: Method- ologies, designs, and applications across disci- plines,”SSRN Electronic Journal, pp. 1–9, 03 2024

  15. [15]

    Middleware for llms: Tools are instrumental for language agents in complex environments,

    Y . Gu, Y . Shu, H. Yu, X. Liu, Y . Dong, J. Tang, J. Srinivasa, H. Latapie, and Y . Su, “Middleware for llms: Tools are instrumental for language agents in complex environments,” pp. 7646–7663, 01 2024

  16. [16]

    Supervised machine learning: A review of classification techniques.,

    S. Kotsiantis, “Supervised machine learning: A review of classification techniques.,”Informatica (Slovenia), vol. 31, pp. 249–268, 01 2007

  17. [17]

    Github mcp server registry

    Github, “Github mcp server registry.” https://github.com/mcp

  18. [18]

    Fastmcp webpage

    Anthropic, “Fastmcp webpage.” https://gofastmcp.com/getting-started/welcome

  19. [19]

    Semantics at an angle: When cosine simi- larity works until it doesn’t,

    K. You, “Semantics at an angle: When cosine simi- larity works until it doesn’t,” 2025

  20. [20]

    Un- derstanding back-translation at scale,

    S. Edunov, M. Ott, M. Auli, and D. Grangier, “Un- derstanding back-translation at scale,” 2018

  21. [21]

    The impact of linguistic distance from english on economic growth: A cross-country analysis,

    Z. Ozkok, B. Malloy, and A. Rowe, “The impact of linguistic distance from english on economic growth: A cross-country analysis,”International Journal of Economics and Financial Issues, vol. 12, p. 1–15, Mar. 2022

  22. [22]

    A framework for rapidly de- veloping and deploying protection against large lan- guage model attacks,

    A. Swanda, A. Chang, A. Chen, F. Burch, P. Kas- sianik, and K. Berlin, “A framework for rapidly de- veloping and deploying protection against large lan- guage model attacks,” 2025

  23. [23]

    Bert: Pre-training of deep bidirectional transform- ers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” 2019

  24. [24]

    A fine-tuned bert-based transfer learning approach for text classification,

    R. Qasim, W. Bangyal, M. Alqarni, and A. Almazroi, “A fine-tuned bert-based transfer learning approach for text classification,”Journal of Healthcare Engi- neering, vol. 2022, pp. 1–17, 01 2022

  25. [25]

    Effec- tive few-shot classification with transfer learning,

    A. Gupta, K. Thadani, and N. O’Hare, “Effec- tive few-shot classification with transfer learning,” pp. 1061–1066, 01 2020

  26. [26]

    Awad and R

    M. Awad and R. Khanna,Support Vector Machines for Classification, pp. 39–66. 04 2015

  27. [27]

    Svm kernel func- tions for classification,

    A. Patle and D. S. Chouhan, “Svm kernel func- tions for classification,” in2013 International Con- ference on Advances in Technology and Engineering (ICATE), pp. 1–9, 2013

  28. [28]

    Glossary of terms. spe- cial issue of applications of machine learning and the knowledge discovery process,

    R. Kohavi and F. Provost, “Glossary of terms. spe- cial issue of applications of machine learning and the knowledge discovery process,”Mach. Learn., vol. 30, 01 1998

  29. [29]

    100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox,

    F. J. Valverde-Albacete and C. Peláez-Moreno, “100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox,”PloS one, vol. 9, no. 1, p. e84217, 2014

  30. [30]

    Calculating precision and recall for multiclass classification using confusion matrix,

    kiwkandmd, “Calculating precision and recall for multiclass classification using confusion matrix,”

  31. [31]

    https://www.geeksforgeeks.org/machine- learning/calculating-precision-and-recall-for- multiclass-classification-using-confusion-matrix/

  32. [32]

    Metrics for multi-class classification: an overview,

    M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: an overview,” 2020

  33. [33]

    Axiomatic attribution for deep networks,

    M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017

  34. [34]

    Malware detection using automated gener- ation of yara rules on dynamic features,

    Q. Si, H. Xu, Y . Tong, Y . Zhou, J. Liang, L. Cui, and Z. Hao, “Malware detection using automated gener- ation of yara rules on dynamic features,” inScience of Cyber Security(C. Su, K. Sakurai, and F. Liu, eds.), (Cham), pp. 315–330, Springer International Publishing, 2022

  35. [35]

    Example clients,

    T. L. Foundation, “Example clients,” 2026. https://modelcontextprotocol.io/clients#feature- support-matrix

  36. [36]

    Claude skills: Teach claude your way of working,

    Anthropic, “Claude skills: Teach claude your way of working,” 2026. https://claude.com/skills. 13 A Appendix Figure 8: Visualization of the tokens that influence the BERT model’s decisions. 14