pith. sign in

arxiv: 2506.13538 · v5 · submitted 2025-06-16 · 💻 cs.SE · cs.ET

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

Pith reviewed 2026-05-19 09:28 UTC · model grok-4.3

classification 💻 cs.SE cs.ET
keywords Model Context ProtocolMCP serverssecurity vulnerabilitiestool poisoningmaintainabilitystatic analysisAI control flowcode smells
0
0 comments X

The pith

MCP's AI-driven control flow creates eight new vulnerability types that traditional checks miss in open servers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the security and maintainability of servers built for the Model Context Protocol, a recent standard that lets foundation models interact with external tools through a unified interface. It reports that these servers perform well on general health metrics yet still expose risks tied to their non-deterministic, AI-controlled execution paths. The study finds eight distinct vulnerability categories, only three of which match conventional software flaws, plus specific rates of general vulnerabilities and MCP-unique tool poisoning. These results suggest that the protocol's design requires dedicated detection methods in addition to established analysis practices to support long-term safety and upkeep.

Core claim

An analysis of 1,899 open-source MCP servers shows strong overall health metrics but identifies eight distinct vulnerabilities, with only three overlapping traditional software vulnerabilities. The study further reports that 7.2 percent of servers contain general vulnerabilities and 5.5 percent exhibit MCP-specific tool poisoning, while 66 percent display code smells and 14.4 percent contain ten bug patterns documented in prior work. The authors conclude that MCP's AI-driven non-deterministic control flow introduces risks that call for MCP-specific vulnerability detection alongside continued use of traditional refactoring and scanning practices.

What carries the argument

The hybrid analysis pipeline that pairs a general-purpose static analysis tool with a custom MCP-specific scanner to classify vulnerabilities as traditional or protocol-unique.

If this is right

  • MCP-specific vulnerability detection techniques should supplement traditional analysis methods.
  • MCP vulnerabilities need to be added to standardized vulnerability databases.
  • Automated security scanning should be built into MCP registries.
  • Responsible development practices can help maintain the safety and sustainability of the MCP ecosystem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other emerging AI tool-interface standards may encounter similar non-deterministic risks that general scanners overlook.
  • Teams adopting MCP could add targeted checks for tool poisoning during code review or deployment.
  • Longer-term monitoring of MCP server growth might reveal whether the observed vulnerability rates change as the ecosystem matures.

Load-bearing premise

The hybrid analysis pipeline accurately separates MCP-specific vulnerabilities from traditional ones without large numbers of misclassifications, and the 1,899 servers represent the wider open MCP ecosystem.

What would settle it

A manual review of a random subset of the servers that finds substantially higher or lower counts of tool poisoning or different vulnerability classifications than the automated pipeline produced.

Figures

Figures reproduced from arXiv: 2506.13538 by Ahmed E. Hassan, Bram Adams, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Hao Li, Mohammed Mehedi Hasan.

Figure 1
Figure 1. Figure 1: A motivating example of developing FM-based AI applications. In (a), Alex developed an AI application [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level overview of MCP client-server architecture [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the study design. 5.1 Data Collection 5.1.1 Extracting from Anthropic’s official repository. Anthropic has published a list of MCP servers in their official repository and maintains the list actively. We start with this list of MCP servers maintained by Anthropic in their official Model Context Protocol repository4 . In this repository, Anthropic classifies MCP servers into two major categories… view at source ↗
Figure 4
Figure 4. Figure 4: Vulnerability count distribution per MCP server grouped by Integration Type. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of credential exposure across different code and configuration formats. As these are sensitive [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of code smells by integration type and programming language. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of bugs by integration type and programming language across MCP servers. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

Although Foundation Models (FMs), such as GPT-4, are increasingly used in domains like finance and software engineering, reliance on textual interfaces limits these models' real-world interaction. To address this, FM providers introduced a tool called -- triggering a proliferation of frameworks with distinct tool interfaces. In late 2024, Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem. MCP is rapidly emerging as a de facto industry standard. Despite its adoption, MCP's AI-driven, non-deterministic control flow introduces new risks to sustainability, security, and maintainability, warranting closer examination. Towards this end, we present the first large-scale empirical study of MCP. Using state-of-the-art health metrics and a hybrid analysis pipeline that combines a general-purpose static analysis tool with an MCP-specific scanner, we evaluate 1,899 open-source MCP servers to assess their health, security, and maintainability. Despite MCP servers demonstrating strong health metrics, we identify eight distinct vulnerabilities -- only three of which overlap with traditional software vulnerabilities. Additionally, 7.2% of servers contain general vulnerabilities, and 5.5% exhibit MCP-specific tool poisoning. Regarding maintainability, while 66% exhibit code smells, 14.4% contain ten bug patterns overlapping prior research. These findings highlight the need for MCP-specific vulnerability detection techniques while reaffirming the value of traditional analysis and refactoring practices. Furthermore, we advocate for stronger governance across the MCP ecosystem by incorporating MCP-specific vulnerabilities into standardized vulnerability databases, enabling automated security scanning within MCP registries, and promoting responsible development practices to ensure the long-term safety and sustainability of the MCP ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the first large-scale empirical study of 1,899 open-source MCP servers, employing state-of-the-art health metrics and a hybrid analysis pipeline that combines general-purpose static analysis with an MCP-specific scanner. It reports strong overall health metrics but identifies eight distinct vulnerabilities (only three overlapping traditional software vulnerabilities), with 7.2% of servers containing general vulnerabilities and 5.5% exhibiting MCP-specific tool poisoning; additionally, 66% of servers show code smells and 14.4% contain ten bug patterns overlapping prior research. The work concludes by advocating MCP-specific vulnerability detection, integration into standardized databases, and improved governance for the ecosystem.

Significance. If the hybrid pipeline's classification of MCP-specific vulnerabilities holds, the study offers a timely empirical baseline for an emerging protocol that standardizes tool interfaces for foundation models. The scale of the analysis (1,899 servers) and the explicit separation of novel versus traditional risks provide a useful foundation for future work on AI-driven control flow security. The call for incorporating MCP-specific issues into vulnerability databases is a concrete, actionable contribution.

major comments (1)
  1. [Abstract / hybrid analysis pipeline description] Abstract and the description of the hybrid analysis pipeline: the central claim that eight distinct vulnerabilities exist with only three overlaps, and that 5.5% of servers exhibit MCP-specific tool poisoning, depends on the MCP-specific scanner correctly partitioning findings from the general static analyzer. No validation details (false-positive rates, held-out test set, or inter-rater audit) are reported for the scanner's rules on tool poisoning, unsafe parameter handling, or prompt-like injection vectors in tool schemas. This directly affects the headline distinction between new protocol risks and conventional issues such as command injection or deserialization flaws.
minor comments (2)
  1. The representativeness of the 1,899 collected servers should be discussed more explicitly, including any sampling biases or coverage of the open MCP ecosystem.
  2. Consider adding a limitations section that addresses potential false negatives in the general static analysis tools when applied to MCP-specific code patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the timeliness and scale of our empirical study on MCP servers. We address the major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract and the description of the hybrid analysis pipeline: the central claim that eight distinct vulnerabilities exist with only three overlaps, and that 5.5% of servers exhibit MCP-specific tool poisoning, depends on the MCP-specific scanner correctly partitioning findings from the general static analyzer. No validation details (false-positive rates, held-out test set, or inter-rater audit) are reported for the scanner's rules on tool poisoning, unsafe parameter handling, or prompt-like injection vectors in tool schemas. This directly affects the headline distinction between new protocol risks and conventional issues such as command injection or deserialization flaws.

    Authors: We agree that explicit validation of the MCP-specific scanner is necessary to substantiate the distinction between the eight reported vulnerabilities and traditional issues. The current manuscript describes the hybrid pipeline and the rule-based scanner but omits quantitative validation metrics. In the revised version we will add a new subsection (Section 3.3) that details the scanner's rule development process, reports false-positive rates obtained from manual inspection of a random sample of 100 servers, and describes the inter-rater audit performed by two authors on a held-out set of 50 servers for tool-poisoning and unsafe-parameter classifications. These additions will directly support the headline claims regarding MCP-specific risks versus conventional vulnerabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical data collection and tool application

full rationale

The paper reports a large-scale empirical study that collects 1,899 open-source MCP servers and applies a hybrid pipeline of general static analysis plus an MCP-specific scanner. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. Vulnerability counts and classifications are presented as direct outputs of the analysis on external data rather than reductions to author-defined quantities. The work is self-contained against the collected corpus and external tools, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claims rest on the representativeness of the open-source server sample and the accuracy of the custom scanner in distinguishing MCP-specific issues; these are domain assumptions rather than derived results.

free parameters (1)
  • Vulnerability classification thresholds in MCP scanner
    Parameters used to label issues as tool poisoning or other MCP-specific problems.
axioms (2)
  • domain assumption The 1,899 open-source MCP servers are representative of the broader ecosystem
    Selection criteria and potential sampling bias are not detailed in the abstract.
  • domain assumption Static analysis combined with the MCP-specific scanner reliably identifies the reported vulnerabilities
    No validation or ground-truth comparison is described.

pith-pipeline@v0.9.0 · 5859 in / 1280 out tokens · 34560 ms · 2026-05-19T09:28:33.148149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A First Measurement Study on Authentication Security in Real-World Remote MCP Servers

    cs.CR 2026-05 conditional novelty 8.0

    First measurement study of 7,973 remote MCP servers finds 40.55% lack authentication and all 119 tested OAuth servers have flaws that risk data leaks or account takeover.

  2. Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    cs.CR 2026-01 unverdicted novelty 8.0

    26.1% of analyzed AI agent skills contain vulnerabilities across 14 patterns, with executable scripts raising risk 2.12x, based on static and LLM analysis of 31k skills.

  3. Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

    cs.CR 2025-09 unverdicted novelty 8.0

    This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers r...

  4. DADL: A Declarative Description Language for Enterprise Tool Libraries in LLM Agent Systems

    cs.SE 2026-05 unverdicted novelty 7.0

    DADL is a declarative YAML format that lets a single runtime handle many REST API tools for LLM agents, cutting tool advertisement context cost by 142x from 142,000 to 1,000 tokens on a catalog of 1,833 definitions.

  5. MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security

    cs.CR 2026-04 conditional novelty 7.0

    MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.

  6. From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

    cs.CR 2026-04 unverdicted novelty 7.0

    Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.

  7. AgentBound: Securing Execution Boundaries of AI Agents

    cs.CR 2025-10 conditional novelty 7.0

    AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.

  8. An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

    cs.SE 2025-09 conditional novelty 7.0

    Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.

  9. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    cs.CR 2025-03 unverdicted novelty 7.0

    MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.

  10. OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research

    cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

    OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.

  11. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.

  12. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.

  13. Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

    cs.SE 2026-05 unverdicted novelty 6.0

    MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.

  14. Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol

    cs.SE 2026-03 unverdicted novelty 6.0

    The paper proposes Context-Aware Broker Protocol, Adaptive Timeout Budget Allocation, and Structured Error Recovery Framework to address gaps in identity, budgeting, and error handling for production AI agent deployme...

  15. Semantic Attacks on Tool-Augmented LLMs: Securing the Model Context Protocol Against Descriptor-Level Manipulation

    cs.CR 2025-12 unverdicted novelty 6.0

    Descriptor-level manipulation in the Model Context Protocol can drive LLMs to unsafe tool selections in up to 36% of cases; a layered defense of integrity checks, auxiliary-LLM vetting, and runtime guardrails reduces ...

  16. VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers

    cs.CR 2026-05 unverdicted novelty 5.0

    VIPER-MCP detects and exploits taint-style vulnerabilities in Model Context Protocol servers via anchor-query static analysis and feedback-driven prompt evolution, uncovering 106 zero-day vulnerabilities across 39,884...

  17. Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments

    cs.HC 2026-03 unverdicted novelty 5.0

    The paper introduces an agentic AI platform to train and support recovered soldiers as peer facilitators providing mental health triage and interventions in austere battlefield environments.

  18. Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP

    cs.CR 2026-02 unverdicted novelty 5.0

    The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.

  19. CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems

    cs.CR 2026-04 unverdicted novelty 4.0

    CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...

  20. Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

    cs.AI 2026-04 unverdicted novelty 3.0

    Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · cited by 19 Pith papers · 6 internal anchors

  1. [1]

    Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. InIEEE/ACM International Conference on Mining Software Repositories

  2. [2]

    Glama AI. 2025. Glama: Your #1 Platform for Discovering Every MCP Server. https://glama.ai/mcp, last visited: May 15

  3. [3]

    Pydantic AI. 2025. Pydantic-AI: Agent Framework / shim to use Pydantic with LLMs. https://ai.pydantic.dev/, last visited: May 22

  4. [4]

    Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2022. An empirical study on the survival rate of GitHub projects. In Proceedings of the 19th International Conference on Mining Software Repositories . 365–375

  5. [5]

    Jehad Al Dallal and Anas Abdin. 2017. Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review. IEEE Transactions on Software Engineering 44, 1 (2017), 44–69

  6. [6]

    Mahmoud Alfadel, Diego Elias Costa, and Emad Shihab. 2023. Empirical analysis of security vulnerabilities in python packages. Empirical Software Engineering 28, 3 (2023), 59

  7. [7]

    Malak Aljabri, Maryam Aldossary, Noor Al-Homeed, Bushra Alhetelah, Malek Althubiany, Ohoud Alotaibi, and Sara Alsaqer. 2022. Testing and exploiting tools to improve owasp top ten security vulnerabilities detection. In 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) . IEEE, 797–803

  8. [8]

    Eman Abdullah AlOmar, Anushkrishna Venkatakrishnan, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2024. How to refactor this code? An exploratory study on developer-ChatGPT refactoring conversations. In Proceedings of the 21st International Conference on Mining Software Repositories . 202–206

  9. [9]

    Idan Amit and Dror G Feitelson. 2021. Corrective commit probability: a measure of the effort invested in bug fixing. Software Quality Journal 29, 4 (2021), 817–861

  10. [10]

    Anthropic. 2025. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol, last visited: Apr 23

  11. [11]

    Anthropic. 2025. Model Context Protocol: NPM package. https://www.npmjs.com/package/%40modelcontextprotocol/ sdk, last visited: May 18

  12. [12]

    Anthropic. 2025. Model Context Protocol: PyPi package. https://pypistats.org/packages/mcp, last visited: May 18

  13. [13]

    Anthropic. 2025. Tool Calling: Tool Usage with Claude. https://docs.anthropic.com/en/docs/agents-and-tools/tool- use/overview, last visited: May 15

  14. [14]

    Apple. 2025. App Review Guidelines. https://developer.apple.com/app-store/review/guidelines/, last visited: June 03

  15. [15]

    Ellen Arteca, Max Schäfer, and Frank Tip. 2023. A statistical approach for finding property-access errors. arXiv preprint arXiv:2306.08741 (2023)

  16. [16]

    Microsoft Autogen. 2025. AutoGen: A framework for building AI agents and applications. https://microsoft.github.io/ autogen/stable/, last visited: Apr 23

  17. [17]

    Guilherme Avelino, Eleni Constantinou, Marco Tulio Valente, and Alexander Serebrenik. 2019. On the abandonment and survival of open source projects: An empirical investigation. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–12

  18. [18]

    Nathaniel Ayewah, William Pugh, J David Morgenthaler, John Penix, and YuQian Zhou. 2007. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering . 1–8

  19. [19]

    Sebastian Baltes, Jascha Knack, Daniel Anastasiou, Ralf Tymann, and Stephan Diehl. 2018. (No) influence of continuous integration on the commit activity in GitHub projects. In Proceedings of the 4th ACM SIGSOFT International Workshop on Software Analytics. 1–7. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Se...

  20. [20]

    Lingfeng Bao, Xin Xia, David Lo, and Gail C Murphy. 2019. A large scale study of long-time contributor prediction for github projects. IEEE Transactions on Software Engineering 47, 6 (2019), 1277–1298

  21. [21]

    Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2022. What are the practices for secret management in software artifacts?. In 2022 IEEE Secure Development Conference (SecDev) . IEEE, 69–76

  22. [22]

    João Helis Bernardo, Daniel Alencar Da Costa, Sérgio Queiroz de Medeiros, and Uirá Kulesza. 2024. How do machine learning projects use continuous integration practices? an empirical study on GitHub actions. In Proceedings of the 21st International Conference on Mining Software Repositories . 665–676

  23. [23]

    Ethan Bommarito and Michael Bommarito. 2019. An empirical analysis of the python package index (pypi). arXiv preprint arXiv:1907.11073 (2019)

  24. [24]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  25. [25]

    Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 334–344

  26. [26]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  27. [27]

    Simon Butler, Jonas Gamalielsson, Björn Lundell, Christoffer Brax, Anders Mattsson, Tomas Gustavsson, Jonas Feist, Bengt Kvarnström, and Erik Lönroth. 2022. Considerations and challenges for the adoption of open source components in software-intensive businesses. Journal of Systems and Software 186 (2022), 111152

  28. [28]

    Paolo Calciati and Alessandra Gorla. 2017. How do apps evolve in their permission requests? a preliminary study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) . IEEE, 37–41

  29. [29]

    G Ann Campbell and Patroklos P Papapetrou. 2013. SonarQube in action. Manning Publications Co

  30. [30]

    Giuseppe Castagna and Victor Lanvin. 2017. Gradual typing with union and intersection types. Proceedings of the ACM on Programming Languages 1, ICFP (2017), 1–28

  31. [31]

    CHAOSS Project. 2025. Community Health Analytics in Open Source Software: Topic - All Metrics. https://chaoss. community/kbtopic/all-metrics/. Accessed: Jun 10, 2025

  32. [32]

    CHAOSS Project. 2025. Practitioner Guide: Responsiveness. https://chaoss.community/practitioner-guide- responsiveness/. Accessed: May 15, 2025

  33. [33]

    Bihuan Chen, Linlin Chen, Chen Zhang, and Xin Peng. 2020. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 42–53

  34. [34]

    Celia Chen, Shi Lin, Michael Shoga, Qing Wang, and Barry Boehm. 2018. How do defects hurt qualities? an empirical study on characterizing a software maintainability ontology in open source software. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) . IEEE, 226–237

  35. [35]

    Zhifei Chen, Lin Chen, Wanwangying Ma, and Baowen Xu. 2016. Detecting code smells in python programs. In 2016 international conference on Software Analysis, Testing and Evolution (SATE) . IEEE, 18–23

  36. [36]

    Henry Chesbrough. 2023. Measuring the economic value of open source. San Francisco: Linux Foundation (2023)

  37. [37]

    Steve Christey and Robert A Martin. 2007. Vulnerability type distributions in CVE. Mitre report, May (2007)

  38. [38]

    Cloudflare. 2025. Cloudflare Agents Docs: Model Context Protocol (MCP). https://developers.cloudflare.com/agents/ model-context-protocol, last visited: Apr 23

  39. [39]

    Jailton Coelho and Marco Tulio Valente. 2017. Why modern open source projects fail. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering . 186–196

  40. [40]

    CrewAI. 2025. CrewAI: The leading multi-agent platform. https://www.crewai.com/, last visited: May 27

  41. [41]

    Dinis Barroqueiro Cruz, João Rafael Almeida, and José Luís Oliveira. 2023. Open source solutions for vulnerability assessment: A comparative analysis. IEEE Access 11 (2023), 100234–100255

  42. [42]

    Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in github for MSR studies. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) . IEEE, 560–564

  43. [43]

    Andrey Loskutov Keith Lea David Hovemeyer, Bill Pugh. 2025. An extensible multilanguage static code analyzer. https://findbugs.sourceforge.net/, last visited: May 18

  44. [44]

    DI De Silva, RD New Kandy, BLO Sachethana, SMDTH Dias, PYC Perera, ME Katipearachchi, and TDDH Jayasuriya

  45. [45]

    Journal of Software Engineering Research and Development 11, 1 (2023), 1

    The Relationship between Code Complexity and Software Quality: An Empirical Study. Journal of Software Engineering Research and Development 11, 1 (2023), 1

  46. [46]

    Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. In Proceedings of the 15th international conference on mining software repositories . 181–191. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. 34 M. Mehedi Hasan et al

  47. [47]

    Kerstin Denecke, Richard May, LLMHealthGroup, and Octavio Rivera Romero. 2024. Potential of large language models in health care: Delphi study. Journal of Medical Internet Research 26 (2024), e52399

  48. [48]

    Dify. 2025. Dify: Build Production Ready Agentic Solution. https://dify.ai/, last visited: May 27

  49. [49]

    Inc Docker et al. 2020. Docker. lınea].[Junio de 2017]. Disponible en: https://www. docker. com/what-docker (2020)

  50. [50]

    Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments. Information and Software Technology 48, 8 (2006), 745–755

  51. [51]

    Filipe Falcão, Caio Barbosa, Baldoino Fonseca, Alessandro Garcia, Márcio Ribeiro, and Rohit Gheyi. 2020. On relating technical, social factors, and the introduction of bugs. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 378–388

  52. [52]

    Rosa Falotico and Piero Quatto. 2015. Fleiss’ kappa statistic without paradoxes.Quality & Quantity 49 (2015), 463–470

  53. [53]

    Amir Hossein Ghapanchi. 2015. Predicting software future sustainability: A longitudinal perspective. Information Systems 49 (2015), 40–51

  54. [54]

    Sean Goggins, Kevin Lumbard, and Matt Germonprez. 2021. Open source community health: Analytical metrics and their corresponding narratives. In 2021 IEEE/ACM 4th International Workshop on Software Health in Projects, Ecosystems and Communities (SoHeal) . IEEE, 25–33

  55. [55]

    Software Improvement Group. 2025. State of Software 2025: A Global Report on the Hidden Costs and Risks of Software. https://www.softwareimprovementgroup.com/wp-content/uploads/State-of-software-2025.pdf, last visited: May 08

  56. [56]

    Aakanshi Gupta, Rashmi Gandhi, Nishtha Jatana, Divya Jatain, Sandeep Kumar Panda, and Janjhyam Venkata Naga Ramesh. 2023. A severity assessment of python code smells. IEEE Access 11 (2023), 119146–119160

  57. [57]

    Md Shariful Haque, Jeff Carver, and Travis Atkison. 2018. Causes, impacts, and detection approaches of code smell: a survey. In Proceedings of the 2018 ACM Southeast Conference . 1–8

  58. [58]

    E Hassan

    Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and A. E Hassan

  59. [59]

    https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

    The replication package of our study on MCP Servers. https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

  60. [60]

    Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al. 2024. Rethinking software engineering in the foundation model era: From task-driven ai copilots to goal-driven ai pair programmers. arXiv preprint arXiv:2404.10225 (2024)

  61. [61]

    Runzhi He, Hao He, Yuxia Zhang, and Minghui Zhou. 2023. Automating dependency updates in practice: An exploratory study on github dependabot. IEEE Transactions on Software Engineering 49, 8 (2023), 4004–4022

  62. [62]

    Israel Herraiz, Jesus M Gonzalez-Barahona, and Gregorio Robles. 2008. Determinism and evolution. In Proceedings of the 2008 international working conference on Mining software repositories . 1–10

  63. [63]

    Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. 2016. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering. 426–437

  64. [64]

    Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv preprint arXiv:2503.23278 (2025)

  65. [65]

    IBM. 2025. Cost of a Data Breach Report 2024. https://www.ibm.com/reports/data-breach, last visited: May 27

  66. [66]

    Samuel Idowu, Yorick Sens, Thorsten Berger, Jacob Krüger, and Michael Vierhauser. 2024. A large-scale study of ml-related python projects. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing . 1272–1281

  67. [67]

    Alphabet Inc. 2025. Security checklist. https://developer.android.com/privacy-and-security/security-tips, last visited: June 03

  68. [68]

    Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A static analysis tool for detecting web application vulnerabilities. In 2006 IEEE Symposium on Security and Privacy (S&P’06) . IEEE, 6–pp

  69. [69]

    Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv preprint arXiv:2407.18370 (2024)

  70. [70]

    Arvinder Kaur and Ruchikaa Nayyar. 2020. A comparative study of static code analysis tools for vulnerability detection in c/c++ and java source code. Procedia Computer Science 171 (2020), 2023–2029

  71. [71]

    Noureddine Kerzazi, Foutse Khomh, and Bram Adams. 2014. Why do automated builds break? an empirical study. In 2014 IEEE international conference on software maintenance and evolution . IEEE, 41–50

  72. [72]

    Sonu Kumar, Anubhav Girdhar, Ritesh Patil, and Divyansh Tripathi. 2025. MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System. arXiv preprint arXiv:2504.12757 (2025)

  73. [73]

    Invariant Lab. 2025. Introducing MCP-Scan: Protecting MCP with Invariant. https://invariantlabs.ai/blog/introducing- mcp-scan, last visited: May 29

  74. [74]

    Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, and Rajesh Vasa. 2024. Comparative analysis of real issues in open-source machine learning projects. Empirical Software Engineering 29, 3 (2024), 60. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Security and Maintainability of MCP Servers 35

  75. [75]

    LangChain. 2025. LangChain: composable framework to build with LLMs . https://www.langchain.com/, last visited: Apr 23

  76. [76]

    Jasmine Latendresse, Suhaib Mujahid, Diego Elias Costa, and Emad Shihab. 2022. Not all dependencies are equal: An empirical study on production dependencies in npm. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

  77. [77]

    Luigi Lavazza, Sandro Morasca, and Davide Tosi. 2021. Comparing static analysis and code smells as defect predictors: an empirical study. In IFIP international conference on open source systems . Springer, 1–15

  78. [78]

    Valentina Lenarduzzi, Francesco Lomio, Heikki Huttunen, and Davide Taibi. 2020. Are sonarqube rules inducing bugs?. In 2020 IEEE 27th international conference on software analysis, evolution and reengineering (SANER) . IEEE, 501–511

  79. [79]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

  80. [80]

    Hao Li and Cor-Paul Bezemer. 2025. Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystems. Empirical Software Engineering 30, 1 (2025), 6

Showing first 80 references.