Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

Ahmed E. Hassan; Bram Adams; Emad Fallahzadeh; Gopi Krishnan Rajbahadur; Hao Li; Mohammed Mehedi Hasan

arxiv: 2506.13538 · v5 · submitted 2025-06-16 · 💻 cs.SE · cs.ET

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

Mohammed Mehedi Hasan , Hao Li , Emad Fallahzadeh , Gopi Krishnan Rajbahadur , Bram Adams , Ahmed E. Hassan This is my paper

Pith reviewed 2026-05-19 09:28 UTC · model grok-4.3

classification 💻 cs.SE cs.ET

keywords Model Context ProtocolMCP serverssecurity vulnerabilitiestool poisoningmaintainabilitystatic analysisAI control flowcode smells

0 comments

The pith

MCP's AI-driven control flow creates eight new vulnerability types that traditional checks miss in open servers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the security and maintainability of servers built for the Model Context Protocol, a recent standard that lets foundation models interact with external tools through a unified interface. It reports that these servers perform well on general health metrics yet still expose risks tied to their non-deterministic, AI-controlled execution paths. The study finds eight distinct vulnerability categories, only three of which match conventional software flaws, plus specific rates of general vulnerabilities and MCP-unique tool poisoning. These results suggest that the protocol's design requires dedicated detection methods in addition to established analysis practices to support long-term safety and upkeep.

Core claim

An analysis of 1,899 open-source MCP servers shows strong overall health metrics but identifies eight distinct vulnerabilities, with only three overlapping traditional software vulnerabilities. The study further reports that 7.2 percent of servers contain general vulnerabilities and 5.5 percent exhibit MCP-specific tool poisoning, while 66 percent display code smells and 14.4 percent contain ten bug patterns documented in prior work. The authors conclude that MCP's AI-driven non-deterministic control flow introduces risks that call for MCP-specific vulnerability detection alongside continued use of traditional refactoring and scanning practices.

What carries the argument

The hybrid analysis pipeline that pairs a general-purpose static analysis tool with a custom MCP-specific scanner to classify vulnerabilities as traditional or protocol-unique.

If this is right

MCP-specific vulnerability detection techniques should supplement traditional analysis methods.
MCP vulnerabilities need to be added to standardized vulnerability databases.
Automated security scanning should be built into MCP registries.
Responsible development practices can help maintain the safety and sustainability of the MCP ecosystem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other emerging AI tool-interface standards may encounter similar non-deterministic risks that general scanners overlook.
Teams adopting MCP could add targeted checks for tool poisoning during code review or deployment.
Longer-term monitoring of MCP server growth might reveal whether the observed vulnerability rates change as the ecosystem matures.

Load-bearing premise

The hybrid analysis pipeline accurately separates MCP-specific vulnerabilities from traditional ones without large numbers of misclassifications, and the 1,899 servers represent the wider open MCP ecosystem.

What would settle it

A manual review of a random subset of the servers that finds substantially higher or lower counts of tool poisoning or different vulnerability classifications than the automated pipeline produced.

Figures

Figures reproduced from arXiv: 2506.13538 by Ahmed E. Hassan, Bram Adams, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Hao Li, Mohammed Mehedi Hasan.

**Figure 2.** Figure 2: High-level overview of MCP client-server architecture [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the study design. 5.1 Data Collection 5.1.1 Extracting from Anthropic’s official repository. Anthropic has published a list of MCP servers in their official repository and maintains the list actively. We start with this list of MCP servers maintained by Anthropic in their official Model Context Protocol repository4 . In this repository, Anthropic classifies MCP servers into two major categories… view at source ↗

**Figure 4.** Figure 4: Vulnerability count distribution per MCP server grouped by Integration Type. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of credential exposure across different code and configuration formats. As these are sensitive [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of code smells by integration type and programming language. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of bugs by integration type and programming language across MCP servers. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Although Foundation Models (FMs), such as GPT-4, are increasingly used in domains like finance and software engineering, reliance on textual interfaces limits these models' real-world interaction. To address this, FM providers introduced a tool called -- triggering a proliferation of frameworks with distinct tool interfaces. In late 2024, Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem. MCP is rapidly emerging as a de facto industry standard. Despite its adoption, MCP's AI-driven, non-deterministic control flow introduces new risks to sustainability, security, and maintainability, warranting closer examination. Towards this end, we present the first large-scale empirical study of MCP. Using state-of-the-art health metrics and a hybrid analysis pipeline that combines a general-purpose static analysis tool with an MCP-specific scanner, we evaluate 1,899 open-source MCP servers to assess their health, security, and maintainability. Despite MCP servers demonstrating strong health metrics, we identify eight distinct vulnerabilities -- only three of which overlap with traditional software vulnerabilities. Additionally, 7.2% of servers contain general vulnerabilities, and 5.5% exhibit MCP-specific tool poisoning. Regarding maintainability, while 66% exhibit code smells, 14.4% contain ten bug patterns overlapping prior research. These findings highlight the need for MCP-specific vulnerability detection techniques while reaffirming the value of traditional analysis and refactoring practices. Furthermore, we advocate for stronger governance across the MCP ecosystem by incorporating MCP-specific vulnerabilities into standardized vulnerability databases, enabling automated security scanning within MCP registries, and promoting responsible development practices to ensure the long-term safety and sustainability of the MCP ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first large-scale scan of MCP servers and it surfaces some real security flags, but the split between novel MCP risks and standard ones rests on an unvalidated custom scanner.

read the letter

The main thing to know about this paper is that it delivers the first large-scale empirical data on MCP server security and maintainability, but the claims for eight distinct new vulnerabilities hinge on a custom scanner without clear validation. They pulled 1,899 open MCP servers and combined general static analysis with an MCP-specific scanner. The results show decent overall health but flag 7.2% with general vulnerabilities and 5.5% with tool poisoning. They also note eight vulnerability types where only three match traditional software issues. On maintainability, 66% have code smells and 14.4% hit ten bug patterns from prior work. This is new ground because MCP is only a few months old as a standard, and nothing like this scale existed before. The work does well by using established health metrics and actually measuring an emerging ecosystem instead of just theorizing risks. The push to add MCP issues to vulnerability databases and registries is a practical suggestion based on the data. The soft spot is the hybrid pipeline. Without details on how the MCP scanner was tested for false positives or how they made sure tool poisoning isn't just a variant of command injection or similar, the split between new and old risks could be overstated. The abstract mentions no error rates or audit for the classification, which is the load-bearing part for their novelty argument. Sample construction also needs more explanation to confirm it's representative. This paper is for people in AI security, tool integration, or anyone maintaining code for foundation model extensions. A reader who wants early numbers on a new protocol will get something concrete to work with. It deserves a serious referee because the data collection is real and the topic matters for how AI connects to external systems. I'd recommend peer review, asking specifically for more on the scanner validation and the distinction process.

Referee Report

1 major / 2 minor

Summary. The paper presents the first large-scale empirical study of 1,899 open-source MCP servers, employing state-of-the-art health metrics and a hybrid analysis pipeline that combines general-purpose static analysis with an MCP-specific scanner. It reports strong overall health metrics but identifies eight distinct vulnerabilities (only three overlapping traditional software vulnerabilities), with 7.2% of servers containing general vulnerabilities and 5.5% exhibiting MCP-specific tool poisoning; additionally, 66% of servers show code smells and 14.4% contain ten bug patterns overlapping prior research. The work concludes by advocating MCP-specific vulnerability detection, integration into standardized databases, and improved governance for the ecosystem.

Significance. If the hybrid pipeline's classification of MCP-specific vulnerabilities holds, the study offers a timely empirical baseline for an emerging protocol that standardizes tool interfaces for foundation models. The scale of the analysis (1,899 servers) and the explicit separation of novel versus traditional risks provide a useful foundation for future work on AI-driven control flow security. The call for incorporating MCP-specific issues into vulnerability databases is a concrete, actionable contribution.

major comments (1)

[Abstract / hybrid analysis pipeline description] Abstract and the description of the hybrid analysis pipeline: the central claim that eight distinct vulnerabilities exist with only three overlaps, and that 5.5% of servers exhibit MCP-specific tool poisoning, depends on the MCP-specific scanner correctly partitioning findings from the general static analyzer. No validation details (false-positive rates, held-out test set, or inter-rater audit) are reported for the scanner's rules on tool poisoning, unsafe parameter handling, or prompt-like injection vectors in tool schemas. This directly affects the headline distinction between new protocol risks and conventional issues such as command injection or deserialization flaws.

minor comments (2)

The representativeness of the 1,899 collected servers should be discussed more explicitly, including any sampling biases or coverage of the open MCP ecosystem.
Consider adding a limitations section that addresses potential false negatives in the general static analysis tools when applied to MCP-specific code patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the timeliness and scale of our empirical study on MCP servers. We address the major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract and the description of the hybrid analysis pipeline: the central claim that eight distinct vulnerabilities exist with only three overlaps, and that 5.5% of servers exhibit MCP-specific tool poisoning, depends on the MCP-specific scanner correctly partitioning findings from the general static analyzer. No validation details (false-positive rates, held-out test set, or inter-rater audit) are reported for the scanner's rules on tool poisoning, unsafe parameter handling, or prompt-like injection vectors in tool schemas. This directly affects the headline distinction between new protocol risks and conventional issues such as command injection or deserialization flaws.

Authors: We agree that explicit validation of the MCP-specific scanner is necessary to substantiate the distinction between the eight reported vulnerabilities and traditional issues. The current manuscript describes the hybrid pipeline and the rule-based scanner but omits quantitative validation metrics. In the revised version we will add a new subsection (Section 3.3) that details the scanner's rule development process, reports false-positive rates obtained from manual inspection of a random sample of 100 servers, and describes the inter-rater audit performed by two authors on a held-out set of 50 servers for tool-poisoning and unsafe-parameter classifications. These additions will directly support the headline claims regarding MCP-specific risks versus conventional vulnerabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical data collection and tool application

full rationale

The paper reports a large-scale empirical study that collects 1,899 open-source MCP servers and applies a hybrid pipeline of general static analysis plus an MCP-specific scanner. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. Vulnerability counts and classifications are presented as direct outputs of the analysis on external data rather than reductions to author-defined quantities. The work is self-contained against the collected corpus and external tools, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claims rest on the representativeness of the open-source server sample and the accuracy of the custom scanner in distinguishing MCP-specific issues; these are domain assumptions rather than derived results.

free parameters (1)

Vulnerability classification thresholds in MCP scanner
Parameters used to label issues as tool poisoning or other MCP-specific problems.

axioms (2)

domain assumption The 1,899 open-source MCP servers are representative of the broader ecosystem
Selection criteria and potential sampling bias are not detailed in the abstract.
domain assumption Static analysis combined with the MCP-specific scanner reliably identifies the reported vulnerabilities
No validation or ground-truth comparison is described.

pith-pipeline@v0.9.0 · 5859 in / 1280 out tokens · 34560 ms · 2026-05-19T09:28:33.148149+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid analysis pipeline that combines a general-purpose static analysis tool with an MCP-specific scanner... 7.2% of servers contain general vulnerabilities, and 5.5% exhibit MCP-specific tool poisoning
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MCP servers demonstrate strong health metrics... median commit frequency (5.5 commits/week)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A First Measurement Study on Authentication Security in Real-World Remote MCP Servers
cs.CR 2026-05 conditional novelty 8.0

First measurement study of 7,973 remote MCP servers finds 40.55% lack authentication and all 119 tested OAuth servers have flaws that risk data leaks or account takeover.
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale
cs.CR 2026-01 unverdicted novelty 8.0

26.1% of analyzed AI agent skills contain vulnerabilities across 14 patterns, with executable scripts raising risk 2.12x, based on static and LLM analysis of 31k skills.
Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem
cs.CR 2025-09 unverdicted novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers r...
DADL: A Declarative Description Language for Enterprise Tool Libraries in LLM Agent Systems
cs.SE 2026-05 unverdicted novelty 7.0

DADL is a declarative YAML format that lets a single runtime handle many REST API tools for LLM agents, cutting tool advertisement context cost by 142x from 142,000 to 1,000 tokens on a catalog of 1,833 definitions.
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
cs.CR 2026-04 conditional novelty 7.0

MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
cs.CR 2026-04 unverdicted novelty 7.0

Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.
AgentBound: Securing Execution Boundaries of AI Agents
cs.CR 2025-10 conditional novelty 7.0

AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
cs.SE 2025-09 conditional novelty 7.0

Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
cs.CR 2025-03 unverdicted novelty 7.0

MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
cs.SE 2026-05 unverdicted novelty 6.0

MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol
cs.SE 2026-03 unverdicted novelty 6.0

The paper proposes Context-Aware Broker Protocol, Adaptive Timeout Budget Allocation, and Structured Error Recovery Framework to address gaps in identity, budgeting, and error handling for production AI agent deployme...
Semantic Attacks on Tool-Augmented LLMs: Securing the Model Context Protocol Against Descriptor-Level Manipulation
cs.CR 2025-12 unverdicted novelty 6.0

Descriptor-level manipulation in the Model Context Protocol can drive LLMs to unsafe tool selections in up to 36% of cases; a layered defense of integrity checks, auxiliary-LLM vetting, and runtime guardrails reduces ...
VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers
cs.CR 2026-05 unverdicted novelty 5.0

VIPER-MCP detects and exploits taint-style vulnerabilities in Model Context Protocol servers via anchor-query static analysis and feedback-driven prompt evolution, uncovering 106 zero-day vulnerabilities across 39,884...
Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments
cs.HC 2026-03 unverdicted novelty 5.0

The paper introduces an agentic AI platform to train and support recovered soldiers as peer facilitators providing mental health triage and interventions in austere battlefield environments.
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP
cs.CR 2026-02 unverdicted novelty 5.0

The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.
CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems
cs.CR 2026-04 unverdicted novelty 4.0

CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...
Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
cs.AI 2026-04 unverdicted novelty 3.0

Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · cited by 19 Pith papers · 6 internal anchors

[1]

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. InIEEE/ACM International Conference on Mining Software Repositories

work page 2025
[2]

Glama AI. 2025. Glama: Your #1 Platform for Discovering Every MCP Server. https://glama.ai/mcp, last visited: May 15

work page 2025
[3]

Pydantic AI. 2025. Pydantic-AI: Agent Framework / shim to use Pydantic with LLMs. https://ai.pydantic.dev/, last visited: May 22

work page 2025
[4]

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2022. An empirical study on the survival rate of GitHub projects. In Proceedings of the 19th International Conference on Mining Software Repositories . 365–375

work page 2022
[5]

Jehad Al Dallal and Anas Abdin. 2017. Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review. IEEE Transactions on Software Engineering 44, 1 (2017), 44–69

work page 2017
[6]

Mahmoud Alfadel, Diego Elias Costa, and Emad Shihab. 2023. Empirical analysis of security vulnerabilities in python packages. Empirical Software Engineering 28, 3 (2023), 59

work page 2023
[7]

Malak Aljabri, Maryam Aldossary, Noor Al-Homeed, Bushra Alhetelah, Malek Althubiany, Ohoud Alotaibi, and Sara Alsaqer. 2022. Testing and exploiting tools to improve owasp top ten security vulnerabilities detection. In 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) . IEEE, 797–803

work page 2022
[8]

Eman Abdullah AlOmar, Anushkrishna Venkatakrishnan, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2024. How to refactor this code? An exploratory study on developer-ChatGPT refactoring conversations. In Proceedings of the 21st International Conference on Mining Software Repositories . 202–206

work page 2024
[9]

Idan Amit and Dror G Feitelson. 2021. Corrective commit probability: a measure of the effort invested in bug fixing. Software Quality Journal 29, 4 (2021), 817–861

work page 2021
[10]

Anthropic. 2025. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol, last visited: Apr 23

work page 2025
[11]

Anthropic. 2025. Model Context Protocol: NPM package. https://www.npmjs.com/package/%40modelcontextprotocol/ sdk, last visited: May 18

work page 2025
[12]

Anthropic. 2025. Model Context Protocol: PyPi package. https://pypistats.org/packages/mcp, last visited: May 18

work page 2025
[13]

Anthropic. 2025. Tool Calling: Tool Usage with Claude. https://docs.anthropic.com/en/docs/agents-and-tools/tool- use/overview, last visited: May 15

work page 2025
[14]

Apple. 2025. App Review Guidelines. https://developer.apple.com/app-store/review/guidelines/, last visited: June 03

work page 2025
[15]

Ellen Arteca, Max Schäfer, and Frank Tip. 2023. A statistical approach for finding property-access errors. arXiv preprint arXiv:2306.08741 (2023)

work page arXiv 2023
[16]

Microsoft Autogen. 2025. AutoGen: A framework for building AI agents and applications. https://microsoft.github.io/ autogen/stable/, last visited: Apr 23

work page 2025
[17]

Guilherme Avelino, Eleni Constantinou, Marco Tulio Valente, and Alexander Serebrenik. 2019. On the abandonment and survival of open source projects: An empirical investigation. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–12

work page 2019
[18]

Nathaniel Ayewah, William Pugh, J David Morgenthaler, John Penix, and YuQian Zhou. 2007. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering . 1–8

work page 2007
[19]

Sebastian Baltes, Jascha Knack, Daniel Anastasiou, Ralf Tymann, and Stephan Diehl. 2018. (No) influence of continuous integration on the commit activity in GitHub projects. In Proceedings of the 4th ACM SIGSOFT International Workshop on Software Analytics. 1–7. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Se...

work page 2018
[20]

Lingfeng Bao, Xin Xia, David Lo, and Gail C Murphy. 2019. A large scale study of long-time contributor prediction for github projects. IEEE Transactions on Software Engineering 47, 6 (2019), 1277–1298

work page 2019
[21]

Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2022. What are the practices for secret management in software artifacts?. In 2022 IEEE Secure Development Conference (SecDev) . IEEE, 69–76

work page 2022
[22]

João Helis Bernardo, Daniel Alencar Da Costa, Sérgio Queiroz de Medeiros, and Uirá Kulesza. 2024. How do machine learning projects use continuous integration practices? an empirical study on GitHub actions. In Proceedings of the 21st International Conference on Mining Software Repositories . 665–676

work page 2024
[23]

Ethan Bommarito and Michael Bommarito. 2019. An empirical analysis of the python package index (pypi). arXiv preprint arXiv:1907.11073 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 334–344

work page 2016
[26]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[27]

Simon Butler, Jonas Gamalielsson, Björn Lundell, Christoffer Brax, Anders Mattsson, Tomas Gustavsson, Jonas Feist, Bengt Kvarnström, and Erik Lönroth. 2022. Considerations and challenges for the adoption of open source components in software-intensive businesses. Journal of Systems and Software 186 (2022), 111152

work page 2022
[28]

Paolo Calciati and Alessandra Gorla. 2017. How do apps evolve in their permission requests? a preliminary study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) . IEEE, 37–41

work page 2017
[29]

G Ann Campbell and Patroklos P Papapetrou. 2013. SonarQube in action. Manning Publications Co

work page 2013
[30]

Giuseppe Castagna and Victor Lanvin. 2017. Gradual typing with union and intersection types. Proceedings of the ACM on Programming Languages 1, ICFP (2017), 1–28

work page 2017
[31]

CHAOSS Project. 2025. Community Health Analytics in Open Source Software: Topic - All Metrics. https://chaoss. community/kbtopic/all-metrics/. Accessed: Jun 10, 2025

work page 2025
[32]

CHAOSS Project. 2025. Practitioner Guide: Responsiveness. https://chaoss.community/practitioner-guide- responsiveness/. Accessed: May 15, 2025

work page 2025
[33]

Bihuan Chen, Linlin Chen, Chen Zhang, and Xin Peng. 2020. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 42–53

work page 2020
[34]

Celia Chen, Shi Lin, Michael Shoga, Qing Wang, and Barry Boehm. 2018. How do defects hurt qualities? an empirical study on characterizing a software maintainability ontology in open source software. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) . IEEE, 226–237

work page 2018
[35]

Zhifei Chen, Lin Chen, Wanwangying Ma, and Baowen Xu. 2016. Detecting code smells in python programs. In 2016 international conference on Software Analysis, Testing and Evolution (SATE) . IEEE, 18–23

work page 2016
[36]

Henry Chesbrough. 2023. Measuring the economic value of open source. San Francisco: Linux Foundation (2023)

work page 2023
[37]

Steve Christey and Robert A Martin. 2007. Vulnerability type distributions in CVE. Mitre report, May (2007)

work page 2007
[38]

Cloudflare. 2025. Cloudflare Agents Docs: Model Context Protocol (MCP). https://developers.cloudflare.com/agents/ model-context-protocol, last visited: Apr 23

work page 2025
[39]

Jailton Coelho and Marco Tulio Valente. 2017. Why modern open source projects fail. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering . 186–196

work page 2017
[40]

CrewAI. 2025. CrewAI: The leading multi-agent platform. https://www.crewai.com/, last visited: May 27

work page 2025
[41]

Dinis Barroqueiro Cruz, João Rafael Almeida, and José Luís Oliveira. 2023. Open source solutions for vulnerability assessment: A comparative analysis. IEEE Access 11 (2023), 100234–100255

work page 2023
[42]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in github for MSR studies. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) . IEEE, 560–564

work page 2021
[43]

Andrey Loskutov Keith Lea David Hovemeyer, Bill Pugh. 2025. An extensible multilanguage static code analyzer. https://findbugs.sourceforge.net/, last visited: May 18

work page 2025
[44]

DI De Silva, RD New Kandy, BLO Sachethana, SMDTH Dias, PYC Perera, ME Katipearachchi, and TDDH Jayasuriya

work page
[45]

Journal of Software Engineering Research and Development 11, 1 (2023), 1

The Relationship between Code Complexity and Software Quality: An Empirical Study. Journal of Software Engineering Research and Development 11, 1 (2023), 1

work page 2023
[46]

Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. In Proceedings of the 15th international conference on mining software repositories . 181–191. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. 34 M. Mehedi Hasan et al

work page 2018
[47]

Kerstin Denecke, Richard May, LLMHealthGroup, and Octavio Rivera Romero. 2024. Potential of large language models in health care: Delphi study. Journal of Medical Internet Research 26 (2024), e52399

work page 2024
[48]

Dify. 2025. Dify: Build Production Ready Agentic Solution. https://dify.ai/, last visited: May 27

work page 2025
[49]

Inc Docker et al. 2020. Docker. lınea].[Junio de 2017]. Disponible en: https://www. docker. com/what-docker (2020)

work page 2020
[50]

Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments. Information and Software Technology 48, 8 (2006), 745–755

work page 2006
[51]

Filipe Falcão, Caio Barbosa, Baldoino Fonseca, Alessandro Garcia, Márcio Ribeiro, and Rohit Gheyi. 2020. On relating technical, social factors, and the introduction of bugs. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 378–388

work page 2020
[52]

Rosa Falotico and Piero Quatto. 2015. Fleiss’ kappa statistic without paradoxes.Quality & Quantity 49 (2015), 463–470

work page 2015
[53]

Amir Hossein Ghapanchi. 2015. Predicting software future sustainability: A longitudinal perspective. Information Systems 49 (2015), 40–51

work page 2015
[54]

Sean Goggins, Kevin Lumbard, and Matt Germonprez. 2021. Open source community health: Analytical metrics and their corresponding narratives. In 2021 IEEE/ACM 4th International Workshop on Software Health in Projects, Ecosystems and Communities (SoHeal) . IEEE, 25–33

work page 2021
[55]

Software Improvement Group. 2025. State of Software 2025: A Global Report on the Hidden Costs and Risks of Software. https://www.softwareimprovementgroup.com/wp-content/uploads/State-of-software-2025.pdf, last visited: May 08

work page 2025
[56]

Aakanshi Gupta, Rashmi Gandhi, Nishtha Jatana, Divya Jatain, Sandeep Kumar Panda, and Janjhyam Venkata Naga Ramesh. 2023. A severity assessment of python code smells. IEEE Access 11 (2023), 119146–119160

work page 2023
[57]

Md Shariful Haque, Jeff Carver, and Travis Atkison. 2018. Causes, impacts, and detection approaches of code smell: a survey. In Proceedings of the 2018 ACM Southeast Conference . 1–8

work page 2018
[58]

E Hassan

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and A. E Hassan

work page
[59]

https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

The replication package of our study on MCP Servers. https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

work page
[60]

Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al. 2024. Rethinking software engineering in the foundation model era: From task-driven ai copilots to goal-driven ai pair programmers. arXiv preprint arXiv:2404.10225 (2024)

work page arXiv 2024
[61]

Runzhi He, Hao He, Yuxia Zhang, and Minghui Zhou. 2023. Automating dependency updates in practice: An exploratory study on github dependabot. IEEE Transactions on Software Engineering 49, 8 (2023), 4004–4022

work page 2023
[62]

Israel Herraiz, Jesus M Gonzalez-Barahona, and Gregorio Robles. 2008. Determinism and evolution. In Proceedings of the 2008 international working conference on Mining software repositories . 1–10

work page 2008
[63]

Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. 2016. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering. 426–437

work page 2016
[64]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv preprint arXiv:2503.23278 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

IBM. 2025. Cost of a Data Breach Report 2024. https://www.ibm.com/reports/data-breach, last visited: May 27

work page 2025
[66]

Samuel Idowu, Yorick Sens, Thorsten Berger, Jacob Krüger, and Michael Vierhauser. 2024. A large-scale study of ml-related python projects. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing . 1272–1281

work page 2024
[67]

Alphabet Inc. 2025. Security checklist. https://developer.android.com/privacy-and-security/security-tips, last visited: June 03

work page 2025
[68]

Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A static analysis tool for detecting web application vulnerabilities. In 2006 IEEE Symposium on Security and Privacy (S&P’06) . IEEE, 6–pp

work page 2006
[69]

Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv preprint arXiv:2407.18370 (2024)

work page arXiv 2024
[70]

Arvinder Kaur and Ruchikaa Nayyar. 2020. A comparative study of static code analysis tools for vulnerability detection in c/c++ and java source code. Procedia Computer Science 171 (2020), 2023–2029

work page 2020
[71]

Noureddine Kerzazi, Foutse Khomh, and Bram Adams. 2014. Why do automated builds break? an empirical study. In 2014 IEEE international conference on software maintenance and evolution . IEEE, 41–50

work page 2014
[72]

Sonu Kumar, Anubhav Girdhar, Ritesh Patil, and Divyansh Tripathi. 2025. MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System. arXiv preprint arXiv:2504.12757 (2025)

work page arXiv 2025
[73]

Invariant Lab. 2025. Introducing MCP-Scan: Protecting MCP with Invariant. https://invariantlabs.ai/blog/introducing- mcp-scan, last visited: May 29

work page 2025
[74]

Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, and Rajesh Vasa. 2024. Comparative analysis of real issues in open-source machine learning projects. Empirical Software Engineering 29, 3 (2024), 60. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Security and Maintainability of MCP Servers 35

work page 2024
[75]

LangChain. 2025. LangChain: composable framework to build with LLMs . https://www.langchain.com/, last visited: Apr 23

work page 2025
[76]

Jasmine Latendresse, Suhaib Mujahid, Diego Elias Costa, and Emad Shihab. 2022. Not all dependencies are equal: An empirical study on production dependencies in npm. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

work page 2022
[77]

Luigi Lavazza, Sandro Morasca, and Davide Tosi. 2021. Comparing static analysis and code smells as defect predictors: an empirical study. In IFIP international conference on open source systems . Springer, 1–15

work page 2021
[78]

Valentina Lenarduzzi, Francesco Lomio, Heikki Huttunen, and Davide Taibi. 2020. Are sonarqube rules inducing bugs?. In 2020 IEEE 27th international conference on software analysis, evolution and reengineering (SANER) . IEEE, 501–511

work page 2020
[79]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

work page 2020
[80]

Hao Li and Cor-Paul Bezemer. 2025. Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystems. Empirical Software Engineering 30, 1 (2025), 6

work page 2025

Showing first 80 references.

[1] [1]

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. InIEEE/ACM International Conference on Mining Software Repositories

work page 2025

[2] [2]

Glama AI. 2025. Glama: Your #1 Platform for Discovering Every MCP Server. https://glama.ai/mcp, last visited: May 15

work page 2025

[3] [3]

Pydantic AI. 2025. Pydantic-AI: Agent Framework / shim to use Pydantic with LLMs. https://ai.pydantic.dev/, last visited: May 22

work page 2025

[4] [4]

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2022. An empirical study on the survival rate of GitHub projects. In Proceedings of the 19th International Conference on Mining Software Repositories . 365–375

work page 2022

[5] [5]

Jehad Al Dallal and Anas Abdin. 2017. Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: A systematic literature review. IEEE Transactions on Software Engineering 44, 1 (2017), 44–69

work page 2017

[6] [6]

Mahmoud Alfadel, Diego Elias Costa, and Emad Shihab. 2023. Empirical analysis of security vulnerabilities in python packages. Empirical Software Engineering 28, 3 (2023), 59

work page 2023

[7] [7]

Malak Aljabri, Maryam Aldossary, Noor Al-Homeed, Bushra Alhetelah, Malek Althubiany, Ohoud Alotaibi, and Sara Alsaqer. 2022. Testing and exploiting tools to improve owasp top ten security vulnerabilities detection. In 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) . IEEE, 797–803

work page 2022

[8] [8]

Eman Abdullah AlOmar, Anushkrishna Venkatakrishnan, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2024. How to refactor this code? An exploratory study on developer-ChatGPT refactoring conversations. In Proceedings of the 21st International Conference on Mining Software Repositories . 202–206

work page 2024

[9] [9]

Idan Amit and Dror G Feitelson. 2021. Corrective commit probability: a measure of the effort invested in bug fixing. Software Quality Journal 29, 4 (2021), 817–861

work page 2021

[10] [10]

Anthropic. 2025. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol, last visited: Apr 23

work page 2025

[11] [11]

Anthropic. 2025. Model Context Protocol: NPM package. https://www.npmjs.com/package/%40modelcontextprotocol/ sdk, last visited: May 18

work page 2025

[12] [12]

Anthropic. 2025. Model Context Protocol: PyPi package. https://pypistats.org/packages/mcp, last visited: May 18

work page 2025

[13] [13]

Anthropic. 2025. Tool Calling: Tool Usage with Claude. https://docs.anthropic.com/en/docs/agents-and-tools/tool- use/overview, last visited: May 15

work page 2025

[14] [14]

Apple. 2025. App Review Guidelines. https://developer.apple.com/app-store/review/guidelines/, last visited: June 03

work page 2025

[15] [15]

Ellen Arteca, Max Schäfer, and Frank Tip. 2023. A statistical approach for finding property-access errors. arXiv preprint arXiv:2306.08741 (2023)

work page arXiv 2023

[16] [16]

Microsoft Autogen. 2025. AutoGen: A framework for building AI agents and applications. https://microsoft.github.io/ autogen/stable/, last visited: Apr 23

work page 2025

[17] [17]

Guilherme Avelino, Eleni Constantinou, Marco Tulio Valente, and Alexander Serebrenik. 2019. On the abandonment and survival of open source projects: An empirical investigation. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . IEEE, 1–12

work page 2019

[18] [18]

Nathaniel Ayewah, William Pugh, J David Morgenthaler, John Penix, and YuQian Zhou. 2007. Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering . 1–8

work page 2007

[19] [19]

Sebastian Baltes, Jascha Knack, Daniel Anastasiou, Ralf Tymann, and Stephan Diehl. 2018. (No) influence of continuous integration on the commit activity in GitHub projects. In Proceedings of the 4th ACM SIGSOFT International Workshop on Software Analytics. 1–7. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Se...

work page 2018

[20] [20]

Lingfeng Bao, Xin Xia, David Lo, and Gail C Murphy. 2019. A large scale study of long-time contributor prediction for github projects. IEEE Transactions on Software Engineering 47, 6 (2019), 1277–1298

work page 2019

[21] [21]

Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2022. What are the practices for secret management in software artifacts?. In 2022 IEEE Secure Development Conference (SecDev) . IEEE, 69–76

work page 2022

[22] [22]

João Helis Bernardo, Daniel Alencar Da Costa, Sérgio Queiroz de Medeiros, and Uirá Kulesza. 2024. How do machine learning projects use continuous integration practices? an empirical study on GitHub actions. In Proceedings of the 21st International Conference on Mining Software Repositories . 665–676

work page 2024

[23] [23]

Ethan Bommarito and Michael Bommarito. 2019. An empirical analysis of the python package index (pypi). arXiv preprint arXiv:1907.11073 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[24] [24]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 334–344

work page 2016

[26] [26]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020

[27] [27]

Simon Butler, Jonas Gamalielsson, Björn Lundell, Christoffer Brax, Anders Mattsson, Tomas Gustavsson, Jonas Feist, Bengt Kvarnström, and Erik Lönroth. 2022. Considerations and challenges for the adoption of open source components in software-intensive businesses. Journal of Systems and Software 186 (2022), 111152

work page 2022

[28] [28]

Paolo Calciati and Alessandra Gorla. 2017. How do apps evolve in their permission requests? a preliminary study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) . IEEE, 37–41

work page 2017

[29] [29]

G Ann Campbell and Patroklos P Papapetrou. 2013. SonarQube in action. Manning Publications Co

work page 2013

[30] [30]

Giuseppe Castagna and Victor Lanvin. 2017. Gradual typing with union and intersection types. Proceedings of the ACM on Programming Languages 1, ICFP (2017), 1–28

work page 2017

[31] [31]

CHAOSS Project. 2025. Community Health Analytics in Open Source Software: Topic - All Metrics. https://chaoss. community/kbtopic/all-metrics/. Accessed: Jun 10, 2025

work page 2025

[32] [32]

CHAOSS Project. 2025. Practitioner Guide: Responsiveness. https://chaoss.community/practitioner-guide- responsiveness/. Accessed: May 15, 2025

work page 2025

[33] [33]

Bihuan Chen, Linlin Chen, Chen Zhang, and Xin Peng. 2020. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 42–53

work page 2020

[34] [34]

Celia Chen, Shi Lin, Michael Shoga, Qing Wang, and Barry Boehm. 2018. How do defects hurt qualities? an empirical study on characterizing a software maintainability ontology in open source software. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) . IEEE, 226–237

work page 2018

[35] [35]

Zhifei Chen, Lin Chen, Wanwangying Ma, and Baowen Xu. 2016. Detecting code smells in python programs. In 2016 international conference on Software Analysis, Testing and Evolution (SATE) . IEEE, 18–23

work page 2016

[36] [36]

Henry Chesbrough. 2023. Measuring the economic value of open source. San Francisco: Linux Foundation (2023)

work page 2023

[37] [37]

Steve Christey and Robert A Martin. 2007. Vulnerability type distributions in CVE. Mitre report, May (2007)

work page 2007

[38] [38]

Cloudflare. 2025. Cloudflare Agents Docs: Model Context Protocol (MCP). https://developers.cloudflare.com/agents/ model-context-protocol, last visited: Apr 23

work page 2025

[39] [39]

Jailton Coelho and Marco Tulio Valente. 2017. Why modern open source projects fail. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering . 186–196

work page 2017

[40] [40]

CrewAI. 2025. CrewAI: The leading multi-agent platform. https://www.crewai.com/, last visited: May 27

work page 2025

[41] [41]

Dinis Barroqueiro Cruz, João Rafael Almeida, and José Luís Oliveira. 2023. Open source solutions for vulnerability assessment: A comparative analysis. IEEE Access 11 (2023), 100234–100255

work page 2023

[42] [42]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in github for MSR studies. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) . IEEE, 560–564

work page 2021

[43] [43]

Andrey Loskutov Keith Lea David Hovemeyer, Bill Pugh. 2025. An extensible multilanguage static code analyzer. https://findbugs.sourceforge.net/, last visited: May 18

work page 2025

[44] [44]

DI De Silva, RD New Kandy, BLO Sachethana, SMDTH Dias, PYC Perera, ME Katipearachchi, and TDDH Jayasuriya

work page

[45] [45]

Journal of Software Engineering Research and Development 11, 1 (2023), 1

The Relationship between Code Complexity and Software Quality: An Empirical Study. Journal of Software Engineering Research and Development 11, 1 (2023), 1

work page 2023

[46] [46]

Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. In Proceedings of the 15th international conference on mining software repositories . 181–191. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. 34 M. Mehedi Hasan et al

work page 2018

[47] [47]

Kerstin Denecke, Richard May, LLMHealthGroup, and Octavio Rivera Romero. 2024. Potential of large language models in health care: Delphi study. Journal of Medical Internet Research 26 (2024), e52399

work page 2024

[48] [48]

Dify. 2025. Dify: Build Production Ready Agentic Solution. https://dify.ai/, last visited: May 27

work page 2025

[49] [49]

Inc Docker et al. 2020. Docker. lınea].[Junio de 2017]. Disponible en: https://www. docker. com/what-docker (2020)

work page 2020

[50] [50]

Tore Dybå, Vigdis By Kampenes, and Dag IK Sjøberg. 2006. A systematic review of statistical power in software engineering experiments. Information and Software Technology 48, 8 (2006), 745–755

work page 2006

[51] [51]

Filipe Falcão, Caio Barbosa, Baldoino Fonseca, Alessandro Garcia, Márcio Ribeiro, and Rohit Gheyi. 2020. On relating technical, social factors, and the introduction of bugs. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 378–388

work page 2020

[52] [52]

Rosa Falotico and Piero Quatto. 2015. Fleiss’ kappa statistic without paradoxes.Quality & Quantity 49 (2015), 463–470

work page 2015

[53] [53]

Amir Hossein Ghapanchi. 2015. Predicting software future sustainability: A longitudinal perspective. Information Systems 49 (2015), 40–51

work page 2015

[54] [54]

Sean Goggins, Kevin Lumbard, and Matt Germonprez. 2021. Open source community health: Analytical metrics and their corresponding narratives. In 2021 IEEE/ACM 4th International Workshop on Software Health in Projects, Ecosystems and Communities (SoHeal) . IEEE, 25–33

work page 2021

[55] [55]

Software Improvement Group. 2025. State of Software 2025: A Global Report on the Hidden Costs and Risks of Software. https://www.softwareimprovementgroup.com/wp-content/uploads/State-of-software-2025.pdf, last visited: May 08

work page 2025

[56] [56]

Aakanshi Gupta, Rashmi Gandhi, Nishtha Jatana, Divya Jatain, Sandeep Kumar Panda, and Janjhyam Venkata Naga Ramesh. 2023. A severity assessment of python code smells. IEEE Access 11 (2023), 119146–119160

work page 2023

[57] [57]

Md Shariful Haque, Jeff Carver, and Travis Atkison. 2018. Causes, impacts, and detection approaches of code smell: a survey. In Proceedings of the 2018 ACM Southeast Conference . 1–8

work page 2018

[58] [58]

E Hassan

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and A. E Hassan

work page

[59] [59]

https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

The replication package of our study on MCP Servers. https://github.com/SAILResearch/replication-25-mcp- server-empirical-study, last visited: Jun 11

work page

[60] [60]

Ahmed E Hassan, Gustavo A Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, et al. 2024. Rethinking software engineering in the foundation model era: From task-driven ai copilots to goal-driven ai pair programmers. arXiv preprint arXiv:2404.10225 (2024)

work page arXiv 2024

[61] [61]

Runzhi He, Hao He, Yuxia Zhang, and Minghui Zhou. 2023. Automating dependency updates in practice: An exploratory study on github dependabot. IEEE Transactions on Software Engineering 49, 8 (2023), 4004–4022

work page 2023

[62] [62]

Israel Herraiz, Jesus M Gonzalez-Barahona, and Gregorio Robles. 2008. Determinism and evolution. In Proceedings of the 2008 international working conference on Mining software repositories . 1–10

work page 2008

[63] [63]

Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. 2016. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering. 426–437

work page 2016

[64] [64]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv preprint arXiv:2503.23278 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

IBM. 2025. Cost of a Data Breach Report 2024. https://www.ibm.com/reports/data-breach, last visited: May 27

work page 2025

[66] [66]

Samuel Idowu, Yorick Sens, Thorsten Berger, Jacob Krüger, and Michael Vierhauser. 2024. A large-scale study of ml-related python projects. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing . 1272–1281

work page 2024

[67] [67]

Alphabet Inc. 2025. Security checklist. https://developer.android.com/privacy-and-security/security-tips, last visited: June 03

work page 2025

[68] [68]

Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. 2006. Pixy: A static analysis tool for detecting web application vulnerabilities. In 2006 IEEE Symposium on Security and Privacy (S&P’06) . IEEE, 6–pp

work page 2006

[69] [69]

Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv preprint arXiv:2407.18370 (2024)

work page arXiv 2024

[70] [70]

Arvinder Kaur and Ruchikaa Nayyar. 2020. A comparative study of static code analysis tools for vulnerability detection in c/c++ and java source code. Procedia Computer Science 171 (2020), 2023–2029

work page 2020

[71] [71]

Noureddine Kerzazi, Foutse Khomh, and Bram Adams. 2014. Why do automated builds break? an empirical study. In 2014 IEEE international conference on software maintenance and evolution . IEEE, 41–50

work page 2014

[72] [72]

Sonu Kumar, Anubhav Girdhar, Ritesh Patil, and Divyansh Tripathi. 2025. MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System. arXiv preprint arXiv:2504.12757 (2025)

work page arXiv 2025

[73] [73]

Invariant Lab. 2025. Introducing MCP-Scan: Protecting MCP with Invariant. https://invariantlabs.ai/blog/introducing- mcp-scan, last visited: May 29

work page 2025

[74] [74]

Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, and Rajesh Vasa. 2024. Comparative analysis of real issues in open-source machine learning projects. Empirical Software Engineering 29, 3 (2024), 60. ACM Trans. Softw. Eng. Methodol., Vol. , No. , Article . Publication date: TBD. Studying the Security and Maintainability of MCP Servers 35

work page 2024

[75] [75]

LangChain. 2025. LangChain: composable framework to build with LLMs . https://www.langchain.com/, last visited: Apr 23

work page 2025

[76] [76]

Jasmine Latendresse, Suhaib Mujahid, Diego Elias Costa, and Emad Shihab. 2022. Not all dependencies are equal: An empirical study on production dependencies in npm. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

work page 2022

[77] [77]

Luigi Lavazza, Sandro Morasca, and Davide Tosi. 2021. Comparing static analysis and code smells as defect predictors: an empirical study. In IFIP international conference on open source systems . Springer, 1–15

work page 2021

[78] [78]

Valentina Lenarduzzi, Francesco Lomio, Heikki Huttunen, and Davide Taibi. 2020. Are sonarqube rules inducing bugs?. In 2020 IEEE 27th international conference on software analysis, evolution and reengineering (SANER) . IEEE, 501–511

work page 2020

[79] [79]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

work page 2020

[80] [80]

Hao Li and Cor-Paul Bezemer. 2025. Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystems. Empirical Software Engineering 30, 1 (2025), 6

work page 2025