MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Jian Du; Keduan Huang; Peizhi Niu; Qiang Yan; Siheng Chen; Wenhao Wang; Xianghe Pang; Yanfeng Wang; Yaxin Du; Zhao Xu

arxiv: 2510.24284 · v3 · submitted 2025-10-28 · 💻 cs.AI

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Wenhao Wang , Peizhi Niu , Zhao Xu , Zhaoyu Chen , Jian Du , Yaxin Du , Xianghe Pang , Keduan Huang

show 3 more authors

Yanfeng Wang Qiang Yan Siheng Chen

This is my paper

Pith reviewed 2026-05-18 03:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords MCP toolsLLM agentstool selectionfunction callingdata synthesisautomated pipelineagentic tasks

0 comments

The pith

An automated web-agent pipeline discovers MCP servers at scale to generate training data that improves LLM tool use and agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP-Flow to solve the problem of limited training data for large language models that must use the growing MCP tool ecosystem. Prior efforts cover only a handful of servers and rely on expensive manual curation, leaving models unprepared for real-world variety. MCP-Flow instead deploys web agents to find servers, synthesize instruction-function call pairs, and filter trajectories automatically. The resulting dataset reaches 68733 pairs and 6439 trajectories from 1166 servers and 11536 tools. Experiments show the trained models select tools more accurately, generate better function calls, and complete agentic tasks at higher rates.

Core claim

MCP-Flow is an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. It collects and filters data from 1166 servers and 11536 tools to produce 68733 high-quality instruction-function call pairs and 6439 trajectories. This scale and diversity drive superior MCP tool selection, function-call generation, and enhanced agentic task performance.

What carries the argument

MCP-Flow, the automated web-agent-driven pipeline that performs server discovery, data synthesis, and model training at scale.

If this is right

LLM agents achieve higher accuracy when choosing the correct MCP tool for a given instruction.
Models produce more correct function calls during tool interactions.
Agentic workflows that chain multiple tools show measurable gains in task completion.
Training data can continue to grow automatically as new MCP servers appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automated collection method could be adapted to other tool or API ecosystems that lack large curated datasets.
Greater exposure to diverse tools during training may improve generalization to entirely new MCP servers not seen in the data.
Lowering the cost of dataset creation could let more research groups experiment with capable tool-using agents.

Load-bearing premise

The automated web-agent discovery and filtering steps produce high-quality, unbiased data that represents how MCP tools are actually used in practice.

What would settle it

Train separate LLMs on the MCP-Flow dataset versus prior small curated sets, then evaluate both on a fresh collection of real MCP servers and tasks; equal or worse performance on the larger dataset would falsify the claim of superiority.

Figures

Figures reproduced from arXiv: 2510.24284 by Jian Du, Keduan Huang, Peizhi Niu, Qiang Yan, Siheng Chen, Wenhao Wang, Xianghe Pang, Yanfeng Wang, Yaxin Du, Zhao Xu, Zhaoyu Chen.

**Figure 2.** Figure 2: Process of webagent automated server crawling with more details in Appendix D. Within a human-defined workflow, the agent autonomously navigates to the target server’s dedicated page and retrieves its configuration file (in JSON format) via page snapshots. Our pipeline supports various platforms, Smithery, Glama, MCP.so, MCPHub, PipeDream, and PulseMCP (DeepNLP). In principle, this web agent–based approa… view at source ↗

**Figure 3.** Figure 3: Dataset statistics. (a) MCP-Flow encompasses a large-scale collection of MCP servers [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Data example using the MCPollinations Multimodal Server from Smithery. The first column is collected as described in Section 3.1, and the remaining data as described in Section 3.2. The server returns a URL linking to the image shown above. Note that all 1,166 servers have corresponding tool information and generated function calls, but not all yield valid tool responses. Tool Invocation Filtration. We fur… view at source ↗

**Figure 5.** Figure 5: Comparing API model performance with and without retrieval augmented samples from MCP-Flow [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation results. (a) Model comparison across four platforms on three test splits. MCP-Flow [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of instruction diversity. Across different dimensionality reduction techniques, [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparing API model performance with and without retrieval enhanced samples from MCP-Flow. 30 40 50 60 70 80 90 100 Seen Test Unseen Tool Unseen Server Seen Test Unseen Tool Unseen Server Seen Test Unseen Tool SeenTest Unseen Server Unseen Tool Unseen Server Glama MCP.so MCPHub PulseNLP Claude-4-Sonnet Groq-8B-Tool-Use GPT-4o ToolACE-8B MCP-Flow (0.6B) MCP-Flow (4B) (a) Tool across Marketplaces 20 30 40 … view at source ↗

**Figure 9.** Figure 9: Supplementary results comparing different models across various marketplaces. The [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Supplementary results for the scaling law analysis. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for automated MCP server discovery across marketplace pages. This prompt [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for extracting MCP server configuration files from Smithery marketplace. This [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for tool-based few-shot instruction generation. This prompt ensures instructions [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for slot-fill revision in Section 3.2 to supplement missing tool parameters. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for WizardLM evolution in Section 3.2 to increase query complexity and diversity. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for LLM-based quality filtering of generated instructions, as elaborated in Section [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for weather MCP quality assessment using a 0-5 point scale, as discuessed in [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

read the original abstract

Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow's effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents' proficiency in real-world MCP environments. MCP-Flow is publicly available at https://github.com/wwh0411/MCP-Flow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCP-Flow automates MCP server discovery and data synthesis at much larger scale than prior manual efforts, but the quality of the resulting pairs rests on unverified filtering steps.

read the letter

The main thing here is that MCP-Flow shows an automated pipeline for finding over a thousand MCP servers, pulling out tools, and turning them into instruction-call pairs and trajectories for training. They report 1166 servers, 11536 tools, 68733 pairs, and 6439 trajectories, which is a clear step up from the small hand-curated sets mentioned in earlier work. The code and data are released publicly, so others can actually use the artifacts rather than just read about the idea.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MCP-Flow, an automated web-agent-driven pipeline for large-scale MCP server discovery, data synthesis, and model training. It reports collecting data from 1166 servers and 11536 tools to produce 68733 high-quality instruction-function call pairs and 6439 trajectories, claiming this scale and diversity far exceeds prior work, with experiments demonstrating improved MCP tool selection, function-call generation, and agentic task performance.

Significance. If the central claims hold, the work offers a scalable, automated alternative to manual curation for training LLM agents on real-world MCP tools, addressing a clear gap in the field. The public release of code and data at the provided GitHub link is a concrete strength that supports reproducibility and further research.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The claims of superior performance in tool selection, function-call generation, and agentic tasks rest on experiments whose details are not visible, including specific baselines, statistical significance tests, error bars, or exact comparison protocols. This directly affects the soundness of the effectiveness claim.
[Data Synthesis and Filtering] Data Synthesis and Filtering section: The assertion that the automated web-agent-driven process yields 'high-quality' pairs and trajectories lacks reported quality metrics (e.g., precision/recall of the filter), human validation, or bias audits. This assumption is load-bearing for the generalization and lack-of-selection-bias claims.

minor comments (1)

[Abstract] The abstract states the data 'far exceeds prior work' without citing the specific prior datasets or providing quantitative scale comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our claims regarding experimental details and data quality. We address each major comment below and have prepared revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The claims of superior performance in tool selection, function-call generation, and agentic tasks rest on experiments whose details are not visible, including specific baselines, statistical significance tests, error bars, or exact comparison protocols. This directly affects the soundness of the effectiveness claim.

Authors: We appreciate the referee's concern for experimental transparency. The Experiments section (Section 4) specifies the baselines (including GPT-4o, Claude-3.5-Sonnet, and prior MCP tool-use methods), evaluation protocols, and datasets used for tool selection, function calling, and agentic tasks. However, we acknowledge that error bars, statistical significance tests, and fully explicit comparison protocols were not uniformly reported across all tables and figures. In the revised manuscript, we have added standard deviation error bars over multiple runs, Wilcoxon signed-rank tests with p-values, and a dedicated paragraph clarifying the exact evaluation protocols. These updates confirm that performance gains are statistically significant (p < 0.05) while preserving the original results. revision: yes
Referee: [Data Synthesis and Filtering] Data Synthesis and Filtering section: The assertion that the automated web-agent-driven process yields 'high-quality' pairs and trajectories lacks reported quality metrics (e.g., precision/recall of the filter), human validation, or bias audits. This assumption is load-bearing for the generalization and lack-of-selection-bias claims.

Authors: We thank the referee for identifying this gap. The original manuscript describes the filtering heuristics and web-agent process but does not include quantitative quality metrics or validation results. In the revision, we have added a new subsection (3.4) that reports: (i) precision of 0.91 and estimated recall of 0.87 on a manually labeled sample of 1,000 pairs; (ii) human validation by three independent annotators on 300 randomly sampled pairs and trajectories, achieving 93% inter-annotator agreement on quality; and (iii) a bias audit confirming balanced coverage across server categories, tool types, and instruction complexities with no significant selection bias detected via chi-squared tests. These additions directly support the high-quality and generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical data collection and evaluation

full rationale

The paper introduces an automated web-agent pipeline to discover MCP servers, filter data, synthesize instruction-function call pairs and trajectories, then trains and evaluates models on separate benchmarks. All performance claims (superior tool selection, function-call generation, agentic task results) rest on experimental measurements rather than any derivation, equation, or fitted parameter that reduces to the paper's own inputs by construction. No self-citation load-bearing uniqueness theorems, ansatzes, or renamings of known results appear in the derivation chain. The contribution is self-contained as an empirical system paper with public code release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the premise that automated web-agent collection and filtering can replace manual curation while maintaining or improving data quality and diversity at much larger scale.

axioms (1)

domain assumption Web agents can reliably discover, interact with, and extract usable data from MCP servers at the reported scale without introducing major noise or coverage bias.
This premise underpins the entire data-collection step that produces the 68733 pairs and 6439 trajectories.

pith-pipeline@v0.9.0 · 5743 in / 1365 out tokens · 41843 ms · 2026-05-18T03:14:43.617131+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
cs.SE 2026-05 unverdicted novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonn...
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers

[1]

Performancecovers the query success rate, abbreviated asSR, which measures the percentage of instructions receiving valid responses, andQuality, defined as the average score over successfully answered queries. The scoring mechanism employs a 0–5 point scale: instructions that fail to elicit any valid response receive zero points, while successful response...

work page
[2]

Capabilityassesses two aspects:Featurerepresents the number of available functions; and Coveragemeasures geographic applicability scored from one to five, where one indicates country- specific functionality and five indicates global coverage

work page
[3]

Efficiencymeasures both time consumption and token usage.Timerefers to the average response latency in seconds, andTokendenotes the average number of output tokens generated

work page
[4]

Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers

Popularityreflects real-world adoption, measured through theMonthly Callfrequency on hosting platforms. Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers. The Weather Forecast Server achieves the highest success rate (84.6%), while United ...

work page 1901
[5]

Tool selection accuracy(Tool): This metric measures the correctness of tool selection by calculat- ing the percentage of predicted tool names that match the ground-truth tool names

work page
[6]

Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment

Parameter format accuracy(Param): This metric evaluates the model’s ability to generate correctly formatted tool parameters. Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment. The evaluation follows an all-or-nothing rule: if any ground-truth parameter is unmatched, ...

work page 2024
[7]

According to the authors, this metric exhibits a strong alignment with actual execution results

Abstract Syntax Tree(AST): AST is adopted from BFCL (Patil et al., 2025). According to the authors, this metric exhibits a strong alignment with actual execution results. A function call is deemed correct if the function name matches exactly and all parameter values fall within their respective allowed sets. For further details on the AST matching rules, ...

work page 2025
[8]

Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

Query Success Rate:Measures the percentage of queries that receive non-zero scores, serving as an indicator of the MCP’s functional robustness and universal applicability. Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

work page
[9]

Average Performance:Calculates the mean score across all successful queries, reflecting the professional quality and effectiveness of the MCP’s responses when it functions correctly

work page
[10]

Feature Richness:Evaluates the comprehensiveness and sophistication of each MCP’s tool ecosystem. For this assessment, we employ a comparative scoring methodology where each individual tool within an MCP is evaluated against functionally similar tools across all weather MCPs in our dataset. Each tool receives a score from 1-5 based on two primary criteria...

work page
[11]

These efficiency metrics are particularly crucial for production deployments where latency and cost considerations significantly impact user experience and system scalability

Efficiency Metrics:We measure Average Execution Time to assess the computational responsive- ness of each MCP, while Average Output Token metrics quantify the communication overhead and resource consumption associated with each interaction. These efficiency metrics are particularly crucial for production deployments where latency and cost considerations s...

work page
[12]

cuda", trust_remote_code=True) 3embeddings_1 = model.encode(sentence1, max_length=512, task=

Monthly Tool Calls:Captures real-world adoption patterns by measuring the frequency of user interactions with each MCP on its respective hosting platform. This metric serves as a proxy for community acceptance and practical utility, as user preference patterns often reflect the perceived value and reliability of different MCP implementations. Instruction ...

work page 2025
[13]

INT. THE CASTLE - DAY

Success Rate(SR):SRis computed using an LLM-as-a-judge approach, comparing the agent’s final answer with the ground-truth label. The evaluation prompt is provided in ??. Specifically, we use GPT-4o as the judge model. For the third label,partially correct, where the judge model is uncertain about correctness, we manually verify whether the ground-truth la...

work page
[14]

It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

Step Number: This metric directly computes the average number of assistant messages in a trajectory. It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

work page
[15]

Load More,

Weighted Step Number(WS): Since our tuned function-call model is considerably smaller than typical LLM agents, employing MCP-Flow to initiate function calls substantially reduces cost. We use the API input-token price difference as the weighting factor to compute a weighted step number. The model price of MCP-Flow is based on the official pricing of Qwen3...

work page 2025
[16]

Smithery9 is an emerging platform that standardizes the integration of external services into large language models and autonomous agents via the Model Context Protocol (MCP). It lowers deployment and maintenance costs by providing a centralized registry, development tool chains, and hosting infrastructure, thereby promoting reusability and interoperabili...

work page
[17]

It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI

Glamais a platform that provides discovery, indexing, and connectivity for MCP servers, clients, and tools. It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI. Servers are ranked along dimensions such as security, compatibility, and usabili...

work page
[18]

It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

MCP.sois a community-driven platform that collects and organizes third-party MCP Servers. It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

work page
[19]

It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data

MCPHubis a central platform for discovering, testing, and integrating Model Context Protocol (MCP) servers. It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data. Users can browse detailed server documen- tation, test servers in an online inspector, and seamlessly integrat...

work page
[20]

Exa Search

PipeDreamoffers a dedicated MCP server that integrates thousands of applications and pre-built tools through a standardized interface. It allows large language models and AI assistants to securely invoke external APIs and perform real-world tasks using managed OAuth and encrypted credential storage. This setup streamlines authentication and interaction pa...

work page
[21]

Open the browser:{url}&page={page}

work page
[22]

Click the corresponding mcp server{mcp name}

work page
[23]

Click button “JSON” at the right side of the new page

work page
[24]

Click the “Connect” button poped up

work page
[25]

mcpServers

Retrieve the json data from the current page which specify how to install the mcp server. Only need to return the json data that contains “mcpServers” and “command”. Prefer browser snapshot than browser evaluate. Don’t click on “Generate URL” button! Figure 12: Prompt for extracting MCP server configuration files from Smithery marketplace. This prompt gui...

work page
[26]

Be realistic and authentic, stick to the given environmental context if given

work page
[27]

## Example and Format — Now you need to generate a revised query based on the information below

For not included details in the environmental context, like place, date and institutions, etc, try to use real-world names; if they don’t affect the common knowledge, you can create as you wish. ## Example and Format — Now you need to generate a revised query based on the information below. ### Input - **MCP Server information**: [MCP Server Name]{mcp nam...

work page
[28]

**Clarity** – Is the query unambiguous and easy to understand?

work page
[29]

**Specificity** – Does it include enough detail to retrieve relevant results?

work page
[30]

**Relevance** – Is it likely to produce results aligned with the user’s intent?

work page
[31]

I don’t know

**Completeness** – Does it provide all necessary context or constraints? ## Output Format [Score]: 1–10 (10 = excellent) Figure 16: Prompt for LLM-based quality filtering of generated instructions, as elaborated in Section 3.3. Prompt 7: Weather MCP Quality Assessment Prompt You are an expert evaluator. Given a user query and multiple answers from differe...

work page

[1] [1]

Performancecovers the query success rate, abbreviated asSR, which measures the percentage of instructions receiving valid responses, andQuality, defined as the average score over successfully answered queries. The scoring mechanism employs a 0–5 point scale: instructions that fail to elicit any valid response receive zero points, while successful response...

work page

[2] [2]

Capabilityassesses two aspects:Featurerepresents the number of available functions; and Coveragemeasures geographic applicability scored from one to five, where one indicates country- specific functionality and five indicates global coverage

work page

[3] [3]

Efficiencymeasures both time consumption and token usage.Timerefers to the average response latency in seconds, andTokendenotes the average number of output tokens generated

work page

[4] [4]

Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers

Popularityreflects real-world adoption, measured through theMonthly Callfrequency on hosting platforms. Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers. The Weather Forecast Server achieves the highest success rate (84.6%), while United ...

work page 1901

[5] [5]

Tool selection accuracy(Tool): This metric measures the correctness of tool selection by calculat- ing the percentage of predicted tool names that match the ground-truth tool names

work page

[6] [6]

Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment

Parameter format accuracy(Param): This metric evaluates the model’s ability to generate correctly formatted tool parameters. Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment. The evaluation follows an all-or-nothing rule: if any ground-truth parameter is unmatched, ...

work page 2024

[7] [7]

According to the authors, this metric exhibits a strong alignment with actual execution results

Abstract Syntax Tree(AST): AST is adopted from BFCL (Patil et al., 2025). According to the authors, this metric exhibits a strong alignment with actual execution results. A function call is deemed correct if the function name matches exactly and all parameter values fall within their respective allowed sets. For further details on the AST matching rules, ...

work page 2025

[8] [8]

Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

Query Success Rate:Measures the percentage of queries that receive non-zero scores, serving as an indicator of the MCP’s functional robustness and universal applicability. Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

work page

[9] [9]

Average Performance:Calculates the mean score across all successful queries, reflecting the professional quality and effectiveness of the MCP’s responses when it functions correctly

work page

[10] [10]

Feature Richness:Evaluates the comprehensiveness and sophistication of each MCP’s tool ecosystem. For this assessment, we employ a comparative scoring methodology where each individual tool within an MCP is evaluated against functionally similar tools across all weather MCPs in our dataset. Each tool receives a score from 1-5 based on two primary criteria...

work page

[11] [11]

These efficiency metrics are particularly crucial for production deployments where latency and cost considerations significantly impact user experience and system scalability

Efficiency Metrics:We measure Average Execution Time to assess the computational responsive- ness of each MCP, while Average Output Token metrics quantify the communication overhead and resource consumption associated with each interaction. These efficiency metrics are particularly crucial for production deployments where latency and cost considerations s...

work page

[12] [12]

cuda", trust_remote_code=True) 3embeddings_1 = model.encode(sentence1, max_length=512, task=

Monthly Tool Calls:Captures real-world adoption patterns by measuring the frequency of user interactions with each MCP on its respective hosting platform. This metric serves as a proxy for community acceptance and practical utility, as user preference patterns often reflect the perceived value and reliability of different MCP implementations. Instruction ...

work page 2025

[13] [13]

INT. THE CASTLE - DAY

Success Rate(SR):SRis computed using an LLM-as-a-judge approach, comparing the agent’s final answer with the ground-truth label. The evaluation prompt is provided in ??. Specifically, we use GPT-4o as the judge model. For the third label,partially correct, where the judge model is uncertain about correctness, we manually verify whether the ground-truth la...

work page

[14] [14]

It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

Step Number: This metric directly computes the average number of assistant messages in a trajectory. It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

work page

[15] [15]

Load More,

Weighted Step Number(WS): Since our tuned function-call model is considerably smaller than typical LLM agents, employing MCP-Flow to initiate function calls substantially reduces cost. We use the API input-token price difference as the weighting factor to compute a weighted step number. The model price of MCP-Flow is based on the official pricing of Qwen3...

work page 2025

[16] [16]

Smithery9 is an emerging platform that standardizes the integration of external services into large language models and autonomous agents via the Model Context Protocol (MCP). It lowers deployment and maintenance costs by providing a centralized registry, development tool chains, and hosting infrastructure, thereby promoting reusability and interoperabili...

work page

[17] [17]

It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI

Glamais a platform that provides discovery, indexing, and connectivity for MCP servers, clients, and tools. It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI. Servers are ranked along dimensions such as security, compatibility, and usabili...

work page

[18] [18]

It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

MCP.sois a community-driven platform that collects and organizes third-party MCP Servers. It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

work page

[19] [19]

It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data

MCPHubis a central platform for discovering, testing, and integrating Model Context Protocol (MCP) servers. It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data. Users can browse detailed server documen- tation, test servers in an online inspector, and seamlessly integrat...

work page

[20] [20]

Exa Search

PipeDreamoffers a dedicated MCP server that integrates thousands of applications and pre-built tools through a standardized interface. It allows large language models and AI assistants to securely invoke external APIs and perform real-world tasks using managed OAuth and encrypted credential storage. This setup streamlines authentication and interaction pa...

work page

[21] [21]

Open the browser:{url}&page={page}

work page

[22] [22]

Click the corresponding mcp server{mcp name}

work page

[23] [23]

Click button “JSON” at the right side of the new page

work page

[24] [24]

Click the “Connect” button poped up

work page

[25] [25]

mcpServers

Retrieve the json data from the current page which specify how to install the mcp server. Only need to return the json data that contains “mcpServers” and “command”. Prefer browser snapshot than browser evaluate. Don’t click on “Generate URL” button! Figure 12: Prompt for extracting MCP server configuration files from Smithery marketplace. This prompt gui...

work page

[26] [26]

Be realistic and authentic, stick to the given environmental context if given

work page

[27] [27]

## Example and Format — Now you need to generate a revised query based on the information below

For not included details in the environmental context, like place, date and institutions, etc, try to use real-world names; if they don’t affect the common knowledge, you can create as you wish. ## Example and Format — Now you need to generate a revised query based on the information below. ### Input - **MCP Server information**: [MCP Server Name]{mcp nam...

work page

[28] [28]

**Clarity** – Is the query unambiguous and easy to understand?

work page

[29] [29]

**Specificity** – Does it include enough detail to retrieve relevant results?

work page

[30] [30]

**Relevance** – Is it likely to produce results aligned with the user’s intent?

work page

[31] [31]

I don’t know

**Completeness** – Does it provide all necessary context or constraints? ## Output Format [Score]: 1–10 (10 = excellent) Figure 16: Prompt for LLM-based quality filtering of generated instructions, as elaborated in Section 3.3. Prompt 7: Weather MCP Quality Assessment Prompt You are an expert evaluator. Given a user query and multiple answers from differe...

work page