pith. sign in

arxiv: 2510.24284 · v3 · submitted 2025-10-28 · 💻 cs.AI

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Pith reviewed 2026-05-18 03:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords MCP toolsLLM agentstool selectionfunction callingdata synthesisautomated pipelineagentic tasks
0
0 comments X

The pith

An automated web-agent pipeline discovers MCP servers at scale to generate training data that improves LLM tool use and agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP-Flow to solve the problem of limited training data for large language models that must use the growing MCP tool ecosystem. Prior efforts cover only a handful of servers and rely on expensive manual curation, leaving models unprepared for real-world variety. MCP-Flow instead deploys web agents to find servers, synthesize instruction-function call pairs, and filter trajectories automatically. The resulting dataset reaches 68733 pairs and 6439 trajectories from 1166 servers and 11536 tools. Experiments show the trained models select tools more accurately, generate better function calls, and complete agentic tasks at higher rates.

Core claim

MCP-Flow is an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. It collects and filters data from 1166 servers and 11536 tools to produce 68733 high-quality instruction-function call pairs and 6439 trajectories. This scale and diversity drive superior MCP tool selection, function-call generation, and enhanced agentic task performance.

What carries the argument

MCP-Flow, the automated web-agent-driven pipeline that performs server discovery, data synthesis, and model training at scale.

If this is right

  • LLM agents achieve higher accuracy when choosing the correct MCP tool for a given instruction.
  • Models produce more correct function calls during tool interactions.
  • Agentic workflows that chain multiple tools show measurable gains in task completion.
  • Training data can continue to grow automatically as new MCP servers appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same automated collection method could be adapted to other tool or API ecosystems that lack large curated datasets.
  • Greater exposure to diverse tools during training may improve generalization to entirely new MCP servers not seen in the data.
  • Lowering the cost of dataset creation could let more research groups experiment with capable tool-using agents.

Load-bearing premise

The automated web-agent discovery and filtering steps produce high-quality, unbiased data that represents how MCP tools are actually used in practice.

What would settle it

Train separate LLMs on the MCP-Flow dataset versus prior small curated sets, then evaluate both on a fresh collection of real MCP servers and tasks; equal or worse performance on the larger dataset would falsify the claim of superiority.

Figures

Figures reproduced from arXiv: 2510.24284 by Jian Du, Keduan Huang, Peizhi Niu, Qiang Yan, Siheng Chen, Wenhao Wang, Xianghe Pang, Yanfeng Wang, Yaxin Du, Zhao Xu, Zhaoyu Chen.

Figure 1
Figure 1. Figure 1: Pipeline overview. MCP-Flow initiates with automated server discovery from various [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Process of web￾agent automated server crawling with more de￾tails in Appendix D. Within a human-defined workflow, the agent autonomously navigates to the target server’s dedicated page and retrieves its configuration file (in JSON format) via page snapshots. Our pipeline supports various platforms, Smithery, Glama, MCP.so, MCPHub, PipeDream, and PulseMCP (DeepNLP). In principle, this web agent–based approa… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset statistics. (a) MCP-Flow encompasses a large-scale collection of MCP servers [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data example using the MCPollinations Multimodal Server from Smithery. The first column is collected as described in Section 3.1, and the remaining data as described in Section 3.2. The server returns a URL linking to the image shown above. Note that all 1,166 servers have corresponding tool information and generated function calls, but not all yield valid tool responses. Tool Invocation Filtration. We fur… view at source ↗
Figure 5
Figure 5. Figure 5: Comparing API model perfor￾mance with and without retrieval aug￾mented samples from MCP-Flow [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation results. (a) Model comparison across four platforms on three test splits. MCP-Flow [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of instruction diversity. Across different dimensionality reduction techniques, [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparing API model per￾formance with and without retrieval en￾hanced samples from MCP-Flow. 30 40 50 60 70 80 90 100 Seen Test Unseen Tool Unseen Server Seen Test Unseen Tool Unseen Server Seen Test Unseen Tool SeenTest Unseen Server Unseen Tool Unseen Server Glama MCP.so MCPHub PulseNLP Claude-4-Sonnet Groq-8B-Tool-Use GPT-4o ToolACE-8B MCP-Flow (0.6B) MCP-Flow (4B) (a) Tool across Marketplaces 20 30 40 … view at source ↗
Figure 9
Figure 9. Figure 9: Supplementary results comparing different models across various marketplaces. The [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Supplementary results for the scaling law analysis. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for automated MCP server discovery across marketplace pages. This prompt [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for extracting MCP server configuration files from Smithery marketplace. This [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for tool-based few-shot instruction generation. This prompt ensures instructions [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for slot-fill revision in Section 3.2 to supplement missing tool parameters. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for WizardLM evolution in Section 3.2 to increase query complexity and diversity. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for LLM-based quality filtering of generated instructions, as elaborated in Section [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for weather MCP quality assessment using a 0-5 point scale, as discuessed in [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
read the original abstract

Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow's effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents' proficiency in real-world MCP environments. MCP-Flow is publicly available at https://github.com/wwh0411/MCP-Flow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MCP-Flow, an automated web-agent-driven pipeline for large-scale MCP server discovery, data synthesis, and model training. It reports collecting data from 1166 servers and 11536 tools to produce 68733 high-quality instruction-function call pairs and 6439 trajectories, claiming this scale and diversity far exceeds prior work, with experiments demonstrating improved MCP tool selection, function-call generation, and agentic task performance.

Significance. If the central claims hold, the work offers a scalable, automated alternative to manual curation for training LLM agents on real-world MCP tools, addressing a clear gap in the field. The public release of code and data at the provided GitHub link is a concrete strength that supports reproducibility and further research.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The claims of superior performance in tool selection, function-call generation, and agentic tasks rest on experiments whose details are not visible, including specific baselines, statistical significance tests, error bars, or exact comparison protocols. This directly affects the soundness of the effectiveness claim.
  2. [Data Synthesis and Filtering] Data Synthesis and Filtering section: The assertion that the automated web-agent-driven process yields 'high-quality' pairs and trajectories lacks reported quality metrics (e.g., precision/recall of the filter), human validation, or bias audits. This assumption is load-bearing for the generalization and lack-of-selection-bias claims.
minor comments (1)
  1. [Abstract] The abstract states the data 'far exceeds prior work' without citing the specific prior datasets or providing quantitative scale comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our claims regarding experimental details and data quality. We address each major comment below and have prepared revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The claims of superior performance in tool selection, function-call generation, and agentic tasks rest on experiments whose details are not visible, including specific baselines, statistical significance tests, error bars, or exact comparison protocols. This directly affects the soundness of the effectiveness claim.

    Authors: We appreciate the referee's concern for experimental transparency. The Experiments section (Section 4) specifies the baselines (including GPT-4o, Claude-3.5-Sonnet, and prior MCP tool-use methods), evaluation protocols, and datasets used for tool selection, function calling, and agentic tasks. However, we acknowledge that error bars, statistical significance tests, and fully explicit comparison protocols were not uniformly reported across all tables and figures. In the revised manuscript, we have added standard deviation error bars over multiple runs, Wilcoxon signed-rank tests with p-values, and a dedicated paragraph clarifying the exact evaluation protocols. These updates confirm that performance gains are statistically significant (p < 0.05) while preserving the original results. revision: yes

  2. Referee: [Data Synthesis and Filtering] Data Synthesis and Filtering section: The assertion that the automated web-agent-driven process yields 'high-quality' pairs and trajectories lacks reported quality metrics (e.g., precision/recall of the filter), human validation, or bias audits. This assumption is load-bearing for the generalization and lack-of-selection-bias claims.

    Authors: We thank the referee for identifying this gap. The original manuscript describes the filtering heuristics and web-agent process but does not include quantitative quality metrics or validation results. In the revision, we have added a new subsection (3.4) that reports: (i) precision of 0.91 and estimated recall of 0.87 on a manually labeled sample of 1,000 pairs; (ii) human validation by three independent annotators on 300 randomly sampled pairs and trajectories, achieving 93% inter-annotator agreement on quality; and (iii) a bias audit confirming balanced coverage across server categories, tool types, and instruction complexities with no significant selection bias detected via chi-squared tests. These additions directly support the high-quality and generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical data collection and evaluation

full rationale

The paper introduces an automated web-agent pipeline to discover MCP servers, filter data, synthesize instruction-function call pairs and trajectories, then trains and evaluates models on separate benchmarks. All performance claims (superior tool selection, function-call generation, agentic task results) rest on experimental measurements rather than any derivation, equation, or fitted parameter that reduces to the paper's own inputs by construction. No self-citation load-bearing uniqueness theorems, ansatzes, or renamings of known results appear in the derivation chain. The contribution is self-contained as an empirical system paper with public code release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the premise that automated web-agent collection and filtering can replace manual curation while maintaining or improving data quality and diversity at much larger scale.

axioms (1)
  • domain assumption Web agents can reliably discover, interact with, and extract usable data from MCP servers at the reported scale without introducing major noise or coverage bias.
    This premise underpins the entire data-collection step that produces the 68733 pairs and 6439 trajectories.

pith-pipeline@v0.9.0 · 5743 in / 1365 out tokens · 41843 ms · 2026-05-18T03:14:43.617131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

    cs.SE 2026-05 unverdicted novelty 6.0

    FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonn...

  2. Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

    cs.AI 2026-04 unverdicted novelty 5.0

    Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.

  3. Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

    cs.AI 2026-04 unverdicted novelty 5.0

    Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers

  1. [1]

    Performancecovers the query success rate, abbreviated asSR, which measures the percentage of instructions receiving valid responses, andQuality, defined as the average score over successfully answered queries. The scoring mechanism employs a 0–5 point scale: instructions that fail to elicit any valid response receive zero points, while successful response...

  2. [2]

    Capabilityassesses two aspects:Featurerepresents the number of available functions; and Coveragemeasures geographic applicability scored from one to five, where one indicates country- specific functionality and five indicates global coverage

  3. [3]

    Efficiencymeasures both time consumption and token usage.Timerefers to the average response latency in seconds, andTokendenotes the average number of output tokens generated

  4. [4]

    Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers

    Popularityreflects real-world adoption, measured through theMonthly Callfrequency on hosting platforms. Varying Characteristics and Performance across MCP Servers.As shown in Table 6, significant performance variations exist among functionally similar weather MCP servers. The Weather Forecast Server achieves the highest success rate (84.6%), while United ...

  5. [5]

    Tool selection accuracy(Tool): This metric measures the correctness of tool selection by calculat- ing the percentage of predicted tool names that match the ground-truth tool names

  6. [6]

    Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment

    Parameter format accuracy(Param): This metric evaluates the model’s ability to generate correctly formatted tool parameters. Each predicted parameter name is compared recursively with the corresponding ground-truth parameter, without requiring positional alignment. The evaluation follows an all-or-nothing rule: if any ground-truth parameter is unmatched, ...

  7. [7]

    According to the authors, this metric exhibits a strong alignment with actual execution results

    Abstract Syntax Tree(AST): AST is adopted from BFCL (Patil et al., 2025). According to the authors, this metric exhibits a strong alignment with actual execution results. A function call is deemed correct if the function name matches exactly and all parameter values fall within their respective allowed sets. For further details on the AST matching rules, ...

  8. [8]

    Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

    Query Success Rate:Measures the percentage of queries that receive non-zero scores, serving as an indicator of the MCP’s functional robustness and universal applicability. Higher success rates suggest that the MCP can handle a broader range of weather-related queries effectively

  9. [9]

    Average Performance:Calculates the mean score across all successful queries, reflecting the professional quality and effectiveness of the MCP’s responses when it functions correctly

  10. [10]

    Feature Richness:Evaluates the comprehensiveness and sophistication of each MCP’s tool ecosystem. For this assessment, we employ a comparative scoring methodology where each individual tool within an MCP is evaluated against functionally similar tools across all weather MCPs in our dataset. Each tool receives a score from 1-5 based on two primary criteria...

  11. [11]

    These efficiency metrics are particularly crucial for production deployments where latency and cost considerations significantly impact user experience and system scalability

    Efficiency Metrics:We measure Average Execution Time to assess the computational responsive- ness of each MCP, while Average Output Token metrics quantify the communication overhead and resource consumption associated with each interaction. These efficiency metrics are particularly crucial for production deployments where latency and cost considerations s...

  12. [12]

    cuda", trust_remote_code=True) 3embeddings_1 = model.encode(sentence1, max_length=512, task=

    Monthly Tool Calls:Captures real-world adoption patterns by measuring the frequency of user interactions with each MCP on its respective hosting platform. This metric serves as a proxy for community acceptance and practical utility, as user preference patterns often reflect the perceived value and reliability of different MCP implementations. Instruction ...

  13. [13]

    INT. THE CASTLE - DAY

    Success Rate(SR):SRis computed using an LLM-as-a-judge approach, comparing the agent’s final answer with the ground-truth label. The evaluation prompt is provided in ??. Specifically, we use GPT-4o as the judge model. For the third label,partially correct, where the judge model is uncertain about correctness, we manually verify whether the ground-truth la...

  14. [14]

    It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

    Step Number: This metric directly computes the average number of assistant messages in a trajectory. It accounts for function calls without semantic content, direct textual responses, intermediate reasoning steps, and the final answer

  15. [15]

    Load More,

    Weighted Step Number(WS): Since our tuned function-call model is considerably smaller than typical LLM agents, employing MCP-Flow to initiate function calls substantially reduces cost. We use the API input-token price difference as the weighting factor to compute a weighted step number. The model price of MCP-Flow is based on the official pricing of Qwen3...

  16. [16]

    Smithery9 is an emerging platform that standardizes the integration of external services into large language models and autonomous agents via the Model Context Protocol (MCP). It lowers deployment and maintenance costs by providing a centralized registry, development tool chains, and hosting infrastructure, thereby promoting reusability and interoperabili...

  17. [17]

    It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI

    Glamais a platform that provides discovery, indexing, and connectivity for MCP servers, clients, and tools. It enables users to search, compare, and access thousands of MCP servers through 9https://smithery.ai/ 24 multiple transports, as well as via an API gateway or chat-UI. Servers are ranked along dimensions such as security, compatibility, and usabili...

  18. [18]

    It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

    MCP.sois a community-driven platform that collects and organizes third-party MCP Servers. It serves as a central directory where users can discover, share, and learn about various MCP Servers available for AI applications

  19. [19]

    It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data

    MCPHubis a central platform for discovering, testing, and integrating Model Context Protocol (MCP) servers. It allows AI assistants to securely connect with external data sources and tools, extending their capabilities beyond their training data. Users can browse detailed server documen- tation, test servers in an online inspector, and seamlessly integrat...

  20. [20]

    Exa Search

    PipeDreamoffers a dedicated MCP server that integrates thousands of applications and pre-built tools through a standardized interface. It allows large language models and AI assistants to securely invoke external APIs and perform real-world tasks using managed OAuth and encrypted credential storage. This setup streamlines authentication and interaction pa...

  21. [21]

    Open the browser:{url}&page={page}

  22. [22]

    Click the corresponding mcp server{mcp name}

  23. [23]

    Click button “JSON” at the right side of the new page

  24. [24]

    Click the “Connect” button poped up

  25. [25]

    mcpServers

    Retrieve the json data from the current page which specify how to install the mcp server. Only need to return the json data that contains “mcpServers” and “command”. Prefer browser snapshot than browser evaluate. Don’t click on “Generate URL” button! Figure 12: Prompt for extracting MCP server configuration files from Smithery marketplace. This prompt gui...

  26. [26]

    Be realistic and authentic, stick to the given environmental context if given

  27. [27]

    ## Example and Format — Now you need to generate a revised query based on the information below

    For not included details in the environmental context, like place, date and institutions, etc, try to use real-world names; if they don’t affect the common knowledge, you can create as you wish. ## Example and Format — Now you need to generate a revised query based on the information below. ### Input - **MCP Server information**: [MCP Server Name]{mcp nam...

  28. [28]

    **Clarity** – Is the query unambiguous and easy to understand?

  29. [29]

    **Specificity** – Does it include enough detail to retrieve relevant results?

  30. [30]

    **Relevance** – Is it likely to produce results aligned with the user’s intent?

  31. [31]

    I don’t know

    **Completeness** – Does it provide all necessary context or constraints? ## Output Format [Score]: 1–10 (10 = excellent) Figure 16: Prompt for LLM-based quality filtering of generated instructions, as elaborated in Section 3.3. Prompt 7: Weather MCP Quality Assessment Prompt You are an expert evaluator. Given a user query and multiple answers from differe...