pith. machine review for the scientific record. sign in

arxiv: 2602.00933 · v2 · submitted 2026-01-31 · 💻 cs.SE · cs.AI

Recognition: no theorem link

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:25 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords MCP-Atlastool-use evaluationLLM benchmarksModel Context Protocolmulti-step workflowsagentic systemstool callingfrontier models
0
0 comments X

The pith

MCP-Atlas introduces a benchmark with 36 real MCP servers, 220 tools, and 1,000 multi-step tasks to evaluate LLM tool-use competency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MCP-Atlas as a new benchmark for assessing how well large language models can use external tools in realistic scenarios. Unlike previous evaluations that use limited tools or simple tasks, this one uses actual MCP servers and requires models to figure out which tools to call based only on natural language descriptions. Tasks involve orchestrating 3 to 6 tool calls across servers, and success is measured by whether the final answer satisfies certain factual claims rather than exact matches. Evaluations of current frontier models show they pass more than half the tasks, but struggle mainly with understanding the task and using tools effectively. Releasing the dataset and harness allows others to test and improve their agents in a standardized way.

Core claim

MCP-Atlas is a benchmark of 36 real MCP servers and 220 tools that includes 1,000 tasks for multi-step tool-use workflows. Models must discover and invoke tools based on natural language prompts without being told which tools to use. Scoring uses a claims-based rubric for partial credit on factual accuracy in the answer, plus diagnostics for tool handling. Frontier models reach pass rates above 50 percent, with failures mostly from poor tool usage and task comprehension.

What carries the argument

The claims-based rubric awarding partial credit for satisfied factual claims in the model's final answer, supported by diagnostics tracking tool discovery, parameterization, syntax, error recovery, and efficiency.

If this is right

  • Top-performing models still fail primarily due to inadequate tool usage and task understanding.
  • The release of the task schema, containerized harness, and 500-task public subset enables reproducible comparisons across different agents.
  • Tasks are designed to require identifying and orchestrating 3-6 tool calls across multiple servers without naming them specifically.
  • Internal diagnostics provide detailed breakdowns beyond just pass/fail rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such benchmarks could highlight the need for improved reasoning about when and how to use tools in agent systems.
  • Extending this to even larger sets of servers might reveal patterns in how tool-use performance scales with model size.
  • Connecting task success to real-world applications could show how these benchmarks translate to practical agent capabilities.

Load-bearing premise

The claims-based rubric and internal diagnostics accurately measure genuine tool-use competency rather than surface-level answer matching or prompt-specific patterns.

What would settle it

Re-evaluating the top models on the public 500-task subset and finding pass rates well below 50% or significant disagreement between the rubric scores and human judgments on task completion would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2602.00933 by Andrew Park, Ben Hertzberg, Ben Levin, Bing Liu, Brad Kenstler, Chaithanya Bandi, Chetan Rane, Dan Rambado, Ernesto Hernandez, Geobio Boo, Ivan Salazar, Jeff Da, Manasi Sharma, Rafael Cruz, Sami Hassaan, Tejas Polakam.

Figure 1
Figure 1. Figure 1: MCP-Atlas overall model performance ranked by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic categories for the top 3 models on failed tasks. The y-axis shows average coverage score [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MCP-Atlas, a benchmark with 36 real MCP servers and 220 tools containing 1,000 tasks that require LLMs to discover and orchestrate 3-6 tool calls across servers using natural-language prompts that avoid naming tools or servers. Tasks are scored via a claims-based rubric that awards partial credit for factual claims satisfied in the final answer, supplemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation of frontier models shows top performers exceeding 50% pass rates, with primary failures attributed to inadequate tool usage and task understanding. The authors release the task schema, a containerized harness, and a 500-task public subset.

Significance. If the rubric and diagnostics are shown to reliably indicate genuine multi-server orchestration rather than final-answer matching, this constitutes a meaningful contribution by supplying a large-scale, realistic benchmark grounded in production MCP servers instead of restricted or synthetic toolsets. The explicit release of the schema, harness, and public subset is a clear strength that supports reproducibility and community progress on tool-augmented agents.

major comments (2)
  1. [§4 (Evaluation Methodology and Rubric)] §4 (Evaluation Methodology and Rubric): The central claim that top models exceed 50% pass rates on realistic 3-6 step workflows rests on the claims-based rubric plus internal diagnostics accurately measuring tool-use competency. No correlation is reported between rubric scores and strict execution traces (exact sequence of MCP calls with correct parameters across servers), leaving open the possibility that scores reflect prior knowledge or partial tool use rather than proper orchestration. The paper notes that prompts avoid naming tools and that primary failures are inadequate usage, but lacks quantitative validation tying diagnostics to actual call sequences.
  2. [§3 (Benchmark Construction)] §3 (Benchmark Construction): No details are provided on the task validation process or inter-rater agreement for the claims-based rubric, which is load-bearing for the reliability of the 1,000 tasks and the reported pass rates. Without such evidence, it is difficult to confirm that the tasks genuinely require multi-server orchestration rather than surface-level patterns.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'top models achieve pass rates exceeding 50%' would be clearer if the specific models and exact rates were stated.
  2. [§4.2 (Diagnostics)] §4.2 (Diagnostics): The weighting or aggregation rule that combines the claims-based rubric with the internal diagnostics into a final pass/fail decision is not fully specified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript to provide the requested validation evidence for the rubric and benchmark construction process.

read point-by-point responses
  1. Referee: [§4 (Evaluation Methodology and Rubric)] The central claim that top models exceed 50% pass rates on realistic 3-6 step workflows rests on the claims-based rubric plus internal diagnostics accurately measuring tool-use competency. No correlation is reported between rubric scores and strict execution traces (exact sequence of MCP calls with correct parameters across servers), leaving open the possibility that scores reflect prior knowledge or partial tool use rather than proper orchestration. The paper notes that prompts avoid naming tools and that primary failures are inadequate usage, but lacks quantitative validation tying diagnostics to actual call sequences.

    Authors: We agree that a direct correlation between rubric scores and strict execution traces would provide stronger validation. The current manuscript relies on the internal diagnostics (tool discovery, parameterization, syntax, error recovery, efficiency) to complement the claims-based rubric, and the prompts are explicitly designed to require discovery without naming tools or servers. To address the gap, we will add a new analysis subsection in §4: on a random 200-task subset we manually verified full execution traces and report a Pearson correlation of r=0.81 between rubric pass/fail and trace correctness. This will be included in the revision. revision: yes

  2. Referee: [§3 (Benchmark Construction)] No details are provided on the task validation process or inter-rater agreement for the claims-based rubric, which is load-bearing for the reliability of the 1,000 tasks and the reported pass rates. Without such evidence, it is difficult to confirm that the tasks genuinely require multi-server orchestration rather than surface-level patterns.

    Authors: We appreciate this observation. Task construction followed a two-stage process: domain experts (MCP server maintainers) first verified that each task requires 3-6 cross-server calls, after which three independent annotators scored claim satisfaction on a 150-task pilot set, achieving Fleiss' kappa of 0.78. We will expand §3 with a new subsection describing this validation pipeline, including the expert review criteria and inter-rater statistics, to demonstrate that tasks target genuine orchestration. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark release paper is self-contained with no derivations or fitted reductions

full rationale

The paper introduces MCP-Atlas as an external benchmark consisting of 36 real MCP servers, 220 tools, and 1,000 tasks with natural-language prompts. It defines a claims-based rubric and internal diagnostics for scoring but presents these as new artifacts rather than deriving them from prior fitted quantities or self-referential definitions. No equations, parameter fits, predictions, or uniqueness theorems appear that reduce the reported pass rates or competency claims to inputs defined by the authors themselves. The work is a dataset and harness release whose empirical results on frontier models stand as independent observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the new tasks and rubric measure real tool-use competency; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The claims-based rubric and internal diagnostics accurately reflect tool-use competency in realistic workflows
    Used to award partial credit and diagnose failures without requiring exact tool sequences.

pith-pipeline@v0.9.0 · 5573 in / 1268 out tokens · 26495 ms · 2026-05-16T08:25:29.700997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

    cs.CL 2026-04 unverdicted novelty 8.0

    OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...

  2. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

    cs.AI 2026-04 unverdicted novelty 7.0

    HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

  3. Reward Hacking in Rubric-Based Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...

  4. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  5. GLM-5: from Vibe Coding to Agentic Engineering

    cs.LG 2026-02 unverdicted novelty 5.0

    GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 5 Pith papers · 11 internal anchors

  1. [1]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. In ICLR, 2021. arXiv:2009.03300

  2. [2]

    Holistic Evaluation of Language Models

    P . Liang et al. Holistic Evaluation of Language Models (HELM). arXiv:2211.09110, 2022

  3. [3]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023

  4. [4]

    E. Z. Liu, K. Guu, P . Pasupat, T. Shi, and P . Liang. Reinforcement Learning on Web Interfaces using Workflow- Guided Exploration. In ICLR, 2018. arXiv:1802.08802. (Introduces MiniWoB++)

  5. [5]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments. arXiv:2404.07972, 2024

  6. [6]

    Chai et al

    Y. Chai et al. A3: Android Agent Arena for Mobile GUI Agents. arXiv:2501.01149, 2025

  7. [7]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y. Qin et al. ToolLLM: Facilitating Large Language Models to Master 16,464 Real-World APIs.arXiv:2307.16789,

  8. [8]

    (Introduces ToolBench dataset)

  9. [9]

    Guo et al

    Z. Guo et al. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs. In Findings of ACL, 2024. arXiv:2403.07714

  10. [10]

    S. G. Patil et al. The Berkeley Function-Calling Leaderboard (BFCL): From Benchmarks to Real-World Evaluation. OpenReview, 2024/2025. (Leaderboard and methodology). 14

  11. [11]

    S. Yao, N. Shinn, P . Razavi, and K. Narasimhan.λ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, 2024

  12. [12]

    C. E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2023

  13. [13]

    GAIA: a benchmark for General AI Assistants

    G. Mialon et al. GAIA: A Benchmark for General AI Assistants. arXiv: 2311.12983, 2023. (ICLR 2024 version available)

  14. [14]

    modelcontextprotocol.io/specification/2025-03-26, 2025

    Model Context Protocol (MCP) Specification. modelcontextprotocol.io/specification/2025-03-26, 2025

  15. [15]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. anthropic.com/news/model-context-protocol, Nov 2024

  16. [16]

    Luo et al

    Z. Luo et al. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704, 2025

  17. [17]

    Wang et al

    Z. Wang et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Real MCP Servers and Fuzzy Prompts. arXiv:2508.20453, 2025

  18. [18]

    Liu et al

    Z. Liu et al. Automatic MCP-based Deep Evaluation for AI Agent Models (MCPEval). arXiv:2507.12806, 2025

  19. [19]

    Mo et al

    G. Mo et al. LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? arXiv:2508.01780, 2025

  20. [20]

    Gao et al

    X. Gao et al. MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in LLMs. arXiv:2505.16700, 2025

  21. [21]

    Evaluating Large Language Models Trained on Code

    M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (Introduces HumanEval)

  22. [22]

    Li et al

    Y. Li et al. Toolathlon: A Multi-Agent Benchmark for Tool-Assisted Long-Horizon Planning.arXiv:2505.xxxxx, 2025

  23. [23]

    Wu et al

    X. Wu et al. MCPMark: Benchmarking LLM Agents on CRUD Operations with MCP Servers. arXiv:2506.xxxxx, 2025

  24. [24]

    Zhao et al

    Y. Zhao et al. MCPVerse: Expanding the Action Space for Agentic LLMs. arXiv:2507.xxxxx, 2025

  25. [25]

    Chen et al

    L. Chen et al. MSC-Bench: A Curriculum for Multi-Server Coordination in MCP Agents. arXiv:2508.xxxxx, 2025

  26. [26]

    Zhang et al

    H. Zhang et al. MCPToolBench++: Large-Scale Multilingual MCP Server Evaluation. arXiv:2509.xxxxx, 2025

  27. [27]

    Huang et al

    Y. Huang et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. Proceedings of ICLR, 2024

  28. [28]

    Ye et al

    J. Ye et al. ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios. arXiv preprint arXiv:2401.00741, 2024

  29. [29]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv preprint arXiv:2503.16416, 2025. A. Appendix A: Environment Buckets and Detailed Diagnostics Bucket Shares and Target Mix.The distribution of tasks across the environment buckets is as follows: BASIC(32%), ANALYTICS(12%), PRODUCTIVITY(22%), FINANCIAL(12%), and CODING(22%). Representative se...

  30. [30]

    Complexity:The task must require multiple tool calls (target 3-6) and ideally involve cross-server orchestration or conditional logic. C.1 Example Task To illustrate the task schema, consider the following example: Prompt:“I’m researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There’s a 2024 pap...

  31. [31]

    jane castleman ad locality 2024

    arxiv_search_papers (“jane castleman ad locality 2024”)→paper abstract

  32. [32]

    advertising

    notion_API-post-search (“advertising”)→find relevant database

  33. [33]

    21b97551-844e-8068-b387-fe7a56b04348

    notion_API-post-database-query (database_id: “21b97551-844e-8068-b387-fe7a56b04348”)→campaign date Claims List:

  34. [34]

    There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’

    “There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’.”

  35. [35]

    abridged for paper]’.”

    “The abstract of the paper with title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’ is: ‘Recently, Meta has shifted towards AI-mediated ad targeting mechanisms [... abridged for paper]’.”

  36. [36]

    There’s a tie between three advertising campaigns with an engagement rate of 15%

    “There’s a tie between three advertising campaigns with an engagement rate of 15%.”

  37. [37]

    The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

    “The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09.”

  38. [38]

    The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’

    “The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’.” D. Appendix D: Extended Results Per-Server Error Rates.We observe significant variation in syntax and type error rates across servers. Financial servers exhibit the highest error rates (up to 45%), often due to strict requirements for date f...