pith. sign in

arxiv: 2602.00933 · v3 · pith:NZL3RTPYnew · submitted 2026-01-31 · 💻 cs.SE · cs.AI

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Pith reviewed 2026-05-21 15:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords MCP-Atlastool-use benchmarkLLM agentsModel Context Protocolcognitive failuresmulti-step workflowsclaim-level scoringagent diagnostics
0
0 comments X

The pith

A benchmark of 1,000 expert tasks on 36 real servers shows frontier models reach 82.2 percent success on multi-step tool use, with most errors arising from cognitive issues rather than tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP-Atlas to evaluate how effectively language model agents can discover and chain tools from actual production servers in open-ended workflows. Tasks are written so that agents must select relevant tools from plausible alternatives and combine them across servers without explicit guidance on which tools or parameters to choose. Scoring uses atomic factual claims drawn from tool outputs to judge the final answer, which credits any valid sequence of tool calls that produces the right information. Tests of 20 models from six providers produce a top pass rate of 82.2 percent at a 0.75 claim coverage level and expose a consistent three-tier performance pattern. Diagnostics across failures indicate that 63.3 percent stem from problems in task understanding, information synthesis, parsing, or deciding when to stop rather than from incorrect tool invocations.

Core claim

MCP-Atlas contains 1,000 natural-language tasks spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 0.2

What carries the argument

The claim-level rubric that scores final answers against atomic factual claims from tool outputs, paired with an 11-category diagnostic taxonomy separating tool-call errors from cognitive errors in understanding, synthesis, parsing, and stopping.

If this is right

  • Different valid sequences of tool calls for the same task receive equal credit when they produce answers that cover the required factual claims.
  • Automated diagnostics can separate failures caused by incorrect tool selection or formatting from those caused by misunderstanding the task or mishandling results.
  • A three-tier performance structure appears consistently across providers when models are tested under identical task conditions.
  • Several high-performing models still lose points by stopping before they have synthesized the necessary information even after successful tool executions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training methods that emphasize deciding when to stop and combining outputs from multiple sources could raise overall success rates more than further improvements in tool-calling accuracy alone.
  • Claim-based scoring may reduce bias in other agent evaluations where style or length currently influences scores.
  • Testing on live production servers rather than mocks surfaces variability that future agent work should address directly.
  • The released harness and evaluator make it possible to add new servers or tasks while keeping the same scoring standard.

Load-bearing premise

The 1,000 tasks written and verified by human experts accurately represent realistic multi-step, cross-server tool-use scenarios and the claim-level rubric measures competency independently of agent verbosity or answer style.

What would settle it

Re-evaluating the same models on tasks rephrased to alter verbosity or stylistic cues and finding that pass rates or the share of cognitive failures shift by more than 10 percentage points.

Figures

Figures reproduced from arXiv: 2602.00933 by Andrew Park, Ben Hertzberg, Ben Levin, Bing Liu, Brad Kenstler, Chaithanya Bandi, Chetan Rane, Daniel Yue Zhang, Dan Rambado, Divyansh Agarwal, Ernesto Gabriel Hernandez Montoya, Geobio Boo, HiJae Kim, Ivan Salazar, Jeff Da, Manasi Sharma, Martin Dimakis, MohammadHossein Rezaei, Rafael Cruz, Razvan-Gabriel Dumitru, Sami Hassaan, Tejas Polakam, Vipul Gupta.

Figure 1
Figure 1. Figure 1: MCP-Atlas overall model performance ranked by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagnostic categories for the top 3 models on failed tasks. The y-axis shows average coverage score [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MCP-Atlas, a benchmark for LLM agent tool-use competency consisting of 1,000 natural-language tasks written and verified by human experts across 36 real production MCP servers and 220 tools. Prompts require agents to discover relevant tools among distractors and compose multi-step cross-server workflows without explicit server or parameter hints. Scoring uses an answer-centric claim-level rubric that credits valid alternative trajectories, paired with an 11-category diagnostic taxonomy separating cognitive failures from tool-call errors. Evaluation of 20 frontier models from six providers under matched conditions reports pass rates up to 82.2% at a 0.75 claim-coverage threshold, a three-tier performance structure, and that 63.3% of diagnosed failures are cognitive rather than tool-call related. The authors release the task schema, containerized harness, claim evaluator, and a 500-task public split.

Significance. If the tasks and rubric are shown to be representative and style-independent, MCP-Atlas would fill a clear gap by providing the first large-scale benchmark grounded in authentic MCP servers rather than mocks, with reproducible claim-level scoring and failure diagnostics. The public release of code and partial data strengthens its potential utility for the community.

major comments (3)
  1. [Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.
  2. [Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.
  3. [Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.
minor comments (2)
  1. [Data Release] Clarify how the 500-task public split was sampled to ensure representativeness across servers and task complexity.
  2. [Scoring Rubric] Define the exact procedure for extracting atomic claims from tool outputs and final answers in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We have carefully considered each major comment and provide point-by-point responses below. We agree with the need for additional validation and will make revisions to address the concerns about task construction, scoring sensitivity, and diagnostic reliability. These changes will enhance the robustness of our claims regarding MCP-Atlas as a benchmark for tool-use competency.

read point-by-point responses
  1. Referee: [Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.

    Authors: We acknowledge this limitation in our current manuscript. The 1,000 tasks were developed through a rigorous process involving multiple human experts with experience in MCP server development and usage. Each task was authored by one expert and verified by at least one other for realism, correctness, and multi-step nature. However, we did not report formal inter-rater reliability metrics such as Fleiss' kappa. We will revise the manuscript to include a detailed description of the task creation and verification protocol, along with agreement statistics computed on a held-out subset of tasks. Regarding comparison to real MCP usage logs, such logs are proprietary and not accessible for benchmarking purposes. We will add a discussion explaining why expert curation was used as a proxy for realism and note this as a limitation. This addresses the core concern while maintaining the benchmark's value. revision: partial

  2. Referee: [Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.

    Authors: We agree that providing sensitivity analysis would improve interpretability. The 0.75 threshold was selected after initial pilot studies to allow for reasonable variation in answer completeness while maintaining high standards. We will add an ablation study in the revised manuscript, varying the threshold across 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, and report how the pass rates, performance tiers, and failure distributions shift. Additionally, we will describe the claim-extraction process in greater detail, including how claims are automatically extracted from tool responses and manually spot-checked. This will demonstrate that the scoring is robust and not overly sensitive to specific choices or agent verbosity. revision: yes

  3. Referee: [Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.

    Authors: We concur that validation of the diagnostic taxonomy is important for the reliability of the 63.3% statistic. The taxonomy was iteratively refined by the research team based on manual inspection of model outputs. The automated diagnostics combine heuristic rules for tool-call errors with LLM-based classification for cognitive categories. To address this, we will conduct a human validation study on a sample of 200 failure cases, involving two independent annotators, and report inter-annotator agreement (e.g., Cohen's kappa) as well as agreement with the automated system. We will update the results and discussion accordingly in the revision. If the agreement is substantial, it will support our conclusions; we will qualify any findings as needed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and model evaluation

full rationale

The paper introduces MCP-Atlas as an empirical benchmark consisting of 1,000 human-expert-authored tasks across real MCP servers, evaluated via direct model runs under matched conditions to produce pass rates, tiered performance, and diagnostic failure breakdowns. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Task construction, claim-level rubrics, and automated diagnostics are defined externally to the reported results; outcomes (e.g., 82.2% pass rate, 63.3% cognitive failures) are measured against production servers rather than fitted or self-referentially defined. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would trigger circularity patterns. This is a standard self-contained benchmark study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard assumptions about expert task creation and the validity of claim-based evaluation rather than new mathematical derivations or fitted parameters.

free parameters (1)
  • claim coverage threshold
    0.75 threshold used to compute pass rates; chosen to define success level.
axioms (1)
  • domain assumption Human experts can create and verify natural-language tasks that require genuine tool discovery and multi-step orchestration among semantically plausible distractors.
    Invoked in the description of task construction and verification process.

pith-pipeline@v0.9.0 · 5952 in / 1609 out tokens · 66028 ms · 2026-05-21T15:03:13.690263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

    cs.CL 2026-04 unverdicted novelty 8.0

    OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...

  2. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

    cs.AI 2026-04 unverdicted novelty 7.0

    HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

  3. Reward Hacking in Rubric-Based Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...

  4. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  5. GLM-5: from Vibe Coding to Agentic Engineering

    cs.LG 2026-02 unverdicted novelty 5.0

    GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 5 Pith papers · 11 internal anchors

  1. [1]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. In ICLR, 2021. arXiv:2009.03300

  2. [2]

    Holistic Evaluation of Language Models

    P . Liang et al. Holistic Evaluation of Language Models (HELM). arXiv:2211.09110, 2022

  3. [3]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    S. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023

  4. [4]

    E. Z. Liu, K. Guu, P . Pasupat, T. Shi, and P . Liang. Reinforcement Learning on Web Interfaces using Workflow- Guided Exploration. In ICLR, 2018. arXiv:1802.08802. (Introduces MiniWoB++)

  5. [5]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments. arXiv:2404.07972, 2024

  6. [6]

    Chai et al

    Y. Chai et al. A3: Android Agent Arena for Mobile GUI Agents. arXiv:2501.01149, 2025

  7. [7]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y. Qin et al. ToolLLM: Facilitating Large Language Models to Master 16,464 Real-World APIs.arXiv:2307.16789,

  8. [8]

    (Introduces ToolBench dataset)

  9. [9]

    Guo et al

    Z. Guo et al. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs. In Findings of ACL, 2024. arXiv:2403.07714

  10. [10]

    S. G. Patil et al. The Berkeley Function-Calling Leaderboard (BFCL): From Benchmarks to Real-World Evaluation. OpenReview, 2024/2025. (Leaderboard and methodology). 14

  11. [11]

    S. Yao, N. Shinn, P . Razavi, and K. Narasimhan.λ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, 2024

  12. [12]

    C. E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2023

  13. [13]

    GAIA: a benchmark for General AI Assistants

    G. Mialon et al. GAIA: A Benchmark for General AI Assistants. arXiv: 2311.12983, 2023. (ICLR 2024 version available)

  14. [14]

    modelcontextprotocol.io/specification/2025-03-26, 2025

    Model Context Protocol (MCP) Specification. modelcontextprotocol.io/specification/2025-03-26, 2025

  15. [15]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. anthropic.com/news/model-context-protocol, Nov 2024

  16. [16]

    Luo et al

    Z. Luo et al. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704, 2025

  17. [17]

    Wang et al

    Z. Wang et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Real MCP Servers and Fuzzy Prompts. arXiv:2508.20453, 2025

  18. [18]

    Liu et al

    Z. Liu et al. Automatic MCP-based Deep Evaluation for AI Agent Models (MCPEval). arXiv:2507.12806, 2025

  19. [19]

    Mo et al

    G. Mo et al. LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? arXiv:2508.01780, 2025

  20. [20]

    Gao et al

    X. Gao et al. MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in LLMs. arXiv:2505.16700, 2025

  21. [21]

    Evaluating Large Language Models Trained on Code

    M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (Introduces HumanEval)

  22. [22]

    Li et al

    Y. Li et al. Toolathlon: A Multi-Agent Benchmark for Tool-Assisted Long-Horizon Planning.arXiv:2505.xxxxx, 2025

  23. [23]

    Wu et al

    X. Wu et al. MCPMark: Benchmarking LLM Agents on CRUD Operations with MCP Servers. arXiv:2506.xxxxx, 2025

  24. [24]

    Zhao et al

    Y. Zhao et al. MCPVerse: Expanding the Action Space for Agentic LLMs. arXiv:2507.xxxxx, 2025

  25. [25]

    Chen et al

    L. Chen et al. MSC-Bench: A Curriculum for Multi-Server Coordination in MCP Agents. arXiv:2508.xxxxx, 2025

  26. [26]

    Zhang et al

    H. Zhang et al. MCPToolBench++: Large-Scale Multilingual MCP Server Evaluation. arXiv:2509.xxxxx, 2025

  27. [27]

    Huang et al

    Y. Huang et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. Proceedings of ICLR, 2024

  28. [28]

    Ye et al

    J. Ye et al. ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios. arXiv preprint arXiv:2401.00741, 2024

  29. [29]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv preprint arXiv:2503.16416, 2025. A. Appendix A: Environment Buckets and Detailed Diagnostics Bucket Shares and Target Mix.The distribution of tasks across the environment buckets is as follows: BASIC(32%), ANALYTICS(12%), PRODUCTIVITY(22%), FINANCIAL(12%), and CODING(22%). Representative se...

  30. [30]

    Complexity:The task must require multiple tool calls (target 3-6) and ideally involve cross-server orchestration or conditional logic. C.1 Example Task To illustrate the task schema, consider the following example: Prompt:“I’m researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There’s a 2024 pap...

  31. [31]

    jane castleman ad locality 2024

    arxiv_search_papers (“jane castleman ad locality 2024”)→paper abstract

  32. [32]

    advertising

    notion_API-post-search (“advertising”)→find relevant database

  33. [33]

    21b97551-844e-8068-b387-fe7a56b04348

    notion_API-post-database-query (database_id: “21b97551-844e-8068-b387-fe7a56b04348”)→campaign date Claims List:

  34. [34]

    There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’

    “There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’.”

  35. [35]

    abridged for paper]’.”

    “The abstract of the paper with title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’ is: ‘Recently, Meta has shifted towards AI-mediated ad targeting mechanisms [... abridged for paper]’.”

  36. [36]

    There’s a tie between three advertising campaigns with an engagement rate of 15%

    “There’s a tie between three advertising campaigns with an engagement rate of 15%.”

  37. [37]

    The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

    “The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09.”

  38. [38]

    The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’

    “The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’.” D. Appendix D: Extended Results Per-Server Error Rates.We observe significant variation in syntax and type error rates across servers. Financial servers exhibit the highest error rates (up to 45%), often due to strict requirements for date f...