MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Andrew Park; Ben Hertzberg; Ben Levin; Bing Liu; Brad Kenstler; Chaithanya Bandi; Chetan Rane; Daniel Yue Zhang; Dan Rambado; Divyansh Agarwal

arxiv: 2602.00933 · v3 · pith:NZL3RTPYnew · submitted 2026-01-31 · 💻 cs.SE · cs.AI

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Chaithanya Bandi , Razvan-Gabriel Dumitru , Ben Hertzberg , Divyansh Agarwal , Geobio Boo , Tejas Polakam , Sami Hassaan , Jeff Da

show 15 more authors

HiJae Kim Vipul Gupta Manasi Sharma Andrew Park Martin Dimakis Ernesto Gabriel Hernandez Montoya Dan Rambado Ivan Salazar Rafael Cruz MohammadHossein Rezaei Chetan Rane Ben Levin Daniel Yue Zhang Brad Kenstler Bing Liu

This is my paper

Pith reviewed 2026-05-21 15:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords MCP-Atlastool-use benchmarkLLM agentsModel Context Protocolcognitive failuresmulti-step workflowsclaim-level scoringagent diagnostics

0 comments

The pith

A benchmark of 1,000 expert tasks on 36 real servers shows frontier models reach 82.2 percent success on multi-step tool use, with most errors arising from cognitive issues rather than tool calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP-Atlas to evaluate how effectively language model agents can discover and chain tools from actual production servers in open-ended workflows. Tasks are written so that agents must select relevant tools from plausible alternatives and combine them across servers without explicit guidance on which tools or parameters to choose. Scoring uses atomic factual claims drawn from tool outputs to judge the final answer, which credits any valid sequence of tool calls that produces the right information. Tests of 20 models from six providers produce a top pass rate of 82.2 percent at a 0.75 claim coverage level and expose a consistent three-tier performance pattern. Diagnostics across failures indicate that 63.3 percent stem from problems in task understanding, information synthesis, parsing, or deciding when to stop rather than from incorrect tool invocations.

Core claim

MCP-Atlas contains 1,000 natural-language tasks spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 0.2

What carries the argument

The claim-level rubric that scores final answers against atomic factual claims from tool outputs, paired with an 11-category diagnostic taxonomy separating tool-call errors from cognitive errors in understanding, synthesis, parsing, and stopping.

If this is right

Different valid sequences of tool calls for the same task receive equal credit when they produce answers that cover the required factual claims.
Automated diagnostics can separate failures caused by incorrect tool selection or formatting from those caused by misunderstanding the task or mishandling results.
A three-tier performance structure appears consistently across providers when models are tested under identical task conditions.
Several high-performing models still lose points by stopping before they have synthesized the necessary information even after successful tool executions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training methods that emphasize deciding when to stop and combining outputs from multiple sources could raise overall success rates more than further improvements in tool-calling accuracy alone.
Claim-based scoring may reduce bias in other agent evaluations where style or length currently influences scores.
Testing on live production servers rather than mocks surfaces variability that future agent work should address directly.
The released harness and evaluator make it possible to add new servers or tasks while keeping the same scoring standard.

Load-bearing premise

The 1,000 tasks written and verified by human experts accurately represent realistic multi-step, cross-server tool-use scenarios and the claim-level rubric measures competency independently of agent verbosity or answer style.

What would settle it

Re-evaluating the same models on tasks rephrased to alter verbosity or stylistic cues and finding that pass rates or the share of cognitive failures shift by more than 10 percentage points.

Figures

Figures reproduced from arXiv: 2602.00933 by Andrew Park, Ben Hertzberg, Ben Levin, Bing Liu, Brad Kenstler, Chaithanya Bandi, Chetan Rane, Daniel Yue Zhang, Dan Rambado, Divyansh Agarwal, Ernesto Gabriel Hernandez Montoya, Geobio Boo, HiJae Kim, Ivan Salazar, Jeff Da, Manasi Sharma, Martin Dimakis, MohammadHossein Rezaei, Rafael Cruz, Razvan-Gabriel Dumitru, Sami Hassaan, Tejas Polakam, Vipul Gupta.

**Figure 2.** Figure 2: Diagnostic categories for the top 3 models on failed tasks. The y-axis shows average coverage score [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCP-Atlas brings real servers and claim-coverage scoring to tool-use benchmarks, but the tasks and rubric still need independent checks on realism and robustness.

read the letter

The main takeaway is that MCP-Atlas uses real production MCP servers and scores final answers by claim coverage rather than tool call style. This addresses some clear shortcomings in how tool-use competency has been measured so far. The authors put together 1,000 tasks across 36 servers and 220 tools. Prompts do not name the servers or tools, so agents have to find the right ones among distractors and build multi-step workflows. They ran 20 models and used an 11-category taxonomy to diagnose failures, with the result that cognitive problems made up 63 percent of the issues. Releasing the containerized harness and a 500-task public split is helpful for follow-up work. What stands out as solid is the answer-centric scoring that gives credit for valid alternative paths. The three-tier performance split they observed is easy to interpret. The weaker part is the validation of the tasks and the rubric. The abstract says human experts wrote and verified the tasks, but there is no report on agreement rates or any grounding in real usage logs. The choice of 0.75 as the claim coverage threshold also lacks any sensitivity check. These details matter because the pass rates and failure breakdowns rest on them. This work is for researchers and developers focused on making LLM agents reliable with external tools and workflows. Anyone looking for a dataset to test tool discovery and orchestration will get something concrete from it. The paper shows honest engagement with the evaluation challenges in this area. It is worth sending to peer review so reviewers can look at the task construction and scoring details more closely. I would recommend accepting it for review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MCP-Atlas, a benchmark for LLM agent tool-use competency consisting of 1,000 natural-language tasks written and verified by human experts across 36 real production MCP servers and 220 tools. Prompts require agents to discover relevant tools among distractors and compose multi-step cross-server workflows without explicit server or parameter hints. Scoring uses an answer-centric claim-level rubric that credits valid alternative trajectories, paired with an 11-category diagnostic taxonomy separating cognitive failures from tool-call errors. Evaluation of 20 frontier models from six providers under matched conditions reports pass rates up to 82.2% at a 0.75 claim-coverage threshold, a three-tier performance structure, and that 63.3% of diagnosed failures are cognitive rather than tool-call related. The authors release the task schema, containerized harness, claim evaluator, and a 500-task public split.

Significance. If the tasks and rubric are shown to be representative and style-independent, MCP-Atlas would fill a clear gap by providing the first large-scale benchmark grounded in authentic MCP servers rather than mocks, with reproducible claim-level scoring and failure diagnostics. The public release of code and partial data strengthens its potential utility for the community.

major comments (3)

[Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.
[Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.
[Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.

minor comments (2)

[Data Release] Clarify how the 500-task public split was sampled to ensure representativeness across servers and task complexity.
[Scoring Rubric] Define the exact procedure for extracting atomic claims from tool outputs and final answers in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We have carefully considered each major comment and provide point-by-point responses below. We agree with the need for additional validation and will make revisions to address the concerns about task construction, scoring sensitivity, and diagnostic reliability. These changes will enhance the robustness of our claims regarding MCP-Atlas as a benchmark for tool-use competency.

read point-by-point responses

Referee: [Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.

Authors: We acknowledge this limitation in our current manuscript. The 1,000 tasks were developed through a rigorous process involving multiple human experts with experience in MCP server development and usage. Each task was authored by one expert and verified by at least one other for realism, correctness, and multi-step nature. However, we did not report formal inter-rater reliability metrics such as Fleiss' kappa. We will revise the manuscript to include a detailed description of the task creation and verification protocol, along with agreement statistics computed on a held-out subset of tasks. Regarding comparison to real MCP usage logs, such logs are proprietary and not accessible for benchmarking purposes. We will add a discussion explaining why expert curation was used as a proxy for realism and note this as a limitation. This addresses the core concern while maintaining the benchmark's value. revision: partial
Referee: [Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.

Authors: We agree that providing sensitivity analysis would improve interpretability. The 0.75 threshold was selected after initial pilot studies to allow for reasonable variation in answer completeness while maintaining high standards. We will add an ablation study in the revised manuscript, varying the threshold across 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, and report how the pass rates, performance tiers, and failure distributions shift. Additionally, we will describe the claim-extraction process in greater detail, including how claims are automatically extracted from tool responses and manually spot-checked. This will demonstrate that the scoring is robust and not overly sensitive to specific choices or agent verbosity. revision: yes
Referee: [Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.

Authors: We concur that validation of the diagnostic taxonomy is important for the reliability of the 63.3% statistic. The taxonomy was iteratively refined by the research team based on manual inspection of model outputs. The automated diagnostics combine heuristic rules for tool-call errors with LLM-based classification for cognitive categories. To address this, we will conduct a human validation study on a sample of 200 failure cases, involving two independent annotators, and report inter-annotator agreement (e.g., Cohen's kappa) as well as agreement with the automated system. We will update the results and discussion accordingly in the revision. If the agreement is substantial, it will support our conclusions; we will qualify any findings as needed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and model evaluation

full rationale

The paper introduces MCP-Atlas as an empirical benchmark consisting of 1,000 human-expert-authored tasks across real MCP servers, evaluated via direct model runs under matched conditions to produce pass rates, tiered performance, and diagnostic failure breakdowns. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Task construction, claim-level rubrics, and automated diagnostics are defined externally to the reported results; outcomes (e.g., 82.2% pass rate, 63.3% cognitive failures) are measured against production servers rather than fitted or self-referentially defined. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would trigger circularity patterns. This is a standard self-contained benchmark study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard assumptions about expert task creation and the validity of claim-based evaluation rather than new mathematical derivations or fitted parameters.

free parameters (1)

claim coverage threshold
0.75 threshold used to compute pass rates; chosen to define success level.

axioms (1)

domain assumption Human experts can create and verify natural-language tasks that require genuine tool discovery and multi-step orchestration among semantically plausible distractors.
Invoked in the description of task construction and verification process.

pith-pipeline@v0.9.0 · 5952 in / 1609 out tokens · 66028 ms · 2026-05-21T15:03:13.690263+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
Reward Hacking in Rubric-Based Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 5 Pith papers · 11 internal anchors

[1]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. In ICLR, 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Holistic Evaluation of Language Models

P . Liang et al. Holistic Evaluation of Language Models (HELM). arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

E. Z. Liu, K. Guu, P . Pasupat, T. Shi, and P . Liang. Reinforcement Learning on Web Interfaces using Workflow- Guided Exploration. In ICLR, 2018. arXiv:1802.08802. (Introduces MiniWoB++)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments. arXiv:2404.07972, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Chai et al

Y. Chai et al. A3: Android Agent Arena for Mobile GUI Agents. arXiv:2501.01149, 2025

work page arXiv 2025
[7]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y. Qin et al. ToolLLM: Facilitating Large Language Models to Master 16,464 Real-World APIs.arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

(Introduces ToolBench dataset)

work page
[9]

Guo et al

Z. Guo et al. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs. In Findings of ACL, 2024. arXiv:2403.07714

work page arXiv 2024
[10]

S. G. Patil et al. The Berkeley Function-Calling Leaderboard (BFCL): From Benchmarks to Real-World Evaluation. OpenReview, 2024/2025. (Leaderboard and methodology). 14

work page 2024
[11]

S. Yao, N. Shinn, P . Razavi, and K. Narasimhan.λ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

C. E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

GAIA: a benchmark for General AI Assistants

G. Mialon et al. GAIA: A Benchmark for General AI Assistants. arXiv: 2311.12983, 2023. (ICLR 2024 version available)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

modelcontextprotocol.io/specification/2025-03-26, 2025

Model Context Protocol (MCP) Specification. modelcontextprotocol.io/specification/2025-03-26, 2025

work page 2025
[15]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. anthropic.com/news/model-context-protocol, Nov 2024

work page 2024
[16]

Luo et al

Z. Luo et al. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704, 2025

work page arXiv 2025
[17]

Wang et al

Z. Wang et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Real MCP Servers and Fuzzy Prompts. arXiv:2508.20453, 2025

work page arXiv 2025
[18]

Liu et al

Z. Liu et al. Automatic MCP-based Deep Evaluation for AI Agent Models (MCPEval). arXiv:2507.12806, 2025

work page arXiv 2025
[19]

Mo et al

G. Mo et al. LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? arXiv:2508.01780, 2025

work page arXiv 2025
[20]

Gao et al

X. Gao et al. MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in LLMs. arXiv:2505.16700, 2025

work page arXiv 2025
[21]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (Introduces HumanEval)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Li et al

Y. Li et al. Toolathlon: A Multi-Agent Benchmark for Tool-Assisted Long-Horizon Planning.arXiv:2505.xxxxx, 2025

work page 2025
[23]

Wu et al

X. Wu et al. MCPMark: Benchmarking LLM Agents on CRUD Operations with MCP Servers. arXiv:2506.xxxxx, 2025

work page 2025
[24]

Zhao et al

Y. Zhao et al. MCPVerse: Expanding the Action Space for Agentic LLMs. arXiv:2507.xxxxx, 2025

work page 2025
[25]

Chen et al

L. Chen et al. MSC-Bench: A Curriculum for Multi-Server Coordination in MCP Agents. arXiv:2508.xxxxx, 2025

work page 2025
[26]

Zhang et al

H. Zhang et al. MCPToolBench++: Large-Scale Multilingual MCP Server Evaluation. arXiv:2509.xxxxx, 2025

work page 2025
[27]

Huang et al

Y. Huang et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. Proceedings of ICLR, 2024

work page 2024
[28]

Ye et al

J. Ye et al. ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios. arXiv preprint arXiv:2401.00741, 2024

work page arXiv 2024
[29]

Survey on Evaluation of LLM-based Agents

A. Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv preprint arXiv:2503.16416, 2025. A. Appendix A: Environment Buckets and Detailed Diagnostics Bucket Shares and Target Mix.The distribution of tasks across the environment buckets is as follows: BASIC(32%), ANALYTICS(12%), PRODUCTIVITY(22%), FINANCIAL(12%), and CODING(22%). Representative se...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Complexity:The task must require multiple tool calls (target 3-6) and ideally involve cross-server orchestration or conditional logic. C.1 Example Task To illustrate the task schema, consider the following example: Prompt:“I’m researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There’s a 2024 pap...

work page 2024
[31]

jane castleman ad locality 2024

arxiv_search_papers (“jane castleman ad locality 2024”)→paper abstract

work page 2024
[32]

advertising

notion_API-post-search (“advertising”)→find relevant database

work page
[33]

21b97551-844e-8068-b387-fe7a56b04348

notion_API-post-database-query (database_id: “21b97551-844e-8068-b387-fe7a56b04348”)→campaign date Claims List:

work page
[34]

There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’

“There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’.”

work page 2024
[35]

abridged for paper]’.”

“The abstract of the paper with title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’ is: ‘Recently, Meta has shifted towards AI-mediated ad targeting mechanisms [... abridged for paper]’.”

work page
[36]

There’s a tie between three advertising campaigns with an engagement rate of 15%

“There’s a tie between three advertising campaigns with an engagement rate of 15%.”

work page
[37]

The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

“The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09.”

work page 2022
[38]

The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’

“The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’.” D. Appendix D: Extended Results Per-Server Error Rates.We observe significant variation in syntax and type error rates across servers. Financial servers exhibit the highest error rates (up to 45%), often due to strict requirements for date f...

work page

[1] [1]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. In ICLR, 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Holistic Evaluation of Language Models

P . Liang et al. Holistic Evaluation of Language Models (HELM). arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

E. Z. Liu, K. Guu, P . Pasupat, T. Shi, and P . Liang. Reinforcement Learning on Web Interfaces using Workflow- Guided Exploration. In ICLR, 2018. arXiv:1802.08802. (Introduces MiniWoB++)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments. arXiv:2404.07972, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Chai et al

Y. Chai et al. A3: Android Agent Arena for Mobile GUI Agents. arXiv:2501.01149, 2025

work page arXiv 2025

[7] [7]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Y. Qin et al. ToolLLM: Facilitating Large Language Models to Master 16,464 Real-World APIs.arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

(Introduces ToolBench dataset)

work page

[9] [9]

Guo et al

Z. Guo et al. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs. In Findings of ACL, 2024. arXiv:2403.07714

work page arXiv 2024

[10] [10]

S. G. Patil et al. The Berkeley Function-Calling Leaderboard (BFCL): From Benchmarks to Real-World Evaluation. OpenReview, 2024/2025. (Leaderboard and methodology). 14

work page 2024

[11] [11]

S. Yao, N. Shinn, P . Razavi, and K. Narasimhan.λ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

C. E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

GAIA: a benchmark for General AI Assistants

G. Mialon et al. GAIA: A Benchmark for General AI Assistants. arXiv: 2311.12983, 2023. (ICLR 2024 version available)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

modelcontextprotocol.io/specification/2025-03-26, 2025

Model Context Protocol (MCP) Specification. modelcontextprotocol.io/specification/2025-03-26, 2025

work page 2025

[15] [15]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. anthropic.com/news/model-context-protocol, Nov 2024

work page 2024

[16] [16]

Luo et al

Z. Luo et al. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704, 2025

work page arXiv 2025

[17] [17]

Wang et al

Z. Wang et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Real MCP Servers and Fuzzy Prompts. arXiv:2508.20453, 2025

work page arXiv 2025

[18] [18]

Liu et al

Z. Liu et al. Automatic MCP-based Deep Evaluation for AI Agent Models (MCPEval). arXiv:2507.12806, 2025

work page arXiv 2025

[19] [19]

Mo et al

G. Mo et al. LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? arXiv:2508.01780, 2025

work page arXiv 2025

[20] [20]

Gao et al

X. Gao et al. MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in LLMs. arXiv:2505.16700, 2025

work page arXiv 2025

[21] [21]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (Introduces HumanEval)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Li et al

Y. Li et al. Toolathlon: A Multi-Agent Benchmark for Tool-Assisted Long-Horizon Planning.arXiv:2505.xxxxx, 2025

work page 2025

[23] [23]

Wu et al

X. Wu et al. MCPMark: Benchmarking LLM Agents on CRUD Operations with MCP Servers. arXiv:2506.xxxxx, 2025

work page 2025

[24] [24]

Zhao et al

Y. Zhao et al. MCPVerse: Expanding the Action Space for Agentic LLMs. arXiv:2507.xxxxx, 2025

work page 2025

[25] [25]

Chen et al

L. Chen et al. MSC-Bench: A Curriculum for Multi-Server Coordination in MCP Agents. arXiv:2508.xxxxx, 2025

work page 2025

[26] [26]

Zhang et al

H. Zhang et al. MCPToolBench++: Large-Scale Multilingual MCP Server Evaluation. arXiv:2509.xxxxx, 2025

work page 2025

[27] [27]

Huang et al

Y. Huang et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. Proceedings of ICLR, 2024

work page 2024

[28] [28]

Ye et al

J. Ye et al. ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios. arXiv preprint arXiv:2401.00741, 2024

work page arXiv 2024

[29] [29]

Survey on Evaluation of LLM-based Agents

A. Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv preprint arXiv:2503.16416, 2025. A. Appendix A: Environment Buckets and Detailed Diagnostics Bucket Shares and Target Mix.The distribution of tasks across the environment buckets is as follows: BASIC(32%), ANALYTICS(12%), PRODUCTIVITY(22%), FINANCIAL(12%), and CODING(22%). Representative se...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Complexity:The task must require multiple tool calls (target 3-6) and ideally involve cross-server orchestration or conditional logic. C.1 Example Task To illustrate the task schema, consider the following example: Prompt:“I’m researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There’s a 2024 pap...

work page 2024

[31] [31]

jane castleman ad locality 2024

arxiv_search_papers (“jane castleman ad locality 2024”)→paper abstract

work page 2024

[32] [32]

advertising

notion_API-post-search (“advertising”)→find relevant database

work page

[33] [33]

21b97551-844e-8068-b387-fe7a56b04348

notion_API-post-database-query (database_id: “21b97551-844e-8068-b387-fe7a56b04348”)→campaign date Claims List:

work page

[34] [34]

There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’

“There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’.”

work page 2024

[35] [35]

abridged for paper]’.”

“The abstract of the paper with title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’ is: ‘Recently, Meta has shifted towards AI-mediated ad targeting mechanisms [... abridged for paper]’.”

work page

[36] [36]

There’s a tie between three advertising campaigns with an engagement rate of 15%

“There’s a tie between three advertising campaigns with an engagement rate of 15%.”

work page

[37] [37]

The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

“The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09.”

work page 2022

[38] [38]

The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’

“The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’.” D. Appendix D: Extended Results Per-Server Error Rates.We observe significant variation in syntax and type error rates across servers. Financial servers exhibit the highest error rates (up to 45%), often due to strict requirements for date f...

work page