{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:NZL3RTPYQXENOQKH7VSSR442QP","short_pith_number":"pith:NZL3RTPY","schema_version":"1.0","canonical_sha256":"6e57b8cdf885c8d74147fd6528f39a83ccfb206273747ece7b146614becba0c8","source":{"kind":"arxiv","id":"2602.00933","version":3},"attestation_state":"computed","paper":{"title":"MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers","license":"http://creativecommons.org/licenses/by/4.0/","headline":"MCP-Atlas introduces a benchmark with 36 real MCP servers, 220 tools, and 1,000 multi-step tasks to evaluate LLM tool-use competency.","cross_cats":["cs.AI"],"primary_cat":"cs.SE","authors_text":"Andrew Park, Ben Hertzberg, Ben Levin, Bing Liu, Brad Kenstler, Chaithanya Bandi, Chetan Rane, Daniel Yue Zhang, Dan Rambado, Divyansh Agarwal, Ernesto Gabriel Hernandez Montoya, Geobio Boo, HiJae Kim, Ivan Salazar, Jeff Da, Manasi Sharma, Martin Dimakis, MohammadHossein Rezaei, Rafael Cruz, Razvan-Gabriel Dumitru, Sami Hassaan, Tejas Polakam, Vipul Gupta","submitted_at":"2026-01-31T23:19:39Z","abstract_excerpt":"The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verif"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"2602.00933","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.SE","submitted_at":"2026-01-31T23:19:39Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"79eaa429ece9a131715aed626d552315f06f368e15ca4d251cc9efa9c7fa6686","abstract_canon_sha256":"cdd0676ecc194513dd6db5d09f33907f3323583685677fa107dfc90e827e5ab9"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-21T01:04:21.965910Z","signature_b64":"pMgiNycUDXrGNdanVMZF3rvoyKLc7heH3KZT3fOPYsDgZJztkG6LPDggnyXf6H/saABh/ahXuTz2drcsTw3oDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"6e57b8cdf885c8d74147fd6528f39a83ccfb206273747ece7b146614becba0c8","last_reissued_at":"2026-05-21T01:04:21.964990Z","signature_status":"signed_v1","first_computed_at":"2026-05-21T01:04:21.964990Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers","license":"http://creativecommons.org/licenses/by/4.0/","headline":"MCP-Atlas introduces a benchmark with 36 real MCP servers, 220 tools, and 1,000 multi-step tasks to evaluate LLM tool-use competency.","cross_cats":["cs.AI"],"primary_cat":"cs.SE","authors_text":"Andrew Park, Ben Hertzberg, Ben Levin, Bing Liu, Brad Kenstler, Chaithanya Bandi, Chetan Rane, Daniel Yue Zhang, Dan Rambado, Divyansh Agarwal, Ernesto Gabriel Hernandez Montoya, Geobio Boo, HiJae Kim, Ivan Salazar, Jeff Da, Manasi Sharma, Martin Dimakis, MohammadHossein Rezaei, Rafael Cruz, Razvan-Gabriel Dumitru, Sami Hassaan, Tejas Polakam, Vipul Gupta","submitted_at":"2026-01-31T23:19:39Z","abstract_excerpt":"The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verif"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The claims-based rubric and internal diagnostics accurately measure genuine tool-use competency rather than surface-level answer matching or prompt-specific patterns.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MCP-Atlas introduces a benchmark with 36 real MCP servers, 220 tools, and 1,000 multi-step tasks to evaluate LLM tool-use competency.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4078255599a41d8e5e4515c27b01134da9dcd10982147e0f9b65736f093fcda5"},"source":{"id":"2602.00933","kind":"arxiv","version":3},"verdict":{"id":"41868396-43eb-4d7a-adb3-55475e3bea62","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:25:29.700997Z","strongest_claim":"We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%.","one_line_summary":"MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The claims-based rubric and internal diagnostics accurately measure genuine tool-use competency rather than surface-level answer matching or prompt-specific patterns.","pith_extraction_headline":"MCP-Atlas introduces a benchmark with 36 real MCP servers, 220 tools, and 1,000 multi-step tasks to evaluate LLM tool-use competency."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2602.00933/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2602.00933","created_at":"2026-05-21T01:04:21.965096+00:00"},{"alias_kind":"arxiv_version","alias_value":"2602.00933v3","created_at":"2026-05-21T01:04:21.965096+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.00933","created_at":"2026-05-21T01:04:21.965096+00:00"},{"alias_kind":"pith_short_12","alias_value":"NZL3RTPYQXEN","created_at":"2026-05-21T01:04:21.965096+00:00"},{"alias_kind":"pith_short_16","alias_value":"NZL3RTPYQXENOQKH","created_at":"2026-05-21T01:04:21.965096+00:00"},{"alias_kind":"pith_short_8","alias_value":"NZL3RTPY","created_at":"2026-05-21T01:04:21.965096+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":5,"internal_anchor_count":5,"sample":[{"citing_arxiv_id":"2605.12474","citing_title":"Reward Hacking in Rubric-Based Reinforcement Learning","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10866","citing_title":"OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15763","citing_title":"GLM-5: from Vibe Coding to Agentic Engineering","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09408","citing_title":"HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18292","citing_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","ref_index":6,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP","json":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP.json","graph_json":"https://pith.science/api/pith-number/NZL3RTPYQXENOQKH7VSSR442QP/graph.json","events_json":"https://pith.science/api/pith-number/NZL3RTPYQXENOQKH7VSSR442QP/events.json","paper":"https://pith.science/paper/NZL3RTPY"},"agent_actions":{"view_html":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP","download_json":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP.json","view_paper":"https://pith.science/paper/NZL3RTPY","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2602.00933&json=true","fetch_graph":"https://pith.science/api/pith-number/NZL3RTPYQXENOQKH7VSSR442QP/graph.json","fetch_events":"https://pith.science/api/pith-number/NZL3RTPYQXENOQKH7VSSR442QP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP/action/storage_attestation","attest_author":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP/action/author_attestation","sign_citation":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP/action/citation_signature","submit_replication":"https://pith.science/pith/NZL3RTPYQXENOQKH7VSSR442QP/action/replication_record"}},"created_at":"2026-05-21T01:04:21.965096+00:00","updated_at":"2026-05-21T01:04:21.965096+00:00"}