MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
ToolOmni combines supervised fine-tuning on a cold-start multi-turn dataset with Decoupled Multi-Objective GRPO to enable proactive retrieval and grounded execution, yielding +10.8% higher end-to-end tool-use success and better generalization to unseen tools.
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
citing papers explorer
-
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
-
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
-
ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution
ToolOmni combines supervised fine-tuning on a cold-start multi-turn dataset with Decoupled Multi-Objective GRPO to enable proactive retrieval and grounded execution, yielding +10.8% higher end-to-end tool-use success and better generalization to unseen tools.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.