Holistic agent leaderboard: The missing infrastructure for AI agent evaluation

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al · 2025 · arXiv 2510.11977

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

support 1 use method 1

representative citing papers

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

cs.AI · 2026-05-07 · conditional · novelty 6.0

BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.

AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

AuditRepairBench supplies a large trace corpus and four screening methods that reduce evaluator-channel ranking instability in agent repair leaderboards by a mean of 62%.

MarketBench: Evaluating AI Agents as Market Participants

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

cs.CY · 2026-02-19 · accept · novelty 6.0

The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.

ClinQueryAgent: A Conversational Agent for Population Health Management

cs.IR · 2026-04-13 · unverdicted · novelty 4.0

The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

cs.AI · 2026-05-19

citing papers explorer

Showing 9 of 9 citing papers.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents cs.CL · 2026-05-18 · unverdicted · none · ref 14
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 12
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization cs.LG · 2026-05-11 · unverdicted · none · ref 8
Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents cs.AI · 2026-05-07 · conditional · none · ref 20
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair cs.AI · 2026-05-06 · unverdicted · none · ref 48
AuditRepairBench supplies a large trace corpus and four screening methods that reduce evaluator-channel ranking instability in agent repair leaderboards by a mean of 62%.
MarketBench: Evaluating AI Agents as Market Participants cs.AI · 2026-04-26 · unverdicted · none · ref 3
LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems cs.CY · 2026-02-19 · accept · none · ref 67
The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 111
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents cs.AI · 2026-05-19 · unreviewed · ref 6

Holistic agent leaderboard: The missing infrastructure for AI agent evaluation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer