The Leaderboard Illusion

Ahmet \"Ust\"un; Alex Wang; Beyza Ermis; Daniel D'souza; Marzieh Fadaee; Noah A. Smith; Sanmi Koyejo; Sara Hooker; Sayash Kapoor; Shayne Longpre

arxiv: 2504.20879 · v2 · pith:HLO3K46Anew · submitted 2025-04-29 · 💻 cs.AI · cs.CL· cs.LG· stat.ME

The Leaderboard Illusion

Shivalika Singh , Yiyang Nan , Alex Wang , Daniel D'Souza , Sayash Kapoor , Ahmet \"Ust\"un , Sanmi Koyejo , Yuntian Deng

show 5 more authors

Shayne Longpre Noah A. Smith Beyza Ermis Marzieh Fadaee Sara Hooker

This is my paper

classification 💻 cs.AI cs.CLcs.LGstat.ME

keywords arenadatachatbotfieldmodelsprovidersaccessdynamics

0 comments

read the original abstract

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency
stat.ME 2026-04 unverdicted novelty 8.0

LLM pairwise evaluation is recast as low-rank tensor completion, yielding semiparametric efficient estimators and asymptotic normality for ability functionals via a score-whitening correction for anisotropic operators.
LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
cs.CL 2026-06 unverdicted novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
cs.CL 2026-05 unverdicted novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness
cs.LG 2026-05 unverdicted novelty 7.0

Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
Validity Threats for Foundation Model Research
cs.LG 2026-06 accept novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
VERA-MH Concept Paper
cs.CY 2025-10 unverdicted novelty 5.0

VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
cs.CL 2026-04 unverdicted novelty 4.0

A scoping review organizes decades of NLP evaluation debates into a taxonomy of recurring concerns and trade-offs with a structured checklist for better evaluation design.
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
cs.CR 2025-10 unverdicted novelty 4.0

LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.
Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026
cs.CL 2026-06 unverdicted novelty 2.0

Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.