Do Large Language Model Benchmarks Test Reliability?

Aleksander Madry; Edward Vendrow; Joshua Vendrow; Sara Beery

arxiv: 2502.03461 · v1 · pith:RNLXS2ABnew · submitted 2025-02-05 · 💻 cs.LG · cs.CL

Do Large Language Model Benchmarks Test Reliability?

Joshua Vendrow , Edward Vendrow , Sara Beery , Aleksander Madry This is my paper

classification 💻 cs.LG cs.CL

keywords benchmarksmodelsreliabilityfailuresllmsmodelbeenerrors

0 comments

read the original abstract

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flaws in the LLM Automation Narrative
stat.OT 2026-06 unverdicted novelty 7.0

A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.
Auditing LLM Benchmarks with Item Response Theory
cs.CL 2026-05 unverdicted novelty 6.0

An IRT-based detector identifies mislabeled examples in LLM benchmarks at 95% precision in the top 200 cases, outperforming supervised classifiers and revealing reward-model specialization on style over facts.
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
cs.CL 2026-04 unverdicted novelty 6.0

Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteri...
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI 2025-07 unverdicted novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
Benchmarking Misuse Mitigation Against Covert Adversaries
cs.CR 2025-06 unverdicted novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
cs.AI 2026-04 unverdicted novelty 5.0

An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
LLM Reasoning Is Latent, Not the Chain of Thought
cs.AI 2026-04 unverdicted novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
Position: AI Evaluations Should be Grounded on a Theory of Capability
cs.AI 2025-09 conditional novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.