When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Anka Reuel; Arjun Subramonian; Avijit Ghosh; Chenxi Whitehouse; Christina Knight; Dayeon Ki; Eliya Habba; Hossein A. Rahmani; Irene Solaiman; Jan Batzner

arxiv: 2602.16763 · v2 · pith:EA7TJTSPnew · submitted 2026-02-18 · 💻 cs.AI

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar , Anka Reuel , Prajna Soni , Sanchit Ahuja , Pawan Sasanka Ammanamanchi , Ruchit Rawal , Vil\'em Zouhar , Srishti Yadav

show 29 more authors

Chenxi Whitehouse Dayeon Ki Jennifer Mickel Leshem Choshen Marek \v{S}uppa Jan Batzner Jenny Chim Jeba Sania Yanan Long Hossein A. Rahmani Christina Knight Yiyang Nan Jyoutir Raj Yu Fan Shubham Singh Subramanyam Sahoo Eliya Habba Usman Gohar Siddhesh Pawar Robert Scholz Arjun Subramonian Jingwei Ni Mykel Kochenderfer Sanmi Koyejo Mrinmaya Sachan Stella Biderman Zeerak Talat Avijit Ghosh Irene Solaiman

This is my paper

classification 💻 cs.AI

keywords benchmarkssaturationbenchmarkfindmodelacrossanalyzeapproaches

0 comments

read the original abstract

Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, benchmarks quickly "saturate", making it difficult to differentiate models and diminishing their long-term value. In this study, we define benchmark saturation and analyze it across 60 language model benchmarks using 14 properties that relate to saturation. We find that nearly half of the our benchmarks exhibit saturation, with rates increasing with age. Further, we find that resilience to saturation is impacted by expert-curation, not by public test data. Our results suggest that design choices can extend benchmark longevity and inform more durable evaluation approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
The Generalized Turing Test: A Foundation for Comparing Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
cs.AI 2026-05 conditional novelty 6.0

EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
cs.AI 2026-04 unverdicted novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.