When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania · 2026 · cs.AI · arXiv 2602.16763

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, benchmarks quickly "saturate", making it difficult to differentiate models and diminishing their long-term value. In this study, we define benchmark saturation and analyze it across 60 language model benchmarks using 14 properties that relate to saturation. We find that nearly half of the our benchmarks exhibit saturation, with rates increasing with age. Further, we find that resilience to saturation is impacted by expert-curation, not by public test data. Our results suggest that design choices can extend benchmark longevity and inform more durable evaluation approaches.

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Next-Billion AI Index: The compass for AI utility and adoption in the global majority

cs.CY · 2026-05-29 · unverdicted · novelty 7.0

Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

SEAL revives saturated benchmarks via adaptive LLM meta-judging in elimination matches, matching full pairwise accuracy with roughly half the calls across code, math, QA, and agent tasks.

The Generalized Turing Test: A Foundation for Comparing Intelligence

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

cs.AI · 2026-05-11 · conditional · novelty 6.0 · 2 refs

EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

cs.AI · 2026-06-29 · unverdicted · novelty 4.0

A hybrid survey and conceptual framework introduces EvalSafetyGap to organize evaluation and alignment proxy failures in LLMs, supported by an audit of 10 models showing indeterminate capability-robustness links and governance-driven safety gaps.

citing papers explorer

Showing 6 of 6 citing papers.

Next-Billion AI Index: The compass for AI utility and adoption in the global majority cs.CY · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
SEAL revives saturated benchmarks via adaptive LLM meta-judging in elimination matches, matching full pairwise accuracy with roughly half the calls across code, math, QA, and agent tasks.
The Generalized Turing Test: A Foundation for Comparing Intelligence cs.AI · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents cs.AI · 2026-05-11 · conditional · none · ref 21 · 2 links · internal anchor
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V cs.AI · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures cs.AI · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
A hybrid survey and conceptual framework introduces EvalSafetyGap to organize evaluation and alignment proxy failures in LLMs, supported by an audit of 10 models showing indeterminate capability-robustness links and governance-driven safety gaps.

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer