pith. sign in

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it
abstract

Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, benchmarks quickly "saturate", making it difficult to differentiate models and diminishing their long-term value. In this study, we define benchmark saturation and analyze it across 60 language model benchmarks using 14 properties that relate to saturation. We find that nearly half of the our benchmarks exhibit saturation, with rates increasing with age. Further, we find that resilience to saturation is impacted by expert-curation, not by public test data. Our results suggest that design choices can extend benchmark longevity and inform more durable evaluation approaches.

citation-role summary

background 1 method 1

citation-polarity summary

years

2026 7

clear filters

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper after filters.