Position: State-of-the-Art Claims Require State-of-the-Art Evidence

YongKyung Oh

arxiv: 2605.17273 · v2 · pith:4SODO3Y4new · submitted 2026-05-17 · 💻 cs.LG · cs.AI

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

YongKyung Oh This is my paper

Pith reviewed 2026-05-20 13:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords SOTA claimsAI benchmarkingleaderboard analysismodel comparisoneffect sizeconsistencyrobustnessoutlier datasets

0 comments

The pith

Top AI models often claim superiority based on average scores that do not hold up across tasks or datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the common practice of declaring state-of-the-art results from small average improvements on public benchmarks. It shows that these claims rest on unstated assumptions of consistent outperformance with real effect sizes and stability when tasks are removed. Examination of ten cross-domain leaderboards reveals that more than half of top-model comparisons miss at least one of these properties, with gains typically coming from one or two outlier datasets. Readers should care because this gap affects how progress is measured and reported throughout machine learning research. The work calls for reporting that matches the actual strength of the evidence rather than new experiments.

Core claim

Analyzing ten cross-domain benchmarks from public leaderboards shows that in more than half of top-model comparisons at least one commonly assumed property of superiority fails to hold. These properties include meaningful effect size, consistency across tasks, and robustness to removal of individual datasets. Aggregate gains are frequently driven by outlier datasets even when the benchmark contains many tasks. The paper concludes that claim language should reflect the strength of the underlying evidence.

What carries the argument

The three properties used to test each top-model comparison: meaningful effect size in the score difference, consistency of superiority across most tasks, and robustness of the lead when any single dataset is removed from the average.

If this is right

Mean score rankings alone do not establish broad model superiority.
Aggregate improvements should be checked for dependence on individual tasks.
Papers should qualify state-of-the-art language to match the actual evidence shown.
The same fragility appears even in benchmarks that contain many tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gaps between aggregate metrics and claimed superiority could appear in other fields that rank systems by averages.
Requiring per-task breakdowns and outlier checks as standard reporting would change how models are compared in practice.
Existing published state-of-the-art results could be revisited with these three properties to assess how many hold.

Load-bearing premise

The ten selected benchmarks and the three checked properties stand in for what most state-of-the-art claims implicitly require across the wider literature.

What would settle it

Repeating the same analysis on the ten benchmarks or a fresh set and finding that the majority of top comparisons satisfy all three properties simultaneously would falsify the reported gap.

Figures

Figures reproduced from arXiv: 2605.17273 by YongKyung Oh.

**Figure 1.** Figure 1: illustrates the scale of the problem. The number of 1Medical & Imaging Informatics (MII), University of California, Los Angeles (UCLA). Correspondence to: YongKyung Oh <yongkyungoh@mednet.ucla.edu>. Preprint. May 19, 2026. accepted papers at major AI conferences has grown rapidly. Throughout this expansion, a substantial proportion of papers, frequently exceeding 30% at major venues, explicitly claim sta… view at source ↗

**Figure 2.** Figure 2: Distribution of diagnostic metrics across all pairwise comparisons on HELM MMLU Bouthillier et al. (2021) recommend a threshold of P(A > B) ≥ 0.75 for claiming superiority, whereas our default τw = 0.6 is deliberately more lenient. The win rate measures how often the supposedly superior model wins across tasks. Failure indicates that A outperforms on only a minority of tasks despite a higher average scor… view at source ↗

**Figure 3.** Figure 3: Summary of violation rates for each diagnostic test on HELM MMLU. Fragility rate indicates the proportion of model pairs where at least one test fails. These distributions illustrate that the three tests capture distinct failure modes. For instance, a model pair may show consistent wins but a small effect size. Alternatively, it may exhibit large effects but poor stability. This heterogeneity 1https://crfm… view at source ↗

**Figure 4.** Figure 4: Violation rates on HELM MMLU under varying analysis conditions. The purple line indicates the overall fragility rate [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to threshold selection on HELM MMLU. Each panel varies one threshold while holding others at defaults. practices accurately communicate ranking uncertainty. We reconstructed pairwise comparison matrices using public performance logs with a cut-off date of December 31, 2025 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Violation rates by diagnostic test across ten benchmarks. The rightmost bar indicates the overall fragility rate [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of violation rates to threshold selection on the HELM MMLU leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshold) 0.00… view at source ↗

**Figure 8.** Figure 8: Sensitivity analysis of violation rates to threshold selection on the LiveBench leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshold) 0.00… view at source ↗

**Figure 9.** Figure 9: Sensitivity analysis of violation rates to threshold selection on the Open ASR leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshold) 0.00 … view at source ↗

**Figure 10.** Figure 10: Sensitivity analysis of violation rates to threshold selection on the Open VLM leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshold) 0.00… view at source ↗

**Figure 11.** Figure 11: Sensitivity analysis of violation rates to threshold selection on the VBench leaderboard 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Sensitivity analysis of violation rates to threshold selection on the TabArena Binary leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshol… view at source ↗

**Figure 13.** Figure 13: Sensitivity analysis of violation rates to threshold selection on the TabArena Multiclass leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio thre… view at source ↗

**Figure 14.** Figure 14: Sensitivity analysis of violation rates to threshold selection on the TabArena Regression leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio thre… view at source ↗

**Figure 15.** Figure 15: Sensitivity analysis of violation rates to threshold selection on the TSFM (MAE) leaderboard 0.1 0.2 0.3 0.4 0.5 0.6 d (Cohen's d threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Magnitude Fragility Rate (a) Magnitude 0.4 0.5 0.6 0.7 0.8 0.9 w (Win Rate threshold) 0.00 0.25 0.50 0.75 1.00 Violation Rate Consistency Fragility Rate (b) Consistency 0.1 0.2 0.3 0.4 0.5 b (Breakdown Point Ratio threshold) 0.… view at source ↗

**Figure 16.** Figure 16: Sensitivity analysis of violation rates to threshold selection on the TSFM (MSE) leaderboard 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA carries implicit assumptions beyond mean score superiority, suggesting that a model meaningfully outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly assumed property of superiority does not hold. These properties include meaningful effect size, consistency across tasks, or robustness to dataset removal. Instead, aggregate gains are frequently driven by outlier datasets. This fragility persists even in benchmarks with many tasks. We argue that claim language should reflect the strength of the underlying evidence. This requires no additional experiments, only honest reporting of what results actually show, enabling more precise and interpretable comparisons across models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that SOTA assertions in AI/ML rest on insufficient evidence from benchmark leaderboards, where marginal mean-score improvements are taken to imply meaningful superiority. Analysis of ten cross-domain benchmarks shows that in more than half of top-model comparisons at least one implicit superiority property fails (meaningful effect size, cross-task consistency, or robustness to single-dataset removal), with aggregate gains often driven by outliers. The authors conclude that claim language should be calibrated to the actual strength of the supporting evidence.

Significance. If the sampled benchmarks are representative and the chosen properties align with typical implicit assumptions in SOTA language, the work could encourage more precise reporting of benchmark results and reduce overstated claims across the literature. The empirical audit of public leaderboards without new experiments is a methodological strength that supports reproducibility.

major comments (2)

[Methods] Methods section (benchmark selection): The manuscript provides no explicit protocol or inclusion criteria for choosing the ten cross-domain benchmarks. Because the central claim generalizes from an observed >50% failure rate to a 'widespread' claim-evidence gap, the absence of a documented selection procedure leaves open the possibility that the sample is not representative of benchmarks routinely invoked in SOTA statements.
[Results] Results section (sensitivity of thresholds): No sensitivity table or analysis is reported for the specific operationalizations of the three properties (e.g., effect-size cutoff, consistency metric, or leave-one-out vs. leave-two-out robustness). The headline finding is therefore tied to one particular choice of thresholds; without this check the generalization to typical SOTA claims remains vulnerable.

minor comments (2)

[Abstract] Abstract: The phrase 'more than half' could be replaced by the exact count or proportion of failing comparisons to improve precision.
[Figures] Figure legends: Legends for any plots showing per-dataset contributions should explicitly label which property (effect size, consistency, or robustness) is being visualized in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We believe the suggested revisions will improve the clarity and robustness of our analysis. We address each major comment below.

read point-by-point responses

Referee: [Methods] Methods section (benchmark selection): The manuscript provides no explicit protocol or inclusion criteria for choosing the ten cross-domain benchmarks. Because the central claim generalizes from an observed >50% failure rate to a 'widespread' claim-evidence gap, the absence of a documented selection procedure leaves open the possibility that the sample is not representative of benchmarks routinely invoked in SOTA statements.

Authors: We agree that documenting the selection process is necessary for transparency and to support the generalization. In the revised version, we will add an explicit protocol in the Methods section. The ten benchmarks were selected based on the following criteria: they are (i) publicly accessible leaderboards, (ii) cover diverse AI domains including natural language processing, computer vision, and multimodal tasks, (iii) include a sufficient number of tasks (at least 5) to enable meaningful consistency and robustness evaluations, and (iv) are among the most commonly referenced in recent literature on SOTA model comparisons. While not exhaustive, this selection targets benchmarks that underpin many SOTA claims. We will also add a discussion of the limitations regarding representativeness. revision: yes
Referee: [Results] Results section (sensitivity of thresholds): No sensitivity table or analysis is reported for the specific operationalizations of the three properties (e.g., effect-size cutoff, consistency metric, or leave-one-out vs. leave-two-out robustness). The headline finding is therefore tied to one particular choice of thresholds; without this check the generalization to typical SOTA claims remains vulnerable.

Authors: We concur that sensitivity analysis is valuable to demonstrate that the findings are not artifacts of specific threshold choices. We will incorporate a sensitivity analysis in the Results section or as an appendix. This will examine variations in the effect size threshold (e.g., small vs. medium effects), the proportion of tasks required for consistency (e.g., 60%, 70%, 80%), and robustness to removing one or two datasets. Preliminary checks indicate that the proportion of comparisons failing at least one property remains above 50% across reasonable variations, but we will report the full results to allow readers to assess the stability of the conclusions. revision: yes

Circularity Check

0 steps flagged

Empirical audit of leaderboards with no derivation or self-referential reduction

full rationale

The manuscript performs a direct empirical audit of ten public leaderboards, counting how often top-model comparisons fail one of three pre-specified properties (effect size, consistency, robustness to removal). No equations, fitted parameters, or mathematical derivations appear; the central observation is simply the fraction of cases meeting the failure criteria in the chosen data. The paper cites no prior work by the same authors to justify uniqueness or ansatz, and the selection of benchmarks and properties is presented as an operational choice rather than a result derived from the analysis itself. Because the findings rest on external leaderboard data rather than on any quantity constructed from the paper's own outputs or self-citations, the claim-evidence gap argument does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the choice of ten benchmarks and the definition of 'commonly assumed property of superiority'; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Mean score superiority on a benchmark implies an implicit claim of broad outperformance across tasks
Stated in the abstract as the gap between claim language and evidence.

pith-pipeline@v0.9.0 · 5740 in / 1057 out tokens · 50654 ms · 2026-05-20T13:53:12.401854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply three well-established concepts from statistics... Cohen’s d... Win Rate... Breakdown Point... fragility rate F
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Analyzing ten cross-domain benchmarks... more than half of top-model comparisons... at least one commonly assumed property of superiority does not hold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.