The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin

arxiv: 2605.18840 · v2 · pith:HFKFWDX4new · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin This is my paper

Pith reviewed 2026-05-20 21:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords frontier modelscapability couplingbenchmark decompositionh-field residualSWE-benchGPQAscaling transitionsmodel releases

0 comments

The pith

Frontier models show capabilities cooperating with r = +0.72 rather than trading off, and per-lab differences in how gains convert across skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes scores on coding and reasoning benchmarks into a shared population trend and lab-specific residuals to show that progress in one area tends to reinforce the other. This matters because separate leaderboards miss how labs actually steer emphasis and when benchmarks lose their ability to distinguish models. The analysis covers 34 models across 10 labs and uses out-of-sample releases to confirm the pattern holds while identifying saturation points and next measurements to add.

Core claim

Paired SWE-bench and GPQA Diamond scores across frontier releases break down into a population coupling trend with correlation r = +0.72 and a per-release h-field residual that tracks shifts in capability emphasis. Per-lab coupling slopes vary by a factor of five, with examples of reversal in emphasis, consistent focus, or oscillation. A second transition appears in open-weight models between 30B and 72B parameters, SWE-bench shows saturation while other tests retain spread, and five April 2026 releases raise the observed correlation to +0.75.

What carries the argument

The h-field residual obtained from linear decomposition of paired benchmark scores, which isolates per-release capability emphasis after removing the shared population coupling trend.

If this is right

Capabilities cooperate rather than trade off, so gains in coding reliably accompany gains in reasoning at the current frontier.
Per-lab coupling slopes differ up to fivefold, meaning some labs convert coding improvements into reasoning gains more efficiently than others.
SWE-bench is saturating while HLE and instruction-following retain spread, indicating the next axis rotation should prioritize those tests.
Open-weight models exhibit a second capability transition between 30B and 72B parameters.
The three-level playbook and per-lab priority table can guide which stress test to add next.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Labs could deliberately adjust their training emphasis to target specific h-field values once the diagnostic is adopted.
The decomposition method could extend to other benchmark pairs to map multi-skill coupling surfaces.
If the observed cooperation persists, model releases may begin to include explicit statements of intended capability balance.
The interactive dashboard's phase classification could become a standard way to compare development trajectories across labs.

Load-bearing premise

The linear decomposition of paired benchmark scores into a population trend plus per-release residual accurately isolates capability emphasis without being driven by unmodeled factors such as test overlap or training data contamination.

What would settle it

A drop in the SWE-bench to GPQA correlation below 0.5 across the next wave of frontier releases, or failure of the April 2026 out-of-sample correlation to reach +0.75.

Figures

Figures reproduced from arXiv: 2605.18840 by Adil Amin.

**Figure 1.** Figure 1: Frontier coupling: 34 March models + 5 April post-cutoff, 10 labs. (a) SWE-bench Verified vs. GPQA Diamond with frozen regression (GPQA = 0.513 · SWE + 46.4, r = +0.72). Circles: March-frozen models. Diamonds (red edge): April post-cutoff (not used in fit). (b) Per-lab h-field residual (core models): Google reasoning-rich (h = +5.5), Anthropic coding-rich (h = −6.9). (c) Anthropic trajectory including post… view at source ↗

**Figure 2.** Figure 2: The capability cascade: four transitions, one pattern. At each critical scale, the active benchmark pair changes and coupling undergoes a qualitative shift. Nc1 (∼3.5B): HS-TQA coupling flips sign. Nc2 (∼30–72B): cooperation crashes 59%. Nc3 (∼114B): SWE saturates, HLE activates. Nc4 (∼200–400B, predicted): IFEval saturates, next axis TBD. Engineering levers differ at each transition. 5 [PITH_FULL_IMAGE:f… view at source ↗

**Figure 3.** Figure 3: Nc2 cascade: second capability transition at 30–72B. OPT (left): gradual rise → peak at 13B → drop at 30B → partial recovery at 66B. Llama-2 (right): flat maximum at 7B–13B → sharp crash at 70B. Same pattern, different mechanism, same net effect. The dimensional handoff is visible in the pairwise coupling structure. SWE–GPQA cooperate (r = +0.85), GPQA–HLE cooperate (r = +0.72), but SWE–HLE are decoupled (… view at source ↗

**Figure 4.** Figure 4: Asymmetric saturation at the frontier. (a) Among the top-5 SWE-bench models, coding scores compress to a 1.3-pp spread while GPQA retains 9.1 pp of variation—SWE is losing discriminatory power. (b) Benchmark spread among top-5 models: SWE is saturating, GPQA is active, and HLE (26.4 pp spread) may be the next activating axis. 6.1 Seven falsifiable predictions We convert each forecast to a timestamped, benc… view at source ↗

**Figure 5.** Figure 5: Base-model foundation (from Amin [2026]). The coupling regime transition underlying CAPE: below a critical scale, reasoning and truthfulness anticorrelate; above, they cooperate. All frontier models sit in the cooperative regime. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($\Delta h = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes paired coding and reasoning scores into a shared trend plus per-release residual to flag when benchmarks lose separation and to compare lab efficiency, with some out-of-sample support but open questions on what the residual actually captures.

read the letter

The main point is that coding and reasoning scores move together across 34 frontier models rather than trading off, and the strength of that link differs by lab. The authors split the paired SWE-bench and GPQA numbers into a population-level linear trend and a residual they call the h-field, then use the residual to track emphasis shifts like DeepSeek swinging 15.9 points toward coding. They also note SWE-bench saturation and give lab-specific slopes that range fivefold, plus an out-of-sample check on five April 2026 releases where the correlation stays similar or slightly higher. The dashboard and seven timestamped predictions add concrete next steps for anyone watching releases.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that decomposing paired SWE-bench and GPQA Diamond scores from 34 models across 10 labs into a population-level linear coupling trend and per-release h-field residuals reveals cooperating capabilities (r = +0.72, p < 10^{-6}) with lab-specific variations in coupling slopes (varying 5x) and emphasis shifts (e.g., DeepSeek's 15.9 percentage point swing). It provides out-of-sample confirmation on five April 2026 releases (r rising to +0.75), identifies benchmark saturation, and offers a three-level playbook, per-lab table, seven falsifiable predictions, and an interactive dashboard for phase classification and recommendations.

Significance. If the h-field residuals genuinely capture capability emphasis shifts independent of artifacts, this work could significantly advance the field by shifting focus from static leaderboards to dynamic diagnostics of model development trajectories. The out-of-sample validation, provision of falsifiable predictions with timestamped criteria, and open dashboard are notable strengths that enhance reproducibility and testability. It addresses a timely issue in frontier model evaluation as benchmarks saturate.

major comments (3)

[Abstract (decomposition description)] The h-field is defined as the residual after fitting the population coupling trend to the paired benchmark scores (as described in the abstract). Since the per-release emphasis measure is constructed from the same paired scores used to estimate the trend, this introduces circularity in the diagnostic. The reported out-of-sample confirmation on April 2026 releases mitigates this somewhat, but the manuscript must clarify whether the coupling trend parameters are held fixed from the training set of 34 models or refit, and provide the exact fitting procedure to evaluate if the residuals isolate emphasis without tautology.
[Results on per-lab variation] The claim that per-lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23) is central to quantifying recipe efficiency, but lacks reported standard errors, p-values for the difference, or details on the regression model used (e.g., whether it accounts for model size or other covariates). Without these, the 5× variation's robustness is unclear and could be sensitive to outlier models or selection criteria.
[Out-of-sample confirmation] The abstract states that five April 2026 releases confirm the diagnostic with r rising to +0.75, but provides no detail on error bars, exact model selection criteria for the original 34 models, or whether the h-field computation occurs before or after any post-hoc adjustments. These omissions make it difficult to assess the statistical support for the central claim that the decomposition accurately diagnoses emphasis shifts.

minor comments (2)

[Abstract] The notation for the h-field residual could be clarified with an explicit equation in the main text rather than relying on the parenthetical description.
[Playbook and predictions] The seven falsifiable predictions would benefit from a dedicated table listing each prediction, its timestamped criteria, and current status for easier reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and statistical rigor of our decomposition approach. We address each major point below and have revised the manuscript accordingly to provide the requested methodological details and statistical support.

read point-by-point responses

Referee: [Abstract (decomposition description)] The h-field is defined as the residual after fitting the population coupling trend to the paired benchmark scores (as described in the abstract). Since the per-release emphasis measure is constructed from the same paired scores used to estimate the trend, this introduces circularity in the diagnostic. The reported out-of-sample confirmation on April 2026 releases mitigates this somewhat, but the manuscript must clarify whether the coupling trend parameters are held fixed from the training set of 34 models or refit, and provide the exact fitting procedure to evaluate if the residuals isolate emphasis without tautology.

Authors: We agree that the abstract and main text require explicit clarification on this point to eliminate any ambiguity regarding circularity. The population-level coupling trend (slope and intercept) is estimated once via ordinary least squares on the initial set of 34 models and then held strictly fixed for all subsequent calculations, including the out-of-sample April 2026 releases. The h-field residual for any model is computed as the vertical deviation from this fixed trend line; no refitting occurs. We have added a dedicated Methods subsection that specifies the exact OLS procedure, the regression equation (GPQA ~ SWE-bench), the software implementation, and the decision to freeze parameters after the training set. This revision ensures the residuals reflect genuine emphasis shifts rather than estimation artifacts. revision: yes
Referee: [Results on per-lab variation] The claim that per-lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23) is central to quantifying recipe efficiency, but lacks reported standard errors, p-values for the difference, or details on the regression model used (e.g., whether it accounts for model size or other covariates). Without these, the 5× variation's robustness is unclear and could be sensitive to outlier models or selection criteria.

Authors: We accept that the original presentation of the 5× slope variation was insufficiently supported by statistical detail. In the revised manuscript we now report bootstrap-derived standard errors for each lab-specific slope, along with p-values for the pairwise difference between the steepest (Google) and shallowest (DeepSeek) slopes. The per-lab regressions remain simple linear models without additional covariates, as consistent model-size metadata were unavailable across all releases; we explicitly note this limitation and test robustness by showing that the 5× range persists after sequential removal of any single model. These additions appear in a new table and accompanying text in the Results section. revision: yes
Referee: [Out-of-sample confirmation] The abstract states that five April 2026 releases confirm the diagnostic with r rising to +0.75, but provides no detail on error bars, exact model selection criteria for the original 34 models, or whether the h-field computation occurs before or after any post-hoc adjustments. These omissions make it difficult to assess the statistical support for the central claim that the decomposition accurately diagnoses emphasis shifts.

Authors: We acknowledge the need for greater transparency on the out-of-sample protocol. The original 34 models comprise every frontier release from 2024 through early 2026 for which both SWE-bench and GPQA Diamond scores were publicly reported; no post-hoc filtering was applied. The coupling trend parameters were estimated solely on these 34 models and then frozen. For the five April 2026 releases, h-field values and the updated correlation (r = +0.75) were computed using the fixed parameters, with no further adjustments. We have added error bars (95 % bootstrap intervals) to the reported correlation, clarified the selection criteria in the Methods, and included a supplementary table listing the exact models and scores used in both the training and confirmation sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; decomposition is standard residual analysis with independent out-of-sample validation

full rationale

The paper performs a linear regression on paired SWE-bench and GPQA Diamond scores across 34 models to extract a population-level coupling trend (r = +0.72) and per-release residuals labeled as the h-field. These residuals are then interpreted as diagnostics of capability emphasis shifts, with per-lab slope variations reported separately. This is a conventional statistical decomposition rather than a self-referential construction: the trend is estimated from the full sample, residuals are the explicit deviations, and the central claims are supported by an independent out-of-sample test on five April 2026 releases where the correlation rises to +0.75. No load-bearing step reduces to a fitted parameter being renamed as a prediction, no self-citation chain justifies a uniqueness claim, and no ansatz is smuggled in. The derivation remains self-contained against external benchmarks and does not equate its outputs to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that benchmark scores can be linearly decomposed into a stable population trend plus independent residuals, plus standard statistical assumptions for correlation significance. No new physical entities are postulated.

free parameters (1)

population coupling slope
The linear trend relating SWE-bench and GPQA scores across the 34 models is estimated from the data itself.

axioms (1)

domain assumption Paired benchmark scores are sufficiently independent of release timing and data contamination to allow clean residual extraction.
Invoked when interpreting h-field swings as genuine capability emphasis changes.

pith-pipeline@v0.9.0 · 5874 in / 1455 out tokens · 58523 ms · 2026-05-20T21:49:52.595036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual (h-field) ... r=+0.72, p<10^{-6}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Per-lab coupling slopes vary 5× (Google 1.15 vs. DeepSeek 0.23)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.