Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens; Natalia Fischl-Lanzoni; Neil Thompson

arxiv: 2602.07238 · v2 · submitted 2026-02-06 · 💻 cs.AI · cs.LG· econ.GN· q-fin.EC

Is there "Secret Sauce'' in Large Language Model Development?

Matthias Mertens , Natalia Fischl-Lanzoni , Neil Thompson This is my paper

Pith reviewed 2026-05-16 06:21 UTC · model grok-4.3

classification 💻 cs.AI cs.LGecon.GNq-fin.EC

keywords large language modelsscaling lawscompute efficiencyAI developmentproprietary advantagesmodel performancefrontier modelsregression analysis

0 comments

The pith

At the LLM frontier, training compute accounts for 80-90% of performance differences between models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether leading developers hold proprietary secret advantages in large language models or whether performance stems mainly from scaling training compute. It analyzes training and benchmark data across 809 models released from 2022 to 2025 using scaling-law regressions that include release-date and developer fixed effects. The results show clear developer-specific efficiency differences, yet these matter far less at the highest performance levels. At the frontier, 80-90% of gaps trace directly to compute differences, while below the frontier proprietary techniques and shared progress allow reaching the same capabilities with substantially less compute. This distinction matters for understanding how quickly new capabilities spread and which organizations can lead.

Core claim

The analysis establishes that while some developers achieve systematic efficiency advantages and that efficiency varies sharply even within the same company, the dominant driver at the performance frontier remains training compute. Regressions show 80-90% of frontier performance differences explained by compute levels once release timing and developer identity are controlled for. Lower in the distribution, algorithmic progress and proprietary methods reduce the compute required to hit fixed capability thresholds, enabling some firms to produce smaller yet competitive models.

What carries the argument

Scaling-law regressions with release-date and developer fixed effects that isolate compute contributions from proprietary and temporal factors.

If this is right

Frontier leadership depends primarily on access to large training compute rather than unique proprietary methods.
Below the frontier, shared algorithmic progress allows organizations to reach given capabilities with less compute.
Some companies demonstrate consistent efficiency advantages when producing smaller models.
High within-company efficiency variation implies that internal practices can produce more than 40x differences in compute requirements for similar performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern suggests open efforts could match frontier performance by securing equivalent compute rather than discovering hidden techniques.
Policy attention may shift toward compute access and hardware distribution as the key levers for capability spread.
The result invites direct tests on whether the same compute dominance holds for models trained after the 2025 cutoff or in new domains.

Load-bearing premise

That the fixed effects for release date and developer fully separate proprietary advantages from confounders such as data quality, hardware differences, or benchmark selection.

What would settle it

Re-estimating the same regressions on a fresh sample of models released after 2025 and checking whether the 80-90% compute share at the frontier remains stable.

read the original abstract

Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes training and benchmark data for 809 LLMs released between 2022 and 2025. It estimates scaling-law regressions that include release-date and developer fixed effects, finding developer-specific efficiency advantages whose importance varies by position in the performance distribution. At the frontier, 80-90% of performance differences are attributed to higher training compute rather than proprietary techniques, while away from the frontier proprietary methods and shared algorithmic progress allow smaller models to reach capability thresholds with less compute; substantial within-firm efficiency variation (up to 40x) is also reported.

Significance. If the central attribution holds after addressing potential confounders, the result would indicate that scaling dominates frontier progress and that proprietary advantages are more relevant for efficiency at lower performance levels. This has implications for understanding AI leadership, the feasibility of capability diffusion via compute access, and the scope for within-firm efficiency improvements.

major comments (3)

[§4 (frontier analysis)] §4 (frontier analysis): The 80-90% attribution of performance differences to compute is obtained from regressions with developer and release-date fixed effects, but the specification omits controls for data volume, data quality, or token count; if these factors correlate positively with compute at the frontier, the compute coefficient absorbs their effects and the attribution is overstated.
[Sample construction (methods section)] Sample construction (methods section): The 809-model sample is restricted to public releases; systematic non-release of high-efficiency low-compute models would bias the frontier subsample toward compute-heavy observations, weakening the claim that scale—not proprietary technique—drives frontier advances.
[Variance decomposition] Variance decomposition: The paper reports an 80-90% figure for the frontier but does not detail the exact decomposition method (partial R², coefficient scaling, or counterfactual simulation) or provide robustness checks to alternative functional forms of the scaling law or to the precise definition of the frontier subsample.

minor comments (2)

[Abstract] Abstract: Define 'frontier' and the exact performance metric used for the 80-90% claim more explicitly to allow readers to assess sensitivity to these choices.
[Notation] Notation: Clarify the precise functional form of the scaling-law regressions (e.g., log-log, inclusion of interaction terms) and the benchmark normalization procedure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below, with revisions where feasible.

read point-by-point responses

Referee: The 80-90% attribution of performance differences to compute is obtained from regressions with developer and release-date fixed effects, but the specification omits controls for data volume, data quality, or token count; if these factors correlate positively with compute at the frontier, the compute coefficient absorbs their effects and the attribution is overstated.

Authors: We agree that data volume, quality, and token counts are relevant and may be correlated with compute. Detailed per-model data on these factors are not consistently available in public sources, especially for frontier models. In the revision we add an explicit limitations paragraph in §4 acknowledging that the compute coefficient captures the joint effect of scale and associated data practices. We also report a robustness check on the subset of models with disclosed token counts, where the attribution remains in the 75-85% range. We now interpret the result as the combined contribution of compute and correlated data scaling rather than compute in isolation. revision: partial
Referee: The 809-model sample is restricted to public releases; systematic non-release of high-efficiency low-compute models would bias the frontier subsample toward compute-heavy observations, weakening the claim that scale—not proprietary technique—drives frontier advances.

Authors: This selection issue is inherent to any analysis of publicly observable models. Our claims are restricted to the population of released models that shape public benchmarks, competition, and capability diffusion. Non-released models are unobservable by definition, so we cannot correct for them directly. The revised methods section now states this scope explicitly and discusses the implication that our frontier results apply to observable advances rather than all possible internal experiments. revision: partial
Referee: The paper reports an 80-90% figure for the frontier but does not detail the exact decomposition method (partial R², coefficient scaling, or counterfactual simulation) or provide robustness checks to alternative functional forms of the scaling law or to the precise definition of the frontier subsample.

Authors: The 80-90% figure comes from a counterfactual simulation: predicted performance is computed under actual compute versus mean compute (holding developer and release-date fixed), and the ratio of explained variation to total variation is taken. The revised §4 now includes the exact formula, the definition of the frontier subsample (top decile by benchmark score), and robustness tables using (i) alternative scaling-law specifications (log-linear and power-law with estimated exponents) and (ii) alternative frontier thresholds (80th and 90th percentiles). The reported range remains 75-92% across these checks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical variance decomposition from external model data

full rationale

The paper estimates scaling-law regressions of the form performance ~ log(compute) + release-date fixed effects + developer fixed effects on an observed sample of 809 models. The 80-90% attribution at the frontier is obtained by applying the fitted coefficients to decompose observed performance gaps in the data; it is not obtained by defining performance in terms of compute, by renaming a fitted parameter as a prediction, or by any self-citation chain that substitutes for independent evidence. The derivation therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard econometric assumptions for fixed-effects regressions and the applicability of scaling-law functional forms to the collected model data.

free parameters (1)

scaling law parameters
Exponents and intercepts in the scaling-law regressions are estimated from the 809-model dataset.

axioms (1)

domain assumption Linear regression assumptions hold after including release-date and developer fixed effects.
Invoked to interpret coefficients as isolating compute effects and developer advantages.

pith-pipeline@v0.9.0 · 5467 in / 1108 out tokens · 38618 ms · 2026-05-16T06:21:11.337906+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We estimate scaling-law regressions with release-date and developer fixed effects... 80-90% of performance differences are explained by higher training compute
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Shapley Variance Decomposition... Scaling... Company Secret Sauce

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Validity Threats for Foundation Model Research
cs.LG 2026-06 accept novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.