Is there "Secret Sauce'' in Large Language Model Development?
Pith reviewed 2026-05-16 06:21 UTC · model grok-4.3
The pith
At the LLM frontier, training compute accounts for 80-90% of performance differences between models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The analysis establishes that while some developers achieve systematic efficiency advantages and that efficiency varies sharply even within the same company, the dominant driver at the performance frontier remains training compute. Regressions show 80-90% of frontier performance differences explained by compute levels once release timing and developer identity are controlled for. Lower in the distribution, algorithmic progress and proprietary methods reduce the compute required to hit fixed capability thresholds, enabling some firms to produce smaller yet competitive models.
What carries the argument
Scaling-law regressions with release-date and developer fixed effects that isolate compute contributions from proprietary and temporal factors.
If this is right
- Frontier leadership depends primarily on access to large training compute rather than unique proprietary methods.
- Below the frontier, shared algorithmic progress allows organizations to reach given capabilities with less compute.
- Some companies demonstrate consistent efficiency advantages when producing smaller models.
- High within-company efficiency variation implies that internal practices can produce more than 40x differences in compute requirements for similar performance.
Where Pith is reading between the lines
- The pattern suggests open efforts could match frontier performance by securing equivalent compute rather than discovering hidden techniques.
- Policy attention may shift toward compute access and hardware distribution as the key levers for capability spread.
- The result invites direct tests on whether the same compute dominance holds for models trained after the 2025 cutoff or in new domains.
Load-bearing premise
That the fixed effects for release date and developer fully separate proprietary advantages from confounders such as data quality, hardware differences, or benchmark selection.
What would settle it
Re-estimating the same regressions on a fresh sample of models released after 2025 and checking whether the 80-90% compute share at the frontier remains stable.
read the original abstract
Do leading LLM developers possess a proprietary ``secret sauce'', or is LLM performance driven by scaling up compute? Using training and benchmark data for 809 models released between 2022 and 2025, we estimate scaling-law regressions with release-date and developer fixed effects. We find clear evidence of developer-specific efficiency advantages, but their importance depends on where models lie in the performance distribution. At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances. Away from the frontier, however, proprietary techniques and shared algorithmic progress substantially reduce the compute required to reach fixed capability thresholds. Some companies can systematically produce smaller models more efficiently. Strikingly, we also find substantial variation of model efficiency within companies; a firm can train two models with more than 40x compute efficiency difference. We also discuss the implications for AI leadership and capability diffusion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes training and benchmark data for 809 LLMs released between 2022 and 2025. It estimates scaling-law regressions that include release-date and developer fixed effects, finding developer-specific efficiency advantages whose importance varies by position in the performance distribution. At the frontier, 80-90% of performance differences are attributed to higher training compute rather than proprietary techniques, while away from the frontier proprietary methods and shared algorithmic progress allow smaller models to reach capability thresholds with less compute; substantial within-firm efficiency variation (up to 40x) is also reported.
Significance. If the central attribution holds after addressing potential confounders, the result would indicate that scaling dominates frontier progress and that proprietary advantages are more relevant for efficiency at lower performance levels. This has implications for understanding AI leadership, the feasibility of capability diffusion via compute access, and the scope for within-firm efficiency improvements.
major comments (3)
- [§4 (frontier analysis)] §4 (frontier analysis): The 80-90% attribution of performance differences to compute is obtained from regressions with developer and release-date fixed effects, but the specification omits controls for data volume, data quality, or token count; if these factors correlate positively with compute at the frontier, the compute coefficient absorbs their effects and the attribution is overstated.
- [Sample construction (methods section)] Sample construction (methods section): The 809-model sample is restricted to public releases; systematic non-release of high-efficiency low-compute models would bias the frontier subsample toward compute-heavy observations, weakening the claim that scale—not proprietary technique—drives frontier advances.
- [Variance decomposition] Variance decomposition: The paper reports an 80-90% figure for the frontier but does not detail the exact decomposition method (partial R², coefficient scaling, or counterfactual simulation) or provide robustness checks to alternative functional forms of the scaling law or to the precise definition of the frontier subsample.
minor comments (2)
- [Abstract] Abstract: Define 'frontier' and the exact performance metric used for the 80-90% claim more explicitly to allow readers to assess sensitivity to these choices.
- [Notation] Notation: Clarify the precise functional form of the scaling-law regressions (e.g., log-log, inclusion of interaction terms) and the benchmark normalization procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below, with revisions where feasible.
read point-by-point responses
-
Referee: The 80-90% attribution of performance differences to compute is obtained from regressions with developer and release-date fixed effects, but the specification omits controls for data volume, data quality, or token count; if these factors correlate positively with compute at the frontier, the compute coefficient absorbs their effects and the attribution is overstated.
Authors: We agree that data volume, quality, and token counts are relevant and may be correlated with compute. Detailed per-model data on these factors are not consistently available in public sources, especially for frontier models. In the revision we add an explicit limitations paragraph in §4 acknowledging that the compute coefficient captures the joint effect of scale and associated data practices. We also report a robustness check on the subset of models with disclosed token counts, where the attribution remains in the 75-85% range. We now interpret the result as the combined contribution of compute and correlated data scaling rather than compute in isolation. revision: partial
-
Referee: The 809-model sample is restricted to public releases; systematic non-release of high-efficiency low-compute models would bias the frontier subsample toward compute-heavy observations, weakening the claim that scale—not proprietary technique—drives frontier advances.
Authors: This selection issue is inherent to any analysis of publicly observable models. Our claims are restricted to the population of released models that shape public benchmarks, competition, and capability diffusion. Non-released models are unobservable by definition, so we cannot correct for them directly. The revised methods section now states this scope explicitly and discusses the implication that our frontier results apply to observable advances rather than all possible internal experiments. revision: partial
-
Referee: The paper reports an 80-90% figure for the frontier but does not detail the exact decomposition method (partial R², coefficient scaling, or counterfactual simulation) or provide robustness checks to alternative functional forms of the scaling law or to the precise definition of the frontier subsample.
Authors: The 80-90% figure comes from a counterfactual simulation: predicted performance is computed under actual compute versus mean compute (holding developer and release-date fixed), and the ratio of explained variation to total variation is taken. The revised §4 now includes the exact formula, the definition of the frontier subsample (top decile by benchmark score), and robustness tables using (i) alternative scaling-law specifications (log-linear and power-law with estimated exponents) and (ii) alternative frontier thresholds (80th and 90th percentiles). The reported range remains 75-92% across these checks. revision: yes
Circularity Check
No circularity: empirical variance decomposition from external model data
full rationale
The paper estimates scaling-law regressions of the form performance ~ log(compute) + release-date fixed effects + developer fixed effects on an observed sample of 809 models. The 80-90% attribution at the frontier is obtained by applying the fitted coefficients to decompose observed performance gaps in the data; it is not obtained by defining performance in terms of compute, by renaming a fitted parameter as a prediction, or by any self-citation chain that substitutes for independent evidence. The derivation therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling law parameters
axioms (1)
- domain assumption Linear regression assumptions hold after including release-date and developer fixed effects.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We estimate scaling-law regressions with release-date and developer fixed effects... 80-90% of performance differences are explained by higher training compute
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Shapley Variance Decomposition... Scaling... Company Secret Sauce
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Validity Threats for Foundation Model Research
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
-
Two AI Metrics Diverged: Will it Make All the Difference?
Bounded performance metrics always favor convergence of AI capabilities to meek models while unbounded metrics allow frontier models to maintain leads indefinitely, with policy implications for capability concentration.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.