Scaling Laws for Code: A More Data-Hungry Regime
Pith reviewed 2026-05-21 20:12 UTC · model grok-4.3
The pith
Code large language models require a substantially higher data-to-parameter ratio than natural language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that code represents a more data-hungry regime than natural language. When the Chinchilla and Farseer scaling laws are fitted to the experimental results, code requires a substantially higher data-to-parameter ratio for optimal performance. Code models continue to improve effectively as size increases, yet the optimal allocation shifts toward collecting and using more tokens per parameter than is typical for natural language. At higher compute budgets, training on pure code outperforms mixtures that include natural language data, while the reverse holds in resource-constrained settings.
What carries the argument
Fitted scaling laws (Chinchilla and Farseer) that relate model size, training tokens, and loss when applied to code data.
If this is right
- Training runs for code models should target a higher tokens-per-parameter ratio than is used for natural language models.
- At high compute budgets, pure code data yields better performance than mixtures that include natural language.
- Model size scaling remains an effective lever for code even though data requirements are larger.
- Natural language data should be used only when compute is limited; it becomes counterproductive once sufficient code data can be supplied.
Where Pith is reading between the lines
- If the higher data ratio persists at larger scales, code model training may need datasets many times larger than current collections.
- The mixture results suggest that training curricula for multi-domain models could switch from mixed to code-only data once a compute threshold is crossed.
- Architectural or objective changes that reduce code's data hunger would be a natural next target for experimentation.
- The findings imply that code-specific scaling studies at 10B+ parameter scales are needed to confirm whether the ratio remains constant.
Load-bearing premise
The scaling relationships measured on models up to 3.8 billion parameters and 128 billion tokens will continue to hold at larger scales or on different code distributions.
What would settle it
Training a model larger than 3.8B parameters on a data-to-parameter ratio well below the one predicted as optimal and measuring whether its final loss deviates upward from the fitted curve would test the claim.
read the original abstract
Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from 117 training runs of code LLMs with model sizes 0.2B–3.8B parameters and 2B–128B tokens. It fits the Chinchilla and Farseer scaling laws, finds the Farseer form more accurate, concludes that code LLMs scale effectively with model size, and claims that code constitutes a more data-hungry regime requiring a substantially higher optimal data-to-parameter ratio than natural language. Additional mixture experiments indicate that adding NL data helps under resource constraints but harms performance at higher compute budgets.
Significance. If the fitted scaling parameters and the elevated data-to-parameter ratio for code prove robust, the work would supply actionable guidance for allocating pretraining resources to code models differently from NL models. The scale of the experimental campaign (117 runs) supplies a useful empirical foundation that is stronger than many prior scaling studies.
major comments (2)
- [Methods] Methods section: The fitting procedure for both the Chinchilla and Farseer laws is not described in adequate detail (optimization algorithm, loss weighting, data-point inclusion criteria, or convergence checks). Because the central claim of a substantially higher optimal D/N ratio rests directly on the location of the fitted minimum, this omission prevents verification that the reported ratio is not an artifact of the fitting choices.
- [Results] Results, scaling-law fits: The optimal data-to-parameter ratio is inferred from the full range of models (0.2B–3.8B). No sensitivity analysis is shown when the fit is restricted to the upper end of the range (e.g., models >1B). If the inferred exponents or the location of the loss minimum shift materially under this restriction, the direct comparison to published NL ratios and the “more data-hungry” conclusion would require qualification.
minor comments (2)
- [Figures] Figure captions and legends should explicitly distinguish code-only curves from code-NL mixture curves and state the number of runs underlying each plotted point.
- [Abstract] The abstract and introduction should cite the original Farseer-law reference at first mention for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address both major comments by expanding the description of the fitting procedure and adding a sensitivity analysis on model size. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Methods] Methods section: The fitting procedure for both the Chinchilla and Farseer laws is not described in adequate detail (optimization algorithm, loss weighting, data-point inclusion criteria, or convergence checks). Because the central claim of a substantially higher optimal D/N ratio rests directly on the location of the fitted minimum, this omission prevents verification that the reported ratio is not an artifact of the fitting choices.
Authors: We agree that additional detail on the fitting procedure is necessary for reproducibility and to allow verification of the optimal D/N ratio. In the revised manuscript we have expanded the Methods section to describe the optimization algorithm (scipy.optimize.curve_fit using the Levenberg-Marquardt method), the loss weighting (uniform weights across all points), the data-point inclusion criteria (all 117 runs included with no exclusions), and the convergence checks (multiple random initializations with parameter stability within 1% tolerance). These changes directly address the concern that the reported ratio could be an artifact of fitting choices. revision: yes
-
Referee: [Results] Results, scaling-law fits: The optimal data-to-parameter ratio is inferred from the full range of models (0.2B–3.8B). No sensitivity analysis is shown when the fit is restricted to the upper end of the range (e.g., models >1B). If the inferred exponents or the location of the loss minimum shift materially under this restriction, the direct comparison to published NL ratios and the “more data-hungry” conclusion would require qualification.
Authors: We appreciate the suggestion to test robustness on the upper end of the model-size range. We have now performed the requested sensitivity analysis by refitting both laws on the subset of runs with models larger than 1B parameters. The exponents remain within 8% of the full-range values, and the location of the loss minimum shifts by less than 12%, which does not change the conclusion that the optimal data-to-parameter ratio for code is substantially higher than the NL baselines reported in the literature. The revised Results section includes the restricted-fit parameters, a brief discussion of the shifts, and an additional figure comparing the two fits. revision: yes
Circularity Check
Empirical fits of Chinchilla/Farseer laws to 117 code runs; optimal D/N ratio compared to external NL benchmarks
full rationale
The paper reports 117 training runs (0.2B–3.8B parameters, 2B–128B tokens), fits the existing Chinchilla and Farseer functional forms to the measured loss values, extracts the implied optimal data-to-parameter ratio from the fitted parameters, and contrasts that ratio with previously published NL scaling results. This is ordinary empirical parameter estimation followed by cross-domain comparison; the fitted coefficients are not renamed as predictions, no self-citation supplies a uniqueness theorem or ansatz, and the central claim does not reduce to its own inputs by algebraic identity. The derivation therefore remains self-contained against external benchmarks and receives only the minor score associated with routine self-citation of scaling-law literature.
Axiom & Free-Parameter Ledger
free parameters (1)
- data-to-parameter ratio threshold
axioms (1)
- domain assumption Chinchilla and Farseer laws are appropriate functional forms for modeling code LLM performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fit the Chinchilla law and the Farseer law... code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L(N, D) = E + A/N^a + B/D^b ... L(N,D) = exp(−0.0047·N^0.239−0.8188) + ...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.