Scaling Laws for Code: A More Data-Hungry Regime

Houyi Li; Qingfu Zhu; Rongyi Zhang; Siming Huang; Wanxiang Che; Wenzhen Zheng; Xianzhen Luo; Yuantao Fan

arxiv: 2510.08702 · v2 · pith:DRGAPEEXnew · submitted 2025-10-09 · 💻 cs.CL

Scaling Laws for Code: A More Data-Hungry Regime

Xianzhen Luo , Wenzhen Zheng , Qingfu Zhu , Rongyi Zhang , Houyi Li , Siming Huang , YuanTao Fan , Wanxiang Che This is my paper

Pith reviewed 2026-05-21 20:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords scaling lawscode LLMsdata-to-parameter ratioChinchilla lawFarseer lawcode-natural language mixturesmodel scaling

0 comments

The pith

Code large language models require a substantially higher data-to-parameter ratio than natural language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts 117 training runs across model sizes from 0.2B to 3.8B parameters and 2B to 128B tokens to test whether scaling laws derived for natural language apply to code. It finds that code follows predictable scaling with model size but demands far more data relative to parameters than natural language does. The Farseer law fits the observed losses more accurately than the Chinchilla law. Experiments with code-natural language mixtures further show that adding natural language text improves results only under tight compute limits and degrades them when more compute is available. These patterns matter because they directly affect how much data and compute must be allocated to train capable code models.

Core claim

The central claim is that code represents a more data-hungry regime than natural language. When the Chinchilla and Farseer scaling laws are fitted to the experimental results, code requires a substantially higher data-to-parameter ratio for optimal performance. Code models continue to improve effectively as size increases, yet the optimal allocation shifts toward collecting and using more tokens per parameter than is typical for natural language. At higher compute budgets, training on pure code outperforms mixtures that include natural language data, while the reverse holds in resource-constrained settings.

What carries the argument

Fitted scaling laws (Chinchilla and Farseer) that relate model size, training tokens, and loss when applied to code data.

If this is right

Training runs for code models should target a higher tokens-per-parameter ratio than is used for natural language models.
At high compute budgets, pure code data yields better performance than mixtures that include natural language.
Model size scaling remains an effective lever for code even though data requirements are larger.
Natural language data should be used only when compute is limited; it becomes counterproductive once sufficient code data can be supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the higher data ratio persists at larger scales, code model training may need datasets many times larger than current collections.
The mixture results suggest that training curricula for multi-domain models could switch from mixed to code-only data once a compute threshold is crossed.
Architectural or objective changes that reduce code's data hunger would be a natural next target for experimentation.
The findings imply that code-specific scaling studies at 10B+ parameter scales are needed to confirm whether the ratio remains constant.

Load-bearing premise

The scaling relationships measured on models up to 3.8 billion parameters and 128 billion tokens will continue to hold at larger scales or on different code distributions.

What would settle it

Training a model larger than 3.8B parameters on a data-to-parameter ratio well below the one predicted as optimal and measuring whether its final loss deviates upward from the fitted curve would test the claim.

read the original abstract

Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports results from 117 training runs of code LLMs with model sizes 0.2B–3.8B parameters and 2B–128B tokens. It fits the Chinchilla and Farseer scaling laws, finds the Farseer form more accurate, concludes that code LLMs scale effectively with model size, and claims that code constitutes a more data-hungry regime requiring a substantially higher optimal data-to-parameter ratio than natural language. Additional mixture experiments indicate that adding NL data helps under resource constraints but harms performance at higher compute budgets.

Significance. If the fitted scaling parameters and the elevated data-to-parameter ratio for code prove robust, the work would supply actionable guidance for allocating pretraining resources to code models differently from NL models. The scale of the experimental campaign (117 runs) supplies a useful empirical foundation that is stronger than many prior scaling studies.

major comments (2)

[Methods] Methods section: The fitting procedure for both the Chinchilla and Farseer laws is not described in adequate detail (optimization algorithm, loss weighting, data-point inclusion criteria, or convergence checks). Because the central claim of a substantially higher optimal D/N ratio rests directly on the location of the fitted minimum, this omission prevents verification that the reported ratio is not an artifact of the fitting choices.
[Results] Results, scaling-law fits: The optimal data-to-parameter ratio is inferred from the full range of models (0.2B–3.8B). No sensitivity analysis is shown when the fit is restricted to the upper end of the range (e.g., models >1B). If the inferred exponents or the location of the loss minimum shift materially under this restriction, the direct comparison to published NL ratios and the “more data-hungry” conclusion would require qualification.

minor comments (2)

[Figures] Figure captions and legends should explicitly distinguish code-only curves from code-NL mixture curves and state the number of runs underlying each plotted point.
[Abstract] The abstract and introduction should cite the original Farseer-law reference at first mention for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address both major comments by expanding the description of the fitting procedure and adding a sensitivity analysis on model size. Our point-by-point responses follow.

read point-by-point responses

Referee: [Methods] Methods section: The fitting procedure for both the Chinchilla and Farseer laws is not described in adequate detail (optimization algorithm, loss weighting, data-point inclusion criteria, or convergence checks). Because the central claim of a substantially higher optimal D/N ratio rests directly on the location of the fitted minimum, this omission prevents verification that the reported ratio is not an artifact of the fitting choices.

Authors: We agree that additional detail on the fitting procedure is necessary for reproducibility and to allow verification of the optimal D/N ratio. In the revised manuscript we have expanded the Methods section to describe the optimization algorithm (scipy.optimize.curve_fit using the Levenberg-Marquardt method), the loss weighting (uniform weights across all points), the data-point inclusion criteria (all 117 runs included with no exclusions), and the convergence checks (multiple random initializations with parameter stability within 1% tolerance). These changes directly address the concern that the reported ratio could be an artifact of fitting choices. revision: yes
Referee: [Results] Results, scaling-law fits: The optimal data-to-parameter ratio is inferred from the full range of models (0.2B–3.8B). No sensitivity analysis is shown when the fit is restricted to the upper end of the range (e.g., models >1B). If the inferred exponents or the location of the loss minimum shift materially under this restriction, the direct comparison to published NL ratios and the “more data-hungry” conclusion would require qualification.

Authors: We appreciate the suggestion to test robustness on the upper end of the model-size range. We have now performed the requested sensitivity analysis by refitting both laws on the subset of runs with models larger than 1B parameters. The exponents remain within 8% of the full-range values, and the location of the loss minimum shifts by less than 12%, which does not change the conclusion that the optimal data-to-parameter ratio for code is substantially higher than the NL baselines reported in the literature. The revised Results section includes the restricted-fit parameters, a brief discussion of the shifts, and an additional figure comparing the two fits. revision: yes

Circularity Check

0 steps flagged

Empirical fits of Chinchilla/Farseer laws to 117 code runs; optimal D/N ratio compared to external NL benchmarks

full rationale

The paper reports 117 training runs (0.2B–3.8B parameters, 2B–128B tokens), fits the existing Chinchilla and Farseer functional forms to the measured loss values, extracts the implied optimal data-to-parameter ratio from the fitted parameters, and contrasts that ratio with previously published NL scaling results. This is ordinary empirical parameter estimation followed by cross-domain comparison; the fitted coefficients are not renamed as predictions, no self-citation supplies a uniqueness theorem or ansatz, and the central claim does not reduce to its own inputs by algebraic identity. The derivation therefore remains self-contained against external benchmarks and receives only the minor score associated with routine self-citation of scaling-law literature.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the applicability of Chinchilla and Farseer functional forms to code data and on the representativeness of the 117 runs for general code scaling behavior.

free parameters (1)

data-to-parameter ratio threshold
The higher ratio claimed for code is determined by fitting the scaling laws to the experimental results.

axioms (1)

domain assumption Chinchilla and Farseer laws are appropriate functional forms for modeling code LLM performance
Invoked when fitting the laws to the 117 runs and comparing accuracy.

pith-pipeline@v0.9.0 · 5729 in / 1073 out tokens · 33805 ms · 2026-05-21T20:12:12.508522+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fit the Chinchilla law and the Farseer law... code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L(N, D) = E + A/N^a + B/D^b ... L(N,D) = exp(−0.0047·N^0.239−0.8188) + ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.