pith. machine review for the scientific record. sign in

arxiv: 2605.01640 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.CL

Recognition: unknown

Prescriptive Scaling Laws for Data Constrained Training

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords scaling lawsdata constrained trainingoverfittingcompute allocationweight decaylanguage modelspretraining
0
0 comments X

The pith

Excess loss from repeating data is captured by a simple penalty, changing optimal compute allocation from repetition to model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the extra loss incurred when training on repeated data with an additive overfitting penalty that depends on a single coefficient. This extends standard scaling laws to regimes where high-quality data is scarce relative to available compute. A reader should care because it provides concrete guidance on whether to repeat data or scale up the model when tokens run out. If the model holds, following the law leads to better performing models by avoiding counterproductive repetition and using stronger regularization like weight decay to lower the penalty. The approach also allows comparing different training methods by how much they reduce this overfitting coefficient.

Core claim

We introduce a scaling law for data-constrained training by adding a simple overfitting penalty to the standard Chinchilla form that accounts for the excess loss from token repetition. This penalty term accurately fits observed model behavior across sizes and configurations. The resulting law predicts that after a certain point, increasing the number of repetitions raises loss more than it saves compute, so optimal allocation shifts to training larger models instead. We validate that configurations chosen according to this law outperform standard practice in data-limited settings, and that high weight decay lowers the penalty coefficient substantially.

What carries the argument

An additive overfitting penalty with a single coefficient that isolates the effect of data repetition on excess loss.

If this is right

  • In data-constrained regimes, compute is better allocated to increasing model parameters rather than additional epochs on repeated data once a threshold is reached.
  • Following the recommended allocation from the scaling law yields lower final loss than repeating data maximally.
  • The overfitting penalty coefficient can be used to quantify and compare the effectiveness of regularization techniques such as weight decay.
  • Strong weight decay reduces the penalty by about 70%, explaining why larger decay values are optimal when data is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be extended to decide between repeating data and generating synthetic data when real data is exhausted.
  • It implies that monitoring the overfitting coefficient during training could guide dynamic adjustments to model size or regularization.
  • The law may apply to other domains like vision or reinforcement learning where data repetition is common.

Load-bearing premise

The excess loss from data repetition follows a simple additive form whose coefficient does not vary with model size or specific training setup.

What would settle it

Observe whether the measured excess loss on repeated datasets matches the predicted additive penalty for models of varying sizes without requiring adjustments to the coefficient for each size.

Figures

Figures reproduced from arXiv: 2605.01640 by Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Shriya Sudhakar, Srivatsa Kundurthy.

Figure 1
Figure 1. Figure 1: Existing scaling laws fail to model overfitting. Predicted vs. observed validation loss under the Chinchilla baseline (treating repeated tokens as unique) and the Muennighoff et al. (2023) Db(UD, RD), Nb(UN, RN) formulation across four model sizes at UD = 200M. Both formulations fail to capture the loss increase at high repetition counts, systematically underpredicting loss as repetitions grow. 3 Experimen… view at source ↗
Figure 2
Figure 2. Figure 2: The cost of repeating data grows superlinearly. Residual between observed loss and the Chinchilla prediction (treating all repeated tokens as unique data) as a function of repetition count RD, shown for three model sizes and three unique data budgets UD. Dashed lines show power-law fits with a shared exponent δ across all configurations; the fitted δ > 1.0 indicates superlinear repetition damage. Slopes ar… view at source ↗
Figure 3
Figure 3. Figure 3: Our additive overfitting law adapts to model size. Predicted vs. observed validation loss under different scaling laws. Our scaling sweep Muennighoff et al. (2023) scaling sweep 0.000 0.005 0.010 Huber Loss Our scaling sweep Muennighoff et al. (2023) scaling sweep 0.6 0.8 1.0 R 2 (multi-epoch) Scaling Law Fit Quality Dc (1p) Dc; Nc (2p) PRD N UD (1p) PRD ³ N UD ´ (2p) PR ± D ³ N U ° D ´ (4p) view at source ↗
Figure 4
Figure 4. Figure 4: Scaling law fit quality. Huber loss (left, lower is better) and R 2 (right, higher is better) for each scaling law formulation, evaluated on both our scaling sweep and the Muennighoff et al. (2023) data. quantitative improvement: even our one-parameter additive penalty (Equation 6) sub￾stantially outperforms both the Db(UD, RD) and Db(UD, RD), Nb(UN, RN) forms, and the four-parameter form (Equation 8) achi… view at source ↗
Figure 5
Figure 5. Figure 5: Compute-optimal allocation frontiers. (Left) Observed validation loss across our experimental grid (UD = 100M), with contour lines showing the interpolated loss landscape. (Right) Compute-optimal allocation at C = 2 × 1019 FLOPs and UD = 500M, comparing the Chinchilla law, Muennighoff et al. (2023) Db, Nb, and our law (Equation 8) view at source ↗
Figure 6
Figure 6. Figure 6: Strong weight decay incurs a single-epoch loss premium. (Left) Compute￾optimal frontier: strong weight decay (λ = 1.0) achieves higher loss than the standard setting at every compute budget in the single-epoch regime. (Right) Compute-optimal allocation: strong weight decay favors larger models relative to data. Standard WD Strong WD 0.000 0.005 0.010 0.015 0.020 0.025 O verfitting coefficient P -70% 5 10 1… view at source ↗
Figure 7
Figure 7. Figure 7: Strong weight decay improves robustness to data repetition. (Left) Fitted overfitting coefficient P for standard (λ = 0.1) and strong (λ = 1.0) weight decay; strong weight decay reduces P by approximately 70%. (Center, Right) Loss decomposition for a 250M-parameter model trained on 100M unique tokens across weight decay values. Single-epoch scaling. We fit separate Chinchilla parameters (A, B, E, α, β) per… view at source ↗
Figure 8
Figure 8. Figure 8: Strong weight decay outperforms in high-compute regimes. (Left) Compute￾optimal allocation frontiers for standard (λ = 0.1) and strong (λ = 1.0) weight decay at UD = 250M. (Right) Compute-optimal loss as a function of compute budget. Strong weight decay incurs higher loss at low compute but crosses over at C ≈ 3.2 × 1018 FLOPs view at source ↗
Figure 9
Figure 9. Figure 9: Compute-optimal allocation reverses under refit Chinchilla base. Black lines: Chinchilla-optimal (no repetition). Solid lines: Db, Nb-optimal. Yellow: published parameters from Muennighoff et al. (2023). Red: refit parameters. The published Db, Nb law allocates toward smaller models and more data than Chinchilla, while the refit law allocates toward larger models, reversing the original recommendation. UD … view at source ↗
read the original abstract

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ($\lambda=1.0$) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends the Chinchilla scaling law with a simple additive one-parameter overfitting penalty to model excess loss from data repetition in data-constrained regimes. It claims this form accurately describes observed behavior, yields qualitatively new compute-optimal advice (beyond a repetition threshold, allocate compute to model capacity rather than further repetition), demonstrates empirical performance gains from following the derived allocations, and shows that strong weight decay reduces the overfitting coefficient by ~70%, providing a scaling-law account for higher optimal regularization in such settings. The single coefficient is presented as enabling direct cross-configuration comparisons.

Significance. If the one-parameter overfitting term generalizes across scales, optimizers, and datasets, the work would offer a practical, interpretable tool for data-limited pretraining that shifts allocation strategy away from pure Chinchilla optima and supplies a quantitative explanation for recent empirical findings on weight decay. The simplicity of the form is a strength for enabling comparisons, though the prescriptive claims rest on the stability of the fitted coefficient.

major comments (2)
  1. [Abstract] The central prescriptive advice (stop repeating past a threshold and spend on capacity) and the empirical improvement claim depend on the overfitting coefficient generalizing sufficiently that the derived optimum does not shift materially. The abstract asserts that the form 'accurately describes model behavior' and enables cross-configuration comparison, but no explicit cross-validation, sensitivity analysis, or stability checks across model sizes/configurations/datasets are described; if the coefficient varies by more than ~20-30%, the repetition-vs-capacity crossover moves and the recommended configurations cease to be optimal.
  2. [Abstract] Validation of the scaling law appears to rely on fitting the single overfitting coefficient to the same data used to assess its descriptive accuracy, creating circularity that weakens the claim that the law 'accurately describes model behavior' independently of the fitting process. A held-out evaluation or out-of-distribution test of the coefficient's predictive power would be needed to support the prescriptive use.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating the range of model sizes, datasets, and repetition factors used to fit and validate the law.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our extension of Chinchilla scaling laws to data-constrained regimes via a one-parameter overfitting penalty. The feedback highlights valid points about the stability of the coefficient and validation rigor, which we address directly below. We maintain that the simple additive form provides practical prescriptive value, but agree that additional analyses will strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract] The central prescriptive advice (stop repeating past a threshold and spend on capacity) and the empirical improvement claim depend on the overfitting coefficient generalizing sufficiently that the derived optimum does not shift materially. The abstract asserts that the form 'accurately describes model behavior' and enables cross-configuration comparison, but no explicit cross-validation, sensitivity analysis, or stability checks across model sizes/configurations/datasets are described; if the coefficient varies by more than ~20-30%, the repetition-vs-capacity crossover moves and the recommended configurations cease to be optimal.

    Authors: We appreciate this emphasis on generalization. Our experiments already span model sizes from 100M to over 1B parameters and a range of repetition factors, with the fitted coefficient showing consistency within this regime. However, we agree that an explicit sensitivity analysis is warranted to support the prescriptive advice. In the revised manuscript, we will add a dedicated subsection that varies the overfitting coefficient by up to ±30% around the fitted value and recomputes the optimal allocation curves. This will demonstrate that the qualitative recommendation—shifting compute from repetition to capacity beyond a threshold—remains stable, thereby addressing the concern that small variations could invalidate the advice. revision: partial

  2. Referee: [Abstract] Validation of the scaling law appears to rely on fitting the single overfitting coefficient to the same data used to assess its descriptive accuracy, creating circularity that weakens the claim that the law 'accurately describes model behavior' independently of the fitting process. A held-out evaluation or out-of-distribution test of the coefficient's predictive power would be needed to support the prescriptive use.

    Authors: The referee is correct that our current validation fits the coefficient to the observed excess losses and then assesses fit quality on the same curves, which introduces a degree of circularity. To strengthen the claim of independent descriptive accuracy, we will revise the manuscript to include a held-out evaluation: the coefficient will be fitted exclusively on a subset of configurations (e.g., smaller models and lower repetition factors) and then used to predict excess loss on held-out larger models and unseen repetition schedules. We will report the prediction error on these held-out points to quantify the law's out-of-sample performance and support its use for prescriptive allocation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is empirical modeling with external validation

full rationale

The paper extends the Chinchilla scaling law by introducing an additive overfitting penalty term with a single fitted coefficient to capture repetition effects in data-constrained regimes. This is explicitly presented as a modeling choice ('we model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior'), fitted to data, and then optimized to yield allocation advice. The advice is validated by showing performance improvements when following the recommended configurations, and the coefficient is used for cross-configuration comparisons (e.g., weight decay effects). No equations or steps reduce the central result to its inputs by construction, no self-citation chains justify load-bearing premises, and no fitted parameter is renamed as an independent prediction. The form is not claimed as first-principles but as an empirical fit whose generalization is tested via case studies. This is standard non-circular empirical scaling law construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that repetition-induced excess loss is well-approximated by a single additive term whose coefficient can be isolated and compared across runs.

free parameters (1)
  • overfitting penalty coefficient
    Single scalar parameter in the additive excess-loss term; fitted to observed performance under repetition and used both for allocation advice and for comparing regularization strength.
axioms (1)
  • domain assumption Excess loss from token repetition can be modeled as a simple additive penalty that is independent of the usual compute and data scaling terms.
    This is the core modeling decision stated in the abstract that enables the one-parameter form and the new allocation recommendations.

pith-pipeline@v0.9.0 · 5489 in / 1306 out tokens · 50083 ms · 2026-05-09T14:08:38.759025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  2. [2]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Scaling data-constrained language models , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    Advances in Neural Information Processing Systems , pages=

    Training compute-optimal large language models , author=. Advances in Neural Information Processing Systems , pages=

  5. [5]

    The Fourteenth International Conference on Learning Representations , year=

    Pre-training under infinite compute , author=. The Fourteenth International Conference on Learning Representations , year=

  6. [6]

    arXiv preprint arXiv:2205.10487 , year=

    Scaling laws and interpretability of learning from repeated data , author=. arXiv preprint arXiv:2205.10487 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  9. [9]

    Olmo 3

    Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

  10. [10]

    Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martin Blazquez and Guilherme Penedo and Lewis Tunstall and Andr. Smol. Second Conference on Language Modeling , year=

  11. [11]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  12. [12]

    The Thirteenth International Conference on Learning Representations , year=

    (Mis)Fitting Scaling Laws: A Survey of Scaling Law Fitting Techniques in Deep Learning , author=. The Thirteenth International Conference on Learning Representations , year=

  13. [13]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  14. [14]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Olmes: A standard for language model evaluations , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  15. [15]

    Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation , author=

  16. [16]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  17. [17]

    Forty-first International Conference on Machine Learning , year=

    Scaling Laws for Fine-Grained Mixture of Experts , author=. Forty-first International Conference on Machine Learning , year=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Likelihood-based diffusion language models , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  20. [20]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  21. [21]

    Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

    Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

  22. [22]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  23. [23]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  24. [24]

    Social IQa: Commonsense reasoning about social interactions , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

  25. [25]

    Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

    Crowdsourcing multiple choice science questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

  26. [26]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    A dataset of information-seeking questions and answers anchored in research papers , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

  27. [27]

    Laurent and Joseph D

    Lab-bench: Measuring capabilities of language models for biology research , author=. arXiv preprint arXiv:2407.10362 , year=

  28. [28]

    Conference on health, inference, and learning , pages=

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering , author=. Conference on health, inference, and learning , pages=. 2022 , organization=

  29. [29]

    Applied Sciences , volume=

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

  30. [30]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Sciriff: A resource to enhance language model instruction-following over scientific literature , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  31. [31]

    Transactions of the Association for Computational Linguistics , volume=

    Coqa: A conversational question answering challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  32. [32]

    DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  33. [33]

    https://aclanthology.org/ Q19-1026/

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  34. [34]

    Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

    Squad: 100,000+ questions for machine comprehension of text , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

  35. [35]

    Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  36. [36]

    The Thirteenth International Conference on Learning Representations , year=

    Scaling Laws for Precision , author=. The Thirteenth International Conference on Learning Representations , year=

  37. [37]

    200,000+ Jeopardy! Questions , author=

  38. [38]

    International Conference on Machine Learning , pages=

    Scaling laws for generative mixed-modal language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  39. [39]

    Forty-first International Conference on Machine Learning , year=

    Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. Forty-first International Conference on Machine Learning , year=

  40. [40]

    The Thirteenth International Conference on Learning Representations , year=

    Language models scale reliably with over-training and on downstream tasks , author=. The Thirteenth International Conference on Learning Representations , year=

  41. [41]

    International Conference on Learning Representations , year=

    Deep Double Descent: Where Bigger Models and More Data Hurt , author=. International Conference on Learning Representations , year=