Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
Pith reviewed 2026-05-21 07:13 UTC · model grok-4.3
The pith
Repeating smaller datasets during training can accelerate learning compared to larger ones by creating sampling biases that drive better layer-wise growth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Repeating on fewer samples leads to compute savings during training compared to using a larger dataset. The speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. Using a smaller dataset with more repetitions is therefore not merely a fallback under data scarcity but can be proactively leveraged as a favorable inductive bias for optimization, particularly in reasoning tasks.
What carries the argument
Sampling biases induced by repeating a smaller dataset, which promote suitable layer-wise growth during optimization.
If this is right
- Smaller repeated datasets yield measurable compute savings over larger ones across multiple tasks, architectures, and optimizers.
- The layer-wise growth effect strengthens as the underlying dataset becomes smaller.
- Repeated smaller datasets function as a deliberate inductive bias rather than a default response to data limits.
- Both theoretical analysis and targeted interventions confirm the role of sampling biases in driving the observed acceleration.
Where Pith is reading between the lines
- Practitioners facing compute constraints could test repetition schedules as a first-line design choice instead of always maximizing unique data volume.
- The same bias mechanism may appear in non-reasoning domains once similar layer-wise measurements are applied.
- Hybrid schedules that start with small repeated sets and later expand unique data could combine the growth benefit with eventual diversity.
- If the effect scales, current data-scaling laws may need an explicit repetition term to predict optimal training regimes accurately.
Load-bearing premise
The speedup cannot be accounted for by prior theory and instead stems specifically from sampling biases enabling better layer-wise growth rather than other unmeasured factors.
What would settle it
An experiment that removes the sampling biases while preserving repetition and dataset size, or that blocks layer-wise growth differences, and still observes the same speedup would falsify the proposed mechanism.
Figures
read the original abstract
This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that repeating training on smaller datasets yields compute savings compared to using larger datasets, an effect observed across algorithmic tasks, architectures, and optimizers that cannot be explained by prior theory. The speedup is attributed to sampling biases enabling appropriate layer-wise growth, with the effect being more pronounced at smaller dataset sizes. Theoretical analysis and empirical interventions are provided to support this, positioning smaller repeated datasets as a proactive inductive bias for optimization, especially in reasoning tasks.
Significance. If the central mechanism is isolated and the empirical controls are robust, the result would challenge standard data-scaling assumptions and offer a practical lever for faster training under data constraints. The combination of theory and cross-task evidence could influence how practitioners handle repetition versus dataset expansion, particularly if the layer-wise growth account proves distinct from simpler diversity or noise effects.
major comments (2)
- [§5.3] §5.3, ablation on gradient statistics: the interventions do not match total unique samples seen or explicitly control effective batch variance across the small-repeated vs. large conditions; without this, the attribution of speedup specifically to sampling-bias-driven layer-wise growth remains unisolated from reduced per-epoch diversity or altered noise.
- [§4.1] §4.1, theoretical derivation of layer-wise growth: the analysis assumes sampling bias directly modulates per-layer convergence rates, yet does not derive or bound the contribution relative to standard gradient-noise scaling; a concrete test (e.g., variance-matched runs) is needed to establish that the proposed mechanism is load-bearing rather than epiphenomenal.
minor comments (2)
- [Figure 3] Figure 3 caption: the legend does not clarify whether the x-axis is epochs or total compute; this obscures direct comparison of the claimed savings.
- [§2] Related-work section: prior results on epoch-wise repetition (e.g., in language-model scaling studies) are cited only briefly; a short paragraph contrasting the layer-growth account with those findings would improve context.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have reviewed the major comments carefully and provide point-by-point responses below. Where the concerns identify opportunities to strengthen isolation of the proposed mechanism, we indicate revisions that will be incorporated in the next version of the manuscript.
read point-by-point responses
-
Referee: [§5.3] §5.3, ablation on gradient statistics: the interventions do not match total unique samples seen or explicitly control effective batch variance across the small-repeated vs. large conditions; without this, the attribution of speedup specifically to sampling-bias-driven layer-wise growth remains unisolated from reduced per-epoch diversity or altered noise.
Authors: We agree that the current ablations in §5.3 do not explicitly equate the total number of unique samples seen or control effective batch variance between the repeated-small and large-dataset regimes. This leaves open the possibility that observed differences partly reflect reduced per-epoch diversity or changes in noise statistics rather than sampling-bias effects on layer-wise growth. In the revision we will add new controlled experiments that (i) match the cumulative unique samples across conditions and (ii) explicitly equalize gradient variance (via batch-size adjustment or variance-matched resampling). These additions will allow a cleaner attribution to the proposed mechanism. revision: yes
-
Referee: [§4.1] §4.1, theoretical derivation of layer-wise growth: the analysis assumes sampling bias directly modulates per-layer convergence rates, yet does not derive or bound the contribution relative to standard gradient-noise scaling; a concrete test (e.g., variance-matched runs) is needed to establish that the proposed mechanism is load-bearing rather than epiphenomenal.
Authors: The derivation in §4.1 shows how sampling bias can produce differential per-layer convergence rates, but it does not yet bound this effect against the well-known scaling of gradient noise with dataset size. We therefore accept the suggestion to include a direct empirical test. The revised manuscript will report variance-matched runs in which gradient variance is held constant while sampling bias is varied; the resulting layer-wise growth patterns and training-speed differences will be compared to the original conditions. This will clarify whether the sampling-bias account remains load-bearing once standard noise scaling is controlled. revision: yes
Circularity Check
No significant circularity; derivation grounded in empirical observations and new analysis
full rationale
The paper's central claim rests on observed performance differences between small and large datasets under repetition, supported by theoretical analysis and empirical interventions that aim to isolate sampling biases and layer-wise growth effects. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional equivalence. The argument explicitly contrasts with prior theory rather than importing uniqueness from the authors' own prior work. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-referential definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Improved Scaling Laws in Linear Regression via Data Reuse , author =. 2025 , journal =
work page 2025
-
[2]
Emergent properties with repeated examples , author =. 2024 , journal =
work page 2024
-
[3]
Advances in Neural Information Processing Systems , volume =
Scaling data-constrained language models , author =. Advances in Neural Information Processing Systems , volume =
-
[4]
Neural Information Processing Systems , year =
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , author =. Neural Information Processing Systems , year =. doi:10.48550/arXiv.2207.08799 , bibSource =
-
[5]
Yatin Dandi and Emanuele Troiani and Luca Arnaboldi and Luca Pesce and Lenka Zdeborov. The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents , booktitle =. 2024 , url =
work page 2024
-
[6]
Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions , author =. 2024 , journal =
work page 2024
-
[7]
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author =. 2022 , journal =
work page 2022
-
[8]
Advances in Neural Information Processing Systems , volume =
Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit , author =. Advances in Neural Information Processing Systems , volume =
-
[9]
Advances In Neural Information Processing Systems , volume =
Sgd: The role of implicit regularization, batch-size and multiple-epochs , author =. Advances In Neural Information Processing Systems , volume =
-
[10]
Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression , author =. 2025 , journal =
work page 2025
-
[11]
Nikolakakis and Amin Karbasi and Dionysios S
Patrik Okanovic and Roger Waleffe and Vasilis Mageirakos and Konstantinos E. Nikolakakis and Amin Karbasi and Dionysios S. Kalogerias and Nezihe Merve G. Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning , booktitle =. 2024 , url =
work page 2024
- [12]
- [13]
-
[14]
Neural Information Processing Systems , year =
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , author =. Neural Information Processing Systems , year =
-
[15]
2012 IEEE 53rd Annual Symposium on Foundations of Computer Science , pages=
Finding correlations in subquadratic time, with applications to learning parities and juntas , author=. 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science , pages=. 2012 , organization=
work page 2012
-
[16]
2009 50th Annual IEEE Symposium on Foundations of Computer Science , pages=
Learning and smoothed analysis , author=. 2009 50th Annual IEEE Symposium on Foundations of Computer Science , pages=. 2009 , organization=
work page 2009
-
[17]
Advances in Neural Information Processing Systems , volume =
Provable advantage of curriculum learning on parity targets with mixed inputs , author =. Advances in Neural Information Processing Systems , volume =
-
[18]
Feature Learning in Infinite-Width Neural Networks , author =. 2020 , journal =
work page 2020
-
[19]
Elisabetta Cornacchia and Dan Mikulincer and Elchanan Mossel , editor =. Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions , booktitle =. 2025 , url =
work page 2025
-
[20]
Advances in neural information processing systems , volume =
Data distributional properties drive emergent in-context learning in transformers , author =. Advances in neural information processing systems , volume =
-
[21]
Proceedings of the 40th International Conference on Machine Learning , pages =
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[22]
A Spectral Condition for Feature Learning , author =. 2023 , journal =
work page 2023
-
[23]
Annual Conference Computational Learning Theory , year =
SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics , author =. Annual Conference Computational Learning Theory , year =. doi:10.48550/arXiv.2302.11055 , bibSource =
-
[24]
Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning , author =. 2026 , journal =
work page 2026
-
[25]
Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author =. 2023 , journal =
work page 2023
-
[26]
Query-Key Normalization for Transformers , author =. 2020 , journal =
work page 2020
-
[27]
GitHub repository , howpublished =
Andrej Karpathy , title =. GitHub repository , howpublished =. 2022 , publisher =
work page 2022
-
[28]
Everett and Lechao Xiao and Mitchell Wortsman and Alexander A
Katie E. Everett and Lechao Xiao and Mitchell Wortsman and Alexander A. Alemi and Roman Novak and Peter J. Liu and Izzeddin Gur and Jascha Sohl. Scaling Exponents Across Parameterizations and Optimizers , booktitle =. 2024 , url =
work page 2024
-
[29]
Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization , author =. 2025 , journal =
work page 2025
-
[30]
arXiv preprint arXiv:2309.14322 , year=
Small-scale proxies for large-scale Transformer training instabilities , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2309.14322 , bibSource =
-
[31]
Proceedings of the 40th International Conference on Machine Learning , pages =
Scaling Vision Transformers to 22 Billion Parameters , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[32]
Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic , author =. 2024 , journal =
work page 2024
-
[33]
YaRN: Efficient Context Window Extension of Large Language Models
YaRN: Efficient Context Window Extension of Large Language Models , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2309.00071 , bibSource =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00071
-
[34]
Scaling Laws and Interpretability of Learning from Repeated Data , author =. 2022 , journal =
work page 2022
-
[35]
The emergence of sparse attention: impact of data distribution and benefits of repetition , author =. 2025 , journal =
work page 2025
-
[36]
Proceedings of the 38th International Conference on Machine Learning , pages =
On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
-
[37]
Lowering Data Diversity can Accelerate Training: Case Studies in Synthetic Tasks , author=. 2025 , url=
work page 2025
-
[38]
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , url =
Johnson, Rie and Zhang, Tong , booktitle =. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , url =
-
[39]
Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning , author=. 2026 , eprint=
work page 2026
-
[40]
Journal of machine learning research , year =
Online stochastic gradient descent on non-convex losses from high-dimensional inference , author =. Journal of machine learning research , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.