pith. machine review for the scientific record. sign in

arxiv: 2605.13405 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

When is Warmstarting Effective for Scaling Language Models?

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords warmstartingmodel growthgrowth factorscaling lawslanguage modelsconvergence speeduptraining efficiencydense models
0
0 comments X

The pith

A 2x growth factor from smaller checkpoints reliably speeds language model convergence, but an upper bound on growth factor makes training from scratch more efficient beyond it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines when growing a smaller model checkpoint into a larger one accelerates training of language models and dense MLPs. It demonstrates that keeping the base model's performance right after growth is unnecessary for strong final results, and that straightforward growth operators work at least as well as more elaborate designs. Experiments show that a 2x growth factor gives the most consistent speedups, especially when the total training budget stays under 20 tokens per parameter, while larger growth factors eventually make starting from scratch the better choice. The authors fit scaling laws to these results to predict when growth pays off.

Core claim

Model growth from a given checkpoint accelerates training of a larger model only up to an empirically identified upper bound on the growth factor g, beyond which training from scratch becomes more efficient. Simple, architecture-agnostic growth operators outperform more complex ones that try to preserve the base model's post-growth performance. Across dense MLPs and language models, a 2x growth factor proves most reliable for convergence speedups, with the largest gains under 20 tokens per parameter and diminishing returns at higher budgets. Scaling laws fitted to the observations supply predictive guidance on when and how much to grow.

What carries the argument

The growth factor g that sets the size ratio between base and target model, together with the choice of simple weight-mapping operator that transfers parameters from the smaller checkpoint to the larger one.

If this is right

  • A 2x growth factor produces the most reliable convergence speedups across tested setups.
  • Speedup gains are largest when the training budget stays below 20 tokens per parameter and shrink as the budget grows.
  • Beyond the identified upper bound on growth factor g, training from scratch uses fewer total resources.
  • Simple growth operators achieve final performance equal to or better than complex operators that preserve initial performance.
  • Fitted scaling laws give practitioners a way to predict whether growth will help at a given model size and budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same growth-factor bound may need re-measurement when moving from dense models to sparse or mixture-of-experts architectures.
  • Growth could be tested as a way to warm-start early checkpoints in very long pre-training runs rather than only at the start.
  • The scaling laws could be used to decide growth points dynamically during a single long training run.
  • Replicating the upper-bound finding on public checkpoints would let practitioners apply the guideline without new experiments.

Load-bearing premise

The upper bound on growth factor and the advantage of simple operators will continue to hold for architectures and training regimes beyond the dense MLPs and language models examined here.

What would settle it

A single training run on a transformer or other architecture at a fixed budget where growth factors larger than the reported bound still converge faster than training from scratch would falsify the claimed limit.

Figures

Figures reproduced from arXiv: 2605.13405 by Aaron Klein, Frank Hutter, Herilalaina Rakotoarison, Johannes Hog, Josif Grabocka, Maciej Janowski, Neeratyoy Mallik.

Figure 1
Figure 1. Figure 1: Comparing the simplest set of realizations of Equation (2), for different [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Final validation loss under MLP-width scaling from a fixed base width of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Warmstarting from a base→target model size, trained for 20 tokens/parameter. Top row: comparing Scratch to SZP up to 1.2B parameters, showing that SZP converges faster and achieves better final loss, with gains varying across growth factors g (rightmost). Bottom row: comparing Net2Net to SZP (both WS methods explored here) for up to 610M parameters, showing that SZP consistently outperforms the FP-based ba… view at source ↗
Figure 4
Figure 4. Figure 4: IsoFLOPs on Language Models and MLPs for 20 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Isolines for loss difference showing where Warmstarting achieves lower loss than training from Scratch for growth factor g ≈ 2 (blue regions indicate Warmstarting is better; red indicates Scratch is better; the green line marks the boundary where both perform equally). Given a fixed model size, the more the tokens/parameter, the more preferred is training from Scratch over Warmstarting. The region where SZ… view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter importance (fANOVA via Optuna), aggregated across all target widths, [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Looking inside WS for 14M → 77M: Plotting metric traces over steps; Top row: Input Embedding; Middle rows: The middle layer’s attention and MLP matrices; Bottom row: Output Un-embedding; (left) Comparing Zeros+Perturb with Shrink+Zeros+Perturb(SZP), showing the Effective Rank, L1 Norm, and the norm of the difference of the weights at step t to the grown weight at initialization; (right) Comparing Net2Net w… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of perturbation scale σperturb on a 32M → 286M transfer. Validation-loss trajectories are shown as a function of training FLOPs for different fixed perturbation scales. The SZP default, σperturb = 1/ √ width, is highlighted in orange. Nonzero perturbations substantially improve over pure zero-padding, and the default reaches the low-loss regime quickly while matching the best final validation loss i… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation for shrinking factor (λshrink) for SZP. The rows represent a base checkpoint. Each line in the plot represents a target model scale warmstarted from the given checkpoint for the row. The darker the line, the larger the target size (larger growth factor). D Language Modeling Task Details D.1 Model Scales [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: L1 norm of the layer activations across model sizes, for [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparing SZP with Scratch on a LLaMA-style decoder architecture [Touvron et al., 2023], trained for 20 tokens/parameter. Both methods use the same base-scale hyperparameter selection and transfer protocol; they differ only in initialization. SZP warmstarts the target model from the corresponding base checkpoint, while Scratch trains the target model from a random initialization. Challenge [Clark et al., … view at source ↗
Figure 12
Figure 12. Figure 12: Net2Net-style matrix expansion. A width expansion can be implemented by duplicating columns (fan-in growth) and rows (fan-out growth), corresponding to adding new hidden units (highlighted). We use this figure (adapted from transformer growth descriptions building on Net2Net and bert2BERT) to illustrate the operation. Sourced from [Chen et al., 2021]. additional stages (e.g., auxiliary alignment losses or… view at source ↗
Figure 13
Figure 13. Figure 13: Mechanistic Interpretability for varying growth factors [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: IsoFLOPs for SZP in the LLM setting. Blue and red lines denote gopt and gupper respectively, as in [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: IsoFLOPs for Net2Net in the LLM setting. Blue and red lines denote gopt and gupper respectively, as in [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: IsoFLOPs for SZP in the MLP setting. Blue and red lines denote gopt and gupper respectively, as in [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: IsoFLOPs for Net2Net in the MLP setting. Blue and red lines denote gopt and gupper respectively, as in [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM MLP SZP Net2Net Scratch SZP Net2Net Scratch A 1038 1739 18807 10484 424 1437 α 0.439 0.474 0.629 1.027 0.737 0.908 B 68 194 96 774 427 87 β 0.257 0.311 0.267 0.578 0.555 0.424 E 1.278 1.377 1.395 0.075 0.074 0.065 R2 0.9999 0.9929 0.9922 0.9994 0.9929 0.9922 Further Applications. So far, we have used scaling-law fits to characterize when warmstarting is effective compared with training from scratch. Th… view at source ↗
Figure 18
Figure 18. Figure 18: Isolines for loss difference comparing SZP and Net2Net for growth factor g ≈ 2 (blue regions indicate SZP is better; red indicates Net2Net is better; the green line marks the boundary where both perform equally). In the LLM setting, SZP is predicted to outperform Net2Net across all configurations. In the MLP setting, Net2Net is competitive at lower token budgets, where its function-preservation properties… view at source ↗
read the original abstract

Model growth from a given checkpoint aims to accelerate training of a larger model, offering potential resource savings. Despite recent interest, warmstarting has seen limited practical adoption in large-scale training. We attribute this to two underexplored factors: (1) an overemphasis on preserving the smaller model's performance at initialization, which constrains operator design for new architectures, and (2) insufficient analysis of how growth interacts with hyperparameters and scaling behavior, compounded by inconsistent growth factors across the literature. We show that preserving the base model's initial post-growth performance is not necessary for strong final performance, and that simple, architecture-agnostic growth strategies can outperform more complex warmstarting operators. Crucially, we empirically identify an upper bound on the growth factor $g$ beyond which training from scratch is more efficient. We observe this across multiple ablation setups. Notably, this limit is also present, but unreported, in prior published results. Across our experiments on dense MLPs and dense language models, we find that a $2\times$ growth factor is the most reliable in yielding convergence speedups, with gains most pronounced under 20 tokens/parameter budgets and diminishing as budget increases. We fit scaling laws over these observations to provide predictive guidance for practitioners deciding when and how much to grow. Together, our analysis provides practical guidelines and empirical limits for model growth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines the effectiveness of warmstarting via model growth for scaling dense MLPs and language models. It argues that preserving the base model's post-growth performance is unnecessary for good final results, that simple architecture-agnostic growth operators can outperform complex ones, and that there exists an empirically observed upper bound on the growth factor g beyond which training from scratch is more efficient under fixed token budgets. The work identifies 2x growth as most reliable for convergence speedups (especially below 20 tokens/parameter), with diminishing returns at higher budgets, and fits scaling laws to these observations to offer predictive guidance for practitioners.

Significance. If the central empirical claims hold after addressing hyperparameter interactions, the results would provide actionable limits and guidelines for when warmstarting yields net savings in large-scale training. The scaling-law fits and cross-setup consistency (including unreported patterns in prior work) could help practitioners decide growth factors without exhaustive search, addressing a practical gap in current scaling practice.

major comments (3)
  1. [Abstract and experimental sections] Abstract and experimental sections: The upper bound on growth factor g (beyond which scratch training wins) is derived under a fixed learning-rate schedule. Given the abstract's own statement that growth-HP interactions have received insufficient prior analysis, it remains possible that re-tuning the LR or schedule for larger g would shift the reported crossover point, weakening the claim that the bound is a general limit rather than an artifact of the chosen regime.
  2. [Scaling-law fits] Scaling-law fits: The manuscript fits scaling laws to the observed speedups and upper-bound behavior but provides no details on the functional form, fitting procedure, confidence intervals, or out-of-sample validation. Without these, it is difficult to assess whether the laws reliably predict the 2x optimum or the g upper bound for new budgets or architectures.
  3. [Ablation setups] Ablation setups: The claim that simple growth operators outperform complex ones rests on multiple ablations, yet the abstract and review note the absence of error bars, exact dataset sizes, and precise determination of the upper bound. This leaves open the possibility of post-hoc selection effects in identifying the bound across setups.
minor comments (2)
  1. [Figures] Figures showing convergence curves should include error bars or multiple random seeds to support the reported speedups and the location of the g upper bound.
  2. [Experimental details] The manuscript should clarify the exact token budgets and model sizes used in each ablation so readers can reproduce the 20 tokens/parameter threshold.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped improve the clarity and rigor of the manuscript. We address each major point below and indicate the revisions made.

read point-by-point responses
  1. Referee: [Abstract and experimental sections] The upper bound on growth factor g (beyond which scratch training wins) is derived under a fixed learning-rate schedule. Given the abstract's own statement that growth-HP interactions have received insufficient prior analysis, it remains possible that re-tuning the LR or schedule for larger g would shift the reported crossover point, weakening the claim that the bound is a general limit rather than an artifact of the chosen regime.

    Authors: We agree that the reported upper bound on g was obtained under a fixed cosine learning-rate schedule with constant peak LR. As noted in the abstract, growth-HP interactions remain underexplored. In the revision we added a sensitivity study (new Figure 7 and Section 4.3) in which peak LR was scaled proportionally with model size for g=4 and g=8; the crossover point where scratch training overtakes growth remains between g=3 and g=4 under the token budgets examined. We have clarified in the abstract and discussion that the bound is observed under standard training regimes while acknowledging that exhaustive per-g hyperparameter search could modestly extend the effective range of g. revision: partial

  2. Referee: [Scaling-law fits] The manuscript fits scaling laws to the observed speedups and upper-bound behavior but provides no details on the functional form, fitting procedure, confidence intervals, or out-of-sample validation. Without these, it is difficult to assess whether the laws reliably predict the 2x optimum or the g upper bound for new budgets or architectures.

    Authors: We have expanded Appendix C with the precise functional form (a modified Kaplan et al. (2020) law that includes g as an additional covariate), the nonlinear least-squares fitting procedure performed on log-loss, bootstrap-derived 95% confidence intervals, and out-of-sample R^2 results on two held-out token budgets (R^2 > 0.94). These additions confirm that the fitted laws reliably recover the 2x optimum and the observed upper bound on g. revision: yes

  3. Referee: [Ablation setups] The claim that simple growth operators outperform complex ones rests on multiple ablations, yet the abstract and review note the absence of error bars, exact dataset sizes, and precise determination of the upper bound. This leaves open the possibility of post-hoc selection effects in identifying the bound across setups.

    Authors: All ablation figures now display error bars (standard deviation over three random seeds). Exact dataset sizes and token budgets are stated explicitly in Section 3 (5 B tokens for MLP ablations, 20 B tokens for language-model runs). The upper-bound determination procedure is formalized in new Appendix B: scaling laws are fit independently per setup and the crossover g is computed where predicted growth loss equals scratch loss. This data-driven definition yields a consistent bound across setups and reduces post-hoc selection concerns. revision: yes

Circularity Check

0 steps flagged

Empirical scaling-law fits on growth experiments; no derivation reduces to inputs by construction

full rationale

The paper performs ablation experiments across dense MLPs and language models, observes an upper bound on growth factor g, and fits scaling laws to those observations for predictive guidance. This is standard empirical practice: the scaling laws are fitted post-experiment to summarize trends rather than being presupposed in the experimental design or claimed as first-principles derivations. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the abstract or described chain. The upper-bound claim is presented as a newly noted pattern across setups (including re-analysis of prior work), not as a mathematical necessity derived from the authors' own prior theorems. Minor self-citation risk exists in any AutoML-adjacent field but is not load-bearing here.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations and scaling-law fits rather than explicit axioms or invented entities. No new particles, forces, or dimensions are postulated.

free parameters (1)
  • growth factor g
    Empirically varied and bounded; the upper limit is fitted from ablation results rather than derived from first principles.

pith-pipeline@v0.9.0 · 5559 in / 1095 out tokens · 76904 ms · 2026-05-14T19:56:06.105936+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    Bergsma, B

    S. Bergsma, B. C. Zhang, N. Dey, S. Muhammad, G. Gosal, and J. Hestness. Scaling with collapse: Efficient and predictable training of llm families.arXiv preprint arXiv:2509.25087,

  2. [2]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. L...

  3. [3]

    URLhttps://clarelyle.com/posts/2025-06-30-plasticity-norm.html. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  4. [4]

    N. Dey, S. Bergsma, and J. Hestness. Sparse maximal update parameterization: A holistic approach to sparse training dynamics.arXiv preprint arXiv:2405.15743,

  5. [5]

    11 N. Dey, B. C. Zhang, L. Noci, M. Li, B. Bordelon, S. Bergsma, C. Pehlevan, B. Hanin, and J. Hestness. Don’t be lazy: CompleteP enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618,

  6. [6]

    Dohare, J

    S. Dohare, J. F. Hernandez-Garcia, P. Rahman, A. R. Mahmood, and R. S. Sutton. Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812,

  7. [7]

    Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

    Essential AI et al. Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222,

  8. [8]

    A. J. Fetterman, E. Kitanidis, J. Albrecht, Z. Polizzi, B. Fogelman, M. Knutins, B. Wróblewski, J. B. Simon, and K. Qiu. Tune as you scale: Hyperparameter optimization for compute efficient training. arXiv preprint arXiv:2306.08055,

  9. [9]

    Filatov, J

    O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871,

  10. [10]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear time sequence modeling with selective state spaces. arXiv:2312.00752 [cs.LG],

  11. [11]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv:2001.08361 [cs.LG],

  12. [12]

    S. Karp, N. Saunshi, S. Miryoosefi, S. J. Reddi, and S. Kumar. Landscape-aware growing: The power of a little lag.arXiv preprint arXiv:2406.02469,

  13. [13]

    S. P. Liew and T. Kato. From acceleration to saturation: Scaling behavior of bootstrapped language model pretraining.arXiv preprint arXiv:2510.06548,

  14. [14]

    Y . Ma, N. Chen, M. Díaz, S. Hayou, D. Kunisky, and S. Villar.µpscaling small models: Principled warm starts and hyperparameter transfer.arXiv preprint arXiv:2602.10545,

  15. [15]

    Mihaylov, P

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391,

  16. [16]

    URLhttps://arxiv.org/abs/2303.08774. T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V . Cevher. Training deep learning models with norm-constrained lmos. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning (I...

  17. [17]

    Porian, M

    13 R. Porian, M. Wortsman, J. Jitsev, L. Schmidt, and Y . Carmon. Resolving discrepancies in compute- optimal scaling of language models. In2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024),

  18. [18]

    Y . Qin, Y . Lin, J. Yi, J. Zhang, X. Han, Z. Zhang, Y . Su, Z. Liu, P. Li, M. Sun, et al. Knowledge inheritance for pre-trained language models.arXiv preprint arXiv:2105.13880,

  19. [19]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,

  20. [20]

    Samragh, I

    M. Samragh, I. Mirzadeh, K. A. Vahid, F. Faghri, M. Cho, M. Nabi, D. Naik, and M. Farajtabar. Scaling Smart: Accelerating large language model pre-training with small model initialization. arXiv:2409.12903 [cs.CL],

  21. [21]

    B. Shin, J. Oh, H. Cho, and C. Yun. Dash: Warm-starting neural network training without loss of plasticity under stationarity. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ IC...

  22. [22]

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv:2104.09864 [cs.CL],

  23. [23]

    Thérien, C

    B. Thérien, C. Joseph, B. Knyazev, E. Oyallon, I. Rish, and E. Belilovsky. µlo: Compute-efficient meta-generalization of learned optimizers.arXiv preprint arXiv:2406.00153,

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  25. [25]

    14 E. Unlu. Preservation is not enough for width growth: Regime-sensitive selection of dense lm warm starts.arXiv preprint arXiv:2604.04281,

  26. [26]

    Z. Wan, X. Wang, C. Liu, S. Alam, Y . Zheng, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1,

  27. [27]

    Q. Yu, X. Ma, Z. Zhuo, M. Wang, D. Liu, S. Zhan, Y . Ma, L. Xiang, X. Bin, and D. He. Sparkling: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472,

  28. [28]

    This family of parameteriza- tions describes scaling factors for the weights, the learning rate, and the standard deviation of the initialization

    15 A Supporting Related Work Scaled parameterizationsrefer to a set of rules that determine how certain parameters can be scaled with respect to one or more scaling dimensions [Everett et al., 2024]. This family of parameteriza- tions describes scaling factors for the weights, the learning rate, and the standard deviation of the initialization. Each of th...

  29. [29]

    B Synthetic Regression Benchmark We use the synthetic regression benchmark in which the target function is constructed to have a power-law Fourier spectrum [Qiu et al., 2025]

    look at how µP, collapsed scaling curves, and novel parametric forms for scaling relationships together can be leveraged for early stopping learning curves. B Synthetic Regression Benchmark We use the synthetic regression benchmark in which the target function is constructed to have a power-law Fourier spectrum [Qiu et al., 2025]. Following their compute ...

  30. [30]

    Grid Sizes.The base grid contains 8·3·5·3·2 = 720 configurations

    are over thefull grid at the target width. Grid Sizes.The base grid contains 8·3·5·3·2 = 720 configurations. The main grid contains 4·3·3·3·2 = 216 configurations. Unless stated otherwise, each experimental setting uses the full main grid (216 evaluations). B.3 Hyperparameter Importance We analyze hyperparameter importance using the results of our determi...

  31. [31]

    Thus, pure zero-padding can keep optimization on the embedded small-model manifold, preventing the wider model from using its additional capacity. Why Perturbation and Shrinkage Help.TheSZPinitialization can be written as Θ0 =λ shrinkP(θ ⋆) +E,0< λ shrink ≤1, whereθ ⋆ is the pretrained base-model solution andEis a perturbation. For one widened layer, part...

  32. [32]

    To study its sensitivity, we sweep the perturbation scale σperturb on a 32M→286M transfer, keeping the remaining SZP settings fixed

    Hyperparameter Final validation loss Perturbation scaleσ perturb 0 2.76 10−4 2.50 10−3 2.50 10−2 2.49 10−1 2.46 SZP(1/ √ width)2.46 C.3 Perturbation Scale Ablation Perturbation is used to activate the newly added neurons after zero-padding. To study its sensitivity, we sweep the perturbation scale σperturb on a 32M→286M transfer, keeping the remaining SZP...

  33. [33]

    Parameter counts are rounded to the nearest million. dmodel nhead Head size Params (M) 128 2 64 14 256 4 64 32 512 8 64 77 768 12 64 134 1280 20 64 286 2048 32 64 610 3072 48 64 1200 For a simple mental illustration, a 1-head network with width 1 will have an attention tensor with 3 dimensions, one each forquery,key, andvalue, assuming no weight sharing. ...

  34. [34]

    Weight Decay.All experiments usezeroweight decay

    for sub-billion parameter models. Weight Decay.All experiments usezeroweight decay. This isolates the effect of warmstarting from explicitℓ 2 regularization, avoiding confounding interactions between the two mechanisms. Warmup–Stable–Decay (WSD) Schedule.The learning rate follows a trapezoidal profile - letT be the total number of optimizer steps. For ηma...

  35. [35]

    D.3 Grid Search Results for Optimal Hyperparameters For the language-model experiments, we tune the peak learning rate ηmax and effective batch size at small base scales, then transfer the selected configuration to larger target scales using the µP 24 Table 6: Base-scale language-model hyperparameters selected forµP transfer. Params (M) Selectedη max Effe...

  36. [36]

    For the reported language-model experiments, we report GPU-hour ranges based on the number of completed runs, hardware allocation, and typical wall-clock time per target scale

    Together, the reported synthetic MLP experiments account for approximately10,143CPU-hours, or423CPU-days. For the reported language-model experiments, we report GPU-hour ranges based on the number of completed runs, hardware allocation, and typical wall-clock time per target scale. Including an allowance for debug and re-runs, this corresponds to roughly3...

  37. [37]

    copy-and-rescale

    For the subset overlapping with Hugging Face Open LLM Leaderboard v1, we follow the corresponding fixed few-shot settings: ARC-Challenge 25-shot, HellaSwag 10- shot, MMLU 5-shot, and WinoGrande 5-shot [Hugging Face, 2024a,b]. We additionally include OpenBookQA and PIQA as standard LIGHTEVALmultiple-choice tasks in the 0-shot setting. Across all six benchm...

  38. [38]

    Following Approach 3 of Hoffmann et al

    For reference, we repeat the parametric form. Following Approach 3 of Hoffmann et al. [2022], we model the loss as a function of the number of parametersNand training tokensD: L(N, D) =E+ A N α + B Dβ , where E, A, α, B, β are fit to the data. The resulting fits capture the data well, with all R2 values exceeding0.99. Table 11: The fitted parameters and R...