Small Initialization Matters for Large Language Models
Pith reviewed 2026-06-27 00:55 UTC · model grok-4.3
The pith
Reducing the initialization scale improves pretraining of large language models with largest gains on reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Parameter initialization scale determines model capacity by setting a distinct developmental trajectory in which weights first condense into low-complexity structures and subsequently expand into richer representations; this path yields consistent pretraining gains that are largest on reasoning tasks and concentrated on non-trivial, context-constrained token predictions, while a critical scale balances reasoning performance against training stability.
What carries the argument
The condensation-then-expansion trajectory of parameters under small initialization, which supplies a concrete mechanism for the claim that compression precedes richer intelligence.
If this is right
- Small initialization produces consistent pretraining gains across model scales.
- The largest improvements appear on reasoning-demanding tasks rather than uniform token prediction.
- Gains concentrate on non-trivial predictions that depend on context rather than all tokens equally.
- A critical initialization scale exists that trades off reasoning performance against training stability.
- A gamma-initialization rule lets practitioners adopt small initialization by default at negligible cost.
Where Pith is reading between the lines
- The same condense-then-expand dynamic may appear in other neural architectures and training regimes beyond transformers.
- Initialization scale could be scheduled or adapted during training rather than fixed at the start to further exploit the trajectory.
- The finding reframes capacity as partly determined by starting conditions instead of emerging only from scale, data, and architecture.
- Token-level diagnostics of the sort used here could be applied to diagnose whether other interventions also operate through non-uniform prediction improvements.
Load-bearing premise
Two widely used empirical settings are the main factors that have hidden the benefit of small initialization, and relaxing those settings is what restores favorable scaling.
What would settle it
A controlled comparison in which small initialization is applied after the two settings are relaxed yet produces no gain or a loss on reasoning benchmarks would falsify the claim that the trajectory and scaling benefit are recovered.
read the original abstract
Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $\gamma$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that reducing the initialization scale for large language models consistently improves pretraining performance, with the largest gains on reasoning-demanding tasks. It identifies two widely used empirical settings that restrain this advantage and shows that relaxing them restores favorable scaling behavior. Mechanistically, small initialization induces a developmental trajectory in which parameters first condense into low-complexity structures and later expand into richer representations. Token-level analyses indicate that gains concentrate on non-trivial, context-constrained predictions. The work proposes a simple γ-initialization rule to expose initialization range as an explicit training knob.
Significance. If the empirical results and mechanistic claims hold after addressing controls, the work would be significant for establishing initialization scale as a low-cost, high-impact determinant of LLM capacity and reasoning. The condensation-then-expansion trajectory supplies a concrete mechanism supporting the compression-is-intelligence hypothesis, and the task-specific token gains offer falsifiable predictions that could guide future training studies. The γ-initialization proposal is a practical contribution that could be adopted with minimal overhead if shown to be robust.
major comments (1)
- [Identification of empirical settings and ablations] The central claim that relaxing the two identified empirical settings restores favorable scaling for small initialization is load-bearing. The ablations must isolate these settings from correlated variables such as effective learning-rate scale, gradient clipping thresholds, or data-ordering effects that interact with initialization variance; otherwise the observed trajectory and task-specific gains could be regime-specific artifacts rather than a general developmental mechanism (§ on identification of empirical settings and associated ablations).
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., perplexity delta or reasoning benchmark improvement with error bars) to allow readers to gauge effect size immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting the potential significance of initialization scale as a determinant of LLM training dynamics. We address the single major comment below and will strengthen the ablations in revision to better isolate the claimed effects.
read point-by-point responses
-
Referee: [Identification of empirical settings and ablations] The central claim that relaxing the two identified empirical settings restores favorable scaling for small initialization is load-bearing. The ablations must isolate these settings from correlated variables such as effective learning-rate scale, gradient clipping thresholds, or data-ordering effects that interact with initialization variance; otherwise the observed trajectory and task-specific gains could be regime-specific artifacts rather than a general developmental mechanism (§ on identification of empirical settings and associated ablations).
Authors: We agree that isolating the two empirical settings from the listed confounders is essential for establishing the developmental mechanism as general rather than regime-specific. In the submitted manuscript we already held the learning-rate schedule fixed and verified that per-parameter gradient norms differ systematically with initialization scale; we also reported results across multiple random seeds. However, these controls are insufficient to fully rule out interactions. In the revised manuscript we will add targeted ablations that (i) explicitly rescale the base learning rate so that initial gradient norms are matched across initialization scales, (ii) sweep gradient-clipping thresholds while keeping all other hyperparameters constant, and (iii) fix the data order (identical seed and shuffling) while varying only the initialization scale. We will present these results in an expanded version of the section on empirical settings, together with statistical tests confirming that the condensation-then-expansion trajectory and the concentration of gains on non-trivial tokens remain statistically significant under the stricter controls. These additions directly address the load-bearing concern without altering the core claims. revision: yes
Circularity Check
No significant circularity; empirical observations are self-contained.
full rationale
The paper's central claims rest on experimental results showing performance gains from reduced initialization scale, identification of two empirical settings, and observed developmental trajectories (condensation then expansion). No equations, fitted parameters, or self-citations are presented in the abstract or described text that would reduce any prediction or mechanistic claim to a definitional loop or input by construction. The γ-initialization rule is motivated as a practical recommendation from observations rather than derived from prior self-work or ansatz smuggling. The derivation chain is therefore independent of the listed circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)
1901
-
[2]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[3]
In: Advances in Neural Information Processing Systems (2022)
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J., Sifre, L.: Training compute optimal large language models. In: Advances in Neural Inf...
2022
-
[4]
International Conference on Learning Representations (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (2022)
2022
-
[5]
In: International Conference on Learning Representations, vol
Liu, H., Li, Z., Hall, D., Liang, P., Ma, T.: Sophia: A scalable stochastic second- order optimizer for language model pre-training. In: International Conference on Learning Representations, vol. 2024, pp. 1621–1650 (2024)
2024
-
[6]
Nature645(8081), 633–638 (2025)
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)
2025
-
[7]
Advances in neural information processing systems (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems (2017)
2017
-
[8]
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 4171–4186 (2019)
2019
-
[9]
Journal of Machine Learning Research23(120), 1–39 (2022)
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity. Journal of Machine Learning Research23(120), 1–39 (2022)
2022
-
[10]
In: First Conference on Language Modeling (2024)
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling (2024)
2024
-
[11]
nature521(7553), 436–444 23 (2015)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 23 (2015)
2015
-
[12]
In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010). JMLR Workshop and Conference Proceedings
2010
-
[13]
In: Neural Networks: Tricks of the Trade, (2012)
LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, (2012)
2012
-
[14]
In: Proceedings of the IEEE International Conference on Computer Vision, pp
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
2015
-
[15]
Advances in Neural Information Processing Systems 31(2018)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and gener- alization in neural networks. Advances in Neural Information Processing Systems 31(2018)
2018
-
[16]
Advances in neural information processing systems32(2019)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. Advances in neural information processing systems32(2019)
2019
-
[17]
In: Conference on Learning Theory, pp
Woodworth, B., Gunasekar, S., Lee, J.D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., Srebro, N.: Kernel and rich regimes in overparametrized models. In: Conference on Learning Theory, pp. 3635–3673 (2020). PMLR
2020
-
[18]
Journal of Machine Learning Research22(71), 1–47 (2021)
Luo, T., Xu, Z.-Q.J., Ma, Z., Zhang, Y.: Phase diagram for two-layer ReLU neural networks at infinite-width limit. Journal of Machine Learning Research22(71), 1–47 (2021)
2021
-
[19]
In: Advances in Neural Information Processing Systems (2022)
Zhou, H., Zhou, Q., Luo, T., Zhang, Y., Xu, Z.-Q.J.: Towards understanding the condensation of neural networks at initial training. In: Advances in Neural Information Processing Systems (2022)
2022
-
[20]
Advances in Neural Information Processing Systems37, 81157–81203 (2024)
Kunin, D., Ravent´ os, A., Domin´ e, C., Chen, F., Klindt, D., Saxe, A., Ganguli, S.: Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning. Advances in Neural Information Processing Systems37, 81157–81203 (2024)
2024
-
[21]
In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Initialization is critical to whether transformers fit composite functions by reasoning or memorizing. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
2024
-
[22]
In: International Conference on Machine Learning, pp
Yao, J., Zhang, Z., Xu, Z.-Q.J.: An analysis for reasoning bias of language models with small initialization. In: International Conference on Machine Learning, pp. 71834–71864 (2025). PMLR 24
2025
-
[23]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Complexity control facilitates reasoning-based compositional generalization in transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
2025
-
[24]
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
In: International Conference on Learning Representations (2024)
Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming lan- guage models with attention sinks. In: International Conference on Learning Representations (2024)
2024
-
[26]
Advances in Neural Information Processing Systems38, 100092–100118 (2025)
Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S.,et al.: Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. Advances in Neural Information Processing Systems38, 100092–100118 (2025)
2025
-
[27]
An overview of condensation phenomenon in deep learning
Xu, Z.-Q.J., Zhang, Y., Zhou, Z.: An overview of condensation phenomenon in deep learning. arXiv preprint arXiv:2504.09484 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [28]
-
[29]
In: International Conference on Learning Representations, vol
Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: International Conference on Learning Representations, vol. 2025, pp. 97114–97144 (2025)
2025
-
[30]
In: Proceedings of the 41st International Conference on Machine Learning (2024)
Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harnessing hidden attention sinks: enhancing large language models without training through attention calibration. In: Proceedings of the 41st International Conference on Machine Learning (2024)
2024
-
[31]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp
Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611 (2017)
2017
-
[33]
Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., Steinhardt, J.: Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR) (2021)
2021
-
[34]
Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25
2021
-
[35]
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
2019
-
[36]
In: Findings of the Association for Computational Linguistics: ACL 2023, pp
Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q., Chi, E., Zhou, D.,et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051 (2023)
2023
-
[37]
Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social iqa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463– 4473 (2019)
2019
-
[38]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)
Chen, Z.-A., Luo, T.: From condensation to rank collapse: A two-stage analysis of transformer training dynamics. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)
2026
-
[41]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[43]
In: Interna- tional Conference on Learning Representations (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Interna- tional Conference on Learning Representations (2019)
2019
-
[44]
Zenodo (2024) 26
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: The Language Model Evaluation Harness. Zenodo (2024) 26
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.