Small Initialization Matters for Large Language Models

Feiyu Xiong; Hongkang Yang; Junjie Yao; Liangkai Hang; Zhi-Qin John Xu; Zhiyu Li

arxiv: 2606.17945 · v2 · pith:Q6N54BJLnew · submitted 2026-06-16 · 💻 cs.AI

Small Initialization Matters for Large Language Models

Liangkai Hang , Junjie Yao , Zhiyu Li , Feiyu Xiong , Hongkang Yang , Zhi-Qin John Xu This is my paper

Pith reviewed 2026-06-27 00:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords parameter initializationlarge language modelspretrainingreasoning tasksdevelopmental trajectorymodel capacityinitialization scale

0 comments

The pith

Reducing the initialization scale improves pretraining of large language models with largest gains on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the scale at which parameters are randomly initialized functions as a gene-like control on how large language models develop during training. Smaller scales produce better final models, with the strongest improvements appearing on tasks that require reasoning rather than rote pattern matching. This advantage appears because parameters follow a two-phase path: they first collapse into simple low-complexity forms and only later expand into richer structures. Two common training choices have hidden this benefit in past work; relaxing them lets the improvement grow with model size. The authors therefore propose treating initialization scale as an explicit, almost cost-free control knob via a simple gamma rule.

Core claim

Parameter initialization scale determines model capacity by setting a distinct developmental trajectory in which weights first condense into low-complexity structures and subsequently expand into richer representations; this path yields consistent pretraining gains that are largest on reasoning tasks and concentrated on non-trivial, context-constrained token predictions, while a critical scale balances reasoning performance against training stability.

What carries the argument

The condensation-then-expansion trajectory of parameters under small initialization, which supplies a concrete mechanism for the claim that compression precedes richer intelligence.

If this is right

Small initialization produces consistent pretraining gains across model scales.
The largest improvements appear on reasoning-demanding tasks rather than uniform token prediction.
Gains concentrate on non-trivial predictions that depend on context rather than all tokens equally.
A critical initialization scale exists that trades off reasoning performance against training stability.
A gamma-initialization rule lets practitioners adopt small initialization by default at negligible cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same condense-then-expand dynamic may appear in other neural architectures and training regimes beyond transformers.
Initialization scale could be scheduled or adapted during training rather than fixed at the start to further exploit the trajectory.
The finding reframes capacity as partly determined by starting conditions instead of emerging only from scale, data, and architecture.
Token-level diagnostics of the sort used here could be applied to diagnose whether other interventions also operate through non-uniform prediction improvements.

Load-bearing premise

Two widely used empirical settings are the main factors that have hidden the benefit of small initialization, and relaxing those settings is what restores favorable scaling.

What would settle it

A controlled comparison in which small initialization is applied after the two settings are relaxed yet produces no gain or a loss on reasoning benchmarks would falsify the claim that the trajectory and scaling benefit are recovered.

read the original abstract

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $\gamma$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small initialization improves pretraining and reasoning via a condensation-then-expansion trajectory, but the paper needs tighter controls to show the two settings are the main drivers rather than correlated factors.

read the letter

The main point is that smaller initialization scales appear to boost LLM pretraining, with bigger lifts on reasoning tasks, by pushing parameters through a low-complexity condensation phase before they expand into richer structures. The gamma rule is presented as a cheap way to expose initialization as a tunable knob.

What comes across as new is the explicit developmental trajectory and the token-level breakdown showing gains mostly on context-constrained predictions instead of blanket improvements. The paper does a reasonable job documenting consistent gains across scales once the two common settings are relaxed, and it frames initialization as more than a minor hyperparameter.

The soft spot is the claim that those two settings are the primary restraints. The experiments would need to demonstrate that the observed trajectory and task gains survive when learning-rate scaling, gradient clipping, or data ordering are varied independently; otherwise the advantage could be an artifact of the chosen regime rather than a general mechanism. The abstract gives no numbers or error bars, so effect sizes and robustness are hard to judge without the full tables.

This is aimed at people running pretraining runs or studying how capabilities emerge. A practitioner looking for low-cost tweaks or a researcher tracking scaling behavior would get something concrete to test.

It is worth sending to referees. The core empirical observation is straightforward to check and the intervention is cheap enough that even partial confirmation would matter.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that reducing the initialization scale for large language models consistently improves pretraining performance, with the largest gains on reasoning-demanding tasks. It identifies two widely used empirical settings that restrain this advantage and shows that relaxing them restores favorable scaling behavior. Mechanistically, small initialization induces a developmental trajectory in which parameters first condense into low-complexity structures and later expand into richer representations. Token-level analyses indicate that gains concentrate on non-trivial, context-constrained predictions. The work proposes a simple γ-initialization rule to expose initialization range as an explicit training knob.

Significance. If the empirical results and mechanistic claims hold after addressing controls, the work would be significant for establishing initialization scale as a low-cost, high-impact determinant of LLM capacity and reasoning. The condensation-then-expansion trajectory supplies a concrete mechanism supporting the compression-is-intelligence hypothesis, and the task-specific token gains offer falsifiable predictions that could guide future training studies. The γ-initialization proposal is a practical contribution that could be adopted with minimal overhead if shown to be robust.

major comments (1)

[Identification of empirical settings and ablations] The central claim that relaxing the two identified empirical settings restores favorable scaling for small initialization is load-bearing. The ablations must isolate these settings from correlated variables such as effective learning-rate scale, gradient clipping thresholds, or data-ordering effects that interact with initialization variance; otherwise the observed trajectory and task-specific gains could be regime-specific artifacts rather than a general developmental mechanism (§ on identification of empirical settings and associated ablations).

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., perplexity delta or reasoning benchmark improvement with error bars) to allow readers to gauge effect size immediately.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the potential significance of initialization scale as a determinant of LLM training dynamics. We address the single major comment below and will strengthen the ablations in revision to better isolate the claimed effects.

read point-by-point responses

Referee: [Identification of empirical settings and ablations] The central claim that relaxing the two identified empirical settings restores favorable scaling for small initialization is load-bearing. The ablations must isolate these settings from correlated variables such as effective learning-rate scale, gradient clipping thresholds, or data-ordering effects that interact with initialization variance; otherwise the observed trajectory and task-specific gains could be regime-specific artifacts rather than a general developmental mechanism (§ on identification of empirical settings and associated ablations).

Authors: We agree that isolating the two empirical settings from the listed confounders is essential for establishing the developmental mechanism as general rather than regime-specific. In the submitted manuscript we already held the learning-rate schedule fixed and verified that per-parameter gradient norms differ systematically with initialization scale; we also reported results across multiple random seeds. However, these controls are insufficient to fully rule out interactions. In the revised manuscript we will add targeted ablations that (i) explicitly rescale the base learning rate so that initial gradient norms are matched across initialization scales, (ii) sweep gradient-clipping thresholds while keeping all other hyperparameters constant, and (iii) fix the data order (identical seed and shuffling) while varying only the initialization scale. We will present these results in an expanded version of the section on empirical settings, together with statistical tests confirming that the condensation-then-expansion trajectory and the concentration of gains on non-trivial tokens remain statistically significant under the stricter controls. These additions directly address the load-bearing concern without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations are self-contained.

full rationale

The paper's central claims rest on experimental results showing performance gains from reduced initialization scale, identification of two empirical settings, and observed developmental trajectories (condensation then expansion). No equations, fitted parameters, or self-citations are presented in the abstract or described text that would reduce any prediction or mechanistic claim to a definitional loop or input by construction. The γ-initialization rule is motivated as a practical recommendation from observations rather than derived from prior self-work or ansatz smuggling. The derivation chain is therefore independent of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or can be inferred from the provided text.

pith-pipeline@v0.9.1-grok · 5727 in / 1066 out tokens · 37307 ms · 2026-06-27T00:55:57.005739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 9 canonical work pages · 8 internal anchors

[1]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[2]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

In: Advances in Neural Information Processing Systems (2022)

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J., Sifre, L.: Training compute optimal large language models. In: Advances in Neural Inf...

2022
[4]

International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (2022)

2022
[5]

In: International Conference on Learning Representations, vol

Liu, H., Li, Z., Hall, D., Liang, P., Ma, T.: Sophia: A scalable stochastic second- order optimizer for language model pre-training. In: International Conference on Learning Representations, vol. 2024, pp. 1621–1650 (2024)

2024
[6]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

2025
[7]

Advances in neural information processing systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems (2017)

2017
[8]

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 4171–4186 (2019)

2019
[9]

Journal of Machine Learning Research23(120), 1–39 (2022)

Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity. Journal of Machine Learning Research23(120), 1–39 (2022)

2022
[10]

In: First Conference on Language Modeling (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling (2024)

2024
[11]

nature521(7553), 436–444 23 (2015)

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 23 (2015)

2015
[12]

In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp

Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010). JMLR Workshop and Conference Proceedings

2010
[13]

In: Neural Networks: Tricks of the Trade, (2012)

LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, (2012)

2012
[14]

In: Proceedings of the IEEE International Conference on Computer Vision, pp

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

2015
[15]

Advances in Neural Information Processing Systems 31(2018)

Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and gener- alization in neural networks. Advances in Neural Information Processing Systems 31(2018)

2018
[16]

Advances in neural information processing systems32(2019)

Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. Advances in neural information processing systems32(2019)

2019
[17]

In: Conference on Learning Theory, pp

Woodworth, B., Gunasekar, S., Lee, J.D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., Srebro, N.: Kernel and rich regimes in overparametrized models. In: Conference on Learning Theory, pp. 3635–3673 (2020). PMLR

2020
[18]

Journal of Machine Learning Research22(71), 1–47 (2021)

Luo, T., Xu, Z.-Q.J., Ma, Z., Zhang, Y.: Phase diagram for two-layer ReLU neural networks at infinite-width limit. Journal of Machine Learning Research22(71), 1–47 (2021)

2021
[19]

In: Advances in Neural Information Processing Systems (2022)

Zhou, H., Zhou, Q., Luo, T., Zhang, Y., Xu, Z.-Q.J.: Towards understanding the condensation of neural networks at initial training. In: Advances in Neural Information Processing Systems (2022)

2022
[20]

Advances in Neural Information Processing Systems37, 81157–81203 (2024)

Kunin, D., Ravent´ os, A., Domin´ e, C., Chen, F., Klindt, D., Saxe, A., Ganguli, S.: Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning. Advances in Neural Information Processing Systems37, 81157–81203 (2024)

2024
[21]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Initialization is critical to whether transformers fit composite functions by reasoning or memorizing. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

2024
[22]

In: International Conference on Machine Learning, pp

Yao, J., Zhang, Z., Xu, Z.-Q.J.: An analysis for reasoning bias of language models with small initialization. In: International Conference on Machine Learning, pp. 71834–71864 (2025). PMLR 24

2025
[23]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Complexity control facilitates reasoning-based compositional generalization in transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[24]

Layer Normalization

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

In: International Conference on Learning Representations (2024)

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming lan- guage models with attention sinks. In: International Conference on Learning Representations (2024)

2024
[26]

Advances in Neural Information Processing Systems38, 100092–100118 (2025)

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S.,et al.: Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. Advances in Neural Information Processing Systems38, 100092–100118 (2025)

2025
[27]

An overview of condensation phenomenon in deep learning

Xu, Z.-Q.J., Zhang, Y., Zhou, Z.: An overview of condensation phenomenon in deep learning. arXiv preprint arXiv:2504.09484 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Barbero, F., Arroyo, ´A., Gu, X., Perivolaropoulos, C., Bronstein, M., Veliˇ ckovi´ c, P., Pascanu, R.: Why do llms attend to the first token? (2025) arXiv:2504.02732

work page arXiv 2025
[29]

In: International Conference on Learning Representations, vol

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: International Conference on Learning Representations, vol. 2025, pp. 97114–97144 (2025)

2025
[30]

In: Proceedings of the 41st International Conference on Machine Learning (2024)

Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harnessing hidden attention sinks: enhancing large language models without training through attention calibration. In: Proceedings of the 41st International Conference on Machine Learning (2024)

2024
[31]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611 (2017)

2017
[33]

Proceedings of the International Conference on Learning Representations (ICLR) (2021)

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., Steinhardt, J.: Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR) (2021)

2021
[34]

Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25

2021
[35]

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

2019
[36]

In: Findings of the Association for Computational Linguistics: ACL 2023, pp

Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q., Chi, E., Zhou, D.,et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051 (2023)

2023
[37]

Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social iqa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463– 4473 (2019)

2019
[38]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

Chen, Z.-A., Luo, T.: From condensation to rank collapse: A two-stage analysis of transformer training dynamics. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

2026
[41]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[43]

In: Interna- tional Conference on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Interna- tional Conference on Learning Representations (2019)

2019
[44]

Zenodo (2024) 26

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: The Language Model Evaluation Harness. Zenodo (2024) 26

2024

[1] [1]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[2] [2]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[3] [3]

In: Advances in Neural Information Processing Systems (2022)

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J., Sifre, L.: Training compute optimal large language models. In: Advances in Neural Inf...

2022

[4] [4]

International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. International Conference on Learning Representations (2022)

2022

[5] [5]

In: International Conference on Learning Representations, vol

Liu, H., Li, Z., Hall, D., Liang, P., Ma, T.: Sophia: A scalable stochastic second- order optimizer for language model pre-training. In: International Conference on Learning Representations, vol. 2024, pp. 1621–1650 (2024)

2024

[6] [6]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

2025

[7] [7]

Advances in neural information processing systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems (2017)

2017

[8] [8]

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 4171–4186 (2019)

2019

[9] [9]

Journal of Machine Learning Research23(120), 1–39 (2022)

Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity. Journal of Machine Learning Research23(120), 1–39 (2022)

2022

[10] [10]

In: First Conference on Language Modeling (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling (2024)

2024

[11] [11]

nature521(7553), 436–444 23 (2015)

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 23 (2015)

2015

[12] [12]

In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp

Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010). JMLR Workshop and Conference Proceedings

2010

[13] [13]

In: Neural Networks: Tricks of the Trade, (2012)

LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, (2012)

2012

[14] [14]

In: Proceedings of the IEEE International Conference on Computer Vision, pp

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

2015

[15] [15]

Advances in Neural Information Processing Systems 31(2018)

Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and gener- alization in neural networks. Advances in Neural Information Processing Systems 31(2018)

2018

[16] [16]

Advances in neural information processing systems32(2019)

Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. Advances in neural information processing systems32(2019)

2019

[17] [17]

In: Conference on Learning Theory, pp

Woodworth, B., Gunasekar, S., Lee, J.D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., Srebro, N.: Kernel and rich regimes in overparametrized models. In: Conference on Learning Theory, pp. 3635–3673 (2020). PMLR

2020

[18] [18]

Journal of Machine Learning Research22(71), 1–47 (2021)

Luo, T., Xu, Z.-Q.J., Ma, Z., Zhang, Y.: Phase diagram for two-layer ReLU neural networks at infinite-width limit. Journal of Machine Learning Research22(71), 1–47 (2021)

2021

[19] [19]

In: Advances in Neural Information Processing Systems (2022)

Zhou, H., Zhou, Q., Luo, T., Zhang, Y., Xu, Z.-Q.J.: Towards understanding the condensation of neural networks at initial training. In: Advances in Neural Information Processing Systems (2022)

2022

[20] [20]

Advances in Neural Information Processing Systems37, 81157–81203 (2024)

Kunin, D., Ravent´ os, A., Domin´ e, C., Chen, F., Klindt, D., Saxe, A., Ganguli, S.: Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning. Advances in Neural Information Processing Systems37, 81157–81203 (2024)

2024

[21] [21]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Initialization is critical to whether transformers fit composite functions by reasoning or memorizing. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

2024

[22] [22]

In: International Conference on Machine Learning, pp

Yao, J., Zhang, Z., Xu, Z.-Q.J.: An analysis for reasoning bias of language models with small initialization. In: International Conference on Machine Learning, pp. 71834–71864 (2025). PMLR 24

2025

[23] [23]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Zhang, Z., Lin, P., Wang, Z., Zhang, Y., Xu, Z.-Q.J.: Complexity control facilitates reasoning-based compositional generalization in transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025

[24] [24]

Layer Normalization

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

In: International Conference on Learning Representations (2024)

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming lan- guage models with attention sinks. In: International Conference on Learning Representations (2024)

2024

[26] [26]

Advances in Neural Information Processing Systems38, 100092–100118 (2025)

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S.,et al.: Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. Advances in Neural Information Processing Systems38, 100092–100118 (2025)

2025

[27] [27]

An overview of condensation phenomenon in deep learning

Xu, Z.-Q.J., Zhang, Y., Zhou, Z.: An overview of condensation phenomenon in deep learning. arXiv preprint arXiv:2504.09484 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Barbero, F., Arroyo, ´A., Gu, X., Perivolaropoulos, C., Bronstein, M., Veliˇ ckovi´ c, P., Pascanu, R.: Why do llms attend to the first token? (2025) arXiv:2504.02732

work page arXiv 2025

[29] [29]

In: International Conference on Learning Representations, vol

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: International Conference on Learning Representations, vol. 2025, pp. 97114–97144 (2025)

2025

[30] [30]

In: Proceedings of the 41st International Conference on Machine Learning (2024)

Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harnessing hidden attention sinks: enhancing large language models without training through attention calibration. In: Proceedings of the 41st International Conference on Machine Learning (2024)

2024

[31] [31]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611 (2017)

2017

[33] [33]

Proceedings of the International Conference on Learning Representations (ICLR) (2021)

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., Steinhardt, J.: Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR) (2021)

2021

[34] [34]

Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021) 25

2021

[35] [35]

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

2019

[36] [36]

In: Findings of the Association for Computational Linguistics: ACL 2023, pp

Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q., Chi, E., Zhou, D.,et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051 (2023)

2023

[37] [37]

Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social iqa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463– 4473 (2019)

2019

[38] [38]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[39] [39]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[40] [40]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

Chen, Z.-A., Luo, T.: From condensation to rank collapse: A two-stage analysis of transformer training dynamics. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

2026

[41] [41]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1909

[43] [43]

In: Interna- tional Conference on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Interna- tional Conference on Learning Representations (2019)

2019

[44] [44]

Zenodo (2024) 26

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: The Language Model Evaluation Harness. Zenodo (2024) 26

2024