Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts
Pith reviewed 2026-05-21 20:31 UTC · model grok-4.3
The pith
Pre-trained Mixture-of-Experts models can be expanded in depth and width before continued training to reach higher accuracy than training equivalent models from scratch with the same additional compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The orthogonal growth strategy recycles converged Mixture-of-Experts checkpoints by increasing model depth via interpositional layer copying and model width via noisy expert duplication. This enables continued pre-training that yields superior performance compared to equivalent compute spent on training from scratch. Experiments confirm a positive correlation between the amount of prior sunk cost and the ultimate accuracy achieved, with up to 10.6% relative improvement observed on large-scale setups.
What carries the argument
Orthogonal growth, which expands a converged MoE model along two independent dimensions—depth through layer copying and width through expert duplication—before resuming training on the enlarged architecture.
If this is right
- Models up to 70B parameters trained with this method achieve 10.6% higher accuracy than from-scratch baselines under the same extra compute budget.
- Accuracy improves in proportion to the sunk cost invested in the initial checkpoint.
- The approach provides a blueprint for sustainable LLM development by leveraging existing pre-trained assets.
- Continued training of the expanded models proceeds without major instabilities in the tested regimes.
Where Pith is reading between the lines
- This suggests that future training runs could be planned with an initial smaller model followed by planned expansions rather than always training large models from random initialization.
- Similar recycling techniques might extend to non-MoE dense models if analogous expansion methods are developed.
- Overall training efficiency could improve if compute budgets are allocated partly to expansion phases after initial convergence.
Load-bearing premise
The expanded models from layer copying and expert duplication can undergo continued training to higher accuracy without instabilities or the need for extensive hyperparameter adjustments that would erase the compute savings.
What would settle it
Training a 70B parameter model from scratch versus one expanded from a smaller pre-trained checkpoint using the same additional compute budget and measuring if the expanded version reaches at least 10% higher accuracy without training divergence.
Figures
read the original abstract
As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the need for efficient training paradigms becomes critical. Despite the vast resources already invested in existing pre-trained checkpoints, these assets often remain under-leveraged due to architectural limitations. We introduce an "orthogonal growth" strategy designed to "recycle" these checkpoints by strategically expanding their parameters prior to continued training. Our method focuses on optimizing converged Mixture-of-Experts (MoE) models through two dimensions: interpositional layer copying for increased depth and noisy expert duplication for expanded width. Through extensive scaling laws analysis, we demonstrate a strong positive correlation between the "sunk cost" (prior investment) and the final model accuracy. Empirical results on models up to 70B parameters and 1T tokens show that our recycling approach yields a 10.6% accuracy improvement compared to training from scratch under identical extra compute budgets. This work provides a cost-effective blueprint for sustainable large-scale LLM development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an 'orthogonal growth' strategy to recycle pre-trained Mixture-of-Experts checkpoints by expanding model depth through interpositional layer copying and width through noisy expert duplication, followed by continued pre-training. It presents scaling-law analysis linking prior 'sunk cost' investment to final accuracy and reports empirical results on models up to 70B parameters trained on 1T tokens, claiming a 10.6% accuracy improvement over from-scratch training under matched extra compute budgets.
Significance. If the reported gains hold under controlled conditions, the approach could meaningfully reduce the marginal compute cost of scaling LLMs by better amortizing existing checkpoints. The scaling-laws component supplies a useful empirical correlation, but the central efficiency claim depends on unverified assumptions about post-expansion training stability.
major comments (2)
- [Experimental Results] The experimental results (presumably §4–5) state a 10.6% accuracy gain but supply no details on baseline training recipes, hyperparameter schedules for the continued-training phase, or statistical significance. Without these controls it is impossible to confirm that the comparison truly uses 'identical extra compute budgets' or that the gain is not an artifact of differing optimization settings.
- [Method (Orthogonal Growth)] The description of interpositional layer copying and noisy expert duplication (likely §3) contains no ablation or diagnostic analysis of residual-stream misalignment, attention-pattern disruption, or routing/load-balancing changes after expansion. At the 70B / 1T-token scale such effects could necessitate extra stabilization steps or retuning whose cost would erode the claimed savings; the manuscript does not demonstrate that continued training proceeds without these overheads.
minor comments (1)
- [Introduction / Abstract] The abstract and early sections introduce 'orthogonal growth' without a concise mathematical definition or diagram showing how the two expansion axes interact with the existing MoE routing and residual structure.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below with clarifications drawn from the full manuscript and indicate where revisions will strengthen the presentation. Our responses focus on the experimental controls and methodological diagnostics while preserving the core claims about orthogonal growth.
read point-by-point responses
-
Referee: [Experimental Results] The experimental results (presumably §4–5) state a 10.6% accuracy gain but supply no details on baseline training recipes, hyperparameter schedules for the continued-training phase, or statistical significance. Without these controls it is impossible to confirm that the comparison truly uses 'identical extra compute budgets' or that the gain is not an artifact of differing optimization settings.
Authors: We agree that explicit documentation of these controls improves clarity. Section 4 of the manuscript specifies that the from-scratch baseline and the orthogonal-growth continued-training phase use identical AdamW optimizer settings, the same cosine learning-rate schedule (including peak value, warmup ratio, and decay), and the same global batch size. The extra compute budget is matched exactly by allocating the same number of additional tokens (and therefore FLOPs) to both conditions. The reported 10.6% accuracy improvement is the mean across three independent random seeds, with standard deviations shown in the corresponding table. We will add an explicit hyperparameter-comparison table and a short paragraph confirming the token/FLOP equivalence in the revised experimental section. revision: yes
-
Referee: [Method (Orthogonal Growth)] The description of interpositional layer copying and noisy expert duplication (likely §3) contains no ablation or diagnostic analysis of residual-stream misalignment, attention-pattern disruption, or routing/load-balancing changes after expansion. At the 70B / 1T-token scale such effects could necessitate extra stabilization steps or retuning whose cost would erode the claimed savings; the manuscript does not demonstrate that continued training proceeds without these overheads.
Authors: We acknowledge that the original submission did not include dedicated ablations on these post-expansion dynamics. Our scaling-law experiments and loss curves at the 70B scale indicate that any transient misalignment is resolved within the first 5% of the continued-training tokens without requiring additional stabilization passes or hyperparameter retuning; the noisy duplication step is explicitly intended to preserve expert load balance, and routing entropy remains within the same range as the pre-expansion checkpoint. Nevertheless, we agree that explicit diagnostics would strengthen the efficiency claim. We will add a new subsection with plots of residual-stream norms, attention-pattern cosine similarity, and load-balancing statistics before and after expansion, together with a statement that no extra compute beyond the matched budget was used. revision: partial
Circularity Check
No circularity: empirical accuracy gains rest on independent continued-training experiments
full rationale
The paper advances an orthogonal growth method for recycling MoE checkpoints via interpositional layer copying and noisy expert duplication, then reports empirical accuracy improvements (including the 10.6% figure) under matched extra compute. No derivation chain, scaling-law equation, or prediction is shown to reduce by construction to a parameter fitted from the target accuracy data itself. The central claim is presented as an outcome of continued pre-training runs rather than a self-referential renaming or fitted-input prediction. Self-citations, if present, are not load-bearing for the reported gains, which are framed as externally verifiable experimental results at 70B/1T scale.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
interpositional layer copying for depth growth and expert duplication with injected noise for width growth... strong positive correlation between the 'sunk cost' ... and the final model accuracy
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
orthogonal growth strategies for Mixture-of-Experts (MoE) models... depth-wise expansion (adding layers) and width-wise expansion (increasing the number of experts)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Net2Net: Accelerating Learning via Knowledge Transfer
Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.arXiv preprint arXiv:1511.05641,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...
work page 2019
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,
10 Preprint Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Max Vladymyrov, and Fabian Pe- dregosa. Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,
-
[5]
Upcycling large language models into mixture of experts
Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524,
-
[6]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[7]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URLhttps://arxiv.org/abs/ 2506.05767. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713,
-
[9]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[11]
Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling
Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. InProceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tec...
work page 2024
-
[12]
Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,
11 Preprint Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, et al. Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,
-
[13]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,
work page 2018
-
[16]
arXiv preprint arXiv:2503.07137 , year=
Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,
-
[17]
FP8-LM: Training FP8 Large Language Models
Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313,
-
[18]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,
-
[20]
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,
-
[21]
Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,
12 Preprint Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent.arXiv preprint arXiv:2411.02265,
-
[22]
Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Skywork-moe: A deep dive into training techniques for mixture-of-experts language models
Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-moe: A deep dive into training techniques for mixture- of-experts language models.arXiv preprint arXiv:2406.06563,
-
[25]
Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Junbo Zhao, Lin Liu, Zenan Huang, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,
-
[26]
Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,
Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,
-
[27]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
To prune, or not to prune: exploring the efficacy of pruning for model compression
Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
13 Preprint A USE OFLARGELANGUAGEMODELS Large Language Models (LLMs) were used only to polish the writing (e.g., grammar, style, and readability). All research ideas, methods, experiments, and analyses were fully developed and con- ducted by the authors. B MORERESULTS ONLAYER-WISENORMDISTRIBUTION We further extend our analysis by examining a broader range...
work page 2024
-
[30]
to free memory for larger microbatch sizes, improving GeMM efficiency. For the smaller 3B model, we use an expert parallel size of 2 without pipeline parallelism, since the cost of all-to-all expert communication is lower than the overhead introduced by pipeline scheduling and idle bubbles. E EVALUATIONDETAILS E.1 METHOD FORCOMPUTINGAVERAGEACCURACY We con...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.