Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

Baining Guo; Peng Cheng; Ruizhe Wang; Xiao Liu; Yaoxiang Wang; Yeyun Gong; Yucheng Ding; Zhengjun Zha

arxiv: 2510.08008 · v2 · pith:VOBZ6XVMnew · submitted 2025-10-09 · 💻 cs.LG

Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

Ruizhe Wang , Yucheng Ding , Xiao Liu , Yaoxiang Wang , Peng Cheng , Baining Guo , Zhengjun Zha , Yeyun Gong This is my paper

Pith reviewed 2026-05-21 20:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM pretrainingMixture of Expertsmodel expansiontraining efficiencysunk costorthogonal growthcontinued pretraining

0 comments

The pith

Pre-trained Mixture-of-Experts models can be expanded in depth and width before continued training to reach higher accuracy than training equivalent models from scratch with the same additional compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sunk costs in existing LLM checkpoints can be recycled effectively by expanding Mixture-of-Experts models orthogonally. It does this through interpositional layer copying to add depth and noisy expert duplication to add width. Scaling laws show a strong positive link between prior training investment and final accuracy after expansion. Results on models reaching 70 billion parameters and 1 trillion tokens demonstrate a 10.6 percent accuracy boost over from-scratch training under matched extra compute. This approach offers a practical way to make large-scale model development more efficient by building on past work rather than discarding it.

Core claim

The orthogonal growth strategy recycles converged Mixture-of-Experts checkpoints by increasing model depth via interpositional layer copying and model width via noisy expert duplication. This enables continued pre-training that yields superior performance compared to equivalent compute spent on training from scratch. Experiments confirm a positive correlation between the amount of prior sunk cost and the ultimate accuracy achieved, with up to 10.6% relative improvement observed on large-scale setups.

What carries the argument

Orthogonal growth, which expands a converged MoE model along two independent dimensions—depth through layer copying and width through expert duplication—before resuming training on the enlarged architecture.

If this is right

Models up to 70B parameters trained with this method achieve 10.6% higher accuracy than from-scratch baselines under the same extra compute budget.
Accuracy improves in proportion to the sunk cost invested in the initial checkpoint.
The approach provides a blueprint for sustainable LLM development by leveraging existing pre-trained assets.
Continued training of the expanded models proceeds without major instabilities in the tested regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that future training runs could be planned with an initial smaller model followed by planned expansions rather than always training large models from random initialization.
Similar recycling techniques might extend to non-MoE dense models if analogous expansion methods are developed.
Overall training efficiency could improve if compute budgets are allocated partly to expansion phases after initial convergence.

Load-bearing premise

The expanded models from layer copying and expert duplication can undergo continued training to higher accuracy without instabilities or the need for extensive hyperparameter adjustments that would erase the compute savings.

What would settle it

Training a 70B parameter model from scratch versus one expanded from a smaller pre-trained checkpoint using the same additional compute budget and measuring if the expanded version reaches at least 10% higher accuracy without training divergence.

Figures

Figures reproduced from arXiv: 2510.08008 by Baining Guo, Peng Cheng, Ruizhe Wang, Xiao Liu, Yaoxiang Wang, Yeyun Gong, Yucheng Ding, Zhengjun Zha.

**Figure 2.** Figure 2: Characteristic layer-wise weight norm distribution in pre-trained LLMs, including pre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of interposition and stack depth growth strategies. Left: training [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The impact of noise injection scale on width growth performance. Left: training loss; [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparative analysis of performance and stability between depth and width growth. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Full training curve and learning rate scheduler of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Investigation of growth time according to amount of sunk cost. Left: loss curve. Right: [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Investigation of growth time according to total amount of training FLOPs. Left: loss [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison of interposition and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Full training loss for 17B model pretraining and growth training. Left: original loss [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Downstream task evaluation result for 17B [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Characteristic layer-wise weight norm distribution in pre-trained LLMs from several [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the need for efficient training paradigms becomes critical. Despite the vast resources already invested in existing pre-trained checkpoints, these assets often remain under-leveraged due to architectural limitations. We introduce an "orthogonal growth" strategy designed to "recycle" these checkpoints by strategically expanding their parameters prior to continued training. Our method focuses on optimizing converged Mixture-of-Experts (MoE) models through two dimensions: interpositional layer copying for increased depth and noisy expert duplication for expanded width. Through extensive scaling laws analysis, we demonstrate a strong positive correlation between the "sunk cost" (prior investment) and the final model accuracy. Empirical results on models up to 70B parameters and 1T tokens show that our recycling approach yields a 10.6% accuracy improvement compared to training from scratch under identical extra compute budgets. This work provides a cost-effective blueprint for sustainable large-scale LLM development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete recipe for expanding pre-trained MoE checkpoints through interpositional layer copying plus noisy expert duplication, then shows scaling-law correlations and a 10.6% accuracy edge over matched from-scratch runs at 70B scale.

read the letter

Colleague, the central point is that they recycle existing MoE checkpoints by growing them orthogonally: copy layers between existing positions to add depth and duplicate experts with noise to add width, then continue pre-training. They tie this to scaling laws that link sunk compute to final accuracy and report a 10.6% accuracy lift versus training from scratch under the same added compute budget, tested up to 70B parameters and 1T tokens. That combination and the scale of the empirical check are the main new pieces. The work does a decent job focusing on MoE-specific expansion rather than generic upscaling, and the scaling-law plots give some quantitative backing for why reusing checkpoints pays off. The experiments at 70B are a plus if the total compute is truly matched. The soft spot is the unexamined stability of the continued-training phase. Layer copying can shift residual and attention alignments, and noisy expert duplication can temporarily unbalance routing; either issue might force extra warm-up, learning-rate retuning, or longer stabilization steps whose cost is not subtracted from the claimed savings. The abstract-level claim does not yet show whether those adjustments were needed or how they were controlled. This is for groups already running large MoE pre-training who want a practical expansion trick rather than a theoretical advance. A reader working on efficient scaling or checkpoint reuse will get usable details from the method and the numbers. It is worth sending to peer review because the scale and the recycling angle are relevant to current practice, even though the stability question will need tighter evidence in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an 'orthogonal growth' strategy to recycle pre-trained Mixture-of-Experts checkpoints by expanding model depth through interpositional layer copying and width through noisy expert duplication, followed by continued pre-training. It presents scaling-law analysis linking prior 'sunk cost' investment to final accuracy and reports empirical results on models up to 70B parameters trained on 1T tokens, claiming a 10.6% accuracy improvement over from-scratch training under matched extra compute budgets.

Significance. If the reported gains hold under controlled conditions, the approach could meaningfully reduce the marginal compute cost of scaling LLMs by better amortizing existing checkpoints. The scaling-laws component supplies a useful empirical correlation, but the central efficiency claim depends on unverified assumptions about post-expansion training stability.

major comments (2)

[Experimental Results] The experimental results (presumably §4–5) state a 10.6% accuracy gain but supply no details on baseline training recipes, hyperparameter schedules for the continued-training phase, or statistical significance. Without these controls it is impossible to confirm that the comparison truly uses 'identical extra compute budgets' or that the gain is not an artifact of differing optimization settings.
[Method (Orthogonal Growth)] The description of interpositional layer copying and noisy expert duplication (likely §3) contains no ablation or diagnostic analysis of residual-stream misalignment, attention-pattern disruption, or routing/load-balancing changes after expansion. At the 70B / 1T-token scale such effects could necessitate extra stabilization steps or retuning whose cost would erode the claimed savings; the manuscript does not demonstrate that continued training proceeds without these overheads.

minor comments (1)

[Introduction / Abstract] The abstract and early sections introduce 'orthogonal growth' without a concise mathematical definition or diagram showing how the two expansion axes interact with the existing MoE routing and residual structure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications drawn from the full manuscript and indicate where revisions will strengthen the presentation. Our responses focus on the experimental controls and methodological diagnostics while preserving the core claims about orthogonal growth.

read point-by-point responses

Referee: [Experimental Results] The experimental results (presumably §4–5) state a 10.6% accuracy gain but supply no details on baseline training recipes, hyperparameter schedules for the continued-training phase, or statistical significance. Without these controls it is impossible to confirm that the comparison truly uses 'identical extra compute budgets' or that the gain is not an artifact of differing optimization settings.

Authors: We agree that explicit documentation of these controls improves clarity. Section 4 of the manuscript specifies that the from-scratch baseline and the orthogonal-growth continued-training phase use identical AdamW optimizer settings, the same cosine learning-rate schedule (including peak value, warmup ratio, and decay), and the same global batch size. The extra compute budget is matched exactly by allocating the same number of additional tokens (and therefore FLOPs) to both conditions. The reported 10.6% accuracy improvement is the mean across three independent random seeds, with standard deviations shown in the corresponding table. We will add an explicit hyperparameter-comparison table and a short paragraph confirming the token/FLOP equivalence in the revised experimental section. revision: yes
Referee: [Method (Orthogonal Growth)] The description of interpositional layer copying and noisy expert duplication (likely §3) contains no ablation or diagnostic analysis of residual-stream misalignment, attention-pattern disruption, or routing/load-balancing changes after expansion. At the 70B / 1T-token scale such effects could necessitate extra stabilization steps or retuning whose cost would erode the claimed savings; the manuscript does not demonstrate that continued training proceeds without these overheads.

Authors: We acknowledge that the original submission did not include dedicated ablations on these post-expansion dynamics. Our scaling-law experiments and loss curves at the 70B scale indicate that any transient misalignment is resolved within the first 5% of the continued-training tokens without requiring additional stabilization passes or hyperparameter retuning; the noisy duplication step is explicitly intended to preserve expert load balance, and routing entropy remains within the same range as the pre-expansion checkpoint. Nevertheless, we agree that explicit diagnostics would strengthen the efficiency claim. We will add a new subsection with plots of residual-stream norms, attention-pattern cosine similarity, and load-balancing statistics before and after expansion, together with a statement that no extra compute beyond the matched budget was used. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical accuracy gains rest on independent continued-training experiments

full rationale

The paper advances an orthogonal growth method for recycling MoE checkpoints via interpositional layer copying and noisy expert duplication, then reports empirical accuracy improvements (including the 10.6% figure) under matched extra compute. No derivation chain, scaling-law equation, or prediction is shown to reduce by construction to a parameter fitted from the target accuracy data itself. The central claim is presented as an outcome of continued pre-training runs rather than a self-referential renaming or fitted-input prediction. Self-citations, if present, are not load-bearing for the reported gains, which are framed as externally verifiable experimental results at 70B/1T scale.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the scaling-laws correlation and the effectiveness of layer copying plus noisy duplication are presented as empirical observations without stated mathematical assumptions.

pith-pipeline@v0.9.0 · 5727 in / 1179 out tokens · 60791 ms · 2026-05-21T20:31:27.907479+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interpositional layer copying for depth growth and expert duplication with injected noise for width growth... strong positive correlation between the 'sunk cost' ... and the final model accuracy
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

orthogonal growth strategies for Mixture-of-Experts (MoE) models... depth-wise expansion (adding layers) and width-wise expansion (increasing the number of experts)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 13 internal anchors

[1]

Net2Net: Accelerating Learning via Knowledge Transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.arXiv preprint arXiv:1511.05641,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

work page 2019
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,

10 Preprint Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Max Vladymyrov, and Fabian Pe- dregosa. Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,

work page arXiv
[5]

Upcycling large language models into mixture of experts

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524,

work page arXiv
[6]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko

URLhttps://arxiv.org/abs/ 2506.05767. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713,

work page arXiv
[9]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[11]

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling

Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. InProceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2024
[12]

Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,

11 Preprint Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, et al. Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,

work page arXiv
[13]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018
[16]

arXiv preprint arXiv:2503.07137 , year=

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

work page arXiv
[17]

FP8-LM: Training FP8 Large Language Models

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313,

work page arXiv
[18]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

work page arXiv
[20]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,

work page arXiv
[21]

Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,

12 Preprint Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent.arXiv preprint arXiv:2411.02265,

work page arXiv
[22]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Skywork-moe: A deep dive into training techniques for mixture-of-experts language models

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-moe: A deep dive into training techniques for mixture- of-experts language models.arXiv preprint arXiv:2406.06563,

work page arXiv
[25]

Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Junbo Zhao, Lin Liu, Zenan Huang, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

work page arXiv
[26]

Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,

Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,

work page arXiv
[27]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

All research ideas, methods, experiments, and analyses were fully developed and con- ducted by the authors

13 Preprint A USE OFLARGELANGUAGEMODELS Large Language Models (LLMs) were used only to polish the writing (e.g., grammar, style, and readability). All research ideas, methods, experiments, and analyses were fully developed and con- ducted by the authors. B MORERESULTS ONLAYER-WISENORMDISTRIBUTION We further extend our analysis by examining a broader range...

work page 2024
[30]

to free memory for larger microbatch sizes, improving GeMM efficiency. For the smaller 3B model, we use an expert parallel size of 2 without pipeline parallelism, since the cost of all-to-all expert communication is lower than the overhead introduced by pipeline scheduling and idle bubbles. E EVALUATIONDETAILS E.1 METHOD FORCOMPUTINGAVERAGEACCURACY We con...

work page 2024

[1] [1]

Net2Net: Accelerating Learning via Knowledge Transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.arXiv preprint arXiv:1511.05641,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

work page 2019

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,

10 Preprint Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Max Vladymyrov, and Fabian Pe- dregosa. Gradmax: Growing neural networks using gradient information.arXiv preprint arXiv:2201.05125,

work page arXiv

[5] [5]

Upcycling large language models into mixture of experts

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524,

work page arXiv

[6] [6]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[7] [7]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko

URLhttps://arxiv.org/abs/ 2506.05767. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713,

work page arXiv

[9] [9]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[11] [11]

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling

Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. InProceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2024

[12] [12]

Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,

11 Preprint Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, et al. Flm-101b: An open llm and how to train it with $100 k budget.arXiv preprint arXiv:2309.03852,

work page arXiv

[13] [13]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018

[16] [16]

arXiv preprint arXiv:2503.07137 , year=

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

work page arXiv

[17] [17]

FP8-LM: Training FP8 Large Language Models

Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313,

work page arXiv

[18] [18]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796, 2024

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, et al. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

work page arXiv

[20] [20]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,

work page arXiv

[21] [21]

Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,

12 Preprint Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent.arXiv preprint arXiv:2411.02265,

work page arXiv

[22] [22]

Qwen2 Technical Report

Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Skywork-moe: A deep dive into training techniques for mixture-of-experts language models

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-moe: A deep dive into training techniques for mixture- of-experts language models.arXiv preprint arXiv:2406.06563,

work page arXiv

[25] [25]

Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Junbo Zhao, Lin Liu, Zenan Huang, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grovemoe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

work page arXiv

[26] [26]

Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,

Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models.arXiv preprint arXiv:2204.00408,

work page arXiv

[27] [27]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression.arXiv preprint arXiv:1710.01878,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

All research ideas, methods, experiments, and analyses were fully developed and con- ducted by the authors

13 Preprint A USE OFLARGELANGUAGEMODELS Large Language Models (LLMs) were used only to polish the writing (e.g., grammar, style, and readability). All research ideas, methods, experiments, and analyses were fully developed and con- ducted by the authors. B MORERESULTS ONLAYER-WISENORMDISTRIBUTION We further extend our analysis by examining a broader range...

work page 2024

[30] [30]

to free memory for larger microbatch sizes, improving GeMM efficiency. For the smaller 3B model, we use an expert parallel size of 2 without pipeline parallelism, since the cost of all-to-all expert communication is lower than the overhead introduced by pipeline scheduling and idle bubbles. E EVALUATIONDETAILS E.1 METHOD FORCOMPUTINGAVERAGEACCURACY We con...

work page 2024