pith. sign in

arxiv: 2511.04981 · v2 · pith:NZ3U7IAKnew · submitted 2025-11-07 · 💻 cs.LG

Scaling depth capacity via zero/one-layer model expansion

classification 💻 cs.LG
keywords modelexpansionmodelstrainingachievedepthlearningloss
0
0 comments X
read the original abstract

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training (also known as model expansion) scales up model capacity during training and significantly reduces computation with little performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training to achieve an optimal tradeoff between computation and loss, with a comprehensive ablations on our expansion strategy. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently achieve an $\approx 5\times$ acceleration, while attaining a loss comparable to that of a fully trained 60-layer model with 7B parameters, thus demonstrating a mixing behavior in terms of loss. Furthermore, scaling laws on LLAMA3 and DeepSeekV3 models show a $3\sim 5\times$ improvement in compute efficiency, with an increasing advantage at larger scales.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  2. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.