pith. sign in

arxiv: 2507.07129 · v3 · submitted 2025-07-08 · 💻 cs.LG · cs.CL

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Pith reviewed 2026-05-19 05:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords growing transformersmodular compositionfrozen substratelayer-wise expansioncontinued learningparameter efficiencyLoRAdecoder-only transformers
0
0 comments X

The pith

Transformers can grow in depth by training only new blocks on a frozen token interface while keeping active trainable parameters constant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether decoder-only Transformers can continue to improve by stacking additional layers without reopening or retraining earlier parts of the model. It maintains a roughly constant number of active trainable parameters as the model deepens, using a fixed token interface that remains frozen. This regime includes optional limited readjustments via LoRA under the same parameter budget. A sympathetic reader would care because it points to a potentially more efficient way to scale models incrementally rather than retraining large monolithic networks from scratch each time.

Core claim

In a constrained training regime for decoder-only Transformers, the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, new blocks are stacked and only the newest blocks and the LM head are trained; optional LoRA phases provide limited global readjustment under the same budget. In a 9-layer study the constructive model uses 105.0M active parameters versus 180.5M for the interface-matched monolithic baseline. Even with a frozen 16-dim binary token-ID code lifted to d_model, a 16-layer 269.7M model trained on 68.9B tokens reaches 28.92% MMLU after a final

What carries the argument

Layer-wise expansion on a frozen substrate, where new transformer blocks are added and trained while the token embedding matrix and all prior blocks remain fixed under a bounded active-parameter budget.

Load-bearing premise

Restricting training to only the newest blocks plus optional LoRA under a constant active-parameter budget is sufficient to produce useful continued learning without reopening or retraining earlier dense blocks.

What would settle it

A controlled experiment training both the grown model and an equal-sized monolithic model on the exact same data mixture throughout, then comparing final perplexity and MMLU scores.

Figures

Figures reproduced from arXiv: 2507.07129 by A. Bochkov.

Figure 1
Figure 1. Figure 1: Training dynamics for the ‘best_bvv_moe‘ model. The low starting loss indicates successful [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics for the ‘max_bvv_moe‘ model, mirroring the successful convergence pattern [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison: The merged MoE model (best_bvv_moe) demonstrates synergistic [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics during progressive layer-wise growth. Each loss spike marks the stacking of a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Benchmark performance as a function of model depth. Note the significant jump in SQuAD [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MMLU performance on select subjects as a function of model depth, illustrating how different [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, we stack new blocks and train only the newest blocks and the LM head; optional LoRA phases provide limited global readjustment under the same active-parameter budget. The paper asks a feasibility/tradeoff question, not whether this regime matches tuned monolithic pretraining. In a common-protocol 9-layer study on a frozen Unicode substrate, the constructive frozen-Unicode model uses 105.0M active trainable parameters, compared with 180.5M for the interface-matched monolithic frozen baseline and 247.6M for the fully trainable monolithic baseline. We then consider an extreme fixed interface: each token is represented only by a frozen 16-dim binary token-ID code, deterministically lifted to d_model, so the resulting token embedding matrix has rank at most 16. Even in this setting, continued growth remains viable. In a 68.9B-token run on FineWeb-Edu + Cosmopedia, a 16-layer 269.7M model trained above this fixed interface reaches 28.92\% MMLU after an interleaved LoRA stage. Reported final metrics are measured after merging the last-stage LoRA adapters into the 269.7M base model. Because the data mixture changes across stages in this long-horizon run, we interpret it as a viability demonstration rather than a clean causal comparison. Overall, the evidence supports a narrow claim: useful continued learning can proceed above a frozen minimal interface under a bounded active trainable-parameter budget, with a clear tradeoff against dense monolithic training in final perplexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies a constrained regime for decoder-only Transformers in which the token interface remains fixed, prior dense blocks stay frozen, and the active trainable parameter count is held approximately constant while depth grows by stacking and training only the newest blocks (plus optional LoRA under the same budget). It reports a 9-layer study on a frozen Unicode substrate (105M active parameters vs. 180.5M and 247.6M monolithic baselines) and an extreme 16-layer run on 68.9B tokens with a frozen 16-dim binary token embedding (rank at most 16) that reaches 28.92% MMLU after LoRA merging. The central claim is framed as a viability demonstration rather than a clean causal comparison, given explicit data-mixture changes across stages.

Significance. If the results hold, the work shows that continued learning remains viable above a frozen minimal interface under a bounded active-parameter budget, with a measurable perplexity tradeoff versus dense monolithic training. Concrete strengths include the reported parameter counts, the large-token extreme-interface experiment, and the explicit acknowledgment that data changes limit causal claims. This supplies empirical grounding for modular growth strategies that avoid reopening earlier layers.

major comments (2)
  1. [Long-horizon experiment] Long-horizon run (68.9B tokens, 16-layer model): the viability claim rests on the final 28.92% MMLU after LoRA merging, yet the absence of an ablation that holds data mixture fixed while varying only the growth mechanism (new blocks + LoRA vs. continued training of the base) leaves the contribution of the modular construction under-specified for the central feasibility argument.
  2. [9-layer study] 9-layer study: the 105.0M active-parameter figure is compared to the 180.5M and 247.6M baselines, but the manuscript does not provide an explicit per-stage breakdown confirming that the active budget remains constant after each stacking step and after LoRA merging; this detail is load-bearing for the “bounded active trainable-parameter budget” claim.
minor comments (2)
  1. [Abstract] Abstract: the total parameter count of the final 269.7M model is stated but not contrasted with the 105M active figure in the same sentence; adding this contrast would clarify the active-vs-total distinction for readers.
  2. [Notation and terminology] Notation: ensure “active trainable parameters” is used uniformly when describing both the constructive model and the LoRA phases to prevent conflation with total model size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We respond to each major comment below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Long-horizon experiment] Long-horizon run (68.9B tokens, 16-layer model): the viability claim rests on the final 28.92% MMLU after LoRA merging, yet the absence of an ablation that holds data mixture fixed while varying only the growth mechanism (new blocks + LoRA vs. continued training of the base) leaves the contribution of the modular construction under-specified for the central feasibility argument.

    Authors: We agree that the absence of a data-mixture-fixed ablation limits the ability to isolate the contribution of the modular growth. As stated in the manuscript, the long-horizon experiment is intended as a viability demonstration given the data changes across stages. We have updated the text to more explicitly highlight this limitation and the scope of our claims. Performing the suggested ablation would require significant additional resources and is planned for future work. revision: partial

  2. Referee: [9-layer study] 9-layer study: the 105.0M active-parameter figure is compared to the 180.5M and 247.6M baselines, but the manuscript does not provide an explicit per-stage breakdown confirming that the active budget remains constant after each stacking step and after LoRA merging; this detail is load-bearing for the “bounded active trainable-parameter budget” claim.

    Authors: Thank you for this observation. We will incorporate an explicit per-stage breakdown of the active trainable parameters in the revised manuscript. This will include details after each stacking step and following LoRA merging to confirm that the active budget is maintained approximately constant, thereby strengthening support for the bounded active-parameter claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical feasibility study of continued training under a frozen-substrate regime with bounded active parameters. All central claims rest on reported training runs, explicit parameter budgets (e.g., 105.0M vs. 180.5M), and final metrics (MMLU, perplexity) measured after LoRA merging. No derivation chain, first-principles prediction, or equation is present that reduces by construction to fitted inputs or self-citations; the work explicitly frames itself as a viability demonstration rather than a causal comparison or theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard transformer architecture and training assumptions; no new free parameters, axioms, or invented entities are introduced or fitted in the reported abstract.

axioms (1)
  • domain assumption Decoder-only Transformer blocks can be stacked and trained incrementally while keeping earlier blocks frozen.
    Invoked by the constructive growth procedure described in the abstract.

pith-pipeline@v0.9.0 · 5847 in / 1319 out tokens · 39237 ms · 2026-05-19T05:29:31.420992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Emergent semantics beyond token embeddings: Transformer lms with frozen visual unicode representations,

    A. Bochkov, “Emergent semantics beyond token embeddings: Transformer lms with frozen visual unicode representations,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04886

  2. [2]

    A fast learning algorithm for deep belief nets,

    G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,”Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006

  3. [3]

    Greedy layer-wise training of deep networks,

    Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in Neural Information Processing Systems 19 , 2007, pp. 153–160

  4. [4]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  5. [5]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022

  6. [6]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy,

    M. Wortsman, G. Ilharco, H. Kim, R. Gontijo-Lopes, A. Farhadi, H. Hajishirzi, and L. Schmidt, “Model soups: averaging weights of multiple fine-tuned models improves accuracy,” inInternational Conference on Machine Learning . PMLR, 2022, pp. 24144–24164

  7. [7]

    Progressive Neural Networks

    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Hadsell, and F. Heß, “Progressive neural networks,”arXiv preprint arXiv:1606.04671 , 2016

  8. [8]

    Adapterfusion: Non-destructive task composition for transfer learning,

    J. Pfeiffer, I. Vulić, I. Gurevych, and S. Ruder, “Adapterfusion: Non-destructive task composition for transfer learning,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , 2021, pp. 686–700

  9. [9]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations , 2022