Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Pith reviewed 2026-05-19 05:29 UTC · model grok-4.3
The pith
Transformers can grow in depth by training only new blocks on a frozen token interface while keeping active trainable parameters constant.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a constrained training regime for decoder-only Transformers, the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, new blocks are stacked and only the newest blocks and the LM head are trained; optional LoRA phases provide limited global readjustment under the same budget. In a 9-layer study the constructive model uses 105.0M active parameters versus 180.5M for the interface-matched monolithic baseline. Even with a frozen 16-dim binary token-ID code lifted to d_model, a 16-layer 269.7M model trained on 68.9B tokens reaches 28.92% MMLU after a final
What carries the argument
Layer-wise expansion on a frozen substrate, where new transformer blocks are added and trained while the token embedding matrix and all prior blocks remain fixed under a bounded active-parameter budget.
Load-bearing premise
Restricting training to only the newest blocks plus optional LoRA under a constant active-parameter budget is sufficient to produce useful continued learning without reopening or retraining earlier dense blocks.
What would settle it
A controlled experiment training both the grown model and an equal-sized monolithic model on the exact same data mixture throughout, then comparing final perplexity and MMLU scores.
Figures
read the original abstract
We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, we stack new blocks and train only the newest blocks and the LM head; optional LoRA phases provide limited global readjustment under the same active-parameter budget. The paper asks a feasibility/tradeoff question, not whether this regime matches tuned monolithic pretraining. In a common-protocol 9-layer study on a frozen Unicode substrate, the constructive frozen-Unicode model uses 105.0M active trainable parameters, compared with 180.5M for the interface-matched monolithic frozen baseline and 247.6M for the fully trainable monolithic baseline. We then consider an extreme fixed interface: each token is represented only by a frozen 16-dim binary token-ID code, deterministically lifted to d_model, so the resulting token embedding matrix has rank at most 16. Even in this setting, continued growth remains viable. In a 68.9B-token run on FineWeb-Edu + Cosmopedia, a 16-layer 269.7M model trained above this fixed interface reaches 28.92\% MMLU after an interleaved LoRA stage. Reported final metrics are measured after merging the last-stage LoRA adapters into the 269.7M base model. Because the data mixture changes across stages in this long-horizon run, we interpret it as a viability demonstration rather than a clean causal comparison. Overall, the evidence supports a narrow claim: useful continued learning can proceed above a frozen minimal interface under a bounded active trainable-parameter budget, with a clear tradeoff against dense monolithic training in final perplexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies a constrained regime for decoder-only Transformers in which the token interface remains fixed, prior dense blocks stay frozen, and the active trainable parameter count is held approximately constant while depth grows by stacking and training only the newest blocks (plus optional LoRA under the same budget). It reports a 9-layer study on a frozen Unicode substrate (105M active parameters vs. 180.5M and 247.6M monolithic baselines) and an extreme 16-layer run on 68.9B tokens with a frozen 16-dim binary token embedding (rank at most 16) that reaches 28.92% MMLU after LoRA merging. The central claim is framed as a viability demonstration rather than a clean causal comparison, given explicit data-mixture changes across stages.
Significance. If the results hold, the work shows that continued learning remains viable above a frozen minimal interface under a bounded active-parameter budget, with a measurable perplexity tradeoff versus dense monolithic training. Concrete strengths include the reported parameter counts, the large-token extreme-interface experiment, and the explicit acknowledgment that data changes limit causal claims. This supplies empirical grounding for modular growth strategies that avoid reopening earlier layers.
major comments (2)
- [Long-horizon experiment] Long-horizon run (68.9B tokens, 16-layer model): the viability claim rests on the final 28.92% MMLU after LoRA merging, yet the absence of an ablation that holds data mixture fixed while varying only the growth mechanism (new blocks + LoRA vs. continued training of the base) leaves the contribution of the modular construction under-specified for the central feasibility argument.
- [9-layer study] 9-layer study: the 105.0M active-parameter figure is compared to the 180.5M and 247.6M baselines, but the manuscript does not provide an explicit per-stage breakdown confirming that the active budget remains constant after each stacking step and after LoRA merging; this detail is load-bearing for the “bounded active trainable-parameter budget” claim.
minor comments (2)
- [Abstract] Abstract: the total parameter count of the final 269.7M model is stated but not contrasted with the 105M active figure in the same sentence; adding this contrast would clarify the active-vs-total distinction for readers.
- [Notation and terminology] Notation: ensure “active trainable parameters” is used uniformly when describing both the constructive model and the LoRA phases to prevent conflation with total model size.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We respond to each major comment below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Long-horizon experiment] Long-horizon run (68.9B tokens, 16-layer model): the viability claim rests on the final 28.92% MMLU after LoRA merging, yet the absence of an ablation that holds data mixture fixed while varying only the growth mechanism (new blocks + LoRA vs. continued training of the base) leaves the contribution of the modular construction under-specified for the central feasibility argument.
Authors: We agree that the absence of a data-mixture-fixed ablation limits the ability to isolate the contribution of the modular growth. As stated in the manuscript, the long-horizon experiment is intended as a viability demonstration given the data changes across stages. We have updated the text to more explicitly highlight this limitation and the scope of our claims. Performing the suggested ablation would require significant additional resources and is planned for future work. revision: partial
-
Referee: [9-layer study] 9-layer study: the 105.0M active-parameter figure is compared to the 180.5M and 247.6M baselines, but the manuscript does not provide an explicit per-stage breakdown confirming that the active budget remains constant after each stacking step and after LoRA merging; this detail is load-bearing for the “bounded active trainable-parameter budget” claim.
Authors: Thank you for this observation. We will incorporate an explicit per-stage breakdown of the active trainable parameters in the revised manuscript. This will include details after each stacking step and following LoRA merging to confirm that the active budget is maintained approximately constant, thereby strengthening support for the bounded active-parameter claim. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical feasibility study of continued training under a frozen-substrate regime with bounded active parameters. All central claims rest on reported training runs, explicit parameter budgets (e.g., 105.0M vs. 180.5M), and final metrics (MMLU, perplexity) measured after LoRA merging. No derivation chain, first-principles prediction, or equation is present that reduces by construction to fitted inputs or self-citations; the work explicitly frames itself as a viability demonstration rather than a causal comparison or theoretical derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Decoder-only Transformer blocks can be stacked and trained incrementally while keeping earlier blocks frozen.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
progressive layer-wise growth... train only the newest blocks and the LM head; optional LoRA phases... frozen Unicode substrate... fixed 16-dim binary token-ID code
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
logit averaging... seamless modular composition... emergent property of model depth
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Bochkov, “Emergent semantics beyond token embeddings: Transformer lms with frozen visual unicode representations,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04886
-
[2]
A fast learning algorithm for deep belief nets,
G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,”Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006
work page 2006
-
[3]
Greedy layer-wise training of deep networks,
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in Neural Information Processing Systems 19 , 2007, pp. 153–160
work page 2007
-
[4]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022
work page 2022
-
[6]
Model soups: averaging weights of multiple fine-tuned models improves accuracy,
M. Wortsman, G. Ilharco, H. Kim, R. Gontijo-Lopes, A. Farhadi, H. Hajishirzi, and L. Schmidt, “Model soups: averaging weights of multiple fine-tuned models improves accuracy,” inInternational Conference on Machine Learning . PMLR, 2022, pp. 24144–24164
work page 2022
-
[7]
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Hadsell, and F. Heß, “Progressive neural networks,”arXiv preprint arXiv:1606.04671 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Adapterfusion: Non-destructive task composition for transfer learning,
J. Pfeiffer, I. Vulić, I. Gurevych, and S. Ruder, “Adapterfusion: Non-destructive task composition for transfer learning,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , 2021, pp. 686–700
work page 2021
-
[9]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations , 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.