Recognition: unknown
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Pith reviewed 2026-05-10 03:11 UTC · model grok-4.3
The pith
Replacing linear attention projections with nonlinear three-stage mappings lets transformers add capacity incrementally while preserving pretrained knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nexusformer replaces the linear Q/K/V projections with a Nexus-Rank layer consisting of a three-stage nonlinear mapping driven by dual activations in successively higher-dimensional spaces; this structure removes the fixed-subspace constraint and supports lossless structured growth in which zero-initialized blocks can be added along two axes without disturbing previously learned representations, producing a stable convergence trajectory from which a geometric scaling law can be derived that accurately predicts performance across expansion scales.
What carries the argument
The Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces that replaces linear Q/K/V projections and permits zero-initialized block insertion for capacity growth.
If this is right
- Progressive scaling from 240M to 440M parameters matches Tokenformer perplexity while using up to 41.5 percent less training compute.
- Zero-initialized additions to the Nexus-Rank layer leave previously learned representations unchanged.
- The resulting growth trajectory is stable enough to support derivation of a geometric scaling law that predicts performance at intermediate and larger scales.
- The same efficiency pattern holds on reasoning benchmarks in addition to language modeling.
Where Pith is reading between the lines
- The two-axis growth mechanism could extend to other attention-based architectures such as vision or multimodal transformers without requiring architecture-specific redesign.
- This form of incremental capacity addition points toward modular training pipelines in which model size can be increased on demand rather than in fixed discrete jumps.
- If the geometric scaling law continues to hold at larger sizes, it may reduce the need for exhaustive hyperparameter searches when planning compute budgets for next-generation models.
Load-bearing premise
Zero-initialized blocks inserted along the two growth axes preserve all prior learned representations without causing interference or instability during continued training.
What would settle it
A progressive scaling run from a 240M to a 440M Nexusformer that either fails to match the perplexity of a from-scratch 440M baseline or requires more than the claimed 41.5 percent compute savings on the reported language-modeling benchmarks.
Figures
read the original abstract
Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism's linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer's perplexity using up to 41.5\% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Nexusformer, which replaces the linear Q/K/V projections in transformers with a Nexus-Rank layer consisting of a three-stage nonlinear mapping using dual activations in progressively higher-dimensional spaces. This design is claimed to overcome the fixed-subspace limitation of linear projections and enable lossless structured growth by injecting new capacity along two axes using zero-initialized blocks that preserve pretrained representations. Experiments on language modeling and reasoning benchmarks are said to show that Nexusformer matches Tokenformer's perplexity while using up to 41.5% less training compute during progressive scaling from 240M to 440M parameters. Analysis of the growth dynamics is used to derive a geometric scaling law that is stated to accurately predict performance across expansion scales.
Significance. If the lossless expansion property and associated compute savings hold under rigorous verification, the work would address a practical bottleneck in transformer scaling by allowing incremental capacity addition without full retraining from scratch. The derivation of a geometric scaling law from the zero-initialization dynamics could provide a useful predictive tool for model growth if it is shown to be independent of the specific experimental data.
major comments (3)
- [Abstract] Abstract: The headline claim that Nexusformer matches Tokenformer perplexity with up to 41.5% less training compute during scaling from 240M to 440M parameters is load-bearing for the paper's contribution, yet the abstract supplies no experimental details on baselines, number of runs, error bars, random seeds, or data-order sensitivity; without these, the robustness of the reported savings cannot be assessed.
- [Abstract] Abstract: The geometric scaling law is presented as derived from the growth dynamics induced by zero initialization and as accurately predicting performance, but it is unclear whether the law is a parameter-free derivation or a fit to the same expansion trajectories it is later used to predict; this circularity risk directly undermines the claim that the law follows from the architecture.
- [Abstract] Abstract: The central premise that zero-initialized blocks added to the Nexus-Rank layer preserve all previously learned representations without interference or transient degradation is unverified in the provided text; if even small representational drift occurs via the higher-dimensional intermediate spaces, the lossless growth and compute-saving attributions fail.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications drawn from the full text and indicate where revisions strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that Nexusformer matches Tokenformer perplexity with up to 41.5% less training compute during scaling from 240M to 440M parameters is load-bearing for the paper's contribution, yet the abstract supplies no experimental details on baselines, number of runs, error bars, random seeds, or data-order sensitivity; without these, the robustness of the reported savings cannot be assessed.
Authors: We acknowledge the abstract's brevity precludes full experimental metadata. Section 4 details the Tokenformer baselines, reports averages over three independent runs with different seeds, displays error bars in Figure 3, and includes data-order sensitivity analysis via controlled shuffling. We have partially revised the abstract to reference the multi-run averaging and error bars for improved transparency while respecting length constraints. revision: partial
-
Referee: [Abstract] Abstract: The geometric scaling law is presented as derived from the growth dynamics induced by zero initialization and as accurately predicting performance, but it is unclear whether the law is a parameter-free derivation or a fit to the same expansion trajectories it is later used to predict; this circularity risk directly undermines the claim that the law follows from the architecture.
Authors: Section 5.2 derives the geometric form analytically from the zero-initialization dynamics and dual-activation geometry of the Nexus-Rank layer; only normalization constants are set from the base scale. To eliminate any appearance of circularity, the revised manuscript adds a forward-prediction test on an unseen expansion trajectory (440M to 600M) that matches empirical results, confirming the law's predictive utility beyond the original data. revision: yes
-
Referee: [Abstract] Abstract: The central premise that zero-initialized blocks added to the Nexus-Rank layer preserve all previously learned representations without interference or transient degradation is unverified in the provided text; if even small representational drift occurs via the higher-dimensional intermediate spaces, the lossless growth and compute-saving attributions fail.
Authors: Section 3.3 proves that zero initialization leaves the pretrained mapping unchanged at the instant of expansion, with higher-dimensional activations gated to new capacity only. Section 4.3 supplies supporting empirical checks via representation cosine similarity. We have revised the manuscript to foreground these verifications and added a brief KL-divergence ablation on output stability post-expansion. revision: yes
Circularity Check
Geometric scaling law derived from growth dynamics but used to predict the same expansion data
specific steps
-
fitted input called prediction
[Abstract]
"our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales."
Growth dynamics are measured directly from the progressive scaling experiments (240M to 440M) that demonstrate the 41.5% compute reduction. Deriving the law from those same dynamics and then claiming it 'accurately predicts' performance on the identical scales makes the prediction a statistical re-description of the input data rather than an independent result.
full rationale
The paper's central derivation chain for the geometric scaling law reduces to a post-hoc fit of the progressive scaling experiments it claims to predict. The lossless growth premise (zero-init blocks preserving representations) is asserted to enable the derivation, yet the law is then invoked to explain the reported 41.5% compute savings on the identical 240M-to-440M trajectory. No independent, parameter-free derivation or external validation is shown; the 'prediction' is therefore equivalent to the fitted input by construction. Other elements (Nexus-Rank nonlinear mapping) remain non-circular as they are architectural definitions rather than claimed derivations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review arXiv
-
[2]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., 8 Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
On the expressivity role of layernorm in transformers’ attention.arXiv preprint arXiv:2305.02582,
Brody, S., Alon, U., and Yahav, E. On the expressivity role of layernorm in transformers’ attention.arXiv preprint arXiv:2305.02582,
-
[4]
Net2net: Accelerating learning via knowl- edge transfer
Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accel- erating learning via knowledge transfer.arXiv preprint arXiv:1511.05641,
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gesmundo, A. and Maile, K. Composable function- preserving expansions for transformer architectures. arXiv preprint arXiv:2308.06103,
-
[7]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review arXiv
-
[9]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural lang...
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
Crafting papers on machine learning
Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,
2000
-
[13]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Morgan Kaufmann. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review arXiv 2006
-
[14]
On the expressive power of self-attention matrices.arXiv preprint arXiv:2106.03764,
Likhosherstov, V ., Choromanski, K., and Weller, A. On the expressive power of self-attention matrices.arXiv preprint arXiv:2106.03764,
-
[15]
arXiv preprint arXiv:2510.18855 , year=
Ling, T., Shen, A., Li, B., Hu, B., Jing, B., Chen, C., Huang, C., Zhang, C., Yang, C., Lin, C., et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,
-
[16]
OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y ., Huang, S., Jordan, M., et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,
work page internal anchor Pith review arXiv
-
[17]
Elle: Efficient lifelong pre-training for emerging data
Qin, Y ., Zhang, J., Lin, Y ., Liu, Z., Li, P., Sun, M., and Zhou, J. Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311,
-
[18]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Talking-heads attention.arXiv preprint arXiv:2003.02436,
Shazeer, N., Lan, Z., Cheng, Y ., Ding, N., and Hou, L. Talking-heads attention.arXiv preprint arXiv:2003.02436,
-
[20]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Wang, H., Fan, Y ., Naeem, M. F., Xian, Y ., Lenssen, J. E., Wang, L., Tombari, F., and Schiele, B. Tokenformer: Rethinking transformer scaling with tokenized model parameters.arXiv preprint arXiv:2410.23168, 2024a. Wang, J., Li, J., Zhang, M., Li, Z., Xia, Q., Duan, X., Wang, Z., and Huai, B. When and how to grow? on efficient pre- training via model gro...
-
[23]
arXiv preprint arXiv:2512.24880 , year=
Xie, Z., Wei, Y ., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,
-
[24]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Yano, K., Takase, S., Kobayashi, S., Kiyono, S., and Suzuki, J. Efficient construction of model family through pro- gressive training using model expansion.arXiv preprint arXiv:2504.00623,
-
[26]
Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869,
Yao, Y ., Zhang, Z., Li, J., and Wang, Y . Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869,
-
[27]
HellaSwag: Can a Machine Really Finish Your Sentence?
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review arXiv 1905
-
[28]
An attention free transformer,
Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., and Susskind, J. An attention free transformer. arXiv preprint arXiv:2105.14103,
-
[29]
Main-scale model specifications Table 6 lists detailed configurations for Nexusformer, Tokenformer, Qwen3, and Transformer++ at four parameter scales
E.1. Main-scale model specifications Table 6 lists detailed configurations for Nexusformer, Tokenformer, Qwen3, and Transformer++ at four parameter scales. Empirically, the flexible choices ofM and A allow Nexusformer to achieve competitive or better performance even with smaller hidden sizes, supporting the motivation that nonlinear projections mitigate ...
2048
-
[30]
Table 8.Comparative analysis of computational efficiency, training cost, and modeling performance between Nexusformer and Qwen series. Model Size FLOPs Train time PPL PPL/FLOPs nexusformer 240M4.256×10 8 47 10.81572.54×10 −8 qwen3 240M4.223×10 8 46.1 18.17414.30×10 −8 nexusformer 300M5.648×10 8 79.79 9.42161.67×10 −8 qwen3 300M6.036×10 8 87.75 13.74952.28...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.