Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.