Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869

Yao, Y · arXiv 2305.02869

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

citing papers explorer

Showing 1 of 1 citing paper.

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling cs.LG · 2026-04-21 · unverdicted · none · ref 26
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer