Demonstrates that Transformers can continue learning when grown modularly above a frozen minimal token interface under a fixed active-parameter budget, with reported viability in 9-layer and 16-layer experiments.
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
Demonstrates that Transformers can continue learning when grown modularly above a frozen minimal token interface under a fixed active-parameter budget, with reported viability in 9-layer and 16-layer experiments.