DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Impact of layer norm on memorization and generalization in transformers
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.