Transformers using elementwise tanh-like nonlinearities instead of LayerNorm show stretched-exponential APJN growth at large depth, indicating subcritical signal propagation unlike the power-law growth in pre-LayerNorm designs.
Stronger normalization-free transformers
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3representative citing papers
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
citing papers explorer
-
Subcritical Signal Propagation at Initialization in Normalization-Free Transformers
Transformers using elementwise tanh-like nonlinearities instead of LayerNorm show stretched-exponential APJN growth at large depth, indicating subcritical signal propagation unlike the power-law growth in pre-LayerNorm designs.
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.