Stronger normalization-free transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu · 2025 · arXiv 2512.10938

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Transformers using elementwise tanh-like nonlinearities instead of LayerNorm show stretched-exponential APJN growth at large depth, indicating subcritical signal propagation unlike the power-law growth in pre-LayerNorm designs.

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

cs.LG · 2026-04-23 · conditional · novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

citing papers explorer

Showing 3 of 3 citing papers.

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers cs.LG · 2026-04-13 · unverdicted · none · ref 1
Transformers using elementwise tanh-like nonlinearities instead of LayerNorm show stretched-exponential APJN growth at large depth, indicating subcritical signal propagation unlike the power-law growth in pre-LayerNorm designs.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning cs.LG · 2026-04-23 · conditional · none · ref 12
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 7
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Stronger normalization-free transformers

fields

years

verdicts

representative citing papers

citing papers explorer