Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
method 1
citation-polarity summary
years
2026 4representative citing papers
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.