Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.
A multiscale visualization of attention in the transformer model
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
stat.ML 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
Analytical theory of signal propagation in deep transformers at initialization yields quantitative prescriptions for weights and residuals to avoid rank and entropy collapse via Random Energy Model analogy.