Setting the Record Straight on Transformer Oversmoothing

Gb\`etondji J-S Dovonon; Matt J. Kusner; Michael M. Bronstein

arxiv: 2401.04301 · v3 · pith:XQVQSU2Knew · submitted 2024-01-09 · 💻 cs.LG

Setting the Record Straight on Transformer Oversmoothing

Gb\`etondji J-S Dovonon , Michael M. Bronstein , Matt J. Kusner This is my paper

classification 💻 cs.LG

keywords empiricallymodelstheoreticallytransformersweightsbecomebehaviorfindings

0 comments

read the original abstract

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Analogies between Transformer Layers and Power Method
cs.LG 2026-05 unverdicted novelty 6.0

Transformer layers are analogous to power method steps, tilting tokens toward the principal eigenvector of the output-value weight product, with stronger analytical and empirical alignment in shared-weight models and ...