Vision Transformers Need Better Token Interaction
Pith reviewed 2026-05-25 04:43 UTC · model grok-4.3
The pith
Vision transformers suffer semantic diffusion that degrades patch tokens for dense tasks, fixed by switching to entmax-1.5 attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semantic diffusion—an optimization shortcut that lets global semantic information spread through patch tokens beyond local justification—is the main driver of dense degradation in ViTs. Replacing softmax attention with entmax-1.5 produces more selective mixing, preserves ImageNet linear probing accuracy, and raises VOC mIoU from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87 on DINOv1 ViT-S/16 trained 200 epochs on ImageNet-1K.
What carries the argument
semantic diffusion: the optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified; countered by replacing softmax attention with entmax-1.5 sparse attention.
If this is right
- Dense prediction quality improves without sacrificing global representation strength.
- Shallow features remain better localized yet still underperform deeper features for dense tasks.
- CLS token features stay complementary to patch tokens even after the attention change.
- Sparse yet globally connected attention is sufficient to mitigate the degradation.
- The same intervention works across multiple segmentation benchmarks without extra data or architecture changes.
Where Pith is reading between the lines
- The same selective-mixing bias could be tested on object detection or instance segmentation heads.
- If semantic diffusion is the core issue, other sparse or low-entropy attention variants should produce similar gains.
- Longer training schedules might amplify the benefit of entmax, suggesting a scaling law between training length and required selectivity.
- The finding implies that future ViT designs should optimize the selectivity of token mixing rather than simply increasing or decreasing global context.
Load-bearing premise
The measured segmentation gains come specifically from reduced semantic diffusion rather than from incidental differences in optimization dynamics or other properties of the entmax function.
What would settle it
Train identical ViT models with entmax-1.5 and with softmax, then measure whether the segmentation gains disappear once learning-rate schedules, gradient norms, and attention sparsity levels are matched exactly.
Figures
read the original abstract
Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Vision Transformers suffer from 'semantic diffusion' during prolonged self-supervised training, where global semantics spread excessively through patch tokens, degrading dense prediction performance. It argues that the solution is more selective token interactions rather than removing global context, and demonstrates that replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 (trained 200 epochs on ImageNet-1K) preserves ImageNet linear probing accuracy while raising semantic segmentation mIoU on VOC from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87.
Significance. If the reported segmentation gains are causally attributable to the selective mixing induced by entmax-1.5, the result would be significant: a minimal, architecture-preserving intervention that improves dense ViT representations without sacrificing image-level performance or global connectivity. The approach is simple and the benchmarks are standard, but the current evidence does not yet isolate the proposed mechanism from optimization confounds.
major comments (2)
- The central claim that segmentation gains arise from reduced semantic diffusion via selective token mixing is not supported by controls that isolate this mechanism. The experiments consist solely of a direct comparison between the unmodified DINOv1 softmax baseline and the entmax-1.5 variant; no ablations (e.g., top-k attention to match sparsity, gradient-norm equalization, or learning-rate sweeps for the entmax model) are reported to rule out incidental optimization effects.
- The paper introduces and characterizes 'semantic diffusion' as the key phenomenon, yet the empirical results do not include a direct measurement or quantification of this diffusion (or its reduction) that is shown to correlate with the observed mIoU lifts; the attribution therefore remains interpretive rather than demonstrated.
minor comments (2)
- The abstract and results section would benefit from reporting standard deviations or multiple runs for the mIoU numbers, as single-run comparisons on segmentation benchmarks can be sensitive to training stochasticity.
- Clarify whether the entmax-1.5 variant required any hyper-parameter retuning relative to the softmax baseline, and if so, document those changes explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The concerns regarding mechanistic isolation and direct quantification of semantic diffusion can be addressed with additional experiments and analysis in a revision.
read point-by-point responses
-
Referee: The central claim that segmentation gains arise from reduced semantic diffusion via selective token mixing is not supported by controls that isolate this mechanism. The experiments consist solely of a direct comparison between the unmodified DINOv1 softmax baseline and the entmax-1.5 variant; no ablations (e.g., top-k attention to match sparsity, gradient-norm equalization, or learning-rate sweeps for the entmax model) are reported to rule out incidental optimization effects.
Authors: We agree that additional controls would provide stronger isolation of the selective-mixing mechanism from potential optimization differences. The reported experiments use identical training protocols for both models, which already controls for most hyperparameters. To further address the concern, we will add learning-rate sweeps for the entmax-1.5 model and a top-k attention baseline matched for sparsity level in the revised manuscript. revision: yes
-
Referee: The paper introduces and characterizes 'semantic diffusion' as the key phenomenon, yet the empirical results do not include a direct measurement or quantification of this diffusion (or its reduction) that is shown to correlate with the observed mIoU lifts; the attribution therefore remains interpretive rather than demonstrated.
Authors: Section 3 of the manuscript already characterizes semantic diffusion via multiple observations, including that shallow features can remain locally aligned yet underperform deeper ones and that [CLS] tokens remain complementary for dense tasks. We acknowledge that an explicit scalar metric of diffusion reduction correlated with mIoU would make the link more direct. We will introduce such a quantification (e.g., foreground token similarity or attention entropy) and report its correlation with the segmentation gains in the revision. revision: yes
Circularity Check
No circularity: empirical comparisons on external benchmarks
full rationale
The paper's central results consist of measured mIoU and accuracy numbers obtained by training DINOv1 ViT-S/16 models on ImageNet-1K for 200 epochs and evaluating linear probing plus segmentation on VOC, ADE20K and Cityscapes. No equations, fitted parameters, or predictions are shown to reduce to quantities defined inside the same work; the reported gains (e.g., VOC mIoU 42.80 → 48.78) are direct experimental outcomes rather than self-definitional or self-citation reductions. The analysis of semantic diffusion is interpretive framing around these measurements and does not invoke load-bearing self-citations or ansatzes that collapse the claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DINOv1 training protocol on ImageNet-1K produces representations whose dense quality can be measured by linear probing and segmentation mIoU
invented entities (1)
-
semantic diffusion
no independent evidence
Reference graph
Works this paper leans on
-
[1]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
work page 2021
-
[2]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021
work page 2021
-
[3]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...
work page 2024
-
[4]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In International Conference on Learning Representations, 2024
work page 2024
-
[5]
Vision transformers need more than registers, 2026
Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers, 2026
work page 2026
-
[6]
Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InInternational Conference on Learning Representations, 2026
work page 2026
-
[7]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
work page 2025
-
[8]
Swin transformer: Hierarchical vision transformer using shifted windows, 2021
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021
work page 2021
-
[9]
Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[10]
Correia, Vlad Niculae, and André F
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019
work page 2019
-
[11]
Systematic outliers in large language models, 2025
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025
work page 2025
-
[12]
Acc-vit : Atrous convolution’s comeback in vision transformers, 2024
Nabil Ibtehaz, Ning Yan, Masood Mortazavi, and Daisuke Kihara. Acc-vit : Atrous convolution’s comeback in vision transformers, 2024
work page 2024
-
[13]
Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023
Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023
work page 2023
-
[14]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015
work page 2015
-
[15]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html
work page 2012
-
[16]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017. 6
work page 2017
-
[17]
Semantic understanding of scenes through the ade20k dataset, 2018
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset, 2018
work page 2018
-
[18]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[19]
Indoor segmentation and support inference from rgbd images
Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012
work page 2012
-
[20]
Lichtenberg, and Jianxiong Xiao
Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
work page 2015
-
[21]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013
work page 2013
-
[22]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
work page 2025
-
[23]
Adasplash: Adaptive sparse flash attention
Nuno Gonçalves, Marcos V Treviso, and Andre Martins. Adasplash: Adaptive sparse flash attention. In F orty-second International Conference on Machine Learning, 2025
work page 2025
-
[24]
Adasplash-2: Faster differentiable sparse attention
Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, and Marcos V Treviso. Adasplash-2: Faster differentiable sparse attention. InF orty-third International Conference on Machine Learning, 2026. 7
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.