Vision Transformers Need Better Token Interaction

Linxiang Su

arxiv: 2605.23868 · v1 · pith:KR6YZGW4new · submitted 2026-05-22 · 💻 cs.CV

Vision Transformers Need Better Token Interaction

Linxiang Su This is my paper

Pith reviewed 2026-05-25 04:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision transformerssemantic diffusionentmax attentiondense predictionsemantic segmentationtoken interactionsparse attentionDINO

0 comments

The pith

Vision transformers suffer semantic diffusion that degrades patch tokens for dense tasks, fixed by switching to entmax-1.5 attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that ViTs lose effectiveness in patch-level representations during long training because global semantics diffuse too freely across tokens. This shortcut hurts dense prediction even when global classification stays strong. The authors show that making token interactions more selective with entmax-1.5 attention removes much of the diffusion while keeping full connectivity. On a standard DINO ViT the change leaves ImageNet probing accuracy unchanged but raises semantic segmentation mIoU by several points across three datasets. A reader would care because many practical uses of ViTs require accurate per-pixel or per-region outputs rather than image-level labels alone.

Core claim

Semantic diffusion—an optimization shortcut that lets global semantic information spread through patch tokens beyond local justification—is the main driver of dense degradation in ViTs. Replacing softmax attention with entmax-1.5 produces more selective mixing, preserves ImageNet linear probing accuracy, and raises VOC mIoU from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87 on DINOv1 ViT-S/16 trained 200 epochs on ImageNet-1K.

What carries the argument

semantic diffusion: the optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified; countered by replacing softmax attention with entmax-1.5 sparse attention.

If this is right

Dense prediction quality improves without sacrificing global representation strength.
Shallow features remain better localized yet still underperform deeper features for dense tasks.
CLS token features stay complementary to patch tokens even after the attention change.
Sparse yet globally connected attention is sufficient to mitigate the degradation.
The same intervention works across multiple segmentation benchmarks without extra data or architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-mixing bias could be tested on object detection or instance segmentation heads.
If semantic diffusion is the core issue, other sparse or low-entropy attention variants should produce similar gains.
Longer training schedules might amplify the benefit of entmax, suggesting a scaling law between training length and required selectivity.
The finding implies that future ViT designs should optimize the selectivity of token mixing rather than simply increasing or decreasing global context.

Load-bearing premise

The measured segmentation gains come specifically from reduced semantic diffusion rather than from incidental differences in optimization dynamics or other properties of the entmax function.

What would settle it

Train identical ViT models with entmax-1.5 and with softmax, then measure whether the segmentation gains disappear once learning-rate schedules, gradient norms, and attention sparsity levels are matched exactly.

Figures

Figures reproduced from arXiv: 2605.23868 by Linxiang Su.

**Figure 1.** Figure 1: Visualization of the impact of our proposed sparse attention, compared with other baselines. We display the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Relationship between CLS spatial grounding and dense prediction quality across training. Left: PiB heatmap [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt{[CLS]} features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax-1.5 while preserving global token connectivity. On DINOv1 ViT-S/16 trained for 200 epochs on ImageNet-1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Entmax-1.5 attention on DINO ViT-S/16 delivers clear mIoU lifts on three segmentation benchmarks while holding ImageNet accuracy steady, but the paper supplies no controls that tie those lifts to reduced semantic diffusion rather than other properties of the substitution.

read the letter

The new piece here is the specific substitution result: after 200 epochs of DINOv1 training on ImageNet-1K, swapping softmax for entmax-1.5 keeps linear-probe accuracy unchanged and raises VOC mIoU from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. That is a drop-in change with measurable effect on dense tasks that matter for driving and medical imaging. The paper also walks through why locality alone does not explain the degradation and notes that CLS features stay useful, which is a reasonable observation even if not fully quantified. Those two elements are the parts that could be cited or tried by someone already running ViT dense-prediction pipelines. The framing of semantic diffusion is mostly a re-description of known attention spreading; entmax itself is not new. The main weakness is the missing mechanism check. The experiments compare only the baseline against the entmax variant; there are no ablations that match sparsity another way, equalize gradient scale, or sweep learning-rate schedules for the new attention. Without those, the mIoU deltas cannot be confidently attributed to more selective token mixing instead of incidental optimization differences. The analysis of shallow versus deep features is presented but likewise lacks the controls that would make the interpretation robust. A reader who wants a practical tweak to test on their own segmentation head would still find the numbers useful to know about. Someone who needs a verified explanation of why the change works would not. The work is coherent on its own terms and reports reproducible public-benchmark numbers, so it clears the bar for peer review even though the causal claim needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that Vision Transformers suffer from 'semantic diffusion' during prolonged self-supervised training, where global semantics spread excessively through patch tokens, degrading dense prediction performance. It argues that the solution is more selective token interactions rather than removing global context, and demonstrates that replacing softmax attention with entmax-1.5 in DINOv1 ViT-S/16 (trained 200 epochs on ImageNet-1K) preserves ImageNet linear probing accuracy while raising semantic segmentation mIoU on VOC from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87.

Significance. If the reported segmentation gains are causally attributable to the selective mixing induced by entmax-1.5, the result would be significant: a minimal, architecture-preserving intervention that improves dense ViT representations without sacrificing image-level performance or global connectivity. The approach is simple and the benchmarks are standard, but the current evidence does not yet isolate the proposed mechanism from optimization confounds.

major comments (2)

The central claim that segmentation gains arise from reduced semantic diffusion via selective token mixing is not supported by controls that isolate this mechanism. The experiments consist solely of a direct comparison between the unmodified DINOv1 softmax baseline and the entmax-1.5 variant; no ablations (e.g., top-k attention to match sparsity, gradient-norm equalization, or learning-rate sweeps for the entmax model) are reported to rule out incidental optimization effects.
The paper introduces and characterizes 'semantic diffusion' as the key phenomenon, yet the empirical results do not include a direct measurement or quantification of this diffusion (or its reduction) that is shown to correlate with the observed mIoU lifts; the attribution therefore remains interpretive rather than demonstrated.

minor comments (2)

The abstract and results section would benefit from reporting standard deviations or multiple runs for the mIoU numbers, as single-run comparisons on segmentation benchmarks can be sensitive to training stochasticity.
Clarify whether the entmax-1.5 variant required any hyper-parameter retuning relative to the softmax baseline, and if so, document those changes explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns regarding mechanistic isolation and direct quantification of semantic diffusion can be addressed with additional experiments and analysis in a revision.

read point-by-point responses

Referee: The central claim that segmentation gains arise from reduced semantic diffusion via selective token mixing is not supported by controls that isolate this mechanism. The experiments consist solely of a direct comparison between the unmodified DINOv1 softmax baseline and the entmax-1.5 variant; no ablations (e.g., top-k attention to match sparsity, gradient-norm equalization, or learning-rate sweeps for the entmax model) are reported to rule out incidental optimization effects.

Authors: We agree that additional controls would provide stronger isolation of the selective-mixing mechanism from potential optimization differences. The reported experiments use identical training protocols for both models, which already controls for most hyperparameters. To further address the concern, we will add learning-rate sweeps for the entmax-1.5 model and a top-k attention baseline matched for sparsity level in the revised manuscript. revision: yes
Referee: The paper introduces and characterizes 'semantic diffusion' as the key phenomenon, yet the empirical results do not include a direct measurement or quantification of this diffusion (or its reduction) that is shown to correlate with the observed mIoU lifts; the attribution therefore remains interpretive rather than demonstrated.

Authors: Section 3 of the manuscript already characterizes semantic diffusion via multiple observations, including that shallow features can remain locally aligned yet underperform deeper ones and that [CLS] tokens remain complementary for dense tasks. We acknowledge that an explicit scalar metric of diffusion reduction correlated with mIoU would make the link more direct. We will introduce such a quantification (e.g., foreground token similarity or attention entropy) and report its correlation with the segmentation gains in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on external benchmarks

full rationale

The paper's central results consist of measured mIoU and accuracy numbers obtained by training DINOv1 ViT-S/16 models on ImageNet-1K for 200 epochs and evaluating linear probing plus segmentation on VOC, ADE20K and Cityscapes. No equations, fitted parameters, or predictions are shown to reduce to quantities defined inside the same work; the reported gains (e.g., VOC mIoU 42.80 → 48.78) are direct experimental outcomes rather than self-definitional or self-citation reductions. The analysis of semantic diffusion is interpretive framing around these measurements and does not invoke load-bearing self-citations or ansatzes that collapse the claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on an empirical comparison whose validity depends on standard self-supervised training assumptions and on the untested attribution of gains to semantic-diffusion reduction.

axioms (1)

domain assumption DINOv1 training protocol on ImageNet-1K produces representations whose dense quality can be measured by linear probing and segmentation mIoU
The experiment uses this protocol as the baseline.

invented entities (1)

semantic diffusion no independent evidence
purpose: Characterizes the optimization shortcut causing dense degradation beyond high-norm artifacts
Introduced in the abstract as the spreading of global semantic information through patch tokens beyond local justification.

pith-pipeline@v0.9.0 · 5754 in / 1422 out tokens · 28785 ms · 2026-05-25T04:43:15.777139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

work page 2021
[3]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

work page 2024
[4]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In International Conference on Learning Representations, 2024

work page 2024
[5]

Vision transformers need more than registers, 2026

Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers, 2026

work page 2026
[6]

Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InInternational Conference on Learning Representations, 2026

work page 2026
[7]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025
[8]

Swin transformer: Hierarchical vision transformer using shifted windows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021

work page 2021
[9]

Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[10]

Correia, Vlad Niculae, and André F

Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

work page 2019
[11]

Systematic outliers in large language models, 2025

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025

work page 2025
[12]

Acc-vit : Atrous convolution’s comeback in vision transformers, 2024

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, and Daisuke Kihara. Acc-vit : Atrous convolution’s comeback in vision transformers, 2024

work page 2024
[13]

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023

Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023

work page 2023
[14]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015
[15]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012
[16]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017. 6

work page 2017
[17]

Semantic understanding of scenes through the ade20k dataset, 2018

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset, 2018

work page 2018
[18]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[19]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

work page 2012
[20]

Lichtenberg, and Jianxiong Xiao

Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

work page 2015
[21]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013

work page 2013
[22]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

work page 2025
[23]

Adasplash: Adaptive sparse flash attention

Nuno Gonçalves, Marcos V Treviso, and Andre Martins. Adasplash: Adaptive sparse flash attention. In F orty-second International Conference on Machine Learning, 2025

work page 2025
[24]

Adasplash-2: Faster differentiable sparse attention

Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, and Marcos V Treviso. Adasplash-2: Faster differentiable sparse attention. InF orty-third International Conference on Machine Learning, 2026. 7

work page 2026

[1] [1]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[2] [2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

work page 2021

[3] [3]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

work page 2024

[4] [4]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In International Conference on Learning Representations, 2024

work page 2024

[5] [5]

Vision transformers need more than registers, 2026

Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers, 2026

work page 2026

[6] [6]

Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InInternational Conference on Learning Representations, 2026

work page 2026

[7] [7]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025

[8] [8]

Swin transformer: Hierarchical vision transformer using shifted windows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021

work page 2021

[9] [9]

Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[10] [10]

Correia, Vlad Niculae, and André F

Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. Adaptively sparse transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

work page 2019

[11] [11]

Systematic outliers in large language models, 2025

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025

work page 2025

[12] [12]

Acc-vit : Atrous convolution’s comeback in vision transformers, 2024

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, and Daisuke Kihara. Acc-vit : Atrous convolution’s comeback in vision transformers, 2024

work page 2024

[13] [13]

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023

Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer, 2023

work page 2023

[14] [14]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015

[15] [15]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012

[16] [16]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017. 6

work page 2017

[17] [17]

Semantic understanding of scenes through the ade20k dataset, 2018

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset, 2018

work page 2018

[18] [18]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[19] [19]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

work page 2012

[20] [20]

Lichtenberg, and Jianxiong Xiao

Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

work page 2015

[21] [21]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013

work page 2013

[22] [22]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

work page 2025

[23] [23]

Adasplash: Adaptive sparse flash attention

Nuno Gonçalves, Marcos V Treviso, and Andre Martins. Adasplash: Adaptive sparse flash attention. In F orty-second International Conference on Machine Learning, 2025

work page 2025

[24] [24]

Adasplash-2: Faster differentiable sparse attention

Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, and Marcos V Treviso. Adasplash-2: Faster differentiable sparse attention. InF orty-third International Conference on Machine Learning, 2026. 7

work page 2026