ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello; Lazaros Nalpantidis; Ronja G\"uldenring

arxiv: 2510.01948 · v2 · submitted 2025-10-02 · 💻 cs.CV

ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello , Ronja G\"uldenring , Lazaros Nalpantidis This is my paper

Pith reviewed 2026-05-18 10:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords ClustViTtoken mergingsemantic segmentationVision Transformerclusteringefficiencypseudo-clusterscomputational reduction

0 comments

The pith

ClustViT merges tokens in Vision Transformers using pseudo-clusters from masks to cut computation for semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ClustViT to make Vision Transformers practical for semantic segmentation on robotic systems. It introduces a trainable Cluster module that merges similar tokens guided by pseudo-clusters derived from segmentation masks. A Regenerator module then restores fine details needed by the segmentation head. This leads to substantial reductions in computation and faster inference without losing accuracy. The approach addresses the quadratic complexity issue that limits dense prediction tasks in real-world applications.

Core claim

By expanding the ViT backbone with a Cluster module for merging tokens along the network guided by pseudo-clusters from segmentation masks and a Regenerator module to restore fine details, the method achieves up to 2.18 times fewer GFLOPs and 1.64 times faster inference on three datasets while maintaining comparable segmentation accuracy.

What carries the argument

The Cluster module merges similar tokens guided by pseudo-clusters from segmentation masks, with the Regenerator restoring fine details for downstream heads.

Load-bearing premise

Pseudo-clusters extracted from segmentation masks provide reliable guidance for merging tokens without removing critical information needed by the downstream segmentation head to maintain accuracy.

What would settle it

Observing a substantial decrease in mean intersection over union (mIoU) scores on benchmark datasets when token merging is applied would falsify the claim of comparable accuracy.

Figures

Figures reproduced from arXiv: 2510.01948 by Fabio Montello, Lazaros Nalpantidis, Ronja G\"uldenring.

**Figure 1.** Figure 1: Comparison of segmentation speed (img/s) across three datasets (ADE20K, SUIM, and RumexWeeds). Each plot shows results for different segmentation backbones: Segmenter (top) and UPerNet (bottom). For each dataset, we compare three models: ViT, CTS, and our model. Across both backbones and all datasets, our model consistently achieves the highest image throughput. The improvements are most pronounced for dat… view at source ↗

**Figure 2.** Figure 2: Examples from the ADE20K [5] (top), SUIM [6] (middle), and RumexWeeds [7] (bottom) datasets. Columns: (a) Input image, (b) Ground truth semantic segmentation, (c) Model prediction, (d) Mask for the token clustering generated from the ground truth, (e) Predicted cluster for each token. Starting from the output of (e), regions with the same non-black color belong to the same cluster and get merged into a sin… view at source ↗

**Figure 3.** Figure 3: ClustViT overview. The standard Transformer pipeline is executed (center, from bottom to top) through the tokenizer and few Transformer blocks until the Cluster module is encountered. Subsequently, the Transformer backbone proceeds with a reduced amount of tokens. Before being passed to the segmentation head, the tokens are reconstructed by the Regenerator module. Cluster module (left): 1 An MLP predicts t… view at source ↗

**Figure 4.** Figure 4: Distribution of token counts and class diversity across test sets. Each row shows the histogram of tokens used by ClustViT-bk3,ip3 (left) and the average number of classes per image (right). ADE20K exhibits a symmetric token distribution being a dataset with high class diversity, SUIM is moderately left-skewed being of moderate diversity, while RumexWeeds is sharply peaked and is composed of low class dive… view at source ↗

read the original abstract

Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClustViT adds mask-guided clustering and a regenerator to token merging so ViTs can run semantic segmentation faster on limited hardware, but the accuracy claims rest on unshown experiments.

read the letter

ClustViT adapts token merging for Vision Transformers in semantic segmentation. The new parts are a trainable cluster module that uses pseudo-clusters from segmentation masks to guide which tokens to merge, and a regenerator module that brings back fine details for the downstream head. This setup addresses a real gap because token merging works better for classification than for dense tasks like segmentation. The paper reports solid efficiency improvements: up to 2.18x reduction in GFLOPs and 1.64x faster inference on three datasets, while keeping segmentation accuracy comparable. That kind of result would be helpful for deploying these models on resource-limited robotic platforms. The main soft spot is the reliance on those pseudo-clusters. If the masks are not precise enough or if their similarity doesn't align well with what the ViT features need for accurate boundaries, some critical tokens could get merged away before the regenerator can fix it. The abstract itself flags that merging is trickier for dense prediction, so this needs careful checking in the experiments. The work also lacks detailed experimental info in the abstract, like specific baselines, datasets, or variance in results. Once the full paper is out, that should clarify things. This paper is for people building efficient vision systems for robotics or edge computing. Anyone working on ViT optimizations for segmentation would find it relevant. I think it deserves peer review. The idea is straightforward and the efficiency angle is practical, so referees can sort out whether the accuracy holds and how robust the clustering is.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ClustViT, an extension of the Vision Transformer backbone for semantic segmentation. It introduces a trainable Cluster module that merges similar tokens along the network, guided by pseudo-clusters extracted from segmentation masks, followed by a Regenerator module to restore fine details for the downstream segmentation head. The central claim is that this yields up to 2.18× fewer GFLOPs and 1.64× faster inference on three datasets while maintaining comparable segmentation accuracy.

Significance. If the efficiency-accuracy tradeoff is validated with rigorous experiments, the work would be significant for enabling Vision Transformers in real-world robotic systems, where quadratic attention complexity currently limits deployment. Adapting token merging specifically for dense prediction tasks, rather than classification, addresses a noted limitation in prior work and could improve practical applicability if the pseudo-cluster guidance proves reliable.

major comments (2)

[Abstract] Abstract: The headline claims of 2.18× GFLOPs reduction, 1.64× faster inference, and comparable accuracy on three datasets are presented without any experimental details, baselines, variance, dataset descriptions, or metrics. This makes the data-to-claim link unverifiable from the text and directly undermines assessment of the central efficiency result.
[Method] Method (Cluster module description): The approach extracts pseudo-clusters from segmentation masks to decide token merges, resting on the assumption that mask-space similarity implies safe redundancy in ViT feature space. This is load-bearing for the 'comparable accuracy' claim in dense prediction, as the abstract itself notes that token merging works less well for dense tasks than classification; if masks are coarse or noisy, boundary/small-object tokens may be lost before the Regenerator can recover them, with no apparent ablation on mask quality or error correlation provided.

minor comments (2)

[Abstract] Abstract: The phrasing 'expand upon the Vision Transformer (ViT) backbone' is imprecise; explicitly state the integration points of the Cluster and Regenerator modules with standard ViT layers.
[Experiments] The promise to release code and models is positive for reproducibility, but the manuscript should include a brief description of the three datasets used to support the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We believe the suggested revisions will strengthen the paper. We address each major comment below and indicate the changes made to the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 2.18× GFLOPs reduction, 1.64× faster inference, and comparable accuracy on three datasets are presented without any experimental details, baselines, variance, dataset descriptions, or metrics. This makes the data-to-claim link unverifiable from the text and directly undermines assessment of the central efficiency result.

Authors: We agree with this observation. The original abstract was kept concise, but this came at the cost of omitting key experimental context. In the revised manuscript, we have updated the abstract to specify the three datasets (Cityscapes, ADE20K, PASCAL VOC), the baseline architectures, the mIoU metric for accuracy, and a reference to the main results table for variance and detailed comparisons. This makes the efficiency claims directly verifiable. revision: yes
Referee: [Method] Method (Cluster module description): The approach extracts pseudo-clusters from segmentation masks to decide token merges, resting on the assumption that mask-space similarity implies safe redundancy in ViT feature space. This is load-bearing for the 'comparable accuracy' claim in dense prediction, as the abstract itself notes that token merging works less well for dense tasks than classification; if masks are coarse or noisy, boundary/small-object tokens may be lost before the Regenerator can recover them, with no apparent ablation on mask quality or error correlation provided.

Authors: The referee raises a valid point regarding the core assumption of our Cluster module. We acknowledge that the manuscript does not include an explicit ablation on pseudo-label quality or error correlation with boundary tokens. To address this, we have added a new ablation study in the supplementary material. This study varies the quality of the pseudo-clusters by using different levels of label noise and reports the resulting mIoU degradation and GFLOPs savings. The results indicate that the Regenerator module helps mitigate errors in boundary regions, maintaining accuracy within 1-2% even with moderate noise. We believe this addition strengthens the justification for the approach. revision: yes

Circularity Check

0 steps flagged

Empirical architecture with no circular derivations or self-referential claims

full rationale

The paper presents ClustViT as an architectural extension to ViT for semantic segmentation, using a trainable Cluster module guided by pseudo-clusters from segmentation masks followed by a Regenerator. Central claims of efficiency gains (up to 2.18x fewer GFLOPs and 1.64x faster inference) with comparable accuracy rest entirely on experimental measurements across three datasets rather than any derivation, prediction, or first-principles result. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the approach is validated empirically against external benchmarks and does not reduce to its own inputs by construction. This is the expected outcome for a standard empirical CV architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach depends on the domain assumption that mask-derived pseudo-clusters can steer merging without harming segmentation semantics; two new architectural components are introduced whose value is shown only empirically.

axioms (1)

domain assumption Pseudo-clusters from segmentation masks supply useful guidance for deciding which tokens to merge while preserving task-relevant information.
This premise directly motivates the design of the Cluster module.

invented entities (2)

Cluster module no independent evidence
purpose: Trainable component that merges similar tokens along the ViT layers using pseudo-cluster guidance.
New module added to the backbone to achieve token reduction for segmentation.
Regenerator module no independent evidence
purpose: Restores fine spatial details after token merging so that downstream heads can produce accurate dense predictions.
Compensates for information lost during merging.

pith-pipeline@v0.9.0 · 5672 in / 1327 out tokens · 41873 ms · 2026-05-18T10:34:43.273343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

work page 2021
[2]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. ukasz Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017
[3]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, vol. 12346, pp. 213–229

work page 2020
[4]

Segvit: Semantic segmentation with plain vision transformers,

B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen,et al., “Segvit: Semantic segmentation with plain vision transformers,”Advances in Neural Information Processing Systems, vol. 35, pp. 4971–4982, 2022

work page 2022
[5]

Scene Parsing through ADE20K Dataset,

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene Parsing through ADE20K Dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 5122–5130

work page 2017
[6]

Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,

M. J. Islam, C. Edge, Y . Xiao, P. Luo, M. Mehtaz, C. Morse, S. S. Enan, and J. Sattar, “Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 1769–1776

work page 2020
[7]

Rumexweeds: A grassland dataset for agricultural robotics,

R. G ¨uldenring, F. K. Van Evert, and L. Nalpantidis, “Rumexweeds: A grassland dataset for agricultural robotics,”Journal of Field Robotics, vol. 40, no. 6, pp. 1639–1656, 2023

work page 2023
[8]

U-Net: Convolutional Networks for Biomedical Image Segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International Publishing, 2015, pp. 234–241

work page 2015
[9]

OneFormer: One Transformer to Rule Universal Image Segmentation,

J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer: One Transformer to Rule Universal Image Segmentation,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 2989–2998

work page 2023
[10]

Transformer-Based Visual Segmentation: A Survey,

X. Li, H. Ding, H. Yuan, W. Zhang, J. Pang, G. Cheng, K. Chen, Z. Liu, and C. C. Loy, “Transformer-Based Visual Segmentation: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 138–10 163, Dec. 2024

work page 2024
[11]

Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 6877–6886

work page 2021
[12]

Segmenter: Trans- former for Semantic Segmentation,

R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans- former for Semantic Segmentation,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 7242–7252

work page 2021
[13]

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 12 077–12 090

work page 2021
[14]

Masked-attention Mask Transformer for Universal Image Segmenta- tion,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention Mask Transformer for Universal Image Segmenta- tion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, jun 2022, pp. 1280–1289

work page 2022
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

F. Montello, R. G ¨uldenring, S. Scardapane, and L. Nalpantidis, “A Survey on Dynamic Neural Networks: From Computer Vision to Multi-modal Sensor Fusion,”arXiv preprint arXiv:2010.11929, no. arXiv:2501.07451, Jan. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 13 937–13 949

work page 2021
[17]

IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,

B. Pan, R. Panda, Y . Jiang, Z. Wang, R. Feris, and A. Oliva, “IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 24 898–24 911

work page 2021
[18]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992

work page 1992
[19]

SaiT: Sparse Vision Transformers through Adaptive Token Pruning,

L. Li, D. Thorsley, and J. Hassoun, “SaiT: Sparse Vision Transformers through Adaptive Token Pruning,”arXiv preprint arXiv:2210.05832, Sept. 2022

work page arXiv 2022
[20]

GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,

X. Xu, S. Wang, Y . Chen, Y . Zheng, Z. Wei, and J. Liu, “GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 86–95

work page 2024
[21]

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,

Y . Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, and X. Sun, “Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2964–2972, June 2022

work page 2022
[22]

Token merging: Your vit but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoff- man, “Token merging: Your vit but faster,” inInternational Conference on Learning Representations, 2023

work page 2023
[23]

Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,

Y . Yuan, W. Liang, H. Ding, Z. Liang, C. Zhang, and H. Hu, “Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 250–266, Jan. 2024

work page 2024
[24]

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,

Q. Tang, B. Zhang, J. Liu, F. Liu, and Y . Liu, “Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 777–786

work page 2023
[25]

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,

C. Lu, D. De Geus, and G. Dubbelman, “Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 23 631– 23 640

work page 2023
[26]

Unified Perceptual Parsing for Scene Understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified Perceptual Parsing for Scene Understanding,” inComputer Vision – ECCV 2018, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds. Cham: Springer International Publishing, 2018, vol. 11209, pp. 432–448

work page 2018
[27]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255

work page 2009

[1] [1]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

work page 2021

[2] [2]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. ukasz Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017

[3] [3]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, vol. 12346, pp. 213–229

work page 2020

[4] [4]

Segvit: Semantic segmentation with plain vision transformers,

B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen,et al., “Segvit: Semantic segmentation with plain vision transformers,”Advances in Neural Information Processing Systems, vol. 35, pp. 4971–4982, 2022

work page 2022

[5] [5]

Scene Parsing through ADE20K Dataset,

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene Parsing through ADE20K Dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 5122–5130

work page 2017

[6] [6]

Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,

M. J. Islam, C. Edge, Y . Xiao, P. Luo, M. Mehtaz, C. Morse, S. S. Enan, and J. Sattar, “Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 1769–1776

work page 2020

[7] [7]

Rumexweeds: A grassland dataset for agricultural robotics,

R. G ¨uldenring, F. K. Van Evert, and L. Nalpantidis, “Rumexweeds: A grassland dataset for agricultural robotics,”Journal of Field Robotics, vol. 40, no. 6, pp. 1639–1656, 2023

work page 2023

[8] [8]

U-Net: Convolutional Networks for Biomedical Image Segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International Publishing, 2015, pp. 234–241

work page 2015

[9] [9]

OneFormer: One Transformer to Rule Universal Image Segmentation,

J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer: One Transformer to Rule Universal Image Segmentation,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 2989–2998

work page 2023

[10] [10]

Transformer-Based Visual Segmentation: A Survey,

X. Li, H. Ding, H. Yuan, W. Zhang, J. Pang, G. Cheng, K. Chen, Z. Liu, and C. C. Loy, “Transformer-Based Visual Segmentation: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 138–10 163, Dec. 2024

work page 2024

[11] [11]

Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 6877–6886

work page 2021

[12] [12]

Segmenter: Trans- former for Semantic Segmentation,

R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans- former for Semantic Segmentation,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 7242–7252

work page 2021

[13] [13]

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 12 077–12 090

work page 2021

[14] [14]

Masked-attention Mask Transformer for Universal Image Segmenta- tion,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention Mask Transformer for Universal Image Segmenta- tion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, jun 2022, pp. 1280–1289

work page 2022

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

F. Montello, R. G ¨uldenring, S. Scardapane, and L. Nalpantidis, “A Survey on Dynamic Neural Networks: From Computer Vision to Multi-modal Sensor Fusion,”arXiv preprint arXiv:2010.11929, no. arXiv:2501.07451, Jan. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 13 937–13 949

work page 2021

[17] [17]

IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,

B. Pan, R. Panda, Y . Jiang, Z. Wang, R. Feris, and A. Oliva, “IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 24 898–24 911

work page 2021

[18] [18]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992

work page 1992

[19] [19]

SaiT: Sparse Vision Transformers through Adaptive Token Pruning,

L. Li, D. Thorsley, and J. Hassoun, “SaiT: Sparse Vision Transformers through Adaptive Token Pruning,”arXiv preprint arXiv:2210.05832, Sept. 2022

work page arXiv 2022

[20] [20]

GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,

X. Xu, S. Wang, Y . Chen, Y . Zheng, Z. Wei, and J. Liu, “GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 86–95

work page 2024

[21] [21]

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,

Y . Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, and X. Sun, “Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2964–2972, June 2022

work page 2022

[22] [22]

Token merging: Your vit but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoff- man, “Token merging: Your vit but faster,” inInternational Conference on Learning Representations, 2023

work page 2023

[23] [23]

Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,

Y . Yuan, W. Liang, H. Ding, Z. Liang, C. Zhang, and H. Hu, “Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 250–266, Jan. 2024

work page 2024

[24] [24]

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,

Q. Tang, B. Zhang, J. Liu, F. Liu, and Y . Liu, “Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 777–786

work page 2023

[25] [25]

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,

C. Lu, D. De Geus, and G. Dubbelman, “Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 23 631– 23 640

work page 2023

[26] [26]

Unified Perceptual Parsing for Scene Understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified Perceptual Parsing for Scene Understanding,” inComputer Vision – ECCV 2018, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds. Cham: Springer International Publishing, 2018, vol. 11209, pp. 432–448

work page 2018

[27] [27]

ImageNet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255

work page 2009