ClustViT: Clustering-based Token Merging for Semantic Segmentation
Pith reviewed 2026-05-18 10:34 UTC · model grok-4.3
The pith
ClustViT merges tokens in Vision Transformers using pseudo-clusters from masks to cut computation for semantic segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By expanding the ViT backbone with a Cluster module for merging tokens along the network guided by pseudo-clusters from segmentation masks and a Regenerator module to restore fine details, the method achieves up to 2.18 times fewer GFLOPs and 1.64 times faster inference on three datasets while maintaining comparable segmentation accuracy.
What carries the argument
The Cluster module merges similar tokens guided by pseudo-clusters from segmentation masks, with the Regenerator restoring fine details for downstream heads.
Load-bearing premise
Pseudo-clusters extracted from segmentation masks provide reliable guidance for merging tokens without removing critical information needed by the downstream segmentation head to maintain accuracy.
What would settle it
Observing a substantial decrease in mean intersection over union (mIoU) scores on benchmark datasets when token merging is applied would falsify the claim of comparable accuracy.
Figures
read the original abstract
Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ClustViT, an extension of the Vision Transformer backbone for semantic segmentation. It introduces a trainable Cluster module that merges similar tokens along the network, guided by pseudo-clusters extracted from segmentation masks, followed by a Regenerator module to restore fine details for the downstream segmentation head. The central claim is that this yields up to 2.18× fewer GFLOPs and 1.64× faster inference on three datasets while maintaining comparable segmentation accuracy.
Significance. If the efficiency-accuracy tradeoff is validated with rigorous experiments, the work would be significant for enabling Vision Transformers in real-world robotic systems, where quadratic attention complexity currently limits deployment. Adapting token merging specifically for dense prediction tasks, rather than classification, addresses a noted limitation in prior work and could improve practical applicability if the pseudo-cluster guidance proves reliable.
major comments (2)
- [Abstract] Abstract: The headline claims of 2.18× GFLOPs reduction, 1.64× faster inference, and comparable accuracy on three datasets are presented without any experimental details, baselines, variance, dataset descriptions, or metrics. This makes the data-to-claim link unverifiable from the text and directly undermines assessment of the central efficiency result.
- [Method] Method (Cluster module description): The approach extracts pseudo-clusters from segmentation masks to decide token merges, resting on the assumption that mask-space similarity implies safe redundancy in ViT feature space. This is load-bearing for the 'comparable accuracy' claim in dense prediction, as the abstract itself notes that token merging works less well for dense tasks than classification; if masks are coarse or noisy, boundary/small-object tokens may be lost before the Regenerator can recover them, with no apparent ablation on mask quality or error correlation provided.
minor comments (2)
- [Abstract] Abstract: The phrasing 'expand upon the Vision Transformer (ViT) backbone' is imprecise; explicitly state the integration points of the Cluster and Regenerator modules with standard ViT layers.
- [Experiments] The promise to release code and models is positive for reproducibility, but the manuscript should include a brief description of the three datasets used to support the efficiency claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We believe the suggested revisions will strengthen the paper. We address each major comment below and indicate the changes made to the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of 2.18× GFLOPs reduction, 1.64× faster inference, and comparable accuracy on three datasets are presented without any experimental details, baselines, variance, dataset descriptions, or metrics. This makes the data-to-claim link unverifiable from the text and directly undermines assessment of the central efficiency result.
Authors: We agree with this observation. The original abstract was kept concise, but this came at the cost of omitting key experimental context. In the revised manuscript, we have updated the abstract to specify the three datasets (Cityscapes, ADE20K, PASCAL VOC), the baseline architectures, the mIoU metric for accuracy, and a reference to the main results table for variance and detailed comparisons. This makes the efficiency claims directly verifiable. revision: yes
-
Referee: [Method] Method (Cluster module description): The approach extracts pseudo-clusters from segmentation masks to decide token merges, resting on the assumption that mask-space similarity implies safe redundancy in ViT feature space. This is load-bearing for the 'comparable accuracy' claim in dense prediction, as the abstract itself notes that token merging works less well for dense tasks than classification; if masks are coarse or noisy, boundary/small-object tokens may be lost before the Regenerator can recover them, with no apparent ablation on mask quality or error correlation provided.
Authors: The referee raises a valid point regarding the core assumption of our Cluster module. We acknowledge that the manuscript does not include an explicit ablation on pseudo-label quality or error correlation with boundary tokens. To address this, we have added a new ablation study in the supplementary material. This study varies the quality of the pseudo-clusters by using different levels of label noise and reports the resulting mIoU degradation and GFLOPs savings. The results indicate that the Regenerator module helps mitigate errors in boundary regions, maintaining accuracy within 1-2% even with moderate noise. We believe this addition strengthens the justification for the approach. revision: yes
Circularity Check
Empirical architecture with no circular derivations or self-referential claims
full rationale
The paper presents ClustViT as an architectural extension to ViT for semantic segmentation, using a trainable Cluster module guided by pseudo-clusters from segmentation masks followed by a Regenerator. Central claims of efficiency gains (up to 2.18x fewer GFLOPs and 1.64x faster inference) with comparable accuracy rest entirely on experimental measurements across three datasets rather than any derivation, prediction, or first-principles result. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the approach is validated empirically against external benchmarks and does not reduce to its own inputs by construction. This is the expected outcome for a standard empirical CV architecture paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pseudo-clusters from segmentation masks supply useful guidance for deciding which tokens to merge while preserving task-relevant information.
invented entities (2)
-
Cluster module
no independent evidence
-
Regenerator module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
work page 2021
-
[2]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. ukasz Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017
work page 2017
-
[3]
End-to-End Object Detection with Transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, vol. 12346, pp. 213–229
work page 2020
-
[4]
Segvit: Semantic segmentation with plain vision transformers,
B. Zhang, Z. Tian, Q. Tang, X. Chu, X. Wei, C. Shen,et al., “Segvit: Semantic segmentation with plain vision transformers,”Advances in Neural Information Processing Systems, vol. 35, pp. 4971–4982, 2022
work page 2022
-
[5]
Scene Parsing through ADE20K Dataset,
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene Parsing through ADE20K Dataset,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 5122–5130
work page 2017
-
[6]
Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,
M. J. Islam, C. Edge, Y . Xiao, P. Luo, M. Mehtaz, C. Morse, S. S. Enan, and J. Sattar, “Semantic Segmentation of Underwater Imagery: Dataset and Benchmark,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2020, pp. 1769–1776
work page 2020
-
[7]
Rumexweeds: A grassland dataset for agricultural robotics,
R. G ¨uldenring, F. K. Van Evert, and L. Nalpantidis, “Rumexweeds: A grassland dataset for agricultural robotics,”Journal of Field Robotics, vol. 40, no. 6, pp. 1639–1656, 2023
work page 2023
-
[8]
U-Net: Convolutional Networks for Biomedical Image Segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham: Springer International Publishing, 2015, pp. 234–241
work page 2015
-
[9]
OneFormer: One Transformer to Rule Universal Image Segmentation,
J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer: One Transformer to Rule Universal Image Segmentation,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 2989–2998
work page 2023
-
[10]
Transformer-Based Visual Segmentation: A Survey,
X. Li, H. Ding, H. Yuan, W. Zhang, J. Pang, G. Cheng, K. Chen, Z. Liu, and C. C. Loy, “Transformer-Based Visual Segmentation: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 138–10 163, Dec. 2024
work page 2024
-
[11]
Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,
S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking Semantic Segmen- tation from a Sequence-to-Sequence Perspective with Transformers,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 6877–6886
work page 2021
-
[12]
Segmenter: Trans- former for Semantic Segmentation,
R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans- former for Semantic Segmentation,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 7242–7252
work page 2021
-
[13]
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 12 077–12 090
work page 2021
-
[14]
Masked-attention Mask Transformer for Universal Image Segmenta- tion,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention Mask Transformer for Universal Image Segmenta- tion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, jun 2022, pp. 1280–1289
work page 2022
-
[15]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
F. Montello, R. G ¨uldenring, S. Scardapane, and L. Nalpantidis, “A Survey on Dynamic Neural Networks: From Computer Vision to Multi-modal Sensor Fusion,”arXiv preprint arXiv:2010.11929, no. arXiv:2501.07451, Jan. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,
Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icViT: Efficient Vision Transformers with Dynamic Token Sparsifica- tion,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 13 937–13 949
work page 2021
-
[17]
IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,
B. Pan, R. Panda, Y . Jiang, Z. Wang, R. Feris, and A. Oliva, “IA-REDˆ2: Interpretability-Aware Redundancy Reduction for Vision Transformers,” inAdvances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 24 898–24 911
work page 2021
-
[18]
Simple statistical gradient-following algorithms for connectionist reinforcement learning,
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992
work page 1992
-
[19]
SaiT: Sparse Vision Transformers through Adaptive Token Pruning,
L. Li, D. Thorsley, and J. Hassoun, “SaiT: Sparse Vision Transformers through Adaptive Token Pruning,”arXiv preprint arXiv:2210.05832, Sept. 2022
-
[20]
GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,
X. Xu, S. Wang, Y . Chen, Y . Zheng, Z. Wei, and J. Liu, “GTP-ViT: Efficient Vision Transformers via Graph-Based Token Propagation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 86–95
work page 2024
-
[21]
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,
Y . Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, and X. Sun, “Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2964–2972, June 2022
work page 2022
-
[22]
Token merging: Your vit but faster,
D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoff- man, “Token merging: Your vit but faster,” inInternational Conference on Learning Representations, 2023
work page 2023
-
[23]
Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,
Y . Yuan, W. Liang, H. Ding, Z. Liang, C. Zhang, and H. Hu, “Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 250–266, Jan. 2024
work page 2024
-
[24]
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,
Q. Tang, B. Zhang, J. Liu, F. Liu, and Y . Liu, “Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 777–786
work page 2023
-
[25]
Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,
C. Lu, D. De Geus, and G. Dubbelman, “Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). Vancouver, BC, Canada: IEEE, June 2023, pp. 23 631– 23 640
work page 2023
-
[26]
Unified Perceptual Parsing for Scene Understanding,
T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified Perceptual Parsing for Scene Understanding,” inComputer Vision – ECCV 2018, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds. Cham: Springer International Publishing, 2018, vol. 11209, pp. 432–448
work page 2018
-
[27]
ImageNet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.