Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3
The pith
A distilled confidence predictor ranks and merges low-uncertainty tokens to accelerate visual geometric transformers up to 21 times without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Co-Me distills a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers.
What carries the argument
The distilled lightweight confidence predictor that ranks tokens by uncertainty to guide selective merging of low-confidence tokens, thereby shortening the sequence length processed by the transformer.
If this is right
- Up to 21.5x speedup on VGGT and 20.4x speedup on Pi3 with no retraining of the base model.
- The method works across multi-view and streaming visual geometric transformer setups without architecture changes.
- Computation drops while spatial coverage is preserved, outperforming similarity-based merging or pruning.
- Speedups increase with longer input sequences typical of 3D perception and reconstruction tasks.
Where Pith is reading between the lines
- The same confidence-ranking idea could be tested on other vision transformers to reduce latency on edge hardware for real-time 3D tasks.
- If the predictor generalizes across models, it might offer a plug-in efficiency layer for any attention-based geometric network.
- Streaming applications could further benefit by updating the confidence scores incrementally rather than recomputing them each frame.
Load-bearing premise
The lightweight confidence predictor can rank tokens by uncertainty in a way that matches the spatial regions the transformer emphasizes during inference.
What would settle it
Applying the merging to VGGT or Pi3 and measuring a clear drop in accuracy on standard 3D reconstruction or pose estimation benchmarks while still claiming the reported speedups would falsify the performance claim.
Figures
read the original abstract
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper introduces Co-Me, a method for accelerating visual geometric transformers by distilling a lightweight confidence predictor that ranks and merges low-confidence tokens. It claims to achieve up to 21.5x speedup on VGGT and 20.4x on Pi3 without retraining or performance loss, by better preserving spatial coverage and geometric fidelity in multi-view and streaming settings compared to similarity-based approaches.
Significance. If the results hold, this work could make visual geometric transformers practical for real-time 3D tasks by offering a scalable, model-agnostic acceleration technique grounded in uncertainty rather than token similarity.
major comments (2)
- Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.
- Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.
minor comments (1)
- The distinction between confidence-guided merging and similarity-based merging would benefit from a concrete example or diagram to clarify the claimed advantage in spatial coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and completeness that we will address to strengthen the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.
Authors: We agree that the abstract, constrained by length, presents the claims at a high level without supporting experimental specifics. The full manuscript contains these details in the Experiments section, including direct comparisons to similarity-based baselines, error bars from repeated runs, and ablations on geometric fidelity metrics across multi-view and streaming settings. To improve accessibility, we will revise the abstract to concisely reference the evaluation protocol, key baselines, and quantitative preservation of performance. revision: partial
-
Referee: Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.
Authors: We acknowledge that additional methodological specifics are needed for full reproducibility and to substantiate the alignment claim. In the revised manuscript we will expand the relevant subsection to detail the training procedure (including loss formulation and optimization), the exact data sources and uncertainty signals used for distillation, and the validation experiments (with supporting visualizations) that demonstrate correspondence to regions emphasized by the base transformer. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces Co-Me as an external distilled lightweight confidence predictor applied to existing models like VGGT and Pi3 without retraining. Speedup claims derive from token reduction scaling with sequence length and empirical measurements on target tasks, not from any fitted parameter or self-defined quantity that is then renamed as a prediction. No load-bearing step reduces by construction to inputs via self-citation, uniqueness theorem, or ansatz smuggling; the confidence signal is presented as independently trained on uncertainty and validated separately from evaluation metrics. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, pages 1–16, 2016. 5
work page 2016
-
[2]
Token merging for fast sta- ble diffusion
Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE,
-
[3]
Token merging: Your vit but faster, 2023
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster, 2023. 2, 3, 7
work page 2023
-
[4]
Learning to rank using gradient descent
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. InProceedings of the 22nd In- ternational Conference on Machine Learning, page 89–96, New York, NY , USA, 2005. Association for Computing Ma- chinery. 4
work page 2005
-
[5]
Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025. 2
work page 2025
-
[6]
Pumer: Pruning and merging tokens for efficient vision language models
Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15365–15377. Association for Computational Linguistics, 2023. 3
work page 2023
-
[7]
Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer
Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), page 2061–2070. IEEE, 2023. 3
work page 2023
-
[8]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. 2
work page 2025
-
[9]
Flex attention: A programming model for generating optimized attention kernels, 2024
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 1, 5
work page 2024
-
[10]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization, 2025. 2
work page 2025
-
[11]
Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,
-
[12]
Mapanything: Universal feed- forward metric 3d reconstruction, 2025
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...
work page 2025
-
[13]
What uncertainties do we need in bayesian deep learning for computer vision?, 2017
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?, 2017. 2
work page 2017
-
[14]
Rethinking the self-attention in vision transformers
Kyungmin Kim, Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Zhicheng Yan, Peter Vajda, and Seon Kim. Rethinking the self-attention in vision transformers. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3065–3069, 2021. 1
work page 2021
-
[15]
Token fusion: Bridging the gap between token pruning and token merging, 2023
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging, 2023. 3
work page 2023
-
[16]
Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. 2
work page 2025
-
[17]
Jim Lawrence, Javier Bernal, and Christoph Witzgall. A purely algebraic justification of the kabsch-umeyama algo- rithm.Journal of Research of the National Institute of Stan- dards and Technology, 124, 2019. 6
work page 2019
-
[18]
Mast3r: Grounding image matching in 3d world, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r: Grounding image matching in 3d world, 2024. 2
work page 2024
-
[19]
Revisiting token pruning for object detection and instance segmentation, 2024
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation, 2024. 3
work page 2024
-
[20]
Align3r: Aligned monocular depth estimation for dynamic videos, 2024
Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3r: Aligned monocular depth estimation for dynamic videos, 2024. 2
work page 2024
-
[21]
Single-pass parallel prefix scan with decoupled lookback
Duane Merrill and Michael Garland. Single-pass parallel prefix scan with decoupled lookback. InNvidia, 2016. 5
work page 2016
-
[22]
Indoor segmentation and support inference from rgbd images
Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 5
work page 2012
-
[23]
On the uncertainty of self-supervised monocular depth estimation, 2020
Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- toccia. On the uncertainty of self-supervised monocular depth estimation, 2020. 2
work page 2020
-
[24]
MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo
Yuheng Qiu, Yutian Chen, Zihao Zhang, Wenshan Wang, and Sebastian Scherer. MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo. github. io. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 3803–3814. IEEE, 2025. 2
work page 2025
-
[25]
Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021. 2, 3
work page 2021
-
[26]
Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova
Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: What can 8 learned tokens do for images and videos? InAdvances in Neural Information Processing Systems, pages 13728– 13741. Curran Associates, Inc., 2021. 3
work page 2021
-
[27]
Thomas Sch ¨ops, Johannes L. Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5
work page 2017
-
[28]
Fastvggt: Training-free acceleration of visual geometry transformer, 2025
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer, 2025. 2, 3
work page 2025
-
[29]
Emma E. M. Stewart, Matteo Valsecchi, and Alexander C. Sch¨utz. A review of interactions between peripheral and foveal vision.Journal of Vision, 20(12):2–2, 2020. 2
work page 2020
-
[30]
Dynamic token pruning in plain vision transformers for semantic segmentation, 2023
Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, and Yifan Liu. Dynamic token pruning in plain vision transformers for semantic segmentation, 2023. 2
work page 2023
-
[31]
Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. InInternational Conference on 3D Vision (3DV), 2017. 5
work page 2017
-
[32]
S. Umeyama. Least-squares estimation of transformation pa- rameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,
-
[33]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 1
work page 2023
-
[34]
3D Reconstruction with Spatial Memory
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Vggt: Visual geometry grounded transformer, 2024
Jianyuan Wang, Minghao Chen, Longfei Huang, and Xiaom- ing Liu. Vggt: Visual geometry grounded transformer, 2024. 1, 2, 5, 6
work page 2024
-
[36]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2
-
[37]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2
work page 2024
-
[38]
Tartanair: A dataset to push the limits of visual slam, 2020
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam, 2020. 4
work page 2020
-
[39]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 2
work page 2025
-
[40]
Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov
Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022. 2
work page 2022
-
[41]
Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov
Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
-
[42]
X-pruner: explainable pruning for vision transformers, 2023
Lu Yu and Wei Xiang. X-pruner: explainable pruning for vision transformers, 2023. 2
work page 2023
-
[43]
Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans. Graph., 37(4),
-
[44]
Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 5 Supplementary Materials for Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers VGGT Co-Me + VGGT Figure 10. Qualitative comparison between VGGT (left) and Co- Me-accelerated VGGT (ri...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.