arxiv: 2511.14751 · v2 · pith:SCF7EQVEnew · submitted 2025-11-18 · 💻 cs.CV · cs.RO

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen , Yuheng Qiu , Ruogu Li , Ali Agha , Shayegan Omidshafiei , Jay Patrikar , Sebastian Scherer This is my paper

Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords token mergingconfidence predictionvisual geometric transformersmodel acceleration3D perceptionmulti-view reconstructionreal-time inferencesequence reduction

0 comments

The pith

A distilled confidence predictor ranks and merges low-uncertainty tokens to accelerate visual geometric transformers up to 21 times without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Co-Me as a method that distills a lightweight predictor to score token uncertainty and then merges the low-confidence tokens in visual geometric transformers. This selective merging shortens the input sequence and cuts computation while preserving the spatial coverage that the transformer relies on. Unlike similarity-based pruning, the confidence signal aligns more closely with the regions the model actually emphasizes during multi-view and streaming inference. The approach requires no changes to the base model weights and scales with longer sequences typical in 3D perception tasks.

Core claim

Co-Me distills a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers.

What carries the argument

The distilled lightweight confidence predictor that ranks tokens by uncertainty to guide selective merging of low-confidence tokens, thereby shortening the sequence length processed by the transformer.

If this is right

Up to 21.5x speedup on VGGT and 20.4x speedup on Pi3 with no retraining of the base model.
The method works across multi-view and streaming visual geometric transformer setups without architecture changes.
Computation drops while spatial coverage is preserved, outperforming similarity-based merging or pruning.
Speedups increase with longer input sequences typical of 3D perception and reconstruction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence-ranking idea could be tested on other vision transformers to reduce latency on edge hardware for real-time 3D tasks.
If the predictor generalizes across models, it might offer a plug-in efficiency layer for any attention-based geometric network.
Streaming applications could further benefit by updating the confidence scores incrementally rather than recomputing them each frame.

Load-bearing premise

The lightweight confidence predictor can rank tokens by uncertainty in a way that matches the spatial regions the transformer emphasizes during inference.

What would settle it

Applying the merging to VGGT or Pi3 and measuring a clear drop in accuracy on standard 3D reconstruction or pose estimation benchmarks while still claiming the reported speedups would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2511.14751 by Ali Agha, Jay Patrikar, Ruogu Li, Sebastian Scherer, Shayegan Omidshafiei, Yuheng Qiu, Yutian Chen.

**Figure 1.** Figure 1: Co-Me accelerates visual geometric transformers by selectively merging low-confidence tokens guided by a distilled confidence predictor. When applied to VGGT and MapAnything, Co-Me achieves up to 11.3× and 7.2× speedup without retraining or architectural changes to the ViT backbone, turning geometric transformers into real-time-capable models for 3D perception. Abstract We propose Confidence-Guided Token M… view at source ↗

**Figure 2.** Figure 2: Overview of Co-Me. A lightweight module distilled from the frozen ViT backbone predicts per-token confidence from interme [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed mask generation (left), merge (middle), and split (right) operators. Each sample generates an individual merge [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of attention bias correction. Merging tokens dis [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Acceleration ratio of Co-Me-accelerated VGGT across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance v. Speedup trade off curves on multi-view [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Adding attention bias correction improves performance [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: 3D reconstruction with camera trajectory (left) and pre [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison between MapAnything (left) [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 13.** Figure 13: Distillation loss of confidence predictors distilled from [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Confidence distillation with ranking loss achieves sig [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Speedup-accuracy trade-off of Co-Me-accelerated [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: NVIDIA Jetson Thor and the Zed 2i stereo camera payload for real-world deployment test. We run MapAnything on chunks of 4 images and stack the results under the world coordinate frame to simulate a visual odometry. C. Edge Compute Deployment In [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

read the original abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-Me uses a distilled confidence predictor to guide token merging in geometric vision transformers and claims 20x speedups without retraining, but the abstract leaves the actual performance numbers and ablations thin.

read the letter

The main takeaway is that this paper introduces confidence-guided token merging for models like VGGT and Pi3. Instead of merging based on feature similarity, they distill a small predictor to score tokens by uncertainty and drop the low-confidence ones, which they say preserves the parts the transformer actually uses. This runs without changing the base model and reportedly scales well for multi-view and streaming cases, hitting speedups up to 21.5x and 20.4x.

Referee Report

2 major / 1 minor

Summary. This paper introduces Co-Me, a method for accelerating visual geometric transformers by distilling a lightweight confidence predictor that ranks and merges low-confidence tokens. It claims to achieve up to 21.5x speedup on VGGT and 20.4x on Pi3 without retraining or performance loss, by better preserving spatial coverage and geometric fidelity in multi-view and streaming settings compared to similarity-based approaches.

Significance. If the results hold, this work could make visual geometric transformers practical for real-time 3D tasks by offering a scalable, model-agnostic acceleration technique grounded in uncertainty rather than token similarity.

major comments (2)

Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.
Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.

minor comments (1)

The distinction between confidence-guided merging and similarity-based merging would benefit from a concrete example or diagram to clarify the claimed advantage in spatial coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and completeness that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.

Authors: We agree that the abstract, constrained by length, presents the claims at a high level without supporting experimental specifics. The full manuscript contains these details in the Experiments section, including direct comparisons to similarity-based baselines, error bars from repeated runs, and ablations on geometric fidelity metrics across multi-view and streaming settings. To improve accessibility, we will revise the abstract to concisely reference the evaluation protocol, key baselines, and quantitative preservation of performance. revision: partial
Referee: Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.

Authors: We acknowledge that additional methodological specifics are needed for full reproducibility and to substantiate the alignment claim. In the revised manuscript we will expand the relevant subsection to detail the training procedure (including loss formulation and optimization), the exact data sources and uncertainty signals used for distillation, and the validation experiments (with supporting visualizations) that demonstrate correspondence to regions emphasized by the base transformer. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Co-Me as an external distilled lightweight confidence predictor applied to existing models like VGGT and Pi3 without retraining. Speedup claims derive from token reduction scaling with sequence length and empirical measurements on target tasks, not from any fitted parameter or self-defined quantity that is then renamed as a prediction. No load-bearing step reduces by construction to inputs via self-citation, uniqueness theorem, or ansatz smuggling; the confidence signal is presented as independently trained on uncertainty and validated separately from evaluation metrics. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the confidence predictor is described as distilled but its architecture, loss, or training data details are absent.

pith-pipeline@v0.9.0 · 5451 in / 1013 out tokens · 47841 ms · 2026-05-17T20:32:47.511581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, pages 1–16, 2016

Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, pages 1–16, 2016. 5

work page 2016
[2]

Token merging for fast sta- ble diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE,

work page
[3]

Token merging: Your vit but faster, 2023

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster, 2023. 2, 3, 7

work page 2023
[4]

Learning to rank using gradient descent

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. InProceedings of the 22nd In- ternational Conference on Machine Learning, page 89–96, New York, NY , USA, 2005. Association for Computing Ma- chinery. 4

work page 2005
[5]

Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025. 2

work page 2025
[6]

Pumer: Pruning and merging tokens for efficient vision language models

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15365–15377. Association for Computational Linguistics, 2023. 3

work page 2023
[7]

Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer

Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), page 2061–2070. IEEE, 2023. 3

work page 2023
[8]

Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. 2

work page 2025
[9]

Flex attention: A programming model for generating optimized attention kernels, 2024

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 1, 5

work page 2024
[10]

Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization, 2025

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization, 2025. 2

work page 2025
[11]

Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,

work page
[12]

Mapanything: Universal feed- forward metric 3d reconstruction, 2025

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...

work page 2025
[13]

What uncertainties do we need in bayesian deep learning for computer vision?, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?, 2017. 2

work page 2017
[14]

Rethinking the self-attention in vision transformers

Kyungmin Kim, Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Zhicheng Yan, Peter Vajda, and Seon Kim. Rethinking the self-attention in vision transformers. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3065–3069, 2021. 1

work page 2021
[15]

Token fusion: Bridging the gap between token pruning and token merging, 2023

Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging, 2023. 3

work page 2023
[16]

Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. 2

work page 2025
[17]

A purely algebraic justification of the kabsch-umeyama algo- rithm.Journal of Research of the National Institute of Stan- dards and Technology, 124, 2019

Jim Lawrence, Javier Bernal, and Christoph Witzgall. A purely algebraic justification of the kabsch-umeyama algo- rithm.Journal of Research of the National Institute of Stan- dards and Technology, 124, 2019. 6

work page 2019
[18]

Mast3r: Grounding image matching in 3d world, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r: Grounding image matching in 3d world, 2024. 2

work page 2024
[19]

Revisiting token pruning for object detection and instance segmentation, 2024

Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation, 2024. 3

work page 2024
[20]

Align3r: Aligned monocular depth estimation for dynamic videos, 2024

Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3r: Aligned monocular depth estimation for dynamic videos, 2024. 2

work page 2024
[21]

Single-pass parallel prefix scan with decoupled lookback

Duane Merrill and Michael Garland. Single-pass parallel prefix scan with decoupled lookback. InNvidia, 2016. 5

work page 2016
[22]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 5

work page 2012
[23]

On the uncertainty of self-supervised monocular depth estimation, 2020

Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- toccia. On the uncertainty of self-supervised monocular depth estimation, 2020. 2

work page 2020
[24]

MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo

Yuheng Qiu, Yutian Chen, Zihao Zhang, Wenshan Wang, and Sebastian Scherer. MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo. github. io. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 3803–3814. IEEE, 2025. 2

work page 2025
[25]

Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021. 2, 3

work page 2021
[26]

Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova

Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: What can 8 learned tokens do for images and videos? InAdvances in Neural Information Processing Systems, pages 13728– 13741. Curran Associates, Inc., 2021. 3

work page 2021
[27]

Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger

Thomas Sch ¨ops, Johannes L. Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

work page 2017
[28]

Fastvggt: Training-free acceleration of visual geometry transformer, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer, 2025. 2, 3

work page 2025
[29]

Emma E. M. Stewart, Matteo Valsecchi, and Alexander C. Sch¨utz. A review of interactions between peripheral and foveal vision.Journal of Vision, 20(12):2–2, 2020. 2

work page 2020
[30]

Dynamic token pruning in plain vision transformers for semantic segmentation, 2023

Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, and Yifan Liu. Dynamic token pruning in plain vision transformers for semantic segmentation, 2023. 2

work page 2023
[31]

Sparsity invariant cnns

Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. InInternational Conference on 3D Vision (3DV), 2017. 5

work page 2017
[32]

S. Umeyama. Least-squares estimation of transformation pa- rameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

work page
[33]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 1

work page 2023
[34]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Vggt: Visual geometry grounded transformer, 2024

Jianyuan Wang, Minghao Chen, Longfei Huang, and Xiaom- ing Liu. Vggt: Visual geometry grounded transformer, 2024. 1, 2, 5, 6

work page 2024
[36]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2

work page arXiv 2025
[37]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

work page 2024
[38]

Tartanair: A dataset to push the limits of visual slam, 2020

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam, 2020. 4

work page 2020
[39]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 2

work page 2025
[40]

Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022. 2

work page 2022
[41]

Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,

work page
[42]

X-pruner: explainable pruning for vision transformers, 2023

Lu Yu and Wei Xiang. X-pruner: explainable pruning for vision transformers, 2023. 2

work page 2023
[43]

Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans. Graph., 37(4),

work page
[44]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 5 Supplementary Materials for Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers VGGT Co-Me + VGGT Figure 10. Qualitative comparison between VGGT (left) and Co- Me-accelerated VGGT (ri...

work page internal anchor Pith review Pith/arXiv arXiv 2025