pith. machine review for the scientific record. sign in

arxiv: 2511.14751 · v2 · pith:SCF7EQVEnew · submitted 2025-11-18 · 💻 cs.CV · cs.RO

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords token mergingconfidence predictionvisual geometric transformersmodel acceleration3D perceptionmulti-view reconstructionreal-time inferencesequence reduction
0
0 comments X

The pith

A distilled confidence predictor ranks and merges low-uncertainty tokens to accelerate visual geometric transformers up to 21 times without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Co-Me as a method that distills a lightweight predictor to score token uncertainty and then merges the low-confidence tokens in visual geometric transformers. This selective merging shortens the input sequence and cuts computation while preserving the spatial coverage that the transformer relies on. Unlike similarity-based pruning, the confidence signal aligns more closely with the regions the model actually emphasizes during multi-view and streaming inference. The approach requires no changes to the base model weights and scales with longer sequences typical in 3D perception tasks.

Core claim

Co-Me distills a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers.

What carries the argument

The distilled lightweight confidence predictor that ranks tokens by uncertainty to guide selective merging of low-confidence tokens, thereby shortening the sequence length processed by the transformer.

If this is right

  • Up to 21.5x speedup on VGGT and 20.4x speedup on Pi3 with no retraining of the base model.
  • The method works across multi-view and streaming visual geometric transformer setups without architecture changes.
  • Computation drops while spatial coverage is preserved, outperforming similarity-based merging or pruning.
  • Speedups increase with longer input sequences typical of 3D perception and reconstruction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same confidence-ranking idea could be tested on other vision transformers to reduce latency on edge hardware for real-time 3D tasks.
  • If the predictor generalizes across models, it might offer a plug-in efficiency layer for any attention-based geometric network.
  • Streaming applications could further benefit by updating the confidence scores incrementally rather than recomputing them each frame.

Load-bearing premise

The lightweight confidence predictor can rank tokens by uncertainty in a way that matches the spatial regions the transformer emphasizes during inference.

What would settle it

Applying the merging to VGGT or Pi3 and measuring a clear drop in accuracy on standard 3D reconstruction or pose estimation benchmarks while still claiming the reported speedups would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2511.14751 by Ali Agha, Jay Patrikar, Ruogu Li, Sebastian Scherer, Shayegan Omidshafiei, Yuheng Qiu, Yutian Chen.

Figure 1
Figure 1. Figure 1: Co-Me accelerates visual geometric transformers by selectively merging low-confidence tokens guided by a distilled confidence predictor. When applied to VGGT and MapAnything, Co-Me achieves up to 11.3× and 7.2× speedup without retraining or architectural changes to the ViT backbone, turning geometric transformers into real-time-capable models for 3D perception. Abstract We propose Confidence-Guided Token M… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Co-Me. A lightweight module distilled from the frozen ViT backbone predicts per-token confidence from interme [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed mask generation (left), merge (middle), and split (right) operators. Each sample generates an individual merge [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of attention bias correction. Merging tokens dis [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Acceleration ratio of Co-Me-accelerated VGGT across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance v. Speedup trade off curves on multi-view [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Adding attention bias correction improves performance [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 3D reconstruction with camera trajectory (left) and pre [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between MapAnything (left) [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distillation loss of confidence predictors distilled from [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Confidence distillation with ranking loss achieves sig [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Speedup-accuracy trade-off of Co-Me-accelerated [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: NVIDIA Jetson Thor and the Zed 2i stereo camera payload for real-world deployment test. We run MapAny￾thing on chunks of 4 images and stack the results under the world coordinate frame to simulate a visual odometry. C. Edge Compute Deployment In [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
read the original abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper introduces Co-Me, a method for accelerating visual geometric transformers by distilling a lightweight confidence predictor that ranks and merges low-confidence tokens. It claims to achieve up to 21.5x speedup on VGGT and 20.4x on Pi3 without retraining or performance loss, by better preserving spatial coverage and geometric fidelity in multi-view and streaming settings compared to similarity-based approaches.

Significance. If the results hold, this work could make visual geometric transformers practical for real-time 3D tasks by offering a scalable, model-agnostic acceleration technique grounded in uncertainty rather than token similarity.

major comments (2)
  1. Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.
  2. Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.
minor comments (1)
  1. The distinction between confidence-guided merging and similarity-based merging would benefit from a concrete example or diagram to clarify the claimed advantage in spatial coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and completeness that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract: The central claims of substantial speedups (up to 21.5x for VGGT and 20.4x for Pi3) without performance degradation are asserted without any experimental details, baselines, error bars, or ablation results. This absence makes it impossible to evaluate whether the confidence signal reliably preserves geometric fidelity.

    Authors: We agree that the abstract, constrained by length, presents the claims at a high level without supporting experimental specifics. The full manuscript contains these details in the Experiments section, including direct comparisons to similarity-based baselines, error bars from repeated runs, and ablations on geometric fidelity metrics across multi-view and streaming settings. To improve accessibility, we will revise the abstract to concisely reference the evaluation protocol, key baselines, and quantitative preservation of performance. revision: partial

  2. Referee: Method: The distilled lightweight confidence predictor is described as trained on uncertainty signals, but no details are given on the training procedure, data sources, or how it is validated to match regions emphasized by the base transformer. This is load-bearing for the claim that Co-Me avoids degradation across models and setups.

    Authors: We acknowledge that additional methodological specifics are needed for full reproducibility and to substantiate the alignment claim. In the revised manuscript we will expand the relevant subsection to detail the training procedure (including loss formulation and optimization), the exact data sources and uncertainty signals used for distillation, and the validation experiments (with supporting visualizations) that demonstrate correspondence to regions emphasized by the base transformer. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Co-Me as an external distilled lightweight confidence predictor applied to existing models like VGGT and Pi3 without retraining. Speedup claims derive from token reduction scaling with sequence length and empirical measurements on target tasks, not from any fitted parameter or self-defined quantity that is then renamed as a prediction. No load-bearing step reduces by construction to inputs via self-citation, uniqueness theorem, or ansatz smuggling; the confidence signal is presented as independently trained on uncertainty and validated separately from evaluation metrics. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the confidence predictor is described as distilled but its architecture, loss, or training data details are absent.

pith-pipeline@v0.9.0 · 5451 in / 1013 out tokens · 47841 ms · 2026-05-17T20:32:47.511581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, pages 1–16, 2016

    Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, pages 1–16, 2016. 5

  2. [2]

    Token merging for fast sta- ble diffusion

    Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE,

  3. [3]

    Token merging: Your vit but faster, 2023

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster, 2023. 2, 3, 7

  4. [4]

    Learning to rank using gradient descent

    Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. InProceedings of the 22nd In- ternational Conference on Machine Learning, page 89–96, New York, NY , USA, 2005. Association for Computing Ma- chinery. 4

  5. [5]

    Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion, 2025. 2

  6. [6]

    Pumer: Pruning and merging tokens for efficient vision language models

    Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15365–15377. Association for Computational Linguistics, 2023. 3

  7. [7]

    Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer

    Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), page 2061–2070. IEEE, 2023. 3

  8. [8]

    Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. 2

  9. [9]

    Flex attention: A programming model for generating optimized attention kernels, 2024

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels, 2024. 1, 5

  10. [10]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization, 2025

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization, 2025. 2

  11. [11]

    Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors,

  12. [12]

    Mapanything: Universal feed- forward metric 3d reconstruction, 2025

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed- forward metric 3d reconstructio...

  13. [13]

    What uncertainties do we need in bayesian deep learning for computer vision?, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?, 2017. 2

  14. [14]

    Rethinking the self-attention in vision transformers

    Kyungmin Kim, Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Zhicheng Yan, Peter Vajda, and Seon Kim. Rethinking the self-attention in vision transformers. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3065–3069, 2021. 1

  15. [15]

    Token fusion: Bridging the gap between token pruning and token merging, 2023

    Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging, 2023. 3

  16. [16]

    Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. 2

  17. [17]

    A purely algebraic justification of the kabsch-umeyama algo- rithm.Journal of Research of the National Institute of Stan- dards and Technology, 124, 2019

    Jim Lawrence, Javier Bernal, and Christoph Witzgall. A purely algebraic justification of the kabsch-umeyama algo- rithm.Journal of Research of the National Institute of Stan- dards and Technology, 124, 2019. 6

  18. [18]

    Mast3r: Grounding image matching in 3d world, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r: Grounding image matching in 3d world, 2024. 2

  19. [19]

    Revisiting token pruning for object detection and instance segmentation, 2024

    Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation, 2024. 3

  20. [20]

    Align3r: Aligned monocular depth estimation for dynamic videos, 2024

    Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3r: Aligned monocular depth estimation for dynamic videos, 2024. 2

  21. [21]

    Single-pass parallel prefix scan with decoupled lookback

    Duane Merrill and Michael Garland. Single-pass parallel prefix scan with decoupled lookback. InNvidia, 2016. 5

  22. [22]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 5

  23. [23]

    On the uncertainty of self-supervised monocular depth estimation, 2020

    Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- toccia. On the uncertainty of self-supervised monocular depth estimation, 2020. 2

  24. [24]

    MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo

    Yuheng Qiu, Yutian Chen, Zihao Zhang, Wenshan Wang, and Sebastian Scherer. MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry mac-vo. github. io. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 3803–3814. IEEE, 2025. 2

  25. [25]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification, 2021. 2, 3

  26. [26]

    Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova

    Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: What can 8 learned tokens do for images and videos? InAdvances in Neural Information Processing Systems, pages 13728– 13741. Curran Associates, Inc., 2021. 3

  27. [27]

    Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger

    Thomas Sch ¨ops, Johannes L. Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5

  28. [28]

    Fastvggt: Training-free acceleration of visual geometry transformer, 2025

    You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer, 2025. 2, 3

  29. [29]

    Emma E. M. Stewart, Matteo Valsecchi, and Alexander C. Sch¨utz. A review of interactions between peripheral and foveal vision.Journal of Vision, 20(12):2–2, 2020. 2

  30. [30]

    Dynamic token pruning in plain vision transformers for semantic segmentation, 2023

    Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, and Yifan Liu. Dynamic token pruning in plain vision transformers for semantic segmentation, 2023. 2

  31. [31]

    Sparsity invariant cnns

    Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. InInternational Conference on 3D Vision (3DV), 2017. 5

  32. [32]

    S. Umeyama. Least-squares estimation of transformation pa- rameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

  33. [33]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 1

  34. [34]

    3D Reconstruction with Spatial Memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 2

  35. [35]

    Vggt: Visual geometry grounded transformer, 2024

    Jianyuan Wang, Minghao Chen, Longfei Huang, and Xiaom- ing Liu. Vggt: Visual geometry grounded transformer, 2024. 1, 2, 5, 6

  36. [36]

    Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2

  37. [37]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

  38. [38]

    Tartanair: A dataset to push the limits of visual slam, 2020

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam, 2020. 4

  39. [39]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 2

  40. [40]

    Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

    Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022. 2

  41. [41]

    Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

    Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,

  42. [42]

    X-pruner: explainable pruning for vision transformers, 2023

    Lu Yu and Wei Xiang. X-pruner: explainable pruning for vision transformers, 2023. 2

  43. [43]

    Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view syn- thesis using multiplane images.ACM Trans. Graph., 37(4),

  44. [44]

    Streaming 4D Visual Geometry Transformer

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 5 Supplementary Materials for Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers VGGT Co-Me + VGGT Figure 10. Qualitative comparison between VGGT (left) and Co- Me-accelerated VGGT (ri...