Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3
The pith
A two-stage token selection strategy speeds up visual geometry transformers by over 85 percent on large scenes while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Restricting the number of key/value tokens each query interacts with during global attention via inter-frame diversity selection followed by intra-frame layer-aware sparsification guided by attention entropy accelerates visual geometry transformers by over 85 percent for scenes with 500 images while maintaining or improving baseline performance.
What carries the argument
Two-stage token selection framework with inter-frame diversity criterion at the frame level and intra-frame selection by entropy of the global attention pattern at each layer.
If this is right
- Processing of 500-plus image scenes becomes practical within existing hardware budgets.
- Many tokens in these models carry redundant geometric information that can be removed without accuracy loss.
- Layer-aware selection outperforms uniform pruning strategies.
- The same selection logic can be applied to other attention-based multi-view models.
- Token budgets can be set once per scene rather than recomputed per query.
Where Pith is reading between the lines
- The approach may extend to other transformer architectures that process unordered image sets for geometric tasks.
- Adaptive token counts based on estimated scene redundancy could be learned instead of using fixed diversity thresholds.
- Combining this pruning with existing model compression methods could produce further speed gains.
- The entropy signal might serve as a diagnostic for which layers contribute most to geometric reasoning.
Load-bearing premise
That a diversity criterion at the frame level combined with per-layer attention-entropy selection will reliably discard only redundant tokens without discarding geometrically critical information across varied scenes, camera motions, and model architectures.
What would settle it
Experiments on scenes with rapid camera motion or complex geometry that show 3D reconstruction metrics dropping below baseline variance when the selection is applied.
Figures
read the original abstract
Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage token selection framework for visual geometry transformers to address quadratic attention costs: an inter-frame stage using a diversity criterion at the frame level to preserve scene coverage, followed by an intra-frame stage applying layer-aware sparsification guided by the entropy of global attention patterns. The central claim, supported by extensive experiments, is that this yields over 85% acceleration on 500-image scenes while maintaining or improving baseline performance on multi-view 3D reconstruction tasks.
Significance. If the empirical results hold, the work would meaningfully advance scalability of feed-forward visual geometry models by enabling efficient processing of large image sets without accuracy degradation. The analysis-driven selection rules (diversity + attention entropy) and the reported speed-accuracy trade-off represent a practical contribution, though the absence of machine-checked elements or parameter-free derivations limits the strength of the assessment.
major comments (2)
- [Abstract / Experiments] Abstract and experimental evaluation: the assertion of 'over 85% acceleration ... while maintaining, or even improving, baseline performance' is presented without baseline details, error bars, dataset statistics, ablation tables, or quantitative comparisons, which is load-bearing for the central empirical claim of a superior trade-off.
- [Method (inter-frame and intra-frame selection)] Method (two-stage framework description): the inter-frame diversity criterion combined with per-layer attention-entropy selection is motivated by attention pattern analysis, but no test or analysis demonstrates that retained tokens preserve geometrically critical information (e.g., epipolar or multi-view consistency cues) under rapid motion, large baselines, or low-texture conditions; this directly underpins the no-accuracy-loss guarantee.
minor comments (1)
- [Abstract] The project website is referenced but no statement on code or model release is provided, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to strengthen the presentation of results and method justification.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental evaluation: the assertion of 'over 85% acceleration ... while maintaining, or even improving, baseline performance' is presented without baseline details, error bars, dataset statistics, ablation tables, or quantitative comparisons, which is load-bearing for the central empirical claim of a superior trade-off.
Authors: The full experimental evaluation in Section 4 provides the requested details: baseline comparisons against prior token selection methods, error bars from repeated runs, dataset statistics (e.g., image counts and scene characteristics from DTU and Tanks&Temples), ablation tables for inter- and intra-frame stages, and quantitative speed-accuracy curves. The abstract summarizes the headline result. We will revise the abstract to briefly reference the primary evaluation datasets and direct readers to the experiments section for ablations and comparisons. revision: yes
-
Referee: [Method (inter-frame and intra-frame selection)] Method (two-stage framework description): the inter-frame diversity criterion combined with per-layer attention-entropy selection is motivated by attention pattern analysis, but no test or analysis demonstrates that retained tokens preserve geometrically critical information (e.g., epipolar or multi-view consistency cues) under rapid motion, large baselines, or low-texture conditions; this directly underpins the no-accuracy-loss guarantee.
Authors: Section 3 motivates the criteria from attention pattern analysis. Preservation of geometric cues is supported by the end-to-end multi-view reconstruction results, which require epipolar consistency and multi-view agreement; the evaluated scenes encompass varying motion, baselines, and texture levels, and accuracy is maintained or improved at high reduction ratios. We will add a discussion paragraph in the method section explicitly linking the diversity criterion to scene coverage and the entropy criterion to retention of high-attention tokens, with reference to qualitative attention visualizations already present in the experiments. revision: partial
Circularity Check
No circularity: empirical token-selection rules validated by experiments
full rationale
The paper introduces an inter-frame diversity selector and intra-frame attention-entropy sparsifier, then reports empirical speed/accuracy results on visual geometry transformers. No equations, fitted parameters, or self-citations are shown that would make the claimed 85%+ speedup equivalent to the selection criteria by construction. The method is motivated by attention-pattern analysis and tested on held-out scenes; the central claim therefore rests on external experimental outcomes rather than definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sameer Agarwal, Noah Snavely, Steven M. Seitz, and Richard Szeliski. Bundle adjustment in the large. InECCV, 2010. 3
work page 2010
-
[2]
DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass
Vivek Alumootil and Tuan-Anh Vu. DePT3R: Joint dense point tracking and 3D reconstruction of dynamic scenes in a single forward pass.arXiv preprint arXiv:2512.13122, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Goldman, Matthias Nießner, and Justus Thies
Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. InCVPR, 2022. 7, 18, 19, 20, 21, 23
work page 2022
-
[4]
MegaLoc: One retrieval to place them all
Gabriele Berton and Carlo Masone. MegaLoc: One retrieval to place them all. InCVPR Workshops, 2025. 4, 16
work page 2025
-
[5]
MUSt3R: Multi-view network for stereo 3D reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 3
work page 2025
-
[6]
Geometric Context Transformer for Streaming 3D Reconstruction
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. Geometric context transformer for streaming 3D reconstruction.arXiv preprint arXiv:2604.14141, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Easi3R: Estimating disentangled motion from DUSt3R without training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3R: Estimating disentangled motion from DUSt3R without training. InICCV, 2025. 3
work page 2025
-
[8]
TTT3R: 3D reconstruction as test-time training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D reconstruction as test-time training. InICLR, 2026. 3
work page 2026
-
[9]
Human3R: Everyone everywhere all at once
Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Gerard Pons-Moll. Human3R: Everyone everywhere all at once. InICLR, 2026. 3
work page 2026
-
[10]
Co-Me: Confidence-guided token merging for visual geometric transformers
Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, and Sebastian Scherer. Co-Me: Confidence-guided token merging for visual geometric transformers. InCVPR, 2026. 7, 8
work page 2026
-
[11]
LONG3R: Long sequence streaming 3D reconstruction
Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. LONG3R: Long sequence streaming 3D reconstruction. InICCV, 2025. 3
work page 2025
-
[12]
Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, and Liujuan Cao. StereoVGGT: A training-free visual geometry transformer for stereo vision.arXiv preprint arXiv:2603.29368, 2026. 3
-
[13]
LongStream: Long-sequence streaming autoregressive visual geometry
Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026. 3
work page 2026
-
[14]
MERG3R: A divide-and-conquer approach to large-scale neural visual geometry
Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, and Nandita Vijaykumar. MERG3R: A divide-and-conquer approach to large-scale neural visual geometry. InCVPR, 2026. 7
work page 2026
-
[15]
Attention alignment and flexible positional embeddings improve transformer length extrapolation
Ta-Chung Chi, Ting-Han Fan, and Alexander Rudnicky. Attention alignment and flexible positional embeddings improve transformer length extrapolation. InNAACL Findings, 2024. 2
work page 2024
-
[16]
Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026
Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, and Wanzeng Kong. Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026. 3
-
[17]
VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences. InICRA, 2026. 24
work page 2026
-
[18]
Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, and Hesheng Wang. Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025. 3 10
-
[19]
LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction
Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, and Huaizu Jiang. LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction. InCVPR, 2026. 3
work page 2026
-
[20]
Gergely Dinya, Péter Halász, András L ˝orincz, Kristóf Karacs, and Anna Gelencsér-Horváth. Build- ing temporally coherent 3D maps with VGGT for memory-efficient semantic SLAM.arXiv preprint arXiv:2511.16282, 2025. 3
-
[21]
Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, and Yan Wang. MeMix: Writing less, remembering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, 2026. 3
-
[22]
MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion. In 3DV, 2025. 3
work page 2025
-
[23]
VGG-T3: Offline feed-forward 3D reconstruction at scale
Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, and Aljosa Osep. VGG-T3: Offline feed-forward 3D reconstruction at scale. InCVPR, 2026. 24
work page 2026
-
[24]
MoRe: Motion-aware feed-forward 4D reconstruction transformer
Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, and Yu-Shen Liu. MoRe: Motion-aware feed-forward 4D reconstruction transformer. InCVPR, 2026. 3
work page 2026
-
[25]
IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction
Keyu Fang, Changchun Zhou, Yuzhe Fu, Hai Helen Li, and Yiran Chen. IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction. InICLR, 2026. 3
work page 2026
-
[26]
Dens3R: A foundation model for 3D geometry prediction
Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiao-Mu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lv. Dens3R: A foundation model for 3D geometry prediction. InICLR, 2026. 3
work page 2026
-
[27]
Quantized visual geometry grounded transformer
Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, and Yongjun Xu. Quantized visual geometry grounded transformer. InICLR, 2026. 3
work page 2026
-
[28]
MoRE: 3D visual geometry reconstruction meets mixture-of-experts
Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. MoRE: 3D visual geometry reconstruction meets mixture-of-experts. InCVPR, 2026. 3
work page 2026
-
[29]
Anna Gelencsér-Horváth, Gergely Dinya, Dorka Boglárka Er˝os, Péter Halász, Islam Muhammad Muqsit, and Kristóf Karacs. SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation.arXiv preprint arXiv:2602.15899, 2026. 3
- [30]
-
[31]
Emergent outlier view rejection in visual geometry grounded transformers
Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers. In CVPR, 2026. 3
work page 2026
-
[32]
Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, and Xiangyang Xue. DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026. 3
-
[33]
arXiv preprint arXiv:2511.19971 (2025)
Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. VGGT4D: Mining motion cues in visual geometry transformers for 4D scene reconstruction.arXiv preprint arXiv:2511.19971, 2025. 3
-
[34]
Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors. InCVPR, 2025. 3
work page 2025
-
[35]
DriveVGGT: Visual geometry transformer for autonomous driving.arXiv preprint arXiv:2511.22264, 2025
Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, and Junchi Yan. DriveVGGT: Visual geometry transformer for autonomous driving.arXiv preprint arXiv:2511.22264, 2025. 3
-
[36]
Geo4D: Leveraging video generators for geometric 4D scene reconstruction
Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4D: Leveraging video generators for geometric 4D scene reconstruction. InICCV, 2025. 3
work page 2025
-
[37]
Barron, Noah Snavely, and Aleksander Holynski
Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3D reconstruction via test-time training. InCVPR, 2026. 24
work page 2026
-
[38]
Seonghyun Jin and Jong Chul Ye. FILT3R: Latent state adaptive Kalman filter for streaming 3D reconstruction.arXiv preprint arXiv:2603.18493, 2026. 3
-
[39]
Any4D: Unified feed-forward metric 4D reconstruction
Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4D reconstruction. InCVPR, 2026. 3 11
work page 2026
-
[40]
Keyframe-based visual-inertial online SLAM with relocalization
Anton Kasyanov, Francis Engelmann, Jörg Stückler, and Bastian Leibe. Keyframe-based visual-inertial online SLAM with relocalization. InIROS, 2017. 2
work page 2017
-
[41]
MapAnything: Universal feed-forward metric 3D reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fis- cher, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez- Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction. ...
work page 2026
-
[42]
Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, and Evgeny Burnaev. G-CUT3R: Guided 3D reconstruction with camera and depth prior integration.arXiv preprint arXiv:2508.11379,
-
[43]
HeSS: Head sensitivity score for sparsity redistribution in VGGT
Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, and Sungroh Yoon. HeSS: Head sensitivity score for sparsity redistribution in VGGT. InCVPR, 2026. 3
work page 2026
-
[44]
STream3R: Scalable sequential 3D reconstruction with causal transformer
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer. InICLR, 2026. 3
work page 2026
-
[45]
Brandon Leblanc and Charalambos Poullis. Distill3R: A pipeline for democratizing 3D foundation models on commodity hardware.arXiv preprint arXiv:2602.00865, 2026. 24
-
[46]
SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes
Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, and Sangyoun Lee. SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes. InCVPR Findings, 2026. 3
work page 2026
-
[47]
Grounding image matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In ECCV, 2024. 3
work page 2024
-
[48]
Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015
Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015. 2
work page 2015
-
[49]
Huan Li, Longjun Luo, Yuling Shi, and Xiaodong Gu. Analyzing the mechanism of attention collapse in VGGT from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025. 3
-
[50]
IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction
Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, and Ziwei Liu. IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction. InICLR, 2026. 3
work page 2026
-
[51]
WinT3R: Window-based streaming reconstruction with camera token pool
Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. WinT3R: Window-based streaming reconstruction with camera token pool. InICLR, 2026. 3
work page 2026
-
[52]
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InICLR, 2026. 1, 3
work page 2026
-
[53]
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, and Luca Ballan. Mem3R: Streaming 3D reconstruction with hybrid memory via test-time training.arXiv preprint arXiv:2604.07279, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, and Lanyun Zhu. StreamCacheVGGT: Streaming visual geometry transformers with robust scoring and hybrid cache compression.arXiv preprint arXiv:2604.15237, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025
Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3
-
[56]
Align3R: Aligned monocular depth estimation for dynamic videos
Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. In CVPR, 2025. 3
work page 2025
-
[57]
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. OVGGT: O(1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, and Bingbing Liu. UniScale: Unified scale-aware 3D reconstruction for multi-view understanding via prior injection for robotic perception.arXiv preprint arXiv:2602.23224, 2026. 3 12
-
[59]
Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025. 3
-
[60]
TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026
Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, and Junting Dong. TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026. 3
-
[61]
ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals
Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguère, and Cyrill Stachniss. ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals. InIROS, 2019. 7, 8, 17, 23
work page 2019
-
[62]
Junxiang Pan, Lipu Zhou, and Baojie Chen. HyVGGT-VO: Tightly coupled hybrid dense visual odometry with feed-forward models.arXiv preprint arXiv:2604.02107, 2026. 3
-
[63]
Tail-aware post-training quantization for 3d geometry models.arXiv preprint arXiv:2602.01741,
Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang, Jiawei Li, Bin Chen, Shu-Tao Xia, and Zhi Wang. Tail-aware post-training quantization for 3D geometry models.arXiv preprint arXiv:2602.01741, 2026. 3
-
[64]
OmniVGGT: Omni-modality driven visual geometry grounded transformer
Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, and Ziwei Liu. OmniVGGT: Omni-modality driven visual geometry grounded transformer. InCVPR, 2026. 3
work page 2026
-
[65]
Shenhan Qian, Ganlin Zhang, Shangzhe Wu, and Daniel Cremers. Flow4R: Unifying 4D reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026. 3
-
[66]
Zhicheng Qiu, Jiarui Meng, Tong an Luo, Yican Huang, Xuan Feng, Xuanfu Li, and Zhan Xu. SLARM: Streaming and language-aligned reconstruction model for dynamic scenes.arXiv preprint arXiv:2603.22893, 2026. 3
-
[67]
Speed3R: Sparse feed-forward 3D reconstruction models
Weining Ren, Xiao Tan, and Kai Han. Speed3R: Sparse feed-forward 3D reconstruction models. In CVPR Findings, 2026. 3, 7, 8
work page 2026
-
[68]
Schönberger and Jan-Michael Frahm
Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 1, 3
work page 2016
-
[69]
Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. MUT3R: Motion-aware updating transformer for dynamic 3D reconstruction.arXiv preprint arXiv:2512.03939, 2025. 3
-
[70]
FastVGGT: Fast visual geometry transformer
You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Fast visual geometry transformer. InICLR, 2026. 2, 3, 4, 7, 8
work page 2026
-
[71]
Scene coordinate regression forests for camera relocalization in RGB-D images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InCVPR, 2013. 4, 5, 7, 18, 20, 21, 22, 23
work page 2013
-
[72]
LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging
Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging. InCVPR, 2026. 2, 3, 5, 7, 8
work page 2026
-
[73]
A benchmark for the evaluation of RGB-D SLAM systems
Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012. 7, 23
work page 2012
-
[74]
Dynamic point maps: A versatile representation for dynamic 3D reconstruction
Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3D reconstruction. InICCV, 2025. 3
work page 2025
-
[75]
Lilienthaland, and Martin Magnusson
Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, and Martin Magnusson. Dense dynamic scene reconstruction and camera pose estimation from multi-view videos.arXiv preprint arXiv:2603.12064, 2026. 3
-
[76]
A VGGT: Rethinking global attention for accelerating VGGT
Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, and Jianfu Zhang. A VGGT: Rethinking global attention for accelerating VGGT. InCVPR, 2026. 3, 5, 6
work page 2026
-
[77]
Learning to (learn at test time): RNNs with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InICML, 2025. 24
work page 2025
-
[78]
MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. In CVPR, 2025. 3 13
work page 2025
-
[79]
tttLRM: Test-time training for long context and autoregressive 3D reconstruction
Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttLRM: Test-time training for long context and autoregressive 3D reconstruction. InCVPR,
-
[80]
Block-sparse global attention for efficient multi-view geometry transformers
Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, and Bastian Leibe. Block-sparse global attention for efficient multi-view geometry transformers. InCVPR, 2026. 2, 7, 8
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.