Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Erik Sandstr\"om; Federico Tombari; Igor Gilitschenski; Marie-Julie Rakotosaona; Michael Oechsle; Shuhong Zheng

REVIEW 2 major objections 1 minor 116 references

A two-stage token selection strategy speeds up visual geometry transformers by over 85 percent on large scenes while preserving accuracy.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 04:25 UTC pith:KLP2RQ4B

load-bearing objection The paper's two-stage token selection strategy for visual geometry transformers is a sensible engineering fix for quadratic attention costs, but its reliability in edge cases needs more scrutiny. the 2 major comments →

arxiv 2605.23892 v1 pith:KLP2RQ4B submitted 2026-05-22 cs.CV cs.AIcs.GRcs.LGcs.RO

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Shuhong Zheng , Michael Oechsle , Erik Sandstr\"om , Marie-Julie Rakotosaona , Federico Tombari , Igor Gilitschenski This is my paper

classification cs.CV cs.AIcs.GRcs.LGcs.RO

keywords token selectionvisual geometry transformersmulti-view 3D reconstructionattention efficiencytoken pruningtransformer accelerationsparse attention

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the quadratic scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. It proposes restricting key and value tokens through a two-stage process: first selecting diverse frames to ensure scene coverage, then pruning redundant tokens inside those frames using per-layer attention entropy. The goal is to reduce computation for sequences of hundreds of images without losing the models ability to jointly predict 3D attributes. A sympathetic reader would care because the method targets the main barrier to applying these transformers on real-world scale data.

Core claim

Restricting the number of key/value tokens each query interacts with during global attention via inter-frame diversity selection followed by intra-frame layer-aware sparsification guided by attention entropy accelerates visual geometry transformers by over 85 percent for scenes with 500 images while maintaining or improving baseline performance.

What carries the argument

Two-stage token selection framework with inter-frame diversity criterion at the frame level and intra-frame selection by entropy of the global attention pattern at each layer.

Load-bearing premise

That a diversity criterion at the frame level combined with per-layer attention-entropy selection will reliably discard only redundant tokens without discarding geometrically critical information across varied scenes, camera motions, and model architectures.

What would settle it

Experiments on scenes with rapid camera motion or complex geometry that show 3D reconstruction metrics dropping below baseline variance when the selection is applied.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Processing of 500-plus image scenes becomes practical within existing hardware budgets.
Many tokens in these models carry redundant geometric information that can be removed without accuracy loss.
Layer-aware selection outperforms uniform pruning strategies.
The same selection logic can be applied to other attention-based multi-view models.
Token budgets can be set once per scene rather than recomputed per query.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other transformer architectures that process unordered image sets for geometric tasks.
Adaptive token counts based on estimated scene redundancy could be learned instead of using fixed diversity thresholds.
Combining this pruning with existing model compression methods could produce further speed gains.
The entropy signal might serve as a diagnostic for which layers contribute most to geometric reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage token selection framework for visual geometry transformers to address quadratic attention costs: an inter-frame stage using a diversity criterion at the frame level to preserve scene coverage, followed by an intra-frame stage applying layer-aware sparsification guided by the entropy of global attention patterns. The central claim, supported by extensive experiments, is that this yields over 85% acceleration on 500-image scenes while maintaining or improving baseline performance on multi-view 3D reconstruction tasks.

Significance. If the empirical results hold, the work would meaningfully advance scalability of feed-forward visual geometry models by enabling efficient processing of large image sets without accuracy degradation. The analysis-driven selection rules (diversity + attention entropy) and the reported speed-accuracy trade-off represent a practical contribution, though the absence of machine-checked elements or parameter-free derivations limits the strength of the assessment.

major comments (2)

[Abstract / Experiments] Abstract and experimental evaluation: the assertion of 'over 85% acceleration ... while maintaining, or even improving, baseline performance' is presented without baseline details, error bars, dataset statistics, ablation tables, or quantitative comparisons, which is load-bearing for the central empirical claim of a superior trade-off.
[Method (inter-frame and intra-frame selection)] Method (two-stage framework description): the inter-frame diversity criterion combined with per-layer attention-entropy selection is motivated by attention pattern analysis, but no test or analysis demonstrates that retained tokens preserve geometrically critical information (e.g., epipolar or multi-view consistency cues) under rapid motion, large baselines, or low-texture conditions; this directly underpins the no-accuracy-loss guarantee.

minor comments (1)

[Abstract] The project website is referenced but no statement on code or model release is provided, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to strengthen the presentation of results and method justification.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental evaluation: the assertion of 'over 85% acceleration ... while maintaining, or even improving, baseline performance' is presented without baseline details, error bars, dataset statistics, ablation tables, or quantitative comparisons, which is load-bearing for the central empirical claim of a superior trade-off.

Authors: The full experimental evaluation in Section 4 provides the requested details: baseline comparisons against prior token selection methods, error bars from repeated runs, dataset statistics (e.g., image counts and scene characteristics from DTU and Tanks&Temples), ablation tables for inter- and intra-frame stages, and quantitative speed-accuracy curves. The abstract summarizes the headline result. We will revise the abstract to briefly reference the primary evaluation datasets and direct readers to the experiments section for ablations and comparisons. revision: yes
Referee: [Method (inter-frame and intra-frame selection)] Method (two-stage framework description): the inter-frame diversity criterion combined with per-layer attention-entropy selection is motivated by attention pattern analysis, but no test or analysis demonstrates that retained tokens preserve geometrically critical information (e.g., epipolar or multi-view consistency cues) under rapid motion, large baselines, or low-texture conditions; this directly underpins the no-accuracy-loss guarantee.

Authors: Section 3 motivates the criteria from attention pattern analysis. Preservation of geometric cues is supported by the end-to-end multi-view reconstruction results, which require epipolar consistency and multi-view agreement; the evaluated scenes encompass varying motion, baselines, and texture levels, and accuracy is maintained or improved at high reduction ratios. We will add a discussion paragraph in the method section explicitly linking the diversity criterion to scene coverage and the entropy criterion to retention of high-attention tokens, with reference to qualitative attention visualizations already present in the experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical token-selection rules validated by experiments

full rationale

The paper introduces an inter-frame diversity selector and intra-frame attention-entropy sparsifier, then reports empirical speed/accuracy results on visual geometry transformers. No equations, fitted parameters, or self-citations are shown that would make the claimed 85%+ speedup equivalent to the selection criteria by construction. The method is motivated by attention-pattern analysis and tested on held-out scenes; the central claim therefore rests on external experimental outcomes rather than definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the method is presented as an empirical selection heuristic.

pith-pipeline@v0.9.0 · 5822 in / 966 out tokens · 34067 ms · 2026-05-25T04:25:24.761680+00:00 · methodology

0 comments

read the original abstract

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

Figures

Figures reproduced from arXiv: 2605.23892 by Erik Sandstr\"om, Federico Tombari, Igor Gilitschenski, Marie-Julie Rakotosaona, Michael Oechsle, Shuhong Zheng.

**Figure 2.** Figure 2: Pipeline of GoToHunt. Token selection is performed in the K/V space prior to the global attention layers, to determine which key/value tokens each query token interacts with. Our approach follows a two-stage hierarchical design: inter-frame selection first conducts frame-level selection, while intra-frame selection subsequently discard more tokens within each selected frame. tokens per frame), along with a… view at source ↗

**Figure 3.** Figure 3: Illustration of inter-frame selection with K = 10: the selected views (red) form a diverse subset of the full set of views (blue), maximizing view-space coverage under a limited budget. Diversity-based Frame Selection. In contrast to the above strategies, our intuition is to select a set of frames, within a given budget, that can maximize view-space coverage. Formally, given N images with d-dimensional f… view at source ↗

**Figure 4.** Figure 4: Attention pattern analysis of global attention layers (0-23) within VGGT [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 10 internal anchors

[1]

Seitz, and Richard Szeliski

Sameer Agarwal, Noah Snavely, Steven M. Seitz, and Richard Szeliski. Bundle adjustment in the large. InECCV, 2010. 3

work page 2010
[2]

DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Vivek Alumootil and Tuan-Anh Vu. DePT3R: Joint dense point tracking and 3D reconstruction of dynamic scenes in a single forward pass.arXiv preprint arXiv:2512.13122, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Goldman, Matthias Nießner, and Justus Thies

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. InCVPR, 2022. 7, 18, 19, 20, 21, 23

work page 2022
[4]

MegaLoc: One retrieval to place them all

Gabriele Berton and Carlo Masone. MegaLoc: One retrieval to place them all. InCVPR Workshops, 2025. 4, 16

work page 2025
[5]

MUSt3R: Multi-view network for stereo 3D reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 3

work page 2025
[6]

Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. Geometric context transformer for streaming 3D reconstruction.arXiv preprint arXiv:2604.14141, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Easi3R: Estimating disentangled motion from DUSt3R without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3R: Estimating disentangled motion from DUSt3R without training. InICCV, 2025. 3

work page 2025
[8]

TTT3R: 3D reconstruction as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D reconstruction as test-time training. InICLR, 2026. 3

work page 2026
[9]

Human3R: Everyone everywhere all at once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Gerard Pons-Moll. Human3R: Everyone everywhere all at once. InICLR, 2026. 3

work page 2026
[10]

Co-Me: Confidence-guided token merging for visual geometric transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, and Sebastian Scherer. Co-Me: Confidence-guided token merging for visual geometric transformers. InCVPR, 2026. 7, 8

work page 2026
[11]

LONG3R: Long sequence streaming 3D reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. LONG3R: Long sequence streaming 3D reconstruction. InICCV, 2025. 3

work page 2025
[12]

StereoVGGT: A training-free visual geometry transformer for stereo vision.arXiv preprint arXiv:2603.29368, 2026

Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, and Liujuan Cao. StereoVGGT: A training-free visual geometry transformer for stereo vision.arXiv preprint arXiv:2603.29368, 2026. 3

work page arXiv 2026
[13]

LongStream: Long-sequence streaming autoregressive visual geometry

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026. 3

work page 2026
[14]

MERG3R: A divide-and-conquer approach to large-scale neural visual geometry

Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, and Nandita Vijaykumar. MERG3R: A divide-and-conquer approach to large-scale neural visual geometry. InCVPR, 2026. 7

work page 2026
[15]

Attention alignment and flexible positional embeddings improve transformer length extrapolation

Ta-Chung Chi, Ting-Han Fan, and Alexander Rudnicky. Attention alignment and flexible positional embeddings improve transformer length extrapolation. InNAACL Findings, 2024. 2

work page 2024
[16]

Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026

Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, and Wanzeng Kong. Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026. 3

work page arXiv 2026
[17]

VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences. InICRA, 2026. 24

work page 2026
[18]

Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, and Hesheng Wang. Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025. 3 10

work page arXiv 2025
[19]

LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction

Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, and Huaizu Jiang. LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction. InCVPR, 2026. 3

work page 2026
[20]

Build- ing temporally coherent 3D maps with VGGT for memory-efficient semantic SLAM.arXiv preprint arXiv:2511.16282, 2025

Gergely Dinya, Péter Halász, András L ˝orincz, Kristóf Karacs, and Anna Gelencsér-Horváth. Build- ing temporally coherent 3D maps with VGGT for memory-efficient semantic SLAM.arXiv preprint arXiv:2511.16282, 2025. 3

work page arXiv 2025
[21]

MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, and Yan Wang. MeMix: Writing less, remembering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, 2026. 3

work page arXiv 2026
[22]

MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion. In 3DV, 2025. 3

work page 2025
[23]

VGG-T3: Offline feed-forward 3D reconstruction at scale

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, and Aljosa Osep. VGG-T3: Offline feed-forward 3D reconstruction at scale. InCVPR, 2026. 24

work page 2026
[24]

MoRe: Motion-aware feed-forward 4D reconstruction transformer

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, and Yu-Shen Liu. MoRe: Motion-aware feed-forward 4D reconstruction transformer. InCVPR, 2026. 3

work page 2026
[25]

IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction

Keyu Fang, Changchun Zhou, Yuzhe Fu, Hai Helen Li, and Yiran Chen. IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction. InICLR, 2026. 3

work page 2026
[26]

Dens3R: A foundation model for 3D geometry prediction

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiao-Mu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lv. Dens3R: A foundation model for 3D geometry prediction. InICLR, 2026. 3

work page 2026
[27]

Quantized visual geometry grounded transformer

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, and Yongjun Xu. Quantized visual geometry grounded transformer. InICLR, 2026. 3

work page 2026
[28]

MoRE: 3D visual geometry reconstruction meets mixture-of-experts

Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. MoRE: 3D visual geometry reconstruction meets mixture-of-experts. InCVPR, 2026. 3

work page 2026
[29]

SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation.arXiv preprint arXiv:2602.15899, 2026

Anna Gelencsér-Horváth, Gergely Dinya, Dorka Boglárka Er˝os, Péter Halász, Islam Muhammad Muqsit, and Kristóf Karacs. SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation.arXiv preprint arXiv:2602.15899, 2026. 3

work page arXiv 2026
[30]

Gonzalez

Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance.Theoretical Computer Science, 38:293–306, 1985. 5

work page 1985
[31]

Emergent outlier view rejection in visual geometry grounded transformers

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers. In CVPR, 2026. 3

work page 2026
[32]

DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, and Xiangyang Xue. DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026. 3

work page arXiv 2026
[33]

arXiv preprint arXiv:2511.19971 (2025)

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. VGGT4D: Mining motion cues in visual geometry transformers for 4D scene reconstruction.arXiv preprint arXiv:2511.19971, 2025. 3

work page arXiv 2025
[34]

Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors. InCVPR, 2025. 3

work page 2025
[35]

Drivevggt: Visual geometry transformer for autonomous driving,

Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, and Junchi Yan. DriveVGGT: Visual geometry transformer for autonomous driving.arXiv preprint arXiv:2511.22264, 2025. 3

work page arXiv 2025
[36]

Geo4D: Leveraging video generators for geometric 4D scene reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4D: Leveraging video generators for geometric 4D scene reconstruction. InICCV, 2025. 3

work page 2025
[37]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3D reconstruction via test-time training. InCVPR, 2026. 24

work page 2026
[38]

FILT3R: Latent state adaptive Kalman filter for streaming 3D reconstruction.arXiv preprint arXiv:2603.18493, 2026

Seonghyun Jin and Jong Chul Ye. FILT3R: Latent state adaptive Kalman filter for streaming 3D reconstruction.arXiv preprint arXiv:2603.18493, 2026. 3

work page arXiv 2026
[39]

Any4D: Unified feed-forward metric 4D reconstruction

Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4D reconstruction. InCVPR, 2026. 3 11

work page 2026
[40]

Keyframe-based visual-inertial online SLAM with relocalization

Anton Kasyanov, Francis Engelmann, Jörg Stückler, and Bastian Leibe. Keyframe-based visual-inertial online SLAM with relocalization. InIROS, 2017. 2

work page 2017
[41]

MapAnything: Universal feed-forward metric 3D reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fis- cher, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez- Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction. ...

work page 2026
[42]

G-CUT3R: Guided 3D reconstruction with camera and depth prior integration.arXiv preprint arXiv:2508.11379,

Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, and Evgeny Burnaev. G-CUT3R: Guided 3D reconstruction with camera and depth prior integration.arXiv preprint arXiv:2508.11379,

work page arXiv
[43]

HeSS: Head sensitivity score for sparsity redistribution in VGGT

Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, and Sungroh Yoon. HeSS: Head sensitivity score for sparsity redistribution in VGGT. InCVPR, 2026. 3

work page 2026
[44]

STream3R: Scalable sequential 3D reconstruction with causal transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer. InICLR, 2026. 3

work page 2026
[45]

Distill3R: A pipeline for democratizing 3D foundation models on commodity hardware.arXiv preprint arXiv:2602.00865, 2026

Brandon Leblanc and Charalambos Poullis. Distill3R: A pipeline for democratizing 3D foundation models on commodity hardware.arXiv preprint arXiv:2602.00865, 2026. 24

work page arXiv 2026
[46]

SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes

Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, and Sangyoun Lee. SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes. InCVPR Findings, 2026. 3

work page 2026
[47]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In ECCV, 2024. 3

work page 2024
[48]

Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015

Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015. 2

work page 2015
[49]

Analyzing the mechanism of attention collapse in vggt from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

Huan Li, Longjun Luo, Yuling Shi, and Xiaodong Gu. Analyzing the mechanism of attention collapse in VGGT from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025. 3

work page arXiv 2025
[50]

IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction

Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, and Ziwei Liu. IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction. InICLR, 2026. 3

work page 2026
[51]

WinT3R: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. WinT3R: Window-based streaming reconstruction with camera token pool. InICLR, 2026. 3

work page 2026
[52]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InICLR, 2026. 1, 3

work page 2026
[53]

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, and Luca Ballan. Mem3R: Streaming 3D reconstruction with hybrid memory via test-time training.arXiv preprint arXiv:2604.07279, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, and Lanyun Zhu. StreamCacheVGGT: Streaming visual geometry transformers with robust scoring and hybrid cache compression.arXiv preprint arXiv:2604.15237, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3

work page arXiv 2025
[56]

Align3R: Aligned monocular depth estimation for dynamic videos

Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. In CVPR, 2025. 3

work page 2025
[57]

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. OVGGT: O(1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

UniScale: Unified scale-aware 3D reconstruction for multi-view understanding via prior injection for robotic perception.arXiv preprint arXiv:2602.23224, 2026

Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, and Bingbing Liu. UniScale: Unified scale-aware 3D reconstruction for multi-view understanding via prior injection for robotic perception.arXiv preprint arXiv:2602.23224, 2026. 3 12

work page arXiv 2026
[59]

Evict3r: Training-free token eviction for memory-bounded streaming visual geometry trans- formers.arXiv preprint arXiv:2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025. 3

work page arXiv 2025
[60]

TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026

Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, and Junting Dong. TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026. 3

work page arXiv 2026
[61]

ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguère, and Cyrill Stachniss. ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals. InIROS, 2019. 7, 8, 17, 23

work page 2019
[62]

Hyvggt-vo: Tightly coupled hy- brid dense visual odometry with feed-forward models,

Junxiang Pan, Lipu Zhou, and Baojie Chen. HyVGGT-VO: Tightly coupled hybrid dense visual odometry with feed-forward models.arXiv preprint arXiv:2604.02107, 2026. 3

work page arXiv 2026
[63]

Tail-aware post-training quantization for 3d geometry models.arXiv preprint arXiv:2602.01741,

Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang, Jiawei Li, Bin Chen, Shu-Tao Xia, and Zhi Wang. Tail-aware post-training quantization for 3D geometry models.arXiv preprint arXiv:2602.01741, 2026. 3

work page arXiv 2026
[64]

OmniVGGT: Omni-modality driven visual geometry grounded transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, and Ziwei Liu. OmniVGGT: Omni-modality driven visual geometry grounded transformer. InCVPR, 2026. 3

work page 2026
[65]

Flow4r: Unifying 4d reconstruction and tracking with scene flow

Shenhan Qian, Ganlin Zhang, Shangzhe Wu, and Daniel Cremers. Flow4R: Unifying 4D reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026. 3

work page arXiv 2026
[66]

SLARM: Streaming and language-aligned reconstruction model for dynamic scenes.arXiv preprint arXiv:2603.22893, 2026

Zhicheng Qiu, Jiarui Meng, Tong an Luo, Yican Huang, Xuan Feng, Xuanfu Li, and Zhan Xu. SLARM: Streaming and language-aligned reconstruction model for dynamic scenes.arXiv preprint arXiv:2603.22893, 2026. 3

work page arXiv 2026
[67]

Speed3R: Sparse feed-forward 3D reconstruction models

Weining Ren, Xiao Tan, and Kai Han. Speed3R: Sparse feed-forward 3D reconstruction models. In CVPR Findings, 2026. 3, 7, 8

work page 2026
[68]

Schönberger and Jan-Michael Frahm

Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 1, 3

work page 2016
[69]

MUT3R: Motion-aware updating transformer for dynamic 3D reconstruction.arXiv preprint arXiv:2512.03939, 2025

Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. MUT3R: Motion-aware updating transformer for dynamic 3D reconstruction.arXiv preprint arXiv:2512.03939, 2025. 3

work page arXiv 2025
[70]

FastVGGT: Fast visual geometry transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Fast visual geometry transformer. InICLR, 2026. 2, 3, 4, 7, 8

work page 2026
[71]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InCVPR, 2013. 4, 5, 7, 18, 20, 21, 22, 23

work page 2013
[72]

LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging. InCVPR, 2026. 2, 3, 5, 7, 8

work page 2026
[73]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012. 7, 23

work page 2012
[74]

Dynamic point maps: A versatile representation for dynamic 3D reconstruction

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3D reconstruction. InICCV, 2025. 3

work page 2025
[75]

Lilienthaland, and Martin Magnusson

Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, and Martin Magnusson. Dense dynamic scene reconstruction and camera pose estimation from multi-view videos.arXiv preprint arXiv:2603.12064, 2026. 3

work page arXiv 2026
[76]

A VGGT: Rethinking global attention for accelerating VGGT

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, and Jianfu Zhang. A VGGT: Rethinking global attention for accelerating VGGT. InCVPR, 2026. 3, 5, 6

work page 2026
[77]

Learning to (learn at test time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InICML, 2025. 24

work page 2025
[78]

MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. In CVPR, 2025. 3 13

work page 2025
[79]

tttLRM: Test-time training for long context and autoregressive 3D reconstruction

Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttLRM: Test-time training for long context and autoregressive 3D reconstruction. InCVPR,

work page
[80]

Block-sparse global attention for efficient multi-view geometry transformers

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, and Bastian Leibe. Block-sparse global attention for efficient multi-view geometry transformers. InCVPR, 2026. 2, 7, 8

work page 2026

Showing first 80 references.

[1] [1]

Seitz, and Richard Szeliski

Sameer Agarwal, Noah Snavely, Steven M. Seitz, and Richard Szeliski. Bundle adjustment in the large. InECCV, 2010. 3

work page 2010

[2] [2]

DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Vivek Alumootil and Tuan-Anh Vu. DePT3R: Joint dense point tracking and 3D reconstruction of dynamic scenes in a single forward pass.arXiv preprint arXiv:2512.13122, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Goldman, Matthias Nießner, and Justus Thies

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. InCVPR, 2022. 7, 18, 19, 20, 21, 23

work page 2022

[4] [4]

MegaLoc: One retrieval to place them all

Gabriele Berton and Carlo Masone. MegaLoc: One retrieval to place them all. InCVPR Workshops, 2025. 4, 16

work page 2025

[5] [5]

MUSt3R: Multi-view network for stereo 3D reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 3

work page 2025

[6] [6]

Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. Geometric context transformer for streaming 3D reconstruction.arXiv preprint arXiv:2604.14141, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Easi3R: Estimating disentangled motion from DUSt3R without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3R: Estimating disentangled motion from DUSt3R without training. InICCV, 2025. 3

work page 2025

[8] [8]

TTT3R: 3D reconstruction as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D reconstruction as test-time training. InICLR, 2026. 3

work page 2026

[9] [9]

Human3R: Everyone everywhere all at once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Gerard Pons-Moll. Human3R: Everyone everywhere all at once. InICLR, 2026. 3

work page 2026

[10] [10]

Co-Me: Confidence-guided token merging for visual geometric transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, and Sebastian Scherer. Co-Me: Confidence-guided token merging for visual geometric transformers. InCVPR, 2026. 7, 8

work page 2026

[11] [11]

LONG3R: Long sequence streaming 3D reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. LONG3R: Long sequence streaming 3D reconstruction. InICCV, 2025. 3

work page 2025

[12] [12]

StereoVGGT: A training-free visual geometry transformer for stereo vision.arXiv preprint arXiv:2603.29368, 2026

Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, and Liujuan Cao. StereoVGGT: A training-free visual geometry transformer for stereo vision.arXiv preprint arXiv:2603.29368, 2026. 3

work page arXiv 2026

[13] [13]

LongStream: Long-sequence streaming autoregressive visual geometry

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026. 3

work page 2026

[14] [14]

MERG3R: A divide-and-conquer approach to large-scale neural visual geometry

Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, and Nandita Vijaykumar. MERG3R: A divide-and-conquer approach to large-scale neural visual geometry. InCVPR, 2026. 7

work page 2026

[15] [15]

Attention alignment and flexible positional embeddings improve transformer length extrapolation

Ta-Chung Chi, Ting-Han Fan, and Alexander Rudnicky. Attention alignment and flexible positional embeddings improve transformer length extrapolation. InNAACL Findings, 2024. 2

work page 2024

[16] [16]

Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026

Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, and Wanzeng Kong. Keyframe-based feed-forward visual odometry.arXiv preprint arXiv:2601.16020, 2026. 3

work page arXiv 2026

[17] [17]

VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, Loop it, Align it – Pushing VGGT’s limits on kilometer-scale long RGB sequences. InICRA, 2026. 24

work page 2026

[18] [18]

Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, and Hesheng Wang. Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025. 3 10

work page arXiv 2025

[19] [19]

LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction

Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, and Huaizu Jiang. LASER: Layer-wise scale alignment for training-free streaming 4D reconstruction. InCVPR, 2026. 3

work page 2026

[20] [20]

Build- ing temporally coherent 3D maps with VGGT for memory-efficient semantic SLAM.arXiv preprint arXiv:2511.16282, 2025

Gergely Dinya, Péter Halász, András L ˝orincz, Kristóf Karacs, and Anna Gelencsér-Horváth. Build- ing temporally coherent 3D maps with VGGT for memory-efficient semantic SLAM.arXiv preprint arXiv:2511.16282, 2025. 3

work page arXiv 2025

[21] [21]

MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, and Yan Wang. MeMix: Writing less, remembering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, 2026. 3

work page arXiv 2026

[22] [22]

MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion. In 3DV, 2025. 3

work page 2025

[23] [23]

VGG-T3: Offline feed-forward 3D reconstruction at scale

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, and Aljosa Osep. VGG-T3: Offline feed-forward 3D reconstruction at scale. InCVPR, 2026. 24

work page 2026

[24] [24]

MoRe: Motion-aware feed-forward 4D reconstruction transformer

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, and Yu-Shen Liu. MoRe: Motion-aware feed-forward 4D reconstruction transformer. InCVPR, 2026. 3

work page 2026

[25] [25]

IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction

Keyu Fang, Changchun Zhou, Yuzhe Fu, Hai Helen Li, and Yiran Chen. IncVGGT: Incremental VGGT for memory-bounded long-range 3D reconstruction. InICLR, 2026. 3

work page 2026

[26] [26]

Dens3R: A foundation model for 3D geometry prediction

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiao-Mu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, and Chengfei Lv. Dens3R: A foundation model for 3D geometry prediction. InICLR, 2026. 3

work page 2026

[27] [27]

Quantized visual geometry grounded transformer

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, and Yongjun Xu. Quantized visual geometry grounded transformer. InICLR, 2026. 3

work page 2026

[28] [28]

MoRE: 3D visual geometry reconstruction meets mixture-of-experts

Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, and Yichao Yan. MoRE: 3D visual geometry reconstruction meets mixture-of-experts. InCVPR, 2026. 3

work page 2026

[29] [29]

SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation.arXiv preprint arXiv:2602.15899, 2026

Anna Gelencsér-Horváth, Gergely Dinya, Dorka Boglárka Er˝os, Péter Halász, Islam Muhammad Muqsit, and Kristóf Karacs. SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation.arXiv preprint arXiv:2602.15899, 2026. 3

work page arXiv 2026

[30] [30]

Gonzalez

Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance.Theoretical Computer Science, 38:293–306, 1985. 5

work page 1985

[31] [31]

Emergent outlier view rejection in visual geometry grounded transformers

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers. In CVPR, 2026. 3

work page 2026

[32] [32]

DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, Jia Cai, and Xiangyang Xue. DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026. 3

work page arXiv 2026

[33] [33]

arXiv preprint arXiv:2511.19971 (2025)

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. VGGT4D: Mining motion cues in visual geometry transformers for 4D scene reconstruction.arXiv preprint arXiv:2511.19971, 2025. 3

work page arXiv 2025

[34] [34]

Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering unconstrained 3d reconstruction with camera and scene priors. InCVPR, 2025. 3

work page 2025

[35] [35]

Drivevggt: Visual geometry transformer for autonomous driving,

Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, and Junchi Yan. DriveVGGT: Visual geometry transformer for autonomous driving.arXiv preprint arXiv:2511.22264, 2025. 3

work page arXiv 2025

[36] [36]

Geo4D: Leveraging video generators for geometric 4D scene reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4D: Leveraging video generators for geometric 4D scene reconstruction. InICCV, 2025. 3

work page 2025

[37] [37]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3D reconstruction via test-time training. InCVPR, 2026. 24

work page 2026

[38] [38]

FILT3R: Latent state adaptive Kalman filter for streaming 3D reconstruction.arXiv preprint arXiv:2603.18493, 2026

Seonghyun Jin and Jong Chul Ye. FILT3R: Latent state adaptive Kalman filter for streaming 3D reconstruction.arXiv preprint arXiv:2603.18493, 2026. 3

work page arXiv 2026

[39] [39]

Any4D: Unified feed-forward metric 4D reconstruction

Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4D reconstruction. InCVPR, 2026. 3 11

work page 2026

[40] [40]

Keyframe-based visual-inertial online SLAM with relocalization

Anton Kasyanov, Francis Engelmann, Jörg Stückler, and Bastian Leibe. Keyframe-based visual-inertial online SLAM with relocalization. InIROS, 2017. 2

work page 2017

[41] [41]

MapAnything: Universal feed-forward metric 3D reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fis- cher, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez- Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction. ...

work page 2026

[42] [42]

G-CUT3R: Guided 3D reconstruction with camera and depth prior integration.arXiv preprint arXiv:2508.11379,

Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, and Evgeny Burnaev. G-CUT3R: Guided 3D reconstruction with camera and depth prior integration.arXiv preprint arXiv:2508.11379,

work page arXiv

[43] [43]

HeSS: Head sensitivity score for sparsity redistribution in VGGT

Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee, and Sungroh Yoon. HeSS: Head sensitivity score for sparsity redistribution in VGGT. InCVPR, 2026. 3

work page 2026

[44] [44]

STream3R: Scalable sequential 3D reconstruction with causal transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer. InICLR, 2026. 3

work page 2026

[45] [45]

Distill3R: A pipeline for democratizing 3D foundation models on commodity hardware.arXiv preprint arXiv:2602.00865, 2026

Brandon Leblanc and Charalambos Poullis. Distill3R: A pipeline for democratizing 3D foundation models on commodity hardware.arXiv preprint arXiv:2602.00865, 2026. 24

work page arXiv 2026

[46] [46]

SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes

Jungho Lee, Minhyeok Lee, Sunghun Yang, Minseok Kang, and Sangyoun Lee. SwiftVGGT: A scalable visual geometry grounded transformer for large-scale scenes. InCVPR Findings, 2026. 3

work page 2026

[47] [47]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In ECCV, 2024. 3

work page 2024

[48] [48]

Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015

Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframe-based visual-inertial odometry using nonlinear optimization.IJRR, 2015. 2

work page 2015

[49] [49]

Analyzing the mechanism of attention collapse in vggt from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

Huan Li, Longjun Luo, Yuling Shi, and Xiaodong Gu. Analyzing the mechanism of attention collapse in VGGT from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025. 3

work page arXiv 2025

[50] [50]

IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction

Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, and Ziwei Liu. IGGT: Instance-grounded geometry transformer for semantic 3D reconstruction. InICLR, 2026. 3

work page 2026

[51] [51]

WinT3R: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. WinT3R: Window-based streaming reconstruction with camera token pool. InICLR, 2026. 3

work page 2026

[52] [52]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InICLR, 2026. 1, 3

work page 2026

[53] [53]

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, and Luca Ballan. Mem3R: Streaming 3D reconstruction with hybrid memory via test-time training.arXiv preprint arXiv:2604.07279, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, and Lanyun Zhu. StreamCacheVGGT: Streaming visual geometry transformers with robust scoring and hybrid cache compression.arXiv preprint arXiv:2604.15237, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. VGGT-X: When VGGT meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3

work page arXiv 2025

[56] [56]

Align3R: Aligned monocular depth estimation for dynamic videos

Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. In CVPR, 2025. 3

work page 2025

[57] [57]

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. OVGGT: O(1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

UniScale: Unified scale-aware 3D reconstruction for multi-view understanding via prior injection for robotic perception.arXiv preprint arXiv:2602.23224, 2026

Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, and Bingbing Liu. UniScale: Unified scale-aware 3D reconstruction for multi-view understanding via prior injection for robotic perception.arXiv preprint arXiv:2602.23224, 2026. 3 12

work page arXiv 2026

[59] [59]

Evict3r: Training-free token eviction for memory-bounded streaming visual geometry trans- formers.arXiv preprint arXiv:2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025. 3

work page arXiv 2025

[60] [60]

TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026

Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, and Junting Dong. TrajVG: 3D trajectory-coupled visual geometry learning.arXiv preprint arXiv:2602.04439, 2026. 3

work page arXiv 2026

[61] [61]

ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguère, and Cyrill Stachniss. ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals. InIROS, 2019. 7, 8, 17, 23

work page 2019

[62] [62]

Hyvggt-vo: Tightly coupled hy- brid dense visual odometry with feed-forward models,

Junxiang Pan, Lipu Zhou, and Baojie Chen. HyVGGT-VO: Tightly coupled hybrid dense visual odometry with feed-forward models.arXiv preprint arXiv:2604.02107, 2026. 3

work page arXiv 2026

[63] [63]

Tail-aware post-training quantization for 3d geometry models.arXiv preprint arXiv:2602.01741,

Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang, Jiawei Li, Bin Chen, Shu-Tao Xia, and Zhi Wang. Tail-aware post-training quantization for 3D geometry models.arXiv preprint arXiv:2602.01741, 2026. 3

work page arXiv 2026

[64] [64]

OmniVGGT: Omni-modality driven visual geometry grounded transformer

Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, and Ziwei Liu. OmniVGGT: Omni-modality driven visual geometry grounded transformer. InCVPR, 2026. 3

work page 2026

[65] [65]

Flow4r: Unifying 4d reconstruction and tracking with scene flow

Shenhan Qian, Ganlin Zhang, Shangzhe Wu, and Daniel Cremers. Flow4R: Unifying 4D reconstruction and tracking with scene flow.arXiv preprint arXiv:2602.14021, 2026. 3

work page arXiv 2026

[66] [66]

SLARM: Streaming and language-aligned reconstruction model for dynamic scenes.arXiv preprint arXiv:2603.22893, 2026

Zhicheng Qiu, Jiarui Meng, Tong an Luo, Yican Huang, Xuan Feng, Xuanfu Li, and Zhan Xu. SLARM: Streaming and language-aligned reconstruction model for dynamic scenes.arXiv preprint arXiv:2603.22893, 2026. 3

work page arXiv 2026

[67] [67]

Speed3R: Sparse feed-forward 3D reconstruction models

Weining Ren, Xiao Tan, and Kai Han. Speed3R: Sparse feed-forward 3D reconstruction models. In CVPR Findings, 2026. 3, 7, 8

work page 2026

[68] [68]

Schönberger and Jan-Michael Frahm

Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 1, 3

work page 2016

[69] [69]

MUT3R: Motion-aware updating transformer for dynamic 3D reconstruction.arXiv preprint arXiv:2512.03939, 2025

Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. MUT3R: Motion-aware updating transformer for dynamic 3D reconstruction.arXiv preprint arXiv:2512.03939, 2025. 3

work page arXiv 2025

[70] [70]

FastVGGT: Fast visual geometry transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Fast visual geometry transformer. InICLR, 2026. 2, 3, 4, 7, 8

work page 2026

[71] [71]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InCVPR, 2013. 4, 5, 7, 18, 20, 21, 22, 23

work page 2013

[72] [72]

LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merging. InCVPR, 2026. 2, 3, 5, 7, 8

work page 2026

[73] [73]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012. 7, 23

work page 2012

[74] [74]

Dynamic point maps: A versatile representation for dynamic 3D reconstruction

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3D reconstruction. InICCV, 2025. 3

work page 2025

[75] [75]

Lilienthaland, and Martin Magnusson

Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, and Martin Magnusson. Dense dynamic scene reconstruction and camera pose estimation from multi-view videos.arXiv preprint arXiv:2603.12064, 2026. 3

work page arXiv 2026

[76] [76]

A VGGT: Rethinking global attention for accelerating VGGT

Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, and Jianfu Zhang. A VGGT: Rethinking global attention for accelerating VGGT. InCVPR, 2026. 3, 5, 6

work page 2026

[77] [77]

Learning to (learn at test time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InICML, 2025. 24

work page 2025

[78] [78]

MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. In CVPR, 2025. 3 13

work page 2025

[79] [79]

tttLRM: Test-time training for long context and autoregressive 3D reconstruction

Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttLRM: Test-time training for long context and autoregressive 3D reconstruction. InCVPR,

work page

[80] [80]

Block-sparse global attention for efficient multi-view geometry transformers

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, and Bastian Leibe. Block-sparse global attention for efficient multi-view geometry transformers. InCVPR, 2026. 2, 7, 8

work page 2026