arxiv: 2603.05959 · v3 · submitted 2026-03-06 · 💻 cs.CV

Recognition: no theorem link

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu , Po-Ting Chen , Hui-Che Hsu , Sin-Ye Jhong , Wen-Huang Cheng , Yung-Yao Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords ovggtstreaminggeometriccachegeometrygithubhttpsinference

0 comments

The pith

A training-free method keeps visual geometry transformers at constant memory and compute for videos of any length while matching top accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent models that turn video into 3D geometry rely on attention that grows with every new frame, so memory use rises without limit and long sequences become impossible on fixed hardware. OVGGT removes this limit by holding both memory and computation to a fixed budget no matter how many frames arrive. It does this through Self-Selective Caching, which drops low-impact past tokens according to the size of their feed-forward residuals, plus Dynamic Anchor Protection, which keeps tokens that carry critical coordinate information from being removed. Because the changes require no retraining and still work with FlashAttention, the same model can run on arbitrarily long indoor, outdoor, or ultra-long sequences. If the approach works, continuous 3D reconstruction from live camera streams becomes practical on ordinary GPUs without ever running out of memory.

Core claim

OVGGT is a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. It achieves this by combining Self-Selective Caching, which uses FFN residual magnitudes to compress the KV cache while remaining compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Experiments on indoor, outdoor, and ultra-long benchmarks show that the method processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

What carries the argument

Self-Selective Caching (FFN residual magnitude selection of KV cache entries) combined with Dynamic Anchor Protection (shielding of coordinate-critical tokens)

If this is right

Arbitrarily long video sequences can be processed in one pass without memory growth or accuracy loss.
Real-time 3D mapping from live camera feeds becomes feasible on hardware with fixed VRAM.
The same model weights work for short clips and hour-long trajectories with no fine-tuning.
FlashAttention remains usable, so the speed gains of that kernel are preserved.
State-of-the-art accuracy is reported on indoor, outdoor, and ultra-long sequence benchmarks.
pith_inferences:[

Load-bearing premise

Selecting KV cache entries by FFN residual magnitudes and shielding coordinate-critical tokens will prevent geometric drift over extended trajectories without any additional training or fine-tuning.

What would settle it

Measure 3D reconstruction error and peak VRAM usage after feeding a single continuous video of 2000 frames; if error rises above prior state-of-the-art levels or memory exceeds the declared fixed budget, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2603.05959 by Hui-Che Hsu, Po-Ting Chen, Sin-Ye Jhong, Si-Yu Lu, Wen-Huang Cheng, Yung-Yao Chen.

**Figure 1.** Figure 1: Streaming 3D on a single 32 GB GPU. Left: On 200-frame sequences [28], OVGGT outperforms all baselines in reconstruction quality, speed, and VRAM usage. Right: From 50 to 500 frames, StreamVGGT runs out of memory; other methods survive but suffer notable quality degradation. OVGGT maintains high-fidelity reconstructions at lower cost. Abstract Reconstructing 3D geometry from streaming video requires conti… view at source ↗

**Figure 2.** Figure 2: Overview of OVGGT. At each time step, the input frame is encoded into tokens and processed by a spatial-temporal decoder that attends to a bounded KV cache. During inference, the Activation Value Rating module scores each token’s geometric salience, and the KV Cache Compression (KVCC) module evicts low-scoring tokens to maintain a fixed cache budget. Dynamic Anchor Protection (DAP) shields coordinate-criti… view at source ↗

**Figure 3.** Figure 3: Per-token FFN activation scores across layers, progressing from high-frequency textures (shallow) to geometric structures (mid) to semantic boundaries (deep). tion across layers ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Activation smoothing effectively improves reconstruction quality over vanilla token retention. Activation Smoothing. Directly selecting tokens by raw activation scores tends to produce spatially fragmented retention patterns, introducing discontinuous references that degrade reconstruction sharpness ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on indoor scene reconstruction (sequence length = 500). Each row shows a different scene with close-up insets. Note that StreamVGGT is limited to a maximum of 200 input frames due to memory constraints. complex replacement strategies as it introduces zero additional computation and, given that older anchors are naturally more likely to have been superseded by newer, spatially clos… view at source ↗

**Figure 6.** Figure 6: Efficiency comparison. FPS and VRAM vs. sequence length [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy. Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OVGGT gives a training-free way to hold streaming geometry models to constant memory via residual-based KV selection plus anchor protection, but the long-term bound on anchors is not fully shown in the abstract.

read the letter

The core advance here is a practical, training-free scheme that caps both memory and compute for long video streams in a visual geometry transformer. Self-selective caching drops KV entries by FFN residual size while staying compatible with FlashAttention, and dynamic anchor protection keeps coordinate-critical tokens from being evicted. Experiments on indoor, outdoor, and ultra-long sequences show constant VRAM with accuracy that holds up against prior streaming baselines, and the code is released, which makes it easy to check.

Referee Report

3 major / 2 minor

Summary. The manuscript presents OVGGT, a training-free streaming framework for 3D geometry reconstruction from video that achieves constant (O(1)) memory and compute cost independent of sequence length. It combines Self-Selective Caching, which selects KV cache entries by FFN residual magnitudes and remains compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens to suppress geometric drift. Experiments on indoor, outdoor, and ultra-long sequence benchmarks are reported to show state-of-the-art accuracy within a fixed VRAM envelope.

Significance. If the constant-cost guarantee and drift suppression hold, the work would remove a fundamental barrier to long-horizon deployment of geometric foundation models, enabling real-time 3D reconstruction under bounded resources. The training-free design and explicit compatibility with FlashAttention are notable strengths that could facilitate adoption.

major comments (3)

[§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.
[§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.
[§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.

minor comments (2)

[Abstract] The abstract would benefit from one or two concrete numbers (e.g., “constant 12 GB VRAM up to 10k frames” or “<2% accuracy drop”) to make the O(1) claim immediately verifiable.
[§3] Notation for the residual magnitude threshold and the anchor-protection flag is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. Where the manuscript requires clarification or additional evidence, we will revise accordingly to strengthen the presentation of the O(1) guarantees and empirical validation.

read point-by-point responses

Referee: [§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.

Authors: We thank the referee for highlighting this point. Section 3.2 specifies that Dynamic Anchor Protection operates under a fixed budget K of protected anchors; when a new coordinate-critical token is identified and the budget is reached, the oldest protected anchor is evicted. This rule, combined with the fixed cache size in Self-Selective Caching, keeps the total KV cache cardinality constant. We will add explicit pseudocode for the eviction logic, a short proof that the protected set size is bounded by K, and a statement confirming the overall O(1) memory bound in the revised §3.2. revision: yes
Referee: [§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.

Authors: We agree that direct measurements would provide stronger support. In the revision we will add (i) plots of cumulative pose and reconstruction error as functions of sequence length on the ultra-long benchmarks and (ii) an ablation comparing OVGGT with and without Dynamic Anchor Protection, reporting the resulting drift metrics. These additions will isolate the contribution of anchor protection to drift suppression. revision: yes
Referee: [§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.

Authors: Self-Selective Caching performs residual-magnitude selection in a lightweight preprocessing pass that produces a compressed KV cache of fixed size; FlashAttention is then invoked unchanged on this compressed cache. No kernel modification is required. Because the cache size is bounded, both selection and attention remain O(1) per frame. We will add a timing table quantifying the selection overhead and a diagram clarifying the data flow in the revised §3.1. revision: partial

Circularity Check

0 steps flagged

No significant circularity in algorithmic construction

full rationale

The paper presents OVGGT as a training-free algorithmic framework that combines Self-Selective Caching (FFN residual magnitude selection) with Dynamic Anchor Protection to enforce a fixed KV cache budget. No equations, derivations, or parameter fits are shown that reduce the O(1) claim to a self-definition or to a fitted input renamed as prediction. The constant-cost guarantee is an explicit design property of the eviction and shielding rules rather than an emergent result derived from the inputs themselves. All performance assertions rest on external benchmark comparisons, not internal consistency checks. This is a standard non-circular algorithmic proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that FFN residual magnitudes reliably indicate geometric importance and that protecting a small set of anchor tokens suffices to bound drift. No free parameters or new physical entities are introduced.

axioms (1)

domain assumption FFN residual magnitudes serve as a reliable proxy for token importance in preserving 3D geometric consistency
Invoked to justify Self-Selective Caching without additional training

pith-pipeline@v0.9.0 · 5520 in / 1209 out tokens · 31397 ms · 2026-05-15T15:36:47.482331+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
cs.CV 2026-04 unverdicted novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Neural RGB-D Surface Reconstruction

Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D Surface Reconstruction. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6290–6301, 2022. 6, 7

work page 2022
[2]

MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025

Yohann Cabon, Vincent Leroy, J ´erˆome Revaud, and Shuzhe Wang. MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025. 3

work page arXiv 2025
[3]

G ´omez Rodr´ıguez, Jos´e M

Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM.IEEE Transactions on Robotics, 37 (6):1874–1890, 2021. 2

work page 2021
[4]

TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv preprint :2509.26645, 2025. 3, 6, 7, 8

work page arXiv 2025
[5]

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, and Hyunwoo J. Kim. Representation Shift: Unifying Token Compression with FlashAttention. In IEEE/CVF International Conference on Computer Vision,

work page
[6]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scan- Net: Richly-annotated 3D Reconstructions of Indoor Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 12, 13

work page 2017
[7]

FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. InInternational Conference on Learning Representations, 2024. 2, 4, 13

work page 2024
[8]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022. 2, 4, 13

work page 2022
[9]

SuperPoint: Self-Supervised Interest Point Detec- tion and Description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-Supervised Interest Point Detec- tion and Description. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 1, 2

work page 2018
[10]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. AdaKV: Optimizing KV Cache Eviction by Adap- tive Budget Allocation for Efficient LLM Inference.arXiv preprint :2407.11550, 2024. 4

work page arXiv 2024
[11]

Accurate, Dense, and Robust Multiview Stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010

Yasutaka Furukawa and Jean Ponce. Accurate, Dense, and Robust Multiview Stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010. 2

work page 2010
[12]

Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013. 8, 10, 15

work page 2013
[13]

Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 2

work page 2020
[14]

Ground- ing Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, 2024. 1, 3

work page 2024
[15]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. In IEEE/CVF International Conference on Computer Vision,

work page
[16]

Deep Patch Vi- sual SLAM

Lahav Lipson, Zachary Teed, and Jia Deng. Deep Patch Vi- sual SLAM. InEuropean Conference on Computer Vision,

work page
[17]

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025. 3, 6, 7, 8, 13

work page arXiv 2025
[18]

Tard ´os

Raul Mur-Artal and Juan D. Tard ´os. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras.IEEE Transactions on Robotics, 33(5): 1255–1262, 2017. 2

work page 2017
[19]

Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tard´os. ORB- 10 SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 2

work page 2015
[20]

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Riku Murai, Eric Orb, Lachlan Nicholson, Kenta Masuda, Keisuke Tateno, and Federico Tombari. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint :2412.12392, 2024. 3

work page arXiv 2024
[21]

DINOv2: Learning Robust Visual Fea- tures without Supervision.Transactions on Machine Learn- ing Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page 2024
[22]

ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. InIEEE/RSJ International Conference on Intelli- gent Robots and Systems, 2019. 8

work page 2019
[23]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision,

work page
[24]

SuperGlue: Learning Feature Matching with Graph Neural Networks

Paul-Erik Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 1, 2

work page 2020
[25]

Structure-from-Motion Revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 1, 2

work page 2016
[26]

Pixelwise View Selection for Un- structured Multi-View Stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Un- structured Multi-View Stereo. InEuropean Conference on Computer Vision, 2016. 2

work page 2016
[27]

A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[28]

Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2013. 1, 6, 7, 12, 13

work page 2013
[29]

A Benchmark for the Evalua- tion of RGB-D SLAM Systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A Benchmark for the Evalua- tion of RGB-D SLAM Systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. 12, 13

work page 2012
[30]

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. InAd- vances in Neural Information Processing Systems, 2021. 2

work page 2021
[31]

Deep Patch Vi- sual Odometry

Zachary Teed, Lahav Lipson, and Jia Deng. Deep Patch Vi- sual Odometry. InAdvances in Neural Information Process- ing Systems, 2024. 2

work page 2024
[32]

Going Deeper with Im- age Transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going Deeper with Im- age Transformers. InIEEE/CVF International Conference on Computer Vision, 2021. 4

work page 2021
[33]

PatchmatchNet: Learned Multi-View Patchmatch Stereo

Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Specber, and Marc Pollefeys. PatchmatchNet: Learned Multi-View Patchmatch Stereo. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2

work page 2021
[34]

Spann3R: 3D Recon- struction with Spatial Memory

Hengyi Wang and Lourdes Agapito. Spann3R: 3D Recon- struction with Spatial Memory. InEuropean Conference on Computer Vision, 2024. 2, 3, 6, 7, 14

work page 2024
[35]

VGGSfM: Visual Geometry Grounded Deep Structure From Motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. VGGSfM: Visual Geometry Grounded Deep Structure From Motion. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024. 2

work page 2024
[36]

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4

work page 2025
[37]

Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,

Jianyuan Wang, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,

work page arXiv
[38]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Raber, and J´erˆome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3

work page 2024
[39]

Point3R: Online Dense 3D Reconstruction with Spatial Pointer Mem- ory.arXiv preprint :2507.05869, 2025

Zhihao Wang, Jinglu Li, Lina Han, and Yan Lu. Point3R: Online Dense 3D Reconstruction with Spatial Pointer Mem- ory.arXiv preprint :2507.05869, 2025. 3, 6, 7, 15

work page arXiv 2025
[40]

Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass

Jianing Yang, Georgios Pavlakos, Neehar Desai, Nikita Karaev, and David Novotny. Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3

work page 2025
[41]

MV- DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views in 2 Seconds.arXiv preprint :2412.06974, 2024

Zhenggang Yang, Delin Wang, Zhuohan Li, Jingkang Yan, Yuhao Ding, Baigui Yin, Ziwei Liu, and Cewu Lu. MV- DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views in 2 Seconds.arXiv preprint :2412.06974, 2024. 2

work page arXiv 2024
[42]

MVSNet: Depth Inference for Unstructured Multi-View Stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi-View Stereo. InEuropean Conference on Computer Vision, 2018. 2

work page 2018
[43]

In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026. 3, 5, 6, 7, 8, 13, 15

work page arXiv 2026
[44]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion.arXiv preprint :2410.03825, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025

Chuanxia Zheng and Andrea Vedaldi. StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025. 2, 3, 4, 6, 7, 8, 12 11 Supplementary Material A. Comparison with Full-Cache Baseline To provide a fine-grained view of how cache manage- ment affects reconstruction quality over time, we compare OVGGT against the full-cache Stre...

work page arXiv 2025