pith. machine review for the scientific record. sign in

arxiv: 2603.05959 · v3 · submitted 2026-03-06 · 💻 cs.CV

Recognition: no theorem link

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords ovggtstreaminggeometriccachegeometrygithubhttpsinference
0
0 comments X

The pith

A training-free method keeps visual geometry transformers at constant memory and compute for videos of any length while matching top accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent models that turn video into 3D geometry rely on attention that grows with every new frame, so memory use rises without limit and long sequences become impossible on fixed hardware. OVGGT removes this limit by holding both memory and computation to a fixed budget no matter how many frames arrive. It does this through Self-Selective Caching, which drops low-impact past tokens according to the size of their feed-forward residuals, plus Dynamic Anchor Protection, which keeps tokens that carry critical coordinate information from being removed. Because the changes require no retraining and still work with FlashAttention, the same model can run on arbitrarily long indoor, outdoor, or ultra-long sequences. If the approach works, continuous 3D reconstruction from live camera streams becomes practical on ordinary GPUs without ever running out of memory.

Core claim

OVGGT is a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. It achieves this by combining Self-Selective Caching, which uses FFN residual magnitudes to compress the KV cache while remaining compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Experiments on indoor, outdoor, and ultra-long benchmarks show that the method processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

What carries the argument

Self-Selective Caching (FFN residual magnitude selection of KV cache entries) combined with Dynamic Anchor Protection (shielding of coordinate-critical tokens)

If this is right

  • Arbitrarily long video sequences can be processed in one pass without memory growth or accuracy loss.
  • Real-time 3D mapping from live camera feeds becomes feasible on hardware with fixed VRAM.
  • The same model weights work for short clips and hour-long trajectories with no fine-tuning.
  • FlashAttention remains usable, so the speed gains of that kernel are preserved.
  • State-of-the-art accuracy is reported on indoor, outdoor, and ultra-long sequence benchmarks.
  • pith_inferences:[

Load-bearing premise

Selecting KV cache entries by FFN residual magnitudes and shielding coordinate-critical tokens will prevent geometric drift over extended trajectories without any additional training or fine-tuning.

What would settle it

Measure 3D reconstruction error and peak VRAM usage after feeding a single continuous video of 2000 frames; if error rises above prior state-of-the-art levels or memory exceeds the declared fixed budget, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2603.05959 by Hui-Che Hsu, Po-Ting Chen, Sin-Ye Jhong, Si-Yu Lu, Wen-Huang Cheng, Yung-Yao Chen.

Figure 1
Figure 1. Figure 1: Streaming 3D on a single 32 GB GPU. Left: On 200-frame sequences [28], OVGGT outperforms all baselines in reconstruction quality, speed, and VRAM usage. Right: From 50 to 500 frames, StreamVGGT runs out of memory; other methods survive but suffer notable quality degradation. OVGGT maintains high-fidelity reconstructions at lower cost. Abstract Reconstructing 3D geometry from streaming video re￾quires conti… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OVGGT. At each time step, the input frame is encoded into tokens and processed by a spatial-temporal decoder that attends to a bounded KV cache. During inference, the Activation Value Rating module scores each token’s geometric salience, and the KV Cache Compression (KVCC) module evicts low-scoring tokens to maintain a fixed cache budget. Dynamic Anchor Protection (DAP) shields coordinate-criti… view at source ↗
Figure 3
Figure 3. Figure 3: Per-token FFN activation scores across layers, pro￾gressing from high-frequency textures (shallow) to geometric structures (mid) to semantic boundaries (deep). tion across layers ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Activation smoothing effectively improves reconstruc￾tion quality over vanilla token retention. Activation Smoothing. Directly selecting tokens by raw activation scores tends to produce spatially fragmented re￾tention patterns, introducing discontinuous references that degrade reconstruction sharpness ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on indoor scene reconstruction (sequence length = 500). Each row shows a different scene with close-up insets. Note that StreamVGGT is limited to a maximum of 200 input frames due to memory constraints. complex replacement strategies as it introduces zero addi￾tional computation and, given that older anchors are natu￾rally more likely to have been superseded by newer, spa￾tially clos… view at source ↗
Figure 6
Figure 6. Figure 6: Efficiency comparison. FPS and VRAM vs. sequence length [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy. Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents OVGGT, a training-free streaming framework for 3D geometry reconstruction from video that achieves constant (O(1)) memory and compute cost independent of sequence length. It combines Self-Selective Caching, which selects KV cache entries by FFN residual magnitudes and remains compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens to suppress geometric drift. Experiments on indoor, outdoor, and ultra-long sequence benchmarks are reported to show state-of-the-art accuracy within a fixed VRAM envelope.

Significance. If the constant-cost guarantee and drift suppression hold, the work would remove a fundamental barrier to long-horizon deployment of geometric foundation models, enabling real-time 3D reconstruction under bounded resources. The training-free design and explicit compatibility with FlashAttention are notable strengths that could facilitate adoption.

major comments (3)
  1. [§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.
  2. [§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.
  3. [§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.
minor comments (2)
  1. [Abstract] The abstract would benefit from one or two concrete numbers (e.g., “constant 12 GB VRAM up to 10k frames” or “<2% accuracy drop”) to make the O(1) claim immediately verifiable.
  2. [§3] Notation for the residual magnitude threshold and the anchor-protection flag is introduced without a consolidated table of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below. Where the manuscript requires clarification or additional evidence, we will revise accordingly to strengthen the presentation of the O(1) guarantees and empirical validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.

    Authors: We thank the referee for highlighting this point. Section 3.2 specifies that Dynamic Anchor Protection operates under a fixed budget K of protected anchors; when a new coordinate-critical token is identified and the budget is reached, the oldest protected anchor is evicted. This rule, combined with the fixed cache size in Self-Selective Caching, keeps the total KV cache cardinality constant. We will add explicit pseudocode for the eviction logic, a short proof that the protected set size is bounded by K, and a statement confirming the overall O(1) memory bound in the revised §3.2. revision: yes

  2. Referee: [§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.

    Authors: We agree that direct measurements would provide stronger support. In the revision we will add (i) plots of cumulative pose and reconstruction error as functions of sequence length on the ultra-long benchmarks and (ii) an ablation comparing OVGGT with and without Dynamic Anchor Protection, reporting the resulting drift metrics. These additions will isolate the contribution of anchor protection to drift suppression. revision: yes

  3. Referee: [§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.

    Authors: Self-Selective Caching performs residual-magnitude selection in a lightweight preprocessing pass that produces a compressed KV cache of fixed size; FlashAttention is then invoked unchanged on this compressed cache. No kernel modification is required. Because the cache size is bounded, both selection and attention remain O(1) per frame. We will add a timing table quantifying the selection overhead and a diagram clarifying the data flow in the revised §3.1. revision: partial

Circularity Check

0 steps flagged

No significant circularity in algorithmic construction

full rationale

The paper presents OVGGT as a training-free algorithmic framework that combines Self-Selective Caching (FFN residual magnitude selection) with Dynamic Anchor Protection to enforce a fixed KV cache budget. No equations, derivations, or parameter fits are shown that reduce the O(1) claim to a self-definition or to a fitted input renamed as prediction. The constant-cost guarantee is an explicit design property of the eviction and shielding rules rather than an emergent result derived from the inputs themselves. All performance assertions rest on external benchmark comparisons, not internal consistency checks. This is a standard non-circular algorithmic proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that FFN residual magnitudes reliably indicate geometric importance and that protecting a small set of anchor tokens suffices to bound drift. No free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption FFN residual magnitudes serve as a reliable proxy for token importance in preserving 3D geometric consistency
    Invoked to justify Self-Selective Caching without additional training

pith-pipeline@v0.9.0 · 5520 in / 1209 out tokens · 31397 ms · 2026-05-15T15:36:47.482331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

    cs.CV 2026-04 unverdicted novelty 5.0

    StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Neural RGB-D Surface Reconstruction

    Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D Surface Reconstruction. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6290–6301, 2022. 6, 7

  2. [2]

    MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025

    Yohann Cabon, Vincent Leroy, J ´erˆome Revaud, and Shuzhe Wang. MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025. 3

  3. [3]

    G ´omez Rodr´ıguez, Jos´e M

    Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM.IEEE Transactions on Robotics, 37 (6):1874–1890, 2021. 2

  4. [4]

    TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv preprint :2509.26645, 2025. 3, 6, 7, 8

  5. [5]

    Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, and Hyunwoo J. Kim. Representation Shift: Unifying Token Compression with FlashAttention. In IEEE/CVF International Conference on Computer Vision,

  6. [6]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scan- Net: Richly-annotated 3D Reconstructions of Indoor Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 12, 13

  7. [7]

    FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. InInternational Conference on Learning Representations, 2024. 2, 4, 13

  8. [8]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022. 2, 4, 13

  9. [9]

    SuperPoint: Self-Supervised Interest Point Detec- tion and Description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-Supervised Interest Point Detec- tion and Description. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 1, 2

  10. [10]

    Kevin Zhou

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. AdaKV: Optimizing KV Cache Eviction by Adap- tive Budget Allocation for Efficient LLM Inference.arXiv preprint :2407.11550, 2024. 4

  11. [11]

    Accurate, Dense, and Robust Multiview Stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010

    Yasutaka Furukawa and Jean Ponce. Accurate, Dense, and Robust Multiview Stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010. 2

  12. [12]

    Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013. 8, 10, 15

  13. [13]

    Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 2

  14. [14]

    Ground- ing Image Matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, 2024. 1, 3

  15. [15]

    LightGlue: Local Feature Matching at Light Speed

    Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. In IEEE/CVF International Conference on Computer Vision,

  16. [16]

    Deep Patch Vi- sual SLAM

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep Patch Vi- sual SLAM. InEuropean Conference on Computer Vision,

  17. [17]

    Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025

    Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025. 3, 6, 7, 8, 13

  18. [18]

    Tard ´os

    Raul Mur-Artal and Juan D. Tard ´os. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras.IEEE Transactions on Robotics, 33(5): 1255–1262, 2017. 2

  19. [19]

    Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tard´os. ORB- 10 SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 2

  20. [20]

    MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

    Riku Murai, Eric Orb, Lachlan Nicholson, Kenta Masuda, Keisuke Tateno, and Federico Tombari. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint :2412.12392, 2024. 3

  21. [21]

    DINOv2: Learning Robust Visual Fea- tures without Supervision.Transactions on Machine Learn- ing Research, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  22. [22]

    ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals

    Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. InIEEE/RSJ International Conference on Intelli- gent Robots and Systems, 2019. 8

  23. [23]

    Global Structure-from-Motion Revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision,

  24. [24]

    SuperGlue: Learning Feature Matching with Graph Neural Networks

    Paul-Erik Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 1, 2

  25. [25]

    Structure-from-Motion Revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 1, 2

  26. [26]

    Pixelwise View Selection for Un- structured Multi-View Stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Un- structured Multi-View Stereo. InEuropean Conference on Computer Vision, 2016. 2

  27. [27]

    A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,

  28. [28]

    Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2013. 1, 6, 7, 12, 13

  29. [29]

    A Benchmark for the Evalua- tion of RGB-D SLAM Systems

    J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A Benchmark for the Evalua- tion of RGB-D SLAM Systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. 12, 13

  30. [30]

    DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. InAd- vances in Neural Information Processing Systems, 2021. 2

  31. [31]

    Deep Patch Vi- sual Odometry

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep Patch Vi- sual Odometry. InAdvances in Neural Information Process- ing Systems, 2024. 2

  32. [32]

    Going Deeper with Im- age Transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going Deeper with Im- age Transformers. InIEEE/CVF International Conference on Computer Vision, 2021. 4

  33. [33]

    PatchmatchNet: Learned Multi-View Patchmatch Stereo

    Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Specber, and Marc Pollefeys. PatchmatchNet: Learned Multi-View Patchmatch Stereo. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2

  34. [34]

    Spann3R: 3D Recon- struction with Spatial Memory

    Hengyi Wang and Lourdes Agapito. Spann3R: 3D Recon- struction with Spatial Memory. InEuropean Conference on Computer Vision, 2024. 2, 3, 6, 7, 14

  35. [35]

    VGGSfM: Visual Geometry Grounded Deep Structure From Motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. VGGSfM: Visual Geometry Grounded Deep Structure From Motion. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024. 2

  36. [36]

    VGGT: Visual Geometry Grounded Transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4

  37. [37]

    Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,

    Jianyuan Wang, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,

  38. [38]

    DUSt3R: Geometric 3D Vision Made Easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Raber, and J´erˆome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3

  39. [39]

    Point3R: Online Dense 3D Reconstruction with Spatial Pointer Mem- ory.arXiv preprint :2507.05869, 2025

    Zhihao Wang, Jinglu Li, Lina Han, and Yan Lu. Point3R: Online Dense 3D Reconstruction with Spatial Pointer Mem- ory.arXiv preprint :2507.05869, 2025. 3, 6, 7, 15

  40. [40]

    Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass

    Jianing Yang, Georgios Pavlakos, Neehar Desai, Nikita Karaev, and David Novotny. Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3

  41. [41]

    MV- DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views in 2 Seconds.arXiv preprint :2412.06974, 2024

    Zhenggang Yang, Delin Wang, Zhuohan Li, Jingkang Yan, Yuhao Ding, Baigui Yin, Ziwei Liu, and Cewu Lu. MV- DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views in 2 Seconds.arXiv preprint :2412.06974, 2024. 2

  42. [42]

    MVSNet: Depth Inference for Unstructured Multi-View Stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi-View Stereo. InEuropean Conference on Computer Vision, 2018. 2

  43. [43]

    In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026

    Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026. 3, 5, 6, 7, 8, 13, 15

  44. [44]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion.arXiv preprint :2410.03825, 2024. 2, 3

  45. [45]

    StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025

    Chuanxia Zheng and Andrea Vedaldi. StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025. 2, 3, 4, 6, 7, 8, 12 11 Supplementary Material A. Comparison with Full-Cache Baseline To provide a fine-grained view of how cache manage- ment affects reconstruction quality over time, we compare OVGGT against the full-cache Stre...