Recognition: no theorem link
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Pith reviewed 2026-05-15 15:36 UTC · model grok-4.3
The pith
A training-free method keeps visual geometry transformers at constant memory and compute for videos of any length while matching top accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OVGGT is a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. It achieves this by combining Self-Selective Caching, which uses FFN residual magnitudes to compress the KV cache while remaining compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Experiments on indoor, outdoor, and ultra-long benchmarks show that the method processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
What carries the argument
Self-Selective Caching (FFN residual magnitude selection of KV cache entries) combined with Dynamic Anchor Protection (shielding of coordinate-critical tokens)
If this is right
- Arbitrarily long video sequences can be processed in one pass without memory growth or accuracy loss.
- Real-time 3D mapping from live camera feeds becomes feasible on hardware with fixed VRAM.
- The same model weights work for short clips and hour-long trajectories with no fine-tuning.
- FlashAttention remains usable, so the speed gains of that kernel are preserved.
- State-of-the-art accuracy is reported on indoor, outdoor, and ultra-long sequence benchmarks.
- pith_inferences:[
Load-bearing premise
Selecting KV cache entries by FFN residual magnitudes and shielding coordinate-critical tokens will prevent geometric drift over extended trajectories without any additional training or fine-tuning.
What would settle it
Measure 3D reconstruction error and peak VRAM usage after feeding a single continuous video of 2000 frames; if error rises above prior state-of-the-art levels or memory exceeds the declared fixed budget, the central claim does not hold.
Figures
read the original abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy. Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents OVGGT, a training-free streaming framework for 3D geometry reconstruction from video that achieves constant (O(1)) memory and compute cost independent of sequence length. It combines Self-Selective Caching, which selects KV cache entries by FFN residual magnitudes and remains compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens to suppress geometric drift. Experiments on indoor, outdoor, and ultra-long sequence benchmarks are reported to show state-of-the-art accuracy within a fixed VRAM envelope.
Significance. If the constant-cost guarantee and drift suppression hold, the work would remove a fundamental barrier to long-horizon deployment of geometric foundation models, enabling real-time 3D reconstruction under bounded resources. The training-free design and explicit compatibility with FlashAttention are notable strengths that could facilitate adoption.
major comments (3)
- [§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.
- [§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.
- [§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.
minor comments (2)
- [Abstract] The abstract would benefit from one or two concrete numbers (e.g., “constant 12 GB VRAM up to 10k frames” or “<2% accuracy drop”) to make the O(1) claim immediately verifiable.
- [§3] Notation for the residual magnitude threshold and the anchor-protection flag is introduced without a consolidated table of symbols.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below. Where the manuscript requires clarification or additional evidence, we will revise accordingly to strengthen the presentation of the O(1) guarantees and empirical validation.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Dynamic Anchor Protection): The description provides no eviction rule, fixed budget, or analysis showing that the cardinality of protected anchors remains bounded as new coordinate-critical tokens (e.g., landmarks in extended scenes) are identified. Without such a mechanism the KV cache size can grow linearly with trajectory length, directly contradicting the O(1) constant-VRAM claim in the abstract and §1.
Authors: We thank the referee for highlighting this point. Section 3.2 specifies that Dynamic Anchor Protection operates under a fixed budget K of protected anchors; when a new coordinate-critical token is identified and the budget is reached, the oldest protected anchor is evicted. This rule, combined with the fixed cache size in Self-Selective Caching, keeps the total KV cache cardinality constant. We will add explicit pseudocode for the eviction logic, a short proof that the protected set size is bounded by K, and a statement confirming the overall O(1) memory bound in the revised §3.2. revision: yes
-
Referee: [§4] §4 (Experiments): No quantitative drift measurements (e.g., cumulative pose or reconstruction error versus sequence length) or ablation isolating Dynamic Anchor Protection are presented, leaving the claim that the method “suppresses geometric drift over extended trajectories” supported only by overall benchmark scores rather than targeted verification.
Authors: We agree that direct measurements would provide stronger support. In the revision we will add (i) plots of cumulative pose and reconstruction error as functions of sequence length on the ultra-long benchmarks and (ii) an ablation comparing OVGGT with and without Dynamic Anchor Protection, reporting the resulting drift metrics. These additions will isolate the contribution of anchor protection to drift suppression. revision: yes
-
Referee: [§3.1] §3.1 (Self-Selective Caching): The integration with FlashAttention is asserted but the precise modification to the attention kernel or the overhead of residual-magnitude selection is not quantified; this is load-bearing for the “constant compute” part of the central claim.
Authors: Self-Selective Caching performs residual-magnitude selection in a lightweight preprocessing pass that produces a compressed KV cache of fixed size; FlashAttention is then invoked unchanged on this compressed cache. No kernel modification is required. Because the cache size is bounded, both selection and attention remain O(1) per frame. We will add a timing table quantifying the selection overhead and a diagram clarifying the data flow in the revised §3.1. revision: partial
Circularity Check
No significant circularity in algorithmic construction
full rationale
The paper presents OVGGT as a training-free algorithmic framework that combines Self-Selective Caching (FFN residual magnitude selection) with Dynamic Anchor Protection to enforce a fixed KV cache budget. No equations, derivations, or parameter fits are shown that reduce the O(1) claim to a self-definition or to a fitted input renamed as prediction. The constant-cost guarantee is an explicit design property of the eviction and shielding rules rather than an emergent result derived from the inputs themselves. All performance assertions rest on external benchmark comparisons, not internal consistency checks. This is a standard non-circular algorithmic proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption FFN residual magnitudes serve as a reliable proxy for token importance in preserving 3D geometric consistency
Forward citations
Cited by 1 Pith paper
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
Reference graph
Works this paper leans on
-
[1]
Neural RGB-D Surface Reconstruction
Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D Surface Reconstruction. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6290–6301, 2022. 6, 7
work page 2022
-
[2]
MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025
Yohann Cabon, Vincent Leroy, J ´erˆome Revaud, and Shuzhe Wang. MUSt3R: Multi-View Network for Stereo 3D Recon- struction.arXiv preprint :2503.01661, 2025. 3
-
[3]
Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM.IEEE Transactions on Robotics, 37 (6):1874–1890, 2021. 2
work page 2021
-
[4]
TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv preprint :2509.26645, 2025. 3, 6, 7, 8
-
[5]
Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, and Hyunwoo J. Kim. Representation Shift: Unifying Token Compression with FlashAttention. In IEEE/CVF International Conference on Computer Vision,
-
[6]
Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scan- Net: Richly-annotated 3D Reconstructions of Indoor Scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017. 12, 13
work page 2017
-
[7]
FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning
Tri Dao. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning. InInternational Conference on Learning Representations, 2024. 2, 4, 13
work page 2024
-
[8]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022. 2, 4, 13
work page 2022
-
[9]
SuperPoint: Self-Supervised Interest Point Detec- tion and Description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-Supervised Interest Point Detec- tion and Description. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. 1, 2
work page 2018
-
[10]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. AdaKV: Optimizing KV Cache Eviction by Adap- tive Budget Allocation for Efficient LLM Inference.arXiv preprint :2407.11550, 2024. 4
-
[11]
Yasutaka Furukawa and Jean Ponce. Accurate, Dense, and Robust Multiview Stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, 2010. 2
work page 2010
-
[12]
Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.The International Journal of Robotics Research, 2013. 8, 10, 15
work page 2013
-
[13]
Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 2
work page 2020
-
[14]
Ground- ing Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, 2024. 1, 3
work page 2024
-
[15]
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. In IEEE/CVF International Conference on Computer Vision,
-
[16]
Lahav Lipson, Zachary Teed, and Jia Deng. Deep Patch Vi- sual SLAM. InEuropean Conference on Computer Vision,
-
[17]
Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025. 3, 6, 7, 8, 13
- [18]
-
[19]
Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tard´os. ORB- 10 SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 2
work page 2015
-
[20]
MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
Riku Murai, Eric Orb, Lachlan Nicholson, Kenta Masuda, Keisuke Tateno, and Federico Tombari. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint :2412.12392, 2024. 3
-
[21]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...
work page 2024
-
[22]
ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals
Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. InIEEE/RSJ International Conference on Intelli- gent Robots and Systems, 2019. 8
work page 2019
-
[23]
Global Structure-from-Motion Revisited
Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision,
-
[24]
SuperGlue: Learning Feature Matching with Graph Neural Networks
Paul-Erik Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 1, 2
work page 2020
-
[25]
Structure-from-Motion Revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 1, 2
work page 2016
-
[26]
Pixelwise View Selection for Un- structured Multi-View Stereo
Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Un- structured Multi-View Stereo. InEuropean Conference on Computer Vision, 2016. 2
work page 2016
-
[27]
A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition,
-
[28]
Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2013. 1, 6, 7, 12, 13
work page 2013
-
[29]
A Benchmark for the Evalua- tion of RGB-D SLAM Systems
J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A Benchmark for the Evalua- tion of RGB-D SLAM Systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. 12, 13
work page 2012
-
[30]
DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras
Zachary Teed and Jia Deng. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. InAd- vances in Neural Information Processing Systems, 2021. 2
work page 2021
-
[31]
Zachary Teed, Lahav Lipson, and Jia Deng. Deep Patch Vi- sual Odometry. InAdvances in Neural Information Process- ing Systems, 2024. 2
work page 2024
-
[32]
Going Deeper with Im- age Transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going Deeper with Im- age Transformers. InIEEE/CVF International Conference on Computer Vision, 2021. 4
work page 2021
-
[33]
PatchmatchNet: Learned Multi-View Patchmatch Stereo
Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Specber, and Marc Pollefeys. PatchmatchNet: Learned Multi-View Patchmatch Stereo. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2
work page 2021
-
[34]
Spann3R: 3D Recon- struction with Spatial Memory
Hengyi Wang and Lourdes Agapito. Spann3R: 3D Recon- struction with Spatial Memory. InEuropean Conference on Computer Vision, 2024. 2, 3, 6, 7, 14
work page 2024
-
[35]
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. VGGSfM: Visual Geometry Grounded Deep Structure From Motion. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024. 2
work page 2024
-
[36]
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4
work page 2025
-
[37]
Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,
Jianyuan Wang, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,
-
[38]
DUSt3R: Geometric 3D Vision Made Easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Raber, and J´erˆome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 3
work page 2024
-
[39]
Zhihao Wang, Jinglu Li, Lina Han, and Yan Lu. Point3R: Online Dense 3D Reconstruction with Spatial Pointer Mem- ory.arXiv preprint :2507.05869, 2025. 3, 6, 7, 15
-
[40]
Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass
Jianing Yang, Georgios Pavlakos, Neehar Desai, Nikita Karaev, and David Novotny. Fast3R: Towards 3D Re- construction of 1000+ Images in One Forward Pass. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3
work page 2025
-
[41]
Zhenggang Yang, Delin Wang, Zhuohan Li, Jingkang Yan, Yuhao Ding, Baigui Yin, Ziwei Liu, and Cewu Lu. MV- DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views in 2 Seconds.arXiv preprint :2412.06974, 2024. 2
-
[42]
MVSNet: Depth Inference for Unstructured Multi-View Stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi-View Stereo. InEuropean Conference on Computer Vision, 2018. 2
work page 2018
-
[43]
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026. 3, 5, 6, 7, 8, 13, 15
-
[44]
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion.arXiv preprint :2410.03825, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025
Chuanxia Zheng and Andrea Vedaldi. StreamVGGT: Streaming Visual Geometry Grounded Transformer.arXiv preprint :2507.11116, 2025. 2, 3, 4, 6, 7, 8, 12 11 Supplementary Material A. Comparison with Full-Cache Baseline To provide a fine-grained view of how cache manage- ment affects reconstruction quality over time, we compare OVGGT against the full-cache Stre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.