Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
Pith reviewed 2026-05-20 14:14 UTC · model grok-4.3
The pith
Mamba-VGGT adds an external sliding window Mamba memory module to VGGT so that long video sequences can maintain geometric consistency without drift from truncated windows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an explicit external Sliding Window Mamba memory token, updated across temporal windows via selective state-space modeling, propagates global geometric priors that prevent catastrophic forgetting, while a zero-initialized spatial injector fuses these priors into the VGGT feature stream without misalignment or added drift, delivering linear-complexity persistent geometry grounding for long video sequences.
What carries the argument
The Sliding Window Mamba (SWM) memory module, which maintains an explicit external memory token across temporal windows and applies selective state-space modeling to distill and propagate global geometric priors.
If this is right
- Long video sequences can be processed without geometric forgetting or the need for aggressive window truncation.
- Global geometric priors remain available across successive temporal windows at linear rather than quadratic cost.
- Pre-trained VGGT spatial features stay intact while long-range cues are added through the zero-init injector.
- Trajectory errors in 3D reconstruction decrease as the model maintains consistent world modeling over extended environments.
Where Pith is reading between the lines
- The same external memory pattern could be tested on other video transformers that currently truncate context, such as those used for action recognition or scene flow.
- Because the injector is zero-initialized, the method may allow plug-in upgrades to existing deployed VGGT models with minimal retraining.
- Linear scaling suggests the architecture could support hour-long video inputs where quadratic attention becomes impossible.
Load-bearing premise
Zero-convolutional layers can fuse the external memory into the existing patch-token stream without introducing drift or breaking alignment with the pre-trained VGGT weights.
What would settle it
Measure trajectory accumulation error and spatial consistency on video sequences longer than the original VGGT window size; if Mamba-VGGT shows no reduction in drift relative to a plain truncated VGGT baseline, the benefit of the external memory is falsified.
Figures
read the original abstract
Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mamba-VGGT, an extension of Visual Geometry Grounded Transformers (VGGT) for long-sequence video geometry reconstruction. It introduces a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows using selective state-space modeling to distill and propagate global geometric priors, bypassing quadratic attention limits. A Zero-Init Spatial Memory Injector with zero-convolutional layers fuses this persistent memory into patch tokens to preserve pre-trained spatial features. Experiments claim improved spatial consistency and reduced trajectory accumulation errors over prior VGGT methods, providing a linear-complexity approach for persistent 3D world modeling.
Significance. If the central claims on drift reduction hold under rigorous testing, the work offers a practical path to scalable long-horizon geometry reasoning in video-based 3D reconstruction, addressing a key limitation of transformer-based models. The explicit external memory via Mamba and zero-init fusion mechanism represent a targeted architectural contribution that could generalize to other geometry-grounded tasks, though its impact hinges on demonstrating that the added components do not introduce new error sources.
major comments (2)
- [§3] §3 (Method), description of Zero-Init Spatial Memory Injector: the claim that zero-convolutional layers ensure 'structural stability and seamless feature alignment' without new drift is load-bearing for the long-sequence consistency result, yet no equations or analysis show how the zero-init preserves orthogonality to the pre-trained VGGT manifold or prevents compounding misalignment across sliding windows.
- [§4] §4 (Experiments), trajectory error tables: reported reductions in accumulation errors lack ablations isolating the SWM module versus the injector alone, and no error bars or statistical significance tests are referenced, making it impossible to confirm the gains are attributable to the proposed memory mechanism rather than implementation details.
minor comments (2)
- [Abstract] Abstract and §1: the term 'catastrophic geometric forgetting' is used without a precise definition or reference to prior VGGT failure modes; a short formalization of the drift metric would improve clarity.
- Notation: the external memory token is described as 'distilling global geometric priors' but its dimensionality and update rule relative to patch tokens are not explicitly contrasted with standard Mamba state updates.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the method description and experimental validation.
read point-by-point responses
-
Referee: [§3] §3 (Method), description of Zero-Init Spatial Memory Injector: the claim that zero-convolutional layers ensure 'structural stability and seamless feature alignment' without new drift is load-bearing for the long-sequence consistency result, yet no equations or analysis show how the zero-init preserves orthogonality to the pre-trained VGGT manifold or prevents compounding misalignment across sliding windows.
Authors: We agree that the current description would benefit from explicit equations and analysis. In the revised manuscript we will add a formal definition of the zero-init convolutional layers, showing that zero initialization produces an initial output of exactly zero and therefore leaves the pre-trained VGGT patch features unchanged at the first fusion step. We will also include a short derivation illustrating that the subsequent adaptive fusion is a convex combination controlled by a learned scalar, which bounds the deviation from the original manifold and limits drift accumulation across successive sliding windows. These additions will directly support the stability claim. revision: yes
-
Referee: [§4] §4 (Experiments), trajectory error tables: reported reductions in accumulation errors lack ablations isolating the SWM module versus the injector alone, and no error bars or statistical significance tests are referenced, making it impossible to confirm the gains are attributable to the proposed memory mechanism rather than implementation details.
Authors: We acknowledge the value of isolating the contributions of each component and of providing statistical support. We will add new ablation tables that evaluate the model with (i) SWM only, (ii) injector only, and (iii) both modules, using the same training protocol. In addition, we will rerun the main trajectory-error experiments with multiple random seeds, report mean and standard deviation, and include paired statistical significance tests against the VGGT baseline. These revisions will make the source of the observed improvements clearer. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces the Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector as architectural additions to VGGT for handling long sequences. No equations are shown that define a quantity in terms of itself or that fit a parameter on data and then rename the fit as an independent prediction. The central claims rest on the proposed modules leveraging selective state-space modeling and zero-initialized convolutions, with performance asserted via experiments rather than any self-referential reduction. No load-bearing self-citations or uniqueness theorems imported from the authors' prior work are evident in the provided text that would collapse the derivation to prior inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Selective state-space modeling can distill and propagate global geometric priors across temporal windows without loss of critical spatial information.
- domain assumption Zero-initialized convolutional layers can adaptively fuse external memory into pre-trained patch tokens while preserving structural stability.
invented entities (2)
-
Sliding Window Mamba (SWM) memory module
no independent evidence
-
Zero-Init Spatial Memory Injector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows... Zero-Init Spatial Memory Injector... zero-convolutional layers
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mamba-based temporal encoding... Selective State-Space Model (SSM)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012
work page 2012
-
[2]
Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam
Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics, 37(6):1874–1890, 2021
work page 2021
-
[3]
Geometric Context Transformer for Streaming 3D Reconstruction
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training.arXiv preprint arXiv:2509.26645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it– pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Compact 3D Gaussian Splatting For Dense Visual SLAM
Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Weidong Chen. What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025
-
[8]
Plgslam: Progressive neural scene represenation with local to global bundle adjustment
Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wentao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19657–19666, June 2024
work page 2024
-
[9]
Available: https://arxiv.org/abs/2602.23361
Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, and Aljosa Osep. Vgg-t 3: Offline feed-forward 3d reconstruction at scale.arXiv preprint arXiv:2602.23361, 2026
-
[10]
Incvggt: Incremental vggt for memory-bounded long-range 3d reconstruction
Keyu Fang, Changchun Zhou, Yuzhe Fu, Hai Helen Li, and Yiran Chen. Incvggt: Incremental vggt for memory-bounded long-range 3d reconstruction. InThe F ourteenth International Conference on Learning Representations, 2026
work page 2026
-
[11]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Large scale multi-view stereopsis evaluation
Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014
work page 2014
-
[13]
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T Barron, Noah Snavely, and Aleksander Holynski. Zipmap: Linear-time stateful 3d reconstruction via test-time training. arXiv preprint arXiv:2603.04385, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[15]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer, 2024
work page 2024
-
[16]
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, and Luca Ballan. Mem3r: Streaming 3d reconstruction with hybrid memory via test-time training.arXiv preprint arXiv:2604.07279, 2026. 10
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18039–18048, 2024
work page 2024
-
[19]
Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals
Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019
work page 2019
-
[20]
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021
work page 2021
-
[21]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021
work page 2021
-
[22]
A multi-view stereo benchmark with high-resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017
work page 2017
-
[23]
Grs-slam3r: Real-time dense slam with gated recurrent state.arXiv preprint arXiv:2509.23737, 2025
Guole Shen, Tianchen Deng, Yanbo Wang, Yongtao Chen, Yilin Shen, Jiuming Liu, and Jingchuan Wang. Grs-slam3r: Real-time dense slam with gated recurrent state.arXiv preprint arXiv:2509.23737, 2025
-
[24]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021
work page 2021
-
[26]
3d reconstruction with spatial memory
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 Interna- tional Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025
work page 2025
-
[27]
Hengyi Wang and Lourdes Agapito. Amb3r: Accurate feed-forward metric-scale 3d reconstruc- tion with backend.arXiv preprint arXiv:2511.20343, 2025
-
[28]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
-
[29]
Continuous 3d perception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025
work page 2025
-
[30]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024
work page 2024
-
[31]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,
Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025
-
[33]
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, et al. Scal3r: Scalable test-time training for large-scale 3d reconstruction.arXiv preprint arXiv:2604.08542, 2026. 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025
work page 2025
-
[35]
Infinitevggt: Visual geometry grounded transformer for endless streams,
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026
-
[36]
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025
work page 2025
-
[38]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[40]
Nice-slam: Neural implicit scalable encoding for slam
Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022
work page 2022
-
[41]
Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.