arxiv: 2603.07690 · v2 · submitted 2026-03-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

Zhisong Xu , Takeshi Oishi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming VGGTframe-level memorybounded KV cachegeometric coherencemulti-view support3D reconstructionvideo depth estimationcamera pose estimation

0 comments

The pith

FrameVGGT organizes each frame's KV contribution as a coherent segment summarized by key-space prototypes to bound memory while preserving multi-view geometric support in streaming VGGT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that token-level KV retention fragments within-frame evidence and weakens the redundant multi-view support needed for coherent geometric reasoning, while frame-level segments allow fixed-capacity memory without that loss. This matters because unbounded cache growth limits practical use of streaming visual geometry transformers in extended video sequences for 3D tasks. FrameVGGT maintains complementary segments in bounded memory with an optional sparse anchor tier, delivering favorable accuracy-memory trade-offs and more stable geometry over long streams. A sympathetic reader cares because it makes reliable online 3D perception feasible under resource constraints.

Core claim

FrameVGGT is a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. It summarizes each segment with a lightweight key-space prototype and maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier for difficult long-horizon intervals. This yields favorable accuracy-memory trade-offs across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation while maintaining more stable geometry over long streams.

What carries the argument

frame-level KV segments summarized by lightweight key-space prototypes that preserve redundant multi-view geometric support within bounded memory

If this is right

Frame-level organization avoids fragmenting within-frame evidence under fixed memory budgets.
More stable geometry is maintained over long streams than with token-level retention.
Favorable accuracy-memory trade-offs hold across 3D reconstruction, video depth estimation, and camera pose estimation.
Optional sparse anchors extend support to difficult long-horizon intervals without exceeding capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design implies that geometric tasks gain more from preserving frame-level coherence than from token-level compression, unlike language-model caching.
Adaptive segment selection based on geometric uncertainty could extend efficiency gains to scenes with varying complexity.
The approach may transfer to other streaming multi-view problems such as visual SLAM where long-term consistency matters.

Load-bearing premise

Summarizing each frame's KV contribution into a key-space prototype retains the redundant multi-view information needed for coherent geometric reasoning.

What would settle it

A controlled experiment on long-sequence 3D reconstruction showing that token-level retention under identical memory limits produces higher accuracy or greater long-term stability than FrameVGGT would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.07690 by Takeshi Oishi, Zhisong Xu.

**Figure 2.** Figure 2: Pipeline of FrameVGGT. Previous inputs are encoded to form per-layer KV blocks, which [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reconstruction visualization comparison on the 7scenes dataset. InfiniteVGGT exhibits [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Pose visualization comparison on the TUM dataset. InfiniteVGGT exhibits noticeable [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of memory-key heatmaps at different timesteps. Left: InfiniteVGGT. Right: [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal evolution of ∆k under bounded memory. Increasing ∆k indicates stronger concentration and more brittle fusion. Token-level retention shows higher and growing ∆k, while FrameVGGT remains more stable. and measure the contrast between subsets using ∆k = 1 |Rk| X i∈Rk cos(ˆki , µR) − 1 |Nk| X j∈Nk cos(ˆkj , µR). (7) This statistic reflects how strongly the retained keys align with a dominant local dire… view at source ↗

**Figure 7.** Figure 7: Additional reconstruction visualizations on 7-Scenes and NRGBD. We show representative [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Additional reconstruction visualizations with [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional reconstruction visualizations with [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Additional reconstruction visualizations with the [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Additional depth visualization results on BONN. We visualize predicted depth maps from [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Additional pose visualization results on the TUM dataset. The figure shows representative [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier for difficult long-horizon intervals. Across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation, FrameVGGT achieves favorable accuracy-memory trade-offs under bounded memory while maintaining more stable geometry over long streams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frame-level segments and key-space prototypes give a clean alternative to token KV caching for bounded streaming VGGT, but the abstract supplies no numbers or tests so the stability claim stays unverified.

read the letter

The main point is that FrameVGGT reorganizes the KV cache around whole frames instead of individual tokens, summarizes each frame segment with a lightweight key-space prototype, and keeps total memory fixed with an optional sparse anchor tier for hard cases. This is presented as better suited to geometry because it avoids fragmenting the redundant multi-view support that token-level retention can break under tight budgets. The motivation is straightforward and domain-specific rather than borrowed from language-model compression tricks. That framing is the clearest new piece. The optional anchor tier also shows some practical attention to long-horizon drift. Both are reasonable design choices given the problem statement. The abstract states that the approach delivers favorable accuracy-memory trade-offs and more stable geometry over long streams, yet it contains no quantitative results, no ablation tables, and no error analysis. The central assumption—that prototype summarization preserves enough within-frame multi-view consistency for downstream geometric reasoning—receives no direct measurement. The stress-test note is right to flag that if the prototypes drop epipolar or depth consistency signals that full token retention would keep, the bounded-memory advantage does not actually materialize. Without those checks the claims rest on untested design intuition. This paper is for people building online 3D reconstruction, video depth, or pose pipelines that must run under fixed memory. A reader already working on streaming vision transformers will find the geometric-support argument useful as a prompt for their own memory design, even if they end up modifying the prototype step. It deserves a serious referee because the problem is concrete, the proposed structure is distinct from prior token-centric work, and the motivation is internally consistent. I would send it to review but flag the need for explicit tests of information loss in the summarization step and head-to-head numbers on long sequences.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes FrameVGGT, a bounded explicit-memory framework for streaming Visual Geometry Transformers. It organizes each frame's incremental KV contribution as a coherent frame-level segment, summarizes segments via lightweight key-space prototypes, maintains a fixed-capacity memory of complementary segments, and adds an optional sparse anchor tier for difficult long-horizon intervals. The central claim is that this geometry-aligned design yields favorable accuracy-memory trade-offs under bounded memory while delivering more stable geometry than token-level retention across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation.

Significance. If the empirical claims hold, the work addresses a practical deployment barrier in online 3D perception by replacing unbounded KV growth with a fixed-budget, frame-coherent memory that better preserves multi-view geometric support. The design choice is motivated by domain observation rather than parameter fitting, and the optional anchor tier offers a concrete mechanism for handling long-horizon drift.

major comments (3)

[Abstract] Abstract: the claim of 'favorable accuracy-memory trade-offs' and 'more stable geometry' is stated without any quantitative results, ablation tables, or error metrics, so the magnitude and reliability of the improvement cannot be assessed from the provided text.
[Method] Method (frame-level segment and key-space prototype construction): the central assumption that lightweight prototypes on frame-level segments preserve the redundant multi-view evidence (epipolar consistency, depth coherence) required for stable long-horizon geometry is not directly tested; no measurement of information loss relative to the original token-level KV is reported, leaving the bounded-memory advantage unverified.
[Experiments] Experiments: without reported numbers on memory usage, accuracy deltas, or stability metrics (e.g., pose drift over sequence length) for the three tasks, it is impossible to confirm that the prototype summarization does not discard the very geometric constraints that token-level retention is said to fragment.

minor comments (1)

[Method] Notation for 'key-space prototype' and 'sparse anchor tier' should be defined with explicit equations or pseudocode on first use to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the quantitative support and clarity of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'favorable accuracy-memory trade-offs' and 'more stable geometry' is stated without any quantitative results, ablation tables, or error metrics, so the magnitude and reliability of the improvement cannot be assessed from the provided text.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will insert concise highlights of key results (e.g., memory reduction percentages, accuracy deltas, and stability metrics across the three tasks) drawn directly from the experimental tables. revision: yes
Referee: [Method] Method (frame-level segment and key-space prototype construction): the central assumption that lightweight prototypes on frame-level segments preserve the redundant multi-view evidence (epipolar consistency, depth coherence) required for stable long-horizon geometry is not directly tested; no measurement of information loss relative to the original token-level KV is reported, leaving the bounded-memory advantage unverified.

Authors: This is a fair observation. While downstream task performance provides indirect evidence, we will add a targeted analysis (new subsection or appendix) that quantifies information preservation, for instance by reporting geometric consistency metrics or reconstruction fidelity differences between full token-level KV and the prototype summaries. revision: yes
Referee: [Experiments] Experiments: without reported numbers on memory usage, accuracy deltas, or stability metrics (e.g., pose drift over sequence length) for the three tasks, it is impossible to confirm that the prototype summarization does not discard the very geometric constraints that token-level retention is said to fragment.

Authors: The manuscript already contains these metrics in the experimental section, including tables for memory usage, accuracy deltas, and long-sequence stability (pose drift) on all three tasks. We will revise the text to make these results more explicitly linked to the method claims and to reference them from the abstract and method sections. revision: partial

Circularity Check

0 steps flagged

No circularity: design motivated by domain observation, no equations or self-referential reductions

full rationale

The paper introduces FrameVGGT as an architectural proposal for bounded-memory streaming visual geometry transformers. It is motivated by the stated observation that geometry depends on redundant multi-view support which token-level KV retention can fragment. The framework organizes frame-level segments, applies lightweight key-space prototypes, and maintains fixed-capacity memory with an optional sparse anchor tier. No equations, derivations, or fitted parameters are presented that reduce the central claims to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The claims rest on empirical accuracy-memory trade-offs across reconstruction, depth, and pose tasks rather than any self-referential chain. This qualifies as a self-contained design choice evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The proposal rests on the domain assumption that geometry benefits from coherent multi-view frame support and introduces three new memory constructs without external validation.

free parameters (1)

fixed memory capacity
The number of retained segments is a design hyperparameter that controls the accuracy-memory trade-off.

axioms (1)

domain assumption Geometry-driven reasoning depends on redundant and mutually compatible multi-view support.
Explicitly stated as the reason token-level retention is suboptimal.

invented entities (3)

frame-level segment no independent evidence
purpose: Organize each frame's incremental KV contribution as a coherent unit
Core new structuring mechanism.
key-space prototype no independent evidence
purpose: Lightweight summary of each segment
Compression mechanism for bounded memory.
sparse anchor tier no independent evidence
purpose: Handle difficult long-horizon intervals
Optional component for stability.

pith-pipeline@v0.9.0 · 5481 in / 1249 out tokens · 35906 ms · 2026-05-15T14:38:34.411159+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FrameVGGT summarizes each segment with a lightweight key-space prototype v^(l)_t = 1/(H|T_t|) Σ K^(l)_t,h,τ followed by ℓ2 normalization and cosine distance d(S_i,S_j)=1−⟨v̄_i,v̄_j⟩; selection uses coverage score m(S)=min d(S,S′) with greedy update.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We treat each frame’s incremental KV contribution as a coherent frame-level segment … maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Review on slam algorithms for augmented reality.Displays, 84:102806, 2024

Xingdong Sheng, Shijie Mao, Yichao Yan, and Xiaokang Yang. Review on slam algorithms for augmented reality.Displays, 84:102806, 2024

work page 2024
[2]

Robust dense mapping for large- scale dynamic environments

Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large- scale dynamic environments. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7510–7517. IEEE, 2018

work page 2018
[3]

Sparse-then-dense alignment-based 3d map reconstruction method for endoscopic capsule robots

Mehmet Turan, Yusuf Yigit Pilavci, Ipek Ganiyusufoglu, Helder Araujo, Ender Konukoglu, and Metin Sitti. Sparse-then-dense alignment-based 3d map reconstruction method for endoscopic capsule robots. Machine Vision and Applications, 29(2):345–359, 2018

work page 2018
[4]

Scene representations for robotic spatial perception.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):351–377, 2025

Ruben Mascaro and Margarita Chli. Scene representations for robotic spatial perception.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):351–377, 2025

work page 2025
[5]

Semantic mapping in indoor embodied ai–a comprehensive survey and future directions.arXiv preprint arXiv:2501.05750, 3, 2025

Sonia Raychaudhuri and Angel X Chang. Semantic mapping in indoor embodied ai–a comprehensive survey and future directions.arXiv preprint arXiv:2501.05750, 3, 2025

work page arXiv 2025
[6]

Perception of three-dimensional structure from motion.Trends in cognitive sciences, 2(6):222–228, 1998

Richard A Andersen and David C Bradley. Perception of three-dimensional structure from motion.Trends in cognitive sciences, 2(6):222–228, 1998

work page 1998
[7]

Visual slam and structure from motion in dynamic environments: A survey.ACM Computing Surveys (CSUR), 51(2):1–36, 2018

Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in dynamic environments: A survey.ACM Computing Surveys (CSUR), 51(2):1–36, 2018

work page 2018
[8]

A comprehensive survey of multi-view video summarization.Pattern Recognition, 109:107567, 2021

Tanveer Hussain, Khan Muhammad, Weiping Ding, Jaime Lloret, Sung Wook Baik, and Victor Hugo C De Albuquerque. A comprehensive survey of multi-view video summarization.Pattern Recognition, 109:107567, 2021

work page 2021
[9]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, June 2024

work page 2024
[10]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025

work page 2025
[11]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[12]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025
[13]

TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

work page arXiv 2025
[14]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

work page internal anchor Pith review arXiv 2025
[15]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023

work page 2023
[16]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[17]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[18]

In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXiv preprint arXiv:2601.02281, 2026. 10

work page arXiv 2026
[19]

Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.arXiv preprint arXiv:2601.01204, 2026

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, and Ngai Wong. Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.arXiv preprint arXiv:2601.01204, 2026

work page arXiv 2026
[20]

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Ge- ometry Transformers.arXiv preprint :2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3r: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025

work page arXiv 2025
[21]

Building rome in a day.Communications of the ACM, 54(10):105–112, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54(10):105–112, 2011

work page 2011
[22]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016
[23]

Multi-view stereo: A tutorial.F oundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa and Carlos Hernández. Multi-view stereo: A tutorial.F oundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

work page 2015
[24]

Slm-slam: a visual slam system based on segmented large-scale model in dynamic scenes and zero-shot conditions

Fan Zhu, Ziyu Chen, Chunmao Jiang, Liwei Xu, Shijin Zhang, Biao Yu, and Hui Zhu. Slm-slam: a visual slam system based on segmented large-scale model in dynamic scenes and zero-shot conditions. Measurement Science and Technology, 35(8):086315, 2024

work page 2024
[25]

Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

work page 2015
[26]

Visual slam algorithms: A survey from 2010 to 2016.IPSJ transactions on computer vision and applications, 9(1):16, 2017

Takafumi Taketomi, Hideaki Uchiyama, and Sei Ikeda. Visual slam algorithms: A survey from 2010 to 2016.IPSJ transactions on computer vision and applications, 9(1):16, 2017

work page 2010
[27]

Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page arXiv 2025
[28]

Sail-recon: Large sfm by augmenting scene regression with localization.arXiv preprint arXiv:2508.17972, 2025

Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, and Xiaoyang Guo. Sail-recon: Large sfm by augmenting scene regression with localization.arXiv preprint arXiv:2508.17972, 2025

work page arXiv 2025
[29]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025

work page 2025
[30]

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

work page arXiv 2025
[31]

Long3r: Long sequence streaming 3d reconstruction

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5273–5284, 2025

work page 2025
[32]

Wint3r: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool. arXiv preprint arXiv:2509.05296, 2025

work page arXiv 2025
[33]

Extending absolute pose regression to multiple scenes

Hunter Blanton, Connor Greenwell, Scott Workman, and Nathan Jacobs. Extending absolute pose regression to multiple scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 38–39, 2020

work page 2020
[34]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

work page 2022
[35]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019

work page 2019
[36]

A benchmark for the evaluation of rgb-d slam systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580, 2012. A Additional Diagnostics for Token-Level Compression in Geometric Streaming This appendix complements Sec. 3 with addition...

work page 2012