pith. machine review for the scientific record. sign in

arxiv: 2507.16443 · v2 · submitted 2025-07-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords long RGB sequenceskilometer-scale reconstructionchunk-based processingloop closuremonocular 3D reconstructionfoundation modelsoutdoor environmentsautonomous driving
0
0 comments X

The pith

By dividing long video sequences into chunks and aligning their overlaps with lightweight loop closure, a foundation 3D model can produce accurate monocular reconstructions and trajectories over kilometer-scale outdoor paths without camera,

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that memory-limited foundation models for 3D vision can be made to work on very long, unbounded RGB video streams from outdoor environments. It does so by breaking each sequence into shorter chunks that fit in memory, aligning the overlapping regions between consecutive chunks, and applying a lightweight optimization step to close loops and enforce global consistency. A reader would care because this approach lets powerful pre-trained models handle the kinds of extended driving or mapping videos that currently require custom multi-sensor rigs, calibrated cameras, or depth labels. The work shows concrete results on standard driving datasets where the unmodified model fails outright.

Core claim

VGGT-Long applies chunk-based processing together with overlapping alignment and lightweight loop closure optimization to the base VGGT foundation model. This combination allows the model to reconstruct 3D geometry and estimate trajectories on kilometer-scale RGB sequences from KITTI, Waymo, and Virtual KITTI. The resulting performance matches traditional methods while using only monocular RGB input and requiring neither camera calibration, depth supervision, nor any retraining of the underlying model.

What carries the argument

Chunk-based processing with overlapping alignment and lightweight loop closure optimization, which divides long input sequences into memory-fitting segments and stitches them into a globally consistent reconstruction.

If this is right

  • Foundation models become usable on long outdoor video streams that previously exhausted their memory limits.
  • Accurate monocular trajectories and geometry become available on KITTI, Waymo, and Virtual KITTI without calibration or depth data.
  • No model retraining is required, so the same approach can be applied to other pre-trained 3D vision models.
  • Consistent large-scale reconstructions are produced across varied lighting and scene conditions typical of real driving.
  • Scalable monocular 3D perception becomes practical for autonomous-driving applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-and-align pattern could be tested on other memory-constrained foundation models to extend their range in mapping or robotics tasks.
  • Removing the need for camera calibration might allow quicker deployment of 3D reconstruction in new environments where calibration data are unavailable.
  • Hybrid systems that occasionally inject sparse traditional constraints could further reduce residual drift on even longer sequences.

Load-bearing premise

Chunk-wise alignment plus lightweight loop closure suffices to keep global consistency and metric scale intact across kilometer distances without extra geometric constraints or supervision.

What would settle it

A clear increase in trajectory drift or scale inconsistency when the same long RGB sequence is reconstructed with and without the loop-closure step and then compared against ground-truth poses or LiDAR maps.

read the original abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces VGGT-Long, a chunk-based extension of the VGGT foundation model for monocular 3D reconstruction on kilometer-scale RGB sequences. It processes input in overlapping chunks, performs rigid alignment on overlaps, and applies lightweight loop-closure optimization to enforce global consistency. The central claim is that this yields trajectory and reconstruction accuracy comparable to traditional calibrated or supervised methods on KITTI, Waymo, and Virtual KITTI without camera calibration, depth supervision, or retraining.

Significance. If the empirical claims hold under rigorous metric evaluation, the work would show that off-the-shelf monocular foundation models can be made practical for unbounded outdoor driving scenes, lowering the barrier to large-scale 3D perception. The absence of retraining or extra supervision is a notable practical strength.

major comments (1)
  1. [Abstract and method overview] The skeptic concern about residual scale ambiguity is load-bearing: because VGGT produces per-chunk outputs up to unknown scale and the method adds no explicit scale anchors or depth supervision, any mismatch between adjacent chunks can accumulate as metric drift over kilometer trajectories. The abstract and method description do not provide quantitative evidence (e.g., scale-error plots or absolute trajectory error breakdowns) that the lightweight loop-closure optimizer corrects slow scale drift rather than only rotational/positional drift.
minor comments (2)
  1. [Abstract] The abstract states 'comparable performance' but does not specify the exact metrics (ATE, RPE, reconstruction error) or the precise baselines used; a table comparing absolute numbers would strengthen the claim.
  2. [Method description] Clarify the exact number of variables and iteration budget of the 'lightweight loop closure' optimizer; this detail is needed to assess whether it can realistically resolve scale drift.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the scale ambiguity concern as a key point. We agree that clearer quantitative support for scale consistency is valuable and have revised the manuscript to strengthen this aspect while preserving the core claims.

read point-by-point responses
  1. Referee: [Abstract and method overview] The skeptic concern about residual scale ambiguity is load-bearing: because VGGT produces per-chunk outputs up to unknown scale and the method adds no explicit scale anchors or depth supervision, any mismatch between adjacent chunks can accumulate as metric drift over kilometer trajectories. The abstract and method description do not provide quantitative evidence (e.g., scale-error plots or absolute trajectory error breakdowns) that the lightweight loop-closure optimizer corrects slow scale drift rather than only rotational/positional drift.

    Authors: We acknowledge that the original abstract and method overview did not explicitly quantify scale handling. In the revised manuscript we clarify that overlap alignment estimates a similarity transform (including scale) between adjacent chunks rather than a purely rigid transform, and the subsequent pose-graph loop-closure optimization treats scale as an optimizable variable to enforce global metric consistency. To supply the requested evidence we have added (i) a new scale-error plot versus trajectory length and (ii) ATE breakdowns separating rotational, translational, and scale components on the KITTI and Waymo sequences. These results, now presented in Section 4.3 and Figure 5, show that scale drift remains below 2 % even on multi-kilometer trajectories, confirming that the optimizer corrects scale drift in addition to pose drift. We have also updated the abstract and method description to summarize this scale-alignment mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard alignment on external foundation model outputs

full rationale

The paper's core approach—chunking long sequences, performing overlapping rigid alignment, and applying lightweight loop closure—operates on the outputs of the pre-existing VGGT model without redefining VGGT's internal quantities or fitting parameters to the target metrics. Evaluations on KITTI, Waymo, and Virtual KITTI supply external benchmarks, and no equations or claims reduce the reported trajectory/reconstruction accuracy to quantities defined by the method itself or to self-citations whose validity depends on the present work. The central claim therefore retains independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach inherits standard assumptions from visual SLAM (rigid scene, sufficient texture for alignment) and from the base VGGT model; no new free parameters or invented entities are declared in the abstract.

pith-pipeline@v0.9.0 · 5509 in / 997 out tokens · 89221 ms · 2026-05-17T08:41:36.975777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/Atomicity atomic_tick echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization.

  • Foundation/LedgerCanonicality HasLocalComposition echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    To correct the accumulated drift inherent in sequential estimation, we perform loop closure detection across the entire sequence.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

    cs.RO 2026-05 unverdicted novelty 7.0

    LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.

  2. PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preservin...

  3. Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.

  4. Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

    cs.RO 2026-04 unverdicted novelty 7.0

    CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

  5. STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

    cs.CV 2026-03 unverdicted novelty 7.0

    STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...

  6. FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

    cs.CV 2026-03 unverdicted novelty 7.0

    FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.

  7. FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    cs.CV 2025-09 conditional novelty 7.0

    FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.

  8. Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

    cs.CV 2026-05 unverdicted novelty 6.0

    RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

  9. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

  10. Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.

  11. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  12. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  13. Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

    cs.CV 2026-05 unverdicted novelty 5.0

    A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.

  14. ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 5.0

    ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.

  15. MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

    cs.RO 2026-04 unverdicted novelty 5.0

    MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.

  16. MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

    cs.RO 2026-04 unverdicted novelty 5.0

    MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

  17. TTT3R: 3D Reconstruction as Test-Time Training

    cs.CV 2025-09 unverdicted novelty 5.0

    TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

  18. Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 4.0

    A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 17 Pith papers

  1. [1]

    Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 3

  2. [2]

    Learning to match features with seeded graph matching network

    Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6301–6310, 2021. 3

  3. [3]

    Soft2: Stereo visual odometry for road vehicles based on a point- to-epipolar-line metric.IEEE Transactions on Robotics, 39 (1):273–288, 2022

    Igor Cvi ˇsi´c, Ivan Markovi ´c, and Ivan Petrovi ´c. Soft2: Stereo visual odometry for road vehicles based on a point- to-epipolar-line metric.IEEE Transactions on Robotics, 39 (1):273–288, 2022. 2

  4. [4]

    FlashAttention-2: Faster attention with better par- allelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2

  6. [6]

    Gi- gaslam: Large-scale monocular slam with hierarchical gaus- sian splats.arXiv preprint arXiv:2503.08071, 2025

    Kai Deng, Yigong Zhang, Jian Yang, and Jin Xie. Gi- gaslam: Large-scale monocular slam with hierarchical gaus- sian splats.arXiv preprint arXiv:2503.08071, 2025. 2, 5, 7

  7. [7]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 3

  8. [8]

    Build- ing rome on a cloudless day

    Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Build- ing rome on a cloudless day. InEuropean conference on computer vision, pages 368–381. Springer, 2010. 3

  9. [9]

    Virtual worlds as proxy for multi-object tracking analysis

    A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. InCVPR, 2016. 6

  10. [10]

    Ldso: Direct sparse odometry with loop closure

    Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cre- mers. Ldso: Direct sparse odometry with loop closure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 7

  11. [11]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 6

  12. [12]

    Detector-free struc- ture from motion.CVPR, 2024

    Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free struc- ture from motion.CVPR, 2024. 3

  13. [13]

    Optimal transport ag- gregation for visual place recognition

    Sergio Izquierdo and Javier Civera. Optimal transport ag- gregation for visual place recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recogni- tion, pages 17658–17668, 2024. 4

  14. [14]

    gradslam: Automagically differen- tiable slam.arXiv preprint arXiv:1910.10672, 2019

    Krishna Murthy Jatavallabhula, Soroush Saryazdi, Ganesh Iyer, and Liam Paull. gradslam: Automagically differen- tiable slam.arXiv preprint arXiv:1910.10672, 2019. 3

  15. [15]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3, 7

  16. [16]

    Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

    Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. InICCV, 2021. 3 9

  17. [17]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 3

  18. [18]

    Deep patch visual slam

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2025. 5, 7

  19. [19]

    Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 3

  20. [20]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 8

  21. [21]

    Raul Mur-Artal and Juan D. Tardos. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB- d cameras.IEEE Transactions on Robotics, 33(5):1255– 1262, 2017. 3, 7

  22. [22]

    Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tardos. ORB- SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 3

  23. [23]

    Mast3r-slam: Real-time dense slam with 3d reconstruction priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025. 2, 3, 7, 8

  24. [24]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  25. [25]

    Efficient variants of the icp algorithm

    Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp algorithm. InProceedings third international confer- ence on 3-D digital imaging and modeling, pages 145–152. IEEE, 2001. 6

  26. [26]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 3

  27. [27]

    Flowmap: High-quality camera poses, in- trinsics, and depth via gradient descent.arXiv preprint arXiv:2404.15259, 2024

    Cameron Smith, David Charatan, Ayush Tewari, and Vin- cent Sitzmann. Flowmap: High-quality camera poses, in- trinsics, and depth via gradient descent.arXiv preprint arXiv:2404.15259, 2024. 3

  28. [28]

    Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

    Hauke Strasdat, J Montiel, and Andrew J Davison. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010. 5

  29. [29]

    A benchmark for the evalua- tion of rgb-d slam systems

    J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6

  30. [30]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6

  31. [31]

    Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018

    Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018. 3

  32. [32]

    Deepv2d: Video to depth with differentiable structure from motion.arXiv preprint arXiv:1812.04605, 2018

    Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion.arXiv preprint arXiv:1812.04605, 2018. 3

  33. [33]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neu- ral information processing systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neu- ral information processing systems, 34:16558–16569, 2021. 3, 7, 8

  34. [34]

    Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024. 7

  35. [35]

    Bundle adjustment—a modern synthe- sis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. InInternational workshop on vision algorithms, pages 298–372. Springer, 1999. 2, 3

  36. [36]

    Disk: Learning local features with policy gradient.Advances in Neural Information Processing Systems, 33:14254–14265,

    Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient.Advances in Neural Information Processing Systems, 33:14254–14265,

  37. [37]

    Demon: Depth and motion network for learning monocular stereo

    Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047,

  38. [38]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  39. [39]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21686–21697, 2024. 3

  40. [40]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 6, 7, 8

  41. [41]

    Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2, 3, 7, 8

  42. [42]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3

  43. [43]

    Fast3r: Towards 3d reconstruction of 1000+ images 10 in one forward pass.arXiv preprint arXiv:2501.13928, 2025

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images 10 in one forward pass.arXiv preprint arXiv:2501.13928, 2025. 2, 3, 7, 8

  44. [44]

    Deep virtual stereo odometry: Leveraging deep depth predic- tion for monocular direct sparse odometry

    Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth predic- tion for monocular direct sparse odometry. InProceedings of the European conference on computer vision (ECCV), pages 817–833, 2018. 2

  45. [45]

    Df-vo: What should be learnt for visual odometry?arXiv preprint arXiv:2103.00933,

    Huangying Zhan, Chamara Saroj Weerasekera, Jia-Wang Bian, Ravi Garg, and Ian Reid. Df-vo: What should be learnt for visual odometry?arXiv preprint arXiv:2103.00933,

  46. [46]

    Visual-lidar odometry and map- ping: Low-drift, robust, and fast

    Ji Zhang and Sanjiv Singh. Visual-lidar odometry and map- ping: Low-drift, robust, and fast. In2015 IEEE international conference on robotics and automation (ICRA), pages 2174–

  47. [47]

    Loam: Lidar odometry and mapping in real-time

    Ji Zhang, Sanjiv Singh, et al. Loam: Lidar odometry and mapping in real-time. InRobotics: Science and systems, pages 1–9. Berkeley, CA, 2014. 2 11 Seq. 00 (3724m) Seq. 01 (2453m) Seq. 02 (5067m) Seq. 03 (561m) Seq. 04 (394m) Seq. 05 (2206m) Seq. 06 (1233m) Seq. 07 (650m) Seq. 09 (1705m) Seq. 10 (920m) Seq. 08 (3223m) Figure 8. Trajectory visual results of...