Recognition: 2 theorem links
· Lean TheoremVGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
Pith reviewed 2026-05-17 08:41 UTC · model grok-4.3
The pith
By dividing long video sequences into chunks and aligning their overlaps with lightweight loop closure, a foundation 3D model can produce accurate monocular reconstructions and trajectories over kilometer-scale outdoor paths without camera,
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-Long applies chunk-based processing together with overlapping alignment and lightweight loop closure optimization to the base VGGT foundation model. This combination allows the model to reconstruct 3D geometry and estimate trajectories on kilometer-scale RGB sequences from KITTI, Waymo, and Virtual KITTI. The resulting performance matches traditional methods while using only monocular RGB input and requiring neither camera calibration, depth supervision, nor any retraining of the underlying model.
What carries the argument
Chunk-based processing with overlapping alignment and lightweight loop closure optimization, which divides long input sequences into memory-fitting segments and stitches them into a globally consistent reconstruction.
If this is right
- Foundation models become usable on long outdoor video streams that previously exhausted their memory limits.
- Accurate monocular trajectories and geometry become available on KITTI, Waymo, and Virtual KITTI without calibration or depth data.
- No model retraining is required, so the same approach can be applied to other pre-trained 3D vision models.
- Consistent large-scale reconstructions are produced across varied lighting and scene conditions typical of real driving.
- Scalable monocular 3D perception becomes practical for autonomous-driving applications.
Where Pith is reading between the lines
- The same chunk-and-align pattern could be tested on other memory-constrained foundation models to extend their range in mapping or robotics tasks.
- Removing the need for camera calibration might allow quicker deployment of 3D reconstruction in new environments where calibration data are unavailable.
- Hybrid systems that occasionally inject sparse traditional constraints could further reduce residual drift on even longer sequences.
Load-bearing premise
Chunk-wise alignment plus lightweight loop closure suffices to keep global consistency and metric scale intact across kilometer distances without extra geometric constraints or supervision.
What would settle it
A clear increase in trajectory drift or scale inconsistency when the same long RGB sequence is reconstructed with and without the loop-closure step and then compared against ground-truth poses or LiDAR maps.
read the original abstract
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VGGT-Long, a chunk-based extension of the VGGT foundation model for monocular 3D reconstruction on kilometer-scale RGB sequences. It processes input in overlapping chunks, performs rigid alignment on overlaps, and applies lightweight loop-closure optimization to enforce global consistency. The central claim is that this yields trajectory and reconstruction accuracy comparable to traditional calibrated or supervised methods on KITTI, Waymo, and Virtual KITTI without camera calibration, depth supervision, or retraining.
Significance. If the empirical claims hold under rigorous metric evaluation, the work would show that off-the-shelf monocular foundation models can be made practical for unbounded outdoor driving scenes, lowering the barrier to large-scale 3D perception. The absence of retraining or extra supervision is a notable practical strength.
major comments (1)
- [Abstract and method overview] The skeptic concern about residual scale ambiguity is load-bearing: because VGGT produces per-chunk outputs up to unknown scale and the method adds no explicit scale anchors or depth supervision, any mismatch between adjacent chunks can accumulate as metric drift over kilometer trajectories. The abstract and method description do not provide quantitative evidence (e.g., scale-error plots or absolute trajectory error breakdowns) that the lightweight loop-closure optimizer corrects slow scale drift rather than only rotational/positional drift.
minor comments (2)
- [Abstract] The abstract states 'comparable performance' but does not specify the exact metrics (ATE, RPE, reconstruction error) or the precise baselines used; a table comparing absolute numbers would strengthen the claim.
- [Method description] Clarify the exact number of variables and iteration budget of the 'lightweight loop closure' optimizer; this detail is needed to assess whether it can realistically resolve scale drift.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the scale ambiguity concern as a key point. We agree that clearer quantitative support for scale consistency is valuable and have revised the manuscript to strengthen this aspect while preserving the core claims.
read point-by-point responses
-
Referee: [Abstract and method overview] The skeptic concern about residual scale ambiguity is load-bearing: because VGGT produces per-chunk outputs up to unknown scale and the method adds no explicit scale anchors or depth supervision, any mismatch between adjacent chunks can accumulate as metric drift over kilometer trajectories. The abstract and method description do not provide quantitative evidence (e.g., scale-error plots or absolute trajectory error breakdowns) that the lightweight loop-closure optimizer corrects slow scale drift rather than only rotational/positional drift.
Authors: We acknowledge that the original abstract and method overview did not explicitly quantify scale handling. In the revised manuscript we clarify that overlap alignment estimates a similarity transform (including scale) between adjacent chunks rather than a purely rigid transform, and the subsequent pose-graph loop-closure optimization treats scale as an optimizable variable to enforce global metric consistency. To supply the requested evidence we have added (i) a new scale-error plot versus trajectory length and (ii) ATE breakdowns separating rotational, translational, and scale components on the KITTI and Waymo sequences. These results, now presented in Section 4.3 and Figure 5, show that scale drift remains below 2 % even on multi-kilometer trajectories, confirming that the optimizer corrects scale drift in addition to pose drift. We have also updated the abstract and method description to summarize this scale-alignment mechanism. revision: yes
Circularity Check
No significant circularity; derivation uses standard alignment on external foundation model outputs
full rationale
The paper's core approach—chunking long sequences, performing overlapping rigid alignment, and applying lightweight loop closure—operates on the outputs of the pre-existing VGGT model without redefining VGGT's internal quantities or fitting parameters to the target metrics. Evaluations on KITTI, Waymo, and Virtual KITTI supply external benchmarks, and no equations or claims reduce the reported trajectory/reconstruction accuracy to quantities defined by the method itself or to self-citations whose validity depends on the present work. The central claim therefore retains independent empirical content.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/Atomicityatomic_tick echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization.
-
Foundation/LedgerCanonicalityHasLocalComposition echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
To correct the accumulated drift inherent in sequential estimation, we perform loop closure detection across the entire sequence.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.
-
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preservin...
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
-
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
-
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...
-
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
-
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
-
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates
A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.
-
ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting
ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.
-
MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos
MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction
A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.
Reference graph
Works this paper leans on
-
[1]
Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 3
work page 2011
-
[2]
Learning to match features with seeded graph matching network
Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6301–6310, 2021. 3
work page 2021
-
[3]
Igor Cvi ˇsi´c, Ivan Markovi ´c, and Ivan Petrovi ´c. Soft2: Stereo visual odometry for road vehicles based on a point- to-epipolar-line metric.IEEE Transactions on Robotics, 39 (1):273–288, 2022. 2
work page 2022
-
[4]
FlashAttention-2: Faster attention with better par- allelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2
work page 2024
-
[5]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 2
work page 2022
-
[6]
Kai Deng, Yigong Zhang, Jian Yang, and Jin Xie. Gi- gaslam: Large-scale monocular slam with hierarchical gaus- sian splats.arXiv preprint arXiv:2503.08071, 2025. 2, 5, 7
-
[7]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 3
work page 2018
-
[8]
Build- ing rome on a cloudless day
Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Build- ing rome on a cloudless day. InEuropean conference on computer vision, pages 368–381. Springer, 2010. 3
work page 2010
-
[9]
Virtual worlds as proxy for multi-object tracking analysis
A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. InCVPR, 2016. 6
work page 2016
-
[10]
Ldso: Direct sparse odometry with loop closure
Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cre- mers. Ldso: Direct sparse odometry with loop closure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 7
work page 2018
-
[11]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 6
work page 2012
-
[12]
Detector-free struc- ture from motion.CVPR, 2024
Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free struc- ture from motion.CVPR, 2024. 3
work page 2024
-
[13]
Optimal transport ag- gregation for visual place recognition
Sergio Izquierdo and Javier Civera. Optimal transport ag- gregation for visual place recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recogni- tion, pages 17658–17668, 2024. 4
work page 2024
-
[14]
gradslam: Automagically differen- tiable slam.arXiv preprint arXiv:1910.10672, 2019
Krishna Murthy Jatavallabhula, Soroush Saryazdi, Ganesh Iyer, and Liam Paull. gradslam: Automagically differen- tiable slam.arXiv preprint arXiv:1910.10672, 2019. 3
-
[15]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3, 7
work page 2024
-
[16]
Pixel-Perfect Structure-from-Motion with Featuremetric Refinement
Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. InICCV, 2021. 3 9
work page 2021
-
[17]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 3
work page 2023
-
[18]
Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2025. 5, 7
work page 2025
-
[19]
Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 3
-
[20]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 8
work page 2021
-
[21]
Raul Mur-Artal and Juan D. Tardos. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB- d cameras.IEEE Transactions on Robotics, 33(5):1255– 1262, 2017. 3, 7
work page 2017
-
[22]
Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tardos. ORB- SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 3
work page 2015
-
[23]
Mast3r-slam: Real-time dense slam with 3d reconstruction priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025. 2, 3, 7, 8
work page 2025
-
[24]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
work page 2023
-
[25]
Efficient variants of the icp algorithm
Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp algorithm. InProceedings third international confer- ence on 3-D digital imaging and modeling, pages 145–152. IEEE, 2001. 6
work page 2001
-
[26]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 3
work page 2016
-
[27]
Cameron Smith, David Charatan, Ayush Tewari, and Vin- cent Sitzmann. Flowmap: High-quality camera poses, in- trinsics, and depth via gradient descent.arXiv preprint arXiv:2404.15259, 2024. 3
-
[28]
Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010
Hauke Strasdat, J Montiel, and Andrew J Davison. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010. 5
work page 2010
-
[29]
A benchmark for the evalua- tion of rgb-d slam systems
J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6
work page 2012
-
[30]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6
work page 2020
-
[31]
Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018
Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018. 3
-
[32]
Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion.arXiv preprint arXiv:1812.04605, 2018. 3
-
[33]
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neu- ral information processing systems, 34:16558–16569, 2021. 3, 7, 8
work page 2021
-
[34]
Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024
Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024. 7
work page 2024
-
[35]
Bundle adjustment—a modern synthe- sis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. InInternational workshop on vision algorithms, pages 298–372. Springer, 1999. 2, 3
work page 1999
-
[36]
Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient.Advances in Neural Information Processing Systems, 33:14254–14265,
-
[37]
Demon: Depth and motion network for learning monocular stereo
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047,
-
[38]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[39]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21686–21697, 2024. 3
work page 2024
-
[40]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 6, 7, 8
work page 2025
-
[41]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2, 3, 7, 8
-
[42]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3
work page 2024
-
[43]
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images 10 in one forward pass.arXiv preprint arXiv:2501.13928, 2025. 2, 3, 7, 8
-
[44]
Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth predic- tion for monocular direct sparse odometry. InProceedings of the European conference on computer vision (ECCV), pages 817–833, 2018. 2
work page 2018
-
[45]
Df-vo: What should be learnt for visual odometry?arXiv preprint arXiv:2103.00933,
Huangying Zhan, Chamara Saroj Weerasekera, Jia-Wang Bian, Ravi Garg, and Ian Reid. Df-vo: What should be learnt for visual odometry?arXiv preprint arXiv:2103.00933,
-
[46]
Visual-lidar odometry and map- ping: Low-drift, robust, and fast
Ji Zhang and Sanjiv Singh. Visual-lidar odometry and map- ping: Low-drift, robust, and fast. In2015 IEEE international conference on robotics and automation (ICRA), pages 2174–
-
[47]
Loam: Lidar odometry and mapping in real-time
Ji Zhang, Sanjiv Singh, et al. Loam: Lidar odometry and mapping in real-time. InRobotics: Science and systems, pages 1–9. Berkeley, CA, 2014. 2 11 Seq. 00 (3724m) Seq. 01 (2453m) Seq. 02 (5067m) Seq. 03 (561m) Seq. 04 (394m) Seq. 05 (2206m) Seq. 06 (1233m) Seq. 07 (650m) Seq. 09 (1705m) Seq. 10 (920m) Seq. 08 (3223m) Figure 8. Trajectory visual results of...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.