pith. machine review for the scientific record. sign in

arxiv: 2408.16061 · v1 · pith:LRH4IMVRnew · submitted 2024-08-28 · 💻 cs.CV

3D Reconstruction with Spatial Memory

Pith reviewed 2026-05-17 07:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionpointmapsspatial memorytransformerglobal alignmentdense reconstructionDUSt3R
0
0 comments X

The pith

Spann3R maintains an external spatial memory to regress per-image pointmaps directly in a global coordinate system from ordered or unordered image collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spann3R as a transformer-based model for dense 3D reconstruction that extends the DUSt3R approach. Instead of predicting pairwise pointmaps in local frames and then aligning them with optimization, Spann3R keeps an external spatial memory of all prior 3D information and queries it to output the next pointmap already expressed in global coordinates. This removes the post-processing alignment step while still leveraging DUSt3R's pre-trained weights. The method is tested on various datasets and reports competitive accuracy with real-time capability for sequential inputs.

Core claim

Spann3R directly regresses per-image pointmaps expressed in a global coordinate system by maintaining an external spatial memory that retains and retrieves all previous relevant 3D information, thereby eliminating any need for optimization-based global alignment after local predictions.

What carries the argument

An external spatial memory module that stores prior 3D information and is queried by the transformer to produce the next frame's pointmap in global coordinates.

Load-bearing premise

The external spatial memory can reliably retain and retrieve all relevant prior 3D information across arbitrary ordered or unordered image collections without drift or loss of consistency.

What would settle it

A clear accumulation of drift or inconsistent geometry when the model processes a long unordered image sequence would show that the spatial memory fails to maintain global consistency.

read the original abstract

We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time. Project page: \url{https://hengyiwang.github.io/projects/spanner}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Spann3R, an extension of the DUSt3R framework for dense 3D reconstruction from ordered or unordered image collections. It introduces an external spatial memory queried via a transformer to regress per-image pointmaps directly in a shared global coordinate system, thereby claiming to eliminate the need for optimization-based global alignment. The method reuses DUSt3R pre-trained weights, performs additional fine-tuning on dataset subsets, and asserts competitive performance, generalization to unseen data, and real-time processing for ordered sequences.

Significance. If the core claim holds—that the learned spatial memory reliably maintains global consistency without drift or explicit optimization—this would constitute a practical advance for real-time dense reconstruction pipelines in robotics and AR, where post-processing bundle adjustment is often a bottleneck. The reuse of pre-trained weights and focus on direct global regression could reduce computational overhead compared to traditional two-stage approaches.

major comments (2)
  1. [Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.
  2. [Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.
minor comments (1)
  1. [Abstract] The project page URL is referenced but no supplementary material or code release is mentioned to support reproducibility of the fine-tuning procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will strengthen the presentation. The experiments in the full paper support the core claims regarding global consistency via learned spatial memory.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.

    Authors: The abstract is intentionally concise and high-level, as is standard. The full manuscript provides quantitative support in the Experiments section, including direct comparisons to DUSt3R followed by optimization-based alignment, ablation studies on the memory module, and metrics demonstrating competitive accuracy, generalization to unseen datasets, and real-time inference on ordered sequences. To address the concern, we will revise the abstract to briefly reference key quantitative results (e.g., improved or comparable reconstruction metrics without post-processing). revision: yes

  2. Referee: [Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.

    Authors: The spatial memory is trained end-to-end on large-scale data with diverse view orders, allowing the transformer to learn implicit consistency and drift correction without hand-crafted losses or regularization terms. Experiments on both ordered and unordered collections (including arbitrary processing orders) show that global alignment is maintained directly, outperforming or matching DUSt3R plus optimization in several metrics. We acknowledge the value of explicit discussion on error propagation and will add a clarifying paragraph in the Method section on how training mitigates this risk for unordered inputs. revision: partial

Circularity Check

0 steps flagged

Minor self-citation present but central architectural claim remains independent

full rationale

The paper extends DUSt3R by introducing an external spatial memory queried via transformer to regress global pointmaps directly. This is framed as a new learned mechanism rather than a mathematical re-expression or fit of DUSt3R's local per-pair outputs. Pre-trained weights are leveraged with additional fine-tuning on external data subsets, which is standard transfer learning and does not reduce the global prediction to prior fitted parameters by construction. No equations or self-citation chains are shown that force the no-optimization claim; the memory module is presented as an independent addition for consistency.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven capacity of the learned spatial memory to maintain global consistency; the paper supplies no independent evidence or formal guarantee for this capacity beyond the architecture description.

axioms (1)
  • domain assumption A transformer equipped with an external memory module can learn to maintain 3D consistency across image sequences without explicit pose estimation.
    Invoked when the paper states that querying the spatial memory suffices to place new pointmaps in the global frame.
invented entities (1)
  • External spatial memory no independent evidence
    purpose: Stores and retrieves prior 3D information so that each new image's pointmap is predicted directly in global coordinates.
    New component introduced to replace post-hoc global alignment; no independent falsifiable prediction (e.g., specific memory capacity or retrieval accuracy) is supplied.

pith-pipeline@v0.9.0 · 5489 in / 1446 out tokens · 74704 ms · 2026-05-17T07:25:46.136038+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

    cs.CV 2026-04 unverdicted novelty 7.0

    Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.

  2. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

  3. STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

    cs.CV 2026-03 unverdicted novelty 7.0

    STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...

  4. ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

    cs.CV 2026-03 unverdicted novelty 7.0

    ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

  5. 3AM: 3egment Anything with Geometric Consistency in Videos

    cs.CV 2026-01 unverdicted novelty 7.0

    3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

  6. $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    cs.CV 2025-07 conditional novelty 7.0

    π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...

  7. Long-tail Internet photo reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.

  8. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  9. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  10. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  11. HD-VGGT: High-Resolution Visual Geometry Transformer

    cs.CV 2026-03 unverdicted novelty 6.0

    HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while mo...

  12. Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

    cs.CV 2025-11 unverdicted novelty 6.0

    Co-Me distills a confidence predictor to selectively merge low-confidence tokens in visual geometric transformers, delivering up to 21.5x speedup on VGGT and 20.4x on Pi3 while preserving spatial coverage and performance.

  13. Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration

    cs.CV 2025-11 unverdicted novelty 6.0

    Lumos3D enables pose-free single-forward restoration of low-light 3D scenes via cross-illumination distillation from a teacher network and a custom Lumos loss on 3D Gaussians.

  14. PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

    cs.CV 2025-10 unverdicted novelty 6.0

    PAGE-4D is a feedforward extension of VGGT that uses a dynamics-aware aggregator and mask to disentangle pose estimation from geometry reconstruction in videos with moving objects.

  15. Streaming 4D Visual Geometry Transformer

    cs.CV 2025-07 unverdicted novelty 6.0

    A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.

  16. ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting

    cs.RO 2026-04 unverdicted novelty 5.0

    ReefMapGS closes the loop between multimodal SLAM and 3D Gaussian Splatting to deliver COLMAP-free incremental 3D reconstruction and improved AUV trajectory estimates for underwater reef surveys up to 700 m.

  17. FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

    cs.CV 2026-04 unverdicted novelty 5.0

    FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.

  18. TTT3R: 3D Reconstruction as Test-Time Training

    cs.CV 2025-09 unverdicted novelty 5.0

    TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

  19. ViPE: Video Pose Engine for 3D Geometric Perception

    cs.CV 2025-08 unverdicted novelty 5.0

    ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.

  20. VGGT-SLAM++

    cs.CV 2026-04 unverdicted novelty 4.0

    VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 20 Pith papers · 1 internal anchor

  1. [1]

    Large-scale data for multiple-view stereopsis

    Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. IJCV, 120:153–168, 2016. 5, 6

  2. [2]

    Building rome in a day

    Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building rome in a day. InICCV, pages 72–79, 2009. 1, 2

  3. [3]

    Bundle adjustment in the large

    Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. In ECCV, pages 29–42, 2010. 1, 2

  4. [4]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. In ECCV, 2022. 8

  5. [5]

    Human mem- ory: A proposed system and its control processes

    Richard C Atkinson and Richard M Shiffrin. Human mem- ory: A proposed system and its control processes. In Psy- chology of learning and motivation, pages 89–195. Elsevier,

  6. [6]

    Neural rgb-d surface reconstruction

    Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In CVPR, pages 6290–6301, 2022. 5, 6

  7. [7]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, pages 5470– 5479, 2022. 3, 8

  8. [8]

    Zip-nerf: Anti-aliased grid- based neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. In ICCV, pages 19697–19705,

  9. [9]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In NeurIPS Datasets and Benchmarks ,

  10. [10]

    Speeded-up robust features (surf)

    Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vi- sion and image understanding, 110(3):346–359, 2008. 1, 2

  11. [11]

    Nope-nerf: Optimising neural ra- diance field with no pose prior

    Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. In CVPR, pages 4160–4169,

  12. [12]

    Codeslam—learning a compact, optimisable representation for dense visual slam

    Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. In CVPR, pages 2560–2568, 2018. 1

  13. [13]

    Dsac-differentiable ransac for camera localization

    Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In CVPR, pages 6684–6692, 2017. 2

  14. [14]

    Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses

    Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, pages 5044–5053, 2023

  15. [15]

    Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer

    Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. arXiv preprint arXiv:2404.14351, 2024. 2

  16. [16]

    Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model

    Ho Kei Cheng and Alexander G Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022. 2, 3, 4

  17. [17]

    Re- thinking space-time networks with improved memory cov- erage for efficient video object segmentation

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Re- thinking space-time networks with improved memory cov- erage for efficient video object segmentation. InNIPS, pages 11781–11794, 2021

  18. [18]

    Putting the object back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, pages 3151–3161,

  19. [19]

    Discrete-continuous optimization for large- scale structure from motion

    David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous optimization for large- scale structure from motion. In CVPR, pages 3001–3008,

  20. [20]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 5

  21. [21]

    Monoslam: Real-time single camera slam

    Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. TPAMI, 29(6):1052–1067, 2007. 1, 2

  22. [22]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW, pages 224–236, 2018. 1, 2

  23. [23]

    Learning a depth covariance function

    Eric Dexheimer and Andrew J Davison. Learning a depth covariance function. In CVPR, pages 13122–13131, 2023. 2

  24. [24]

    Tapir: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. In ICCV, pages 10061–10072, 2023. 3

  25. [25]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 6

  26. [26]

    Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion

    Arda Duzceker, Silvano Galliani, Christoph V ogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion. In CVPR, pages 15324–15333, 2021. 2, 6

  27. [27]

    Lsd- slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. In ECCV, pages 834–849, 2014. 2

  28. [28]

    Direct sparse odometry

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017. 2

  29. [29]

    Accurate, dense, and ro- bust multiview stereopsis

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and ro- bust multiview stereopsis. TPAMI, 32(8):1362–1376, 2009. 1, 2

  30. [30]

    Massively parallel multiview stereopsis by surface normal diffusion

    Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, pages 873–881, 2015. 1, 2 11

  31. [31]

    Multiple view ge- ometry in computer vision

    Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

  32. [32]

    Detector-free struc- ture from motion

    Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free struc- ture from motion. CVPR, 2024. 1

  33. [33]

    2d gaussian splatting for geometrically accu- rate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. In ACM SIGGRAPH, pages 1–11, 2024. 3

  34. [34]

    Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras

    Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. In CVPR, pages 21584–21593, 2024. 3

  35. [35]

    Codenerf: Disentan- gled neural radiance fields for object categories

    Wonbong Jang and Lourdes Agapito. Codenerf: Disentan- gled neural radiance fields for object categories. In ICCV, pages 12949–12958, 2021. 3

  36. [36]

    Nvist: In the wild new view synthesis from a single image with transformers

    Wonbong Jang and Lourdes Agapito. Nvist: In the wild new view synthesis from a single image with transformers. In CVPR, pages 10181–10193, 2024. 3

  37. [37]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023. 3

  38. [38]

    Repurpos- ing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In CVPR, pages 9492–9502, 2024. 2

  39. [39]

    Splatam: Splat track & map 3d gaussians for dense rgb-d slam

    Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In CVPR, pages 21357–21366, 2024. 3

  40. [40]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):139–1, 2023. 3

  41. [41]

    Parallel tracking and map- ping for small ar workspaces

    Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. In ISMAR, pages 1–10, 2007. 1, 2

  42. [42]

    vmap: Vectorised object mapping for neural field slam

    Xin Kong, Shikun Liu, Marwan Taher, and Andrew J Davi- son. vmap: Vectorised object mapping for neural field slam. In CVPR, pages 952–961, 2023. 3

  43. [43]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018. 5

  44. [44]

    Pixel-perfect structure-from-motion with featuremetric refinement

    Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. In ICCV, pages 5987–5997,

  45. [45]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In ICCV, pages 17627–17638, 2023. 2

  46. [46]

    Object recognition from local scale-invariant features

    David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999. 1, 2

  47. [47]

    Distinctive image features from scale- invariant keypoints

    David G Lowe. Distinctive image features from scale- invariant keypoints. IJCV, 60:91–110, 2004. 1, 2

  48. [48]

    Gaussian splatting slam

    Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. In CVPR, pages 18039– 18048, 2024. 3

  49. [49]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016. 5

  50. [50]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, pages 405–421, 2020. 3, 8

  51. [51]

    Key-value mem- ory networks for directly reading documents

    Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value mem- ory networks for directly reading documents. In EMNLP, pages 1400–1409, 2016. 2, 3

  52. [52]

    Orb-slam: a versatile and accurate monocular slam system

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics , 31(5):1147–1163,

  53. [53]

    Kinectfusion: Real-time dense surface mapping and track- ing

    Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In ISMAR, pages 127–136, 2011. 2

  54. [54]

    Dtam: Dense tracking and mapping in real-time

    Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In ICCV, pages 2320–2327, 2011. 1, 2

  55. [55]

    Video object segmentation using space-time memory networks

    Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019. 3

  56. [56]

    Global structure-from-motion re- visited

    Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion re- visited. In ECCV, 2024. 2

  57. [57]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In ICCV, pages 12179–12188, 2021. 6

  58. [58]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10901– 10911, 2021. 5

  59. [59]

    Sacreg: Scene-agnostic co- ordinate regression for visual localization

    Jerome Revaud, Yohann Cabon, Romain Br ´egier, JongMin Lee, and Philippe Weinzaepfel. Sacreg: Scene-agnostic co- ordinate regression for visual localization. InCVPRW, pages 688–698, 2024. 2

  60. [60]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. InICCV, pages 2564–2571, 2011. 1, 2

  61. [61]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, pages 4938– 4947, 2020. 1, 2

  62. [62]

    Habitat: A plat- form for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat- form for embodied ai research. In ICCV, pages 9339–9347,

  63. [63]

    Simplere- con: 3d reconstruction without 3d convolutions

    Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. In ECCV, pages 1–19, 2022. 2, 6

  64. [64]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In CVPR, pages 4104–4113, 2016. 1, 2, 3

  65. [65]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518,

  66. [66]

    A multi-view stereo benchmark with high- resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. In CVPR, pages 3260–3269, 2017. 8

  67. [67]

    Scene co- ordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. In CVPR, pages 2930–2937, 2013. 5, 6

  68. [68]

    Photo tourism: exploring photo collections in 3d

    Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. TOG, 25(3):835– 846, 2006. 1, 2

  69. [69]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, pages 18221–18232, 2024. 3

  70. [70]

    A benchmark for the evalua- tion of rgb-d slam systems

    J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In IROS, pages 573–580, 2012. 8

  71. [71]

    imap: Implicit mapping and positioning in real-time

    Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. In ICCV, pages 6229–6238, 2021. 3

  72. [72]

    End-to-end memory networks

    Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InNIPS, pages 2440– 2448, 2015. 2, 3

  73. [73]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021. 1

  74. [74]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 5

  75. [75]

    Optimizing the viewing graph for structure-from-motion

    Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. In ICCV, pages 801–809, 2015. 1, 2

  76. [76]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419, 2020. 3

  77. [77]

    Bundle adjustment—a modern synthe- sis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. In ICCVW, pages 298–372, 2000. 1, 2

  78. [78]

    Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam

    Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In CVPR, pages 13293–13302, 2023. 3, 5

  79. [79]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In CVPR, pages 21686–21697, 2024. 1

  80. [80]

    Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NIPS, pages 27171–27183, 2021. 3

Showing first 80 references.