arxiv: 2408.16061 · v1 · pith:LRH4IMVRnew · submitted 2024-08-28 · 💻 cs.CV

3D Reconstruction with Spatial Memory

Hengyi Wang , Lourdes Agapito This is my paper

Pith reviewed 2026-05-17 07:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionpointmapsspatial memorytransformerglobal alignmentdense reconstructionDUSt3R

0 comments

The pith

Spann3R maintains an external spatial memory to regress per-image pointmaps directly in a global coordinate system from ordered or unordered image collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spann3R as a transformer-based model for dense 3D reconstruction that extends the DUSt3R approach. Instead of predicting pairwise pointmaps in local frames and then aligning them with optimization, Spann3R keeps an external spatial memory of all prior 3D information and queries it to output the next pointmap already expressed in global coordinates. This removes the post-processing alignment step while still leveraging DUSt3R's pre-trained weights. The method is tested on various datasets and reports competitive accuracy with real-time capability for sequential inputs.

Core claim

Spann3R directly regresses per-image pointmaps expressed in a global coordinate system by maintaining an external spatial memory that retains and retrieves all previous relevant 3D information, thereby eliminating any need for optimization-based global alignment after local predictions.

What carries the argument

An external spatial memory module that stores prior 3D information and is queried by the transformer to produce the next frame's pointmap in global coordinates.

Load-bearing premise

The external spatial memory can reliably retain and retrieve all relevant prior 3D information across arbitrary ordered or unordered image collections without drift or loss of consistency.

What would settle it

A clear accumulation of drift or inconsistent geometry when the model processes a long unordered image sequence would show that the spatial memory fails to maintain global consistency.

read the original abstract

We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time. Project page: \url{https://hengyiwang.github.io/projects/spanner}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spann3R adds an external spatial memory to DUSt3R so it can regress pointmaps straight into a global frame and skip optimization, but the abstract gives no numbers to show whether the memory actually prevents drift or delivers the claimed real-time gains.

read the letter

The key point here is that Spann3R modifies the DUSt3R approach by adding an external spatial memory that the transformer can query to regress pointmaps directly in a global frame, skipping the usual optimization step for alignment. What stands out as new is this memory mechanism for handling consistency across images, whether ordered or not. It builds directly on DUSt3R's pre-trained model and fine-tunes it on some datasets to claim real-time processing for ordered collections and competitive results on unseen ones. The paper does a decent job of describing the architecture in the abstract and positioning it as a simplification for applications like robotics and AR. The idea of managing prior 3D info in memory to predict the next frame globally is a clear architectural shift. On the soft spots, the abstract makes claims about competitive performance and real-time capability but doesn't include any numbers, ablations, or comparisons to show how well the memory maintains consistency. This makes it tough to assess if the no-optimization claim holds up, especially for unordered image sets where processing order is arbitrary and errors might build up without correction. The stress test note about potential drift seems relevant until the full paper shows drift curves or alignment errors. If the full manuscript has quantitative results and failure mode analysis, that would strengthen it a lot. Right now, the soundness feels limited by the lack of evidence in the summary. This paper is aimed at researchers in computer vision working on dense 3D reconstruction pipelines. A reader interested in learned alternatives to bundle adjustment or global optimization would get some value from the architectural idea, even if they need to see the experiments to be convinced. I'd recommend sending it to peer review because the core idea is distinct and could be useful if the results check out, though it will likely need revisions for more data.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Spann3R, an extension of the DUSt3R framework for dense 3D reconstruction from ordered or unordered image collections. It introduces an external spatial memory queried via a transformer to regress per-image pointmaps directly in a shared global coordinate system, thereby claiming to eliminate the need for optimization-based global alignment. The method reuses DUSt3R pre-trained weights, performs additional fine-tuning on dataset subsets, and asserts competitive performance, generalization to unseen data, and real-time processing for ordered sequences.

Significance. If the core claim holds—that the learned spatial memory reliably maintains global consistency without drift or explicit optimization—this would constitute a practical advance for real-time dense reconstruction pipelines in robotics and AR, where post-processing bundle adjustment is often a bottleneck. The reuse of pre-trained weights and focus on direct global regression could reduce computational overhead compared to traditional two-stage approaches.

major comments (2)

[Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.
[Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.

minor comments (1)

[Abstract] The project page URL is referenced but no supplementary material or code release is mentioned to support reproducibility of the fine-tuning procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will strengthen the presentation. The experiments in the full paper support the core claims regarding global consistency via learned spatial memory.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.

Authors: The abstract is intentionally concise and high-level, as is standard. The full manuscript provides quantitative support in the Experiments section, including direct comparisons to DUSt3R followed by optimization-based alignment, ablation studies on the memory module, and metrics demonstrating competitive accuracy, generalization to unseen datasets, and real-time inference on ordered sequences. To address the concern, we will revise the abstract to briefly reference key quantitative results (e.g., improved or comparable reconstruction metrics without post-processing). revision: yes
Referee: [Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.

Authors: The spatial memory is trained end-to-end on large-scale data with diverse view orders, allowing the transformer to learn implicit consistency and drift correction without hand-crafted losses or regularization terms. Experiments on both ordered and unordered collections (including arbitrary processing orders) show that global alignment is maintained directly, outperforming or matching DUSt3R plus optimization in several metrics. We acknowledge the value of explicit discussion on error propagation and will add a clarifying paragraph in the Method section on how training mitigates this risk for unordered inputs. revision: partial

Circularity Check

0 steps flagged

Minor self-citation present but central architectural claim remains independent

full rationale

The paper extends DUSt3R by introducing an external spatial memory queried via transformer to regress global pointmaps directly. This is framed as a new learned mechanism rather than a mathematical re-expression or fit of DUSt3R's local per-pair outputs. Pre-trained weights are leveraged with additional fine-tuning on external data subsets, which is standard transfer learning and does not reduce the global prediction to prior fitted parameters by construction. No equations or self-citation chains are shown that force the no-optimization claim; the memory module is presented as an independent addition for consistency.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unproven capacity of the learned spatial memory to maintain global consistency; the paper supplies no independent evidence or formal guarantee for this capacity beyond the architecture description.

axioms (1)

domain assumption A transformer equipped with an external memory module can learn to maintain 3D consistency across image sequences without explicit pose estimation.
Invoked when the paper states that querying the spatial memory suffices to place new pointmaps in the global frame.

invented entities (1)

External spatial memory no independent evidence
purpose: Stores and retrieves prior 3D information so that each new image's pointmap is predicted directly in global coordinates.
New component introduced to replace post-hoc global alignment; no independent falsifiable prediction (e.g., specific memory capacity or retrieval accuracy) is supplied.

pith-pipeline@v0.9.0 · 5489 in / 1446 out tokens · 74704 ms · 2026-05-17T07:25:46.136038+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/SimplicialLedger and Foundation/LedgerCanonicality SimplicialLedger.eight_tick_uniqueness and LedgerCanonicality.HasMultilevelComposition echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system.
Foundation/DimensionForcing alexander_duality_circle_linking and dimension_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment.
Foundation/DiscretenessForcing discreteness_forcing_principle refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
cs.CV 2026-04 unverdicted novelty 7.0

Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
cs.CV 2026-04 unverdicted novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
cs.CV 2026-03 unverdicted novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
3AM: 3egment Anything with Geometric Consistency in Videos
cs.CV 2026-01 unverdicted novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
cs.CV 2025-07 conditional novelty 7.0

π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...
Long-tail Internet photo reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
HD-VGGT: High-Resolution Visual Geometry Transformer
cs.CV 2026-03 unverdicted novelty 6.0

HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while mo...
Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
cs.CV 2025-11 unverdicted novelty 6.0

Co-Me distills a confidence predictor to selectively merge low-confidence tokens in visual geometric transformers, delivering up to 21.5x speedup on VGGT and 20.4x on Pi3 while preserving spatial coverage and performance.
Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration
cs.CV 2025-11 unverdicted novelty 6.0

Lumos3D enables pose-free single-forward restoration of low-light 3D scenes via cross-illumination distillation from a teacher network and a custom Lumos loss on 3D Gaussians.
PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation
cs.CV 2025-10 unverdicted novelty 6.0

PAGE-4D is a feedforward extension of VGGT that uses a dynamics-aware aggregator and mask to disentangle pose estimation from geometry reconstruction in videos with moving objects.
Streaming 4D Visual Geometry Transformer
cs.CV 2025-07 unverdicted novelty 6.0

A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting
cs.RO 2026-04 unverdicted novelty 5.0

ReefMapGS closes the loop between multimodal SLAM and 3D Gaussian Splatting to deliver COLMAP-free incremental 3D reconstruction and improved AUV trajectory estimates for underwater reef surveys up to 700 m.
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
cs.CV 2026-04 unverdicted novelty 5.0

FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
TTT3R: 3D Reconstruction as Test-Time Training
cs.CV 2025-09 unverdicted novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
VGGT-SLAM++
cs.CV 2026-04 unverdicted novelty 4.0

VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 20 Pith papers · 1 internal anchor

[1]

Large-scale data for multiple-view stereopsis

Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. IJCV, 120:153–168, 2016. 5, 6

work page 2016
[2]

Building rome in a day

Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building rome in a day. InICCV, pages 72–79, 2009. 1, 2

work page 2009
[3]

Bundle adjustment in the large

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. In ECCV, pages 29–42, 2010. 1, 2

work page 2010
[4]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. In ECCV, 2022. 8

work page 2022
[5]

Human mem- ory: A proposed system and its control processes

Richard C Atkinson and Richard M Shiffrin. Human mem- ory: A proposed system and its control processes. In Psy- chology of learning and motivation, pages 89–195. Elsevier,

work page
[6]

Neural rgb-d surface reconstruction

Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In CVPR, pages 6290–6301, 2022. 5, 6

work page 2022
[7]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, pages 5470– 5479, 2022. 3, 8

work page 2022
[8]

Zip-nerf: Anti-aliased grid- based neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. In ICCV, pages 19697–19705,

work page
[9]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In NeurIPS Datasets and Benchmarks ,

work page
[10]

Speeded-up robust features (surf)

Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vi- sion and image understanding, 110(3):346–359, 2008. 1, 2

work page 2008
[11]

Nope-nerf: Optimising neural ra- diance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. In CVPR, pages 4160–4169,

work page
[12]

Codeslam—learning a compact, optimisable representation for dense visual slam

Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. In CVPR, pages 2560–2568, 2018. 1

work page 2018
[13]

Dsac-differentiable ransac for camera localization

Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In CVPR, pages 6684–6692, 2017. 2

work page 2017
[14]

Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses

Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, pages 5044–5053, 2023

work page 2023
[15]

Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. arXiv preprint arXiv:2404.14351, 2024. 2

work page arXiv 2024
[16]

Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model

Ho Kei Cheng and Alexander G Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022. 2, 3, 4

work page 2022
[17]

Re- thinking space-time networks with improved memory cov- erage for efficient video object segmentation

Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Re- thinking space-time networks with improved memory cov- erage for efficient video object segmentation. InNIPS, pages 11781–11794, 2021

work page 2021
[18]

Putting the object back into video object segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, pages 3151–3161,

work page
[19]

Discrete-continuous optimization for large- scale structure from motion

David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous optimization for large- scale structure from motion. In CVPR, pages 3001–3008,

work page
[20]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 5

work page 2017
[21]

Monoslam: Real-time single camera slam

Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. TPAMI, 29(6):1052–1067, 2007. 1, 2

work page 2007
[22]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW, pages 224–236, 2018. 1, 2

work page 2018
[23]

Learning a depth covariance function

Eric Dexheimer and Andrew J Davison. Learning a depth covariance function. In CVPR, pages 13122–13131, 2023. 2

work page 2023
[24]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. In ICCV, pages 10061–10072, 2023. 3

work page 2023
[25]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 6

work page 2020
[26]

Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion

Arda Duzceker, Silvano Galliani, Christoph V ogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion. In CVPR, pages 15324–15333, 2021. 2, 6

work page 2021
[27]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. In ECCV, pages 834–849, 2014. 2

work page 2014
[28]

Direct sparse odometry

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017. 2

work page 2017
[29]

Accurate, dense, and ro- bust multiview stereopsis

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and ro- bust multiview stereopsis. TPAMI, 32(8):1362–1376, 2009. 1, 2

work page 2009
[30]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, pages 873–881, 2015. 1, 2 11

work page 2015
[31]

Multiple view ge- ometry in computer vision

Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

work page
[32]

Detector-free struc- ture from motion

Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free struc- ture from motion. CVPR, 2024. 1

work page 2024
[33]

2d gaussian splatting for geometrically accu- rate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. In ACM SIGGRAPH, pages 1–11, 2024. 3

work page 2024
[34]

Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras

Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. In CVPR, pages 21584–21593, 2024. 3

work page 2024
[35]

Codenerf: Disentan- gled neural radiance fields for object categories

Wonbong Jang and Lourdes Agapito. Codenerf: Disentan- gled neural radiance fields for object categories. In ICCV, pages 12949–12958, 2021. 3

work page 2021
[36]

Nvist: In the wild new view synthesis from a single image with transformers

Wonbong Jang and Lourdes Agapito. Nvist: In the wild new view synthesis from a single image with transformers. In CVPR, pages 10181–10193, 2024. 3

work page 2024
[37]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023. 3

work page arXiv 2023
[38]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In CVPR, pages 9492–9502, 2024. 2

work page 2024
[39]

Splatam: Splat track & map 3d gaussians for dense rgb-d slam

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In CVPR, pages 21357–21366, 2024. 3

work page 2024
[40]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):139–1, 2023. 3

work page 2023
[41]

Parallel tracking and map- ping for small ar workspaces

Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. In ISMAR, pages 1–10, 2007. 1, 2

work page 2007
[42]

vmap: Vectorised object mapping for neural field slam

Xin Kong, Shikun Liu, Marwan Taher, and Andrew J Davi- son. vmap: Vectorised object mapping for neural field slam. In CVPR, pages 952–961, 2023. 3

work page 2023
[43]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018. 5

work page 2041
[44]

Pixel-perfect structure-from-motion with featuremetric refinement

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. In ICCV, pages 5987–5997,

work page
[45]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In ICCV, pages 17627–17638, 2023. 2

work page 2023
[46]

Object recognition from local scale-invariant features

David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999. 1, 2

work page 1999
[47]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. IJCV, 60:91–110, 2004. 1, 2

work page 2004
[48]

Gaussian splatting slam

Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. In CVPR, pages 18039– 18048, 2024. 3

work page 2024
[49]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016. 5

work page 2016
[50]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, pages 405–421, 2020. 3, 8

work page 2020
[51]

Key-value mem- ory networks for directly reading documents

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value mem- ory networks for directly reading documents. In EMNLP, pages 1400–1409, 2016. 2, 3

work page 2016
[52]

Orb-slam: a versatile and accurate monocular slam system

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics , 31(5):1147–1163,

work page
[53]

Kinectfusion: Real-time dense surface mapping and track- ing

Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In ISMAR, pages 127–136, 2011. 2

work page 2011
[54]

Dtam: Dense tracking and mapping in real-time

Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In ICCV, pages 2320–2327, 2011. 1, 2

work page 2011
[55]

Video object segmentation using space-time memory networks

Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019. 3

work page 2019
[56]

Global structure-from-motion re- visited

Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion re- visited. In ECCV, 2024. 2

work page 2024
[57]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In ICCV, pages 12179–12188, 2021. 6

work page 2021
[58]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10901– 10911, 2021. 5

work page 2021
[59]

Sacreg: Scene-agnostic co- ordinate regression for visual localization

Jerome Revaud, Yohann Cabon, Romain Br ´egier, JongMin Lee, and Philippe Weinzaepfel. Sacreg: Scene-agnostic co- ordinate regression for visual localization. InCVPRW, pages 688–698, 2024. 2

work page 2024
[60]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. InICCV, pages 2564–2571, 2011. 1, 2

work page 2011
[61]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, pages 4938– 4947, 2020. 1, 2

work page 2020
[62]

Habitat: A plat- form for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat- form for embodied ai research. In ICCV, pages 9339–9347,

work page
[63]

Simplere- con: 3d reconstruction without 3d convolutions

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. In ECCV, pages 1–19, 2022. 2, 6

work page 2022
[64]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In CVPR, pages 4104–4113, 2016. 1, 2, 3

work page 2016
[65]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518,

work page
[66]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. In CVPR, pages 3260–3269, 2017. 8

work page 2017
[67]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. In CVPR, pages 2930–2937, 2013. 5, 6

work page 2013
[68]

Photo tourism: exploring photo collections in 3d

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. TOG, 25(3):835– 846, 2006. 1, 2

work page 2006
[69]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, pages 18221–18232, 2024. 3

work page 2024
[70]

A benchmark for the evalua- tion of rgb-d slam systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In IROS, pages 573–580, 2012. 8

work page 2012
[71]

imap: Implicit mapping and positioning in real-time

Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. In ICCV, pages 6229–6238, 2021. 3

work page 2021
[72]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InNIPS, pages 2440– 2448, 2015. 2, 3

work page 2015
[73]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021. 1

work page 2021
[74]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 5

work page 2020
[75]

Optimizing the viewing graph for structure-from-motion

Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. In ICCV, pages 801–809, 2015. 1, 2

work page 2015
[76]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419, 2020. 3

work page 2020
[77]

Bundle adjustment—a modern synthe- sis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. In ICCVW, pages 298–372, 2000. 1, 2

work page 2000
[78]

Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam

Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In CVPR, pages 13293–13302, 2023. 3, 5

work page 2023
[79]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In CVPR, pages 21686–21697, 2024. 1

work page 2024
[80]

Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NIPS, pages 27171–27183, 2021. 3

work page 2021

Showing first 80 references.