3D Reconstruction with Spatial Memory
Pith reviewed 2026-05-17 07:25 UTC · model grok-4.3
The pith
Spann3R maintains an external spatial memory to regress per-image pointmaps directly in a global coordinate system from ordered or unordered image collections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Spann3R directly regresses per-image pointmaps expressed in a global coordinate system by maintaining an external spatial memory that retains and retrieves all previous relevant 3D information, thereby eliminating any need for optimization-based global alignment after local predictions.
What carries the argument
An external spatial memory module that stores prior 3D information and is queried by the transformer to produce the next frame's pointmap in global coordinates.
Load-bearing premise
The external spatial memory can reliably retain and retrieve all relevant prior 3D information across arbitrary ordered or unordered image collections without drift or loss of consistency.
What would settle it
A clear accumulation of drift or inconsistent geometry when the model processes a long unordered image sequence would show that the spatial memory fails to maintain global consistency.
read the original abstract
We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time. Project page: \url{https://hengyiwang.github.io/projects/spanner}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Spann3R, an extension of the DUSt3R framework for dense 3D reconstruction from ordered or unordered image collections. It introduces an external spatial memory queried via a transformer to regress per-image pointmaps directly in a shared global coordinate system, thereby claiming to eliminate the need for optimization-based global alignment. The method reuses DUSt3R pre-trained weights, performs additional fine-tuning on dataset subsets, and asserts competitive performance, generalization to unseen data, and real-time processing for ordered sequences.
Significance. If the core claim holds—that the learned spatial memory reliably maintains global consistency without drift or explicit optimization—this would constitute a practical advance for real-time dense reconstruction pipelines in robotics and AR, where post-processing bundle adjustment is often a bottleneck. The reuse of pre-trained weights and focus on direct global regression could reduce computational overhead compared to traditional two-stage approaches.
major comments (2)
- [Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.
- [Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.
minor comments (1)
- [Abstract] The project page URL is referenced but no supplementary material or code release is mentioned to support reproducibility of the fine-tuning procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will strengthen the presentation. The experiments in the full paper support the core claims regarding global consistency via learned spatial memory.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of 'competitive performance and generalization ability' together with 'real-time' capability after fine-tuning is presented without any quantitative metrics, ablation studies, error analysis, or direct comparisons to DUSt3R plus optimization baselines; this absence leaves the central no-optimization claim without verifiable support.
Authors: The abstract is intentionally concise and high-level, as is standard. The full manuscript provides quantitative support in the Experiments section, including direct comparisons to DUSt3R followed by optimization-based alignment, ablation studies on the memory module, and metrics demonstrating competitive accuracy, generalization to unseen datasets, and real-time inference on ordered sequences. To address the concern, we will revise the abstract to briefly reference key quantitative results (e.g., improved or comparable reconstruction metrics without post-processing). revision: yes
-
Referee: [Method] Method (spatial memory description): The architecture is described as a transformer query over an external memory without explicit cycle-consistency losses, global bundle-adjustment terms, or scale/pose regularization; for unordered inputs where processing order is arbitrary, this leaves open the risk that local regression errors propagate, directly undermining the claim that global alignment is eliminated.
Authors: The spatial memory is trained end-to-end on large-scale data with diverse view orders, allowing the transformer to learn implicit consistency and drift correction without hand-crafted losses or regularization terms. Experiments on both ordered and unordered collections (including arbitrary processing orders) show that global alignment is maintained directly, outperforming or matching DUSt3R plus optimization in several metrics. We acknowledge the value of explicit discussion on error propagation and will add a clarifying paragraph in the Method section on how training mitigates this risk for unordered inputs. revision: partial
Circularity Check
Minor self-citation present but central architectural claim remains independent
full rationale
The paper extends DUSt3R by introducing an external spatial memory queried via transformer to regress global pointmaps directly. This is framed as a new learned mechanism rather than a mathematical re-expression or fit of DUSt3R's local per-pair outputs. Pre-trained weights are leveraged with additional fine-tuning on external data subsets, which is standard transfer learning and does not reduce the global prediction to prior fitted parameters by construction. No equations or self-citation chains are shown that force the no-optimization claim; the memory module is presented as an independent addition for consistency.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A transformer equipped with an external memory module can learn to maintain 3D consistency across image sequences without explicit pose estimation.
invented entities (1)
-
External spatial memory
no independent evidence
Lean theorems connected to this paper
-
Foundation/SimplicialLedger and Foundation/LedgerCanonicalitySimplicialLedger.eight_tick_uniqueness and LedgerCanonicality.HasMultilevelComposition echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system.
-
Foundation/DimensionForcingalexander_duality_circle_linking and dimension_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment.
-
Foundation/DiscretenessForcingdiscreteness_forcing_principle refines?
refinesRelation between the paper passage and the cited Recognition theorem.
Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...
-
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
-
3AM: 3egment Anything with Geometric Consistency in Videos
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
-
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...
-
Long-tail Internet photo reconstruction
Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
HD-VGGT: High-Resolution Visual Geometry Transformer
HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while mo...
-
Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
Co-Me distills a confidence predictor to selectively merge low-confidence tokens in visual geometric transformers, delivering up to 21.5x speedup on VGGT and 20.4x on Pi3 while preserving spatial coverage and performance.
-
Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration
Lumos3D enables pose-free single-forward restoration of low-light 3D scenes via cross-illumination distillation from a teacher network and a custom Lumos loss on 3D Gaussians.
-
PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation
PAGE-4D is a feedforward extension of VGGT that uses a dynamics-aware aggregator and mask to disentangle pose estimation from geometry reconstruction in videos with moving objects.
-
Streaming 4D Visual Geometry Transformer
A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
-
ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting
ReefMapGS closes the loop between multimodal SLAM and 3D Gaussian Splatting to deliver COLMAP-free incremental 3D reconstruction and improved AUV trajectory estimates for underwater reef surveys up to 700 m.
-
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
VGGT-SLAM++
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
Reference graph
Works this paper leans on
-
[1]
Large-scale data for multiple-view stereopsis
Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. IJCV, 120:153–168, 2016. 5, 6
work page 2016
-
[2]
Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building rome in a day. InICCV, pages 72–79, 2009. 1, 2
work page 2009
-
[3]
Bundle adjustment in the large
Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. In ECCV, pages 29–42, 2010. 1, 2
work page 2010
-
[4]
Map-free visual relocalization: Metric pose relative to a single image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. In ECCV, 2022. 8
work page 2022
-
[5]
Human mem- ory: A proposed system and its control processes
Richard C Atkinson and Richard M Shiffrin. Human mem- ory: A proposed system and its control processes. In Psy- chology of learning and motivation, pages 89–195. Elsevier,
-
[6]
Neural rgb-d surface reconstruction
Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In CVPR, pages 6290–6301, 2022. 5, 6
work page 2022
-
[7]
Mip-nerf 360: Unbounded anti-aliased neural radiance fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, pages 5470– 5479, 2022. 3, 8
work page 2022
-
[8]
Zip-nerf: Anti-aliased grid- based neural radiance fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. In ICCV, pages 19697–19705,
-
[9]
Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In NeurIPS Datasets and Benchmarks ,
-
[10]
Speeded-up robust features (surf)
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vi- sion and image understanding, 110(3):346–359, 2008. 1, 2
work page 2008
-
[11]
Nope-nerf: Optimising neural ra- diance field with no pose prior
Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. In CVPR, pages 4160–4169,
-
[12]
Codeslam—learning a compact, optimisable representation for dense visual slam
Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. In CVPR, pages 2560–2568, 2018. 1
work page 2018
-
[13]
Dsac-differentiable ransac for camera localization
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In CVPR, pages 6684–6692, 2017. 2
work page 2017
-
[14]
Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses
Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, pages 5044–5053, 2023
work page 2023
-
[15]
Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. arXiv preprint arXiv:2404.14351, 2024. 2
-
[16]
Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model
Ho Kei Cheng and Alexander G Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022. 2, 3, 4
work page 2022
-
[17]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Re- thinking space-time networks with improved memory cov- erage for efficient video object segmentation. InNIPS, pages 11781–11794, 2021
work page 2021
-
[18]
Putting the object back into video object segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, pages 3151–3161,
-
[19]
Discrete-continuous optimization for large- scale structure from motion
David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous optimization for large- scale structure from motion. In CVPR, pages 3001–3008,
-
[20]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 5
work page 2017
-
[21]
Monoslam: Real-time single camera slam
Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. TPAMI, 29(6):1052–1067, 2007. 1, 2
work page 2007
-
[22]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW, pages 224–236, 2018. 1, 2
work page 2018
-
[23]
Learning a depth covariance function
Eric Dexheimer and Andrew J Davison. Learning a depth covariance function. In CVPR, pages 13122–13131, 2023. 2
work page 2023
-
[24]
Tapir: Tracking any point with per-frame initialization and temporal refinement
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. In ICCV, pages 10061–10072, 2023. 3
work page 2023
-
[25]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 6
work page 2020
-
[26]
Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion
Arda Duzceker, Silvano Galliani, Christoph V ogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. Deep- videomvs: Multi-view stereo on video with recurrent spatio- temporal fusion. In CVPR, pages 15324–15333, 2021. 2, 6
work page 2021
-
[27]
Lsd- slam: Large-scale direct monocular slam
Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. In ECCV, pages 834–849, 2014. 2
work page 2014
-
[28]
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017. 2
work page 2017
-
[29]
Accurate, dense, and ro- bust multiview stereopsis
Yasutaka Furukawa and Jean Ponce. Accurate, dense, and ro- bust multiview stereopsis. TPAMI, 32(8):1362–1376, 2009. 1, 2
work page 2009
-
[30]
Massively parallel multiview stereopsis by surface normal diffusion
Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, pages 873–881, 2015. 1, 2 11
work page 2015
-
[31]
Multiple view ge- ometry in computer vision
Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,
-
[32]
Detector-free struc- ture from motion
Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free struc- ture from motion. CVPR, 2024. 1
work page 2024
-
[33]
2d gaussian splatting for geometrically accu- rate radiance fields
Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. In ACM SIGGRAPH, pages 1–11, 2024. 3
work page 2024
-
[34]
Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. In CVPR, pages 21584–21593, 2024. 3
work page 2024
-
[35]
Codenerf: Disentan- gled neural radiance fields for object categories
Wonbong Jang and Lourdes Agapito. Codenerf: Disentan- gled neural radiance fields for object categories. In ICCV, pages 12949–12958, 2021. 3
work page 2021
-
[36]
Nvist: In the wild new view synthesis from a single image with transformers
Wonbong Jang and Lourdes Agapito. Nvist: In the wild new view synthesis from a single image with transformers. In CVPR, pages 10181–10193, 2024. 3
work page 2024
-
[37]
Co- tracker: It is better to track together
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023. 3
-
[38]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In CVPR, pages 9492–9502, 2024. 2
work page 2024
-
[39]
Splatam: Splat track & map 3d gaussians for dense rgb-d slam
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In CVPR, pages 21357–21366, 2024. 3
work page 2024
-
[40]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):139–1, 2023. 3
work page 2023
-
[41]
Parallel tracking and map- ping for small ar workspaces
Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. In ISMAR, pages 1–10, 2007. 1, 2
work page 2007
-
[42]
vmap: Vectorised object mapping for neural field slam
Xin Kong, Shikun Liu, Marwan Taher, and Andrew J Davi- son. vmap: Vectorised object mapping for neural field slam. In CVPR, pages 952–961, 2023. 3
work page 2023
-
[43]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018. 5
work page 2041
-
[44]
Pixel-perfect structure-from-motion with featuremetric refinement
Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. In ICCV, pages 5987–5997,
-
[45]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In ICCV, pages 17627–17638, 2023. 2
work page 2023
-
[46]
Object recognition from local scale-invariant features
David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999. 1, 2
work page 1999
-
[47]
Distinctive image features from scale- invariant keypoints
David G Lowe. Distinctive image features from scale- invariant keypoints. IJCV, 60:91–110, 2004. 1, 2
work page 2004
-
[48]
Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. In CVPR, pages 18039– 18048, 2024. 3
work page 2024
-
[49]
Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pages 4040–4048, 2016. 5
work page 2016
-
[50]
Nerf: Representing scenes as neural radiance fields for view syn- thesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, pages 405–421, 2020. 3, 8
work page 2020
-
[51]
Key-value mem- ory networks for directly reading documents
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value mem- ory networks for directly reading documents. In EMNLP, pages 1400–1409, 2016. 2, 3
work page 2016
-
[52]
Orb-slam: a versatile and accurate monocular slam system
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics , 31(5):1147–1163,
-
[53]
Kinectfusion: Real-time dense surface mapping and track- ing
Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In ISMAR, pages 127–136, 2011. 2
work page 2011
-
[54]
Dtam: Dense tracking and mapping in real-time
Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In ICCV, pages 2320–2327, 2011. 1, 2
work page 2011
-
[55]
Video object segmentation using space-time memory networks
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019. 3
work page 2019
-
[56]
Global structure-from-motion re- visited
Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion re- visited. In ECCV, 2024. 2
work page 2024
-
[57]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In ICCV, pages 12179–12188, 2021. 6
work page 2021
-
[58]
Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10901– 10911, 2021. 5
work page 2021
-
[59]
Sacreg: Scene-agnostic co- ordinate regression for visual localization
Jerome Revaud, Yohann Cabon, Romain Br ´egier, JongMin Lee, and Philippe Weinzaepfel. Sacreg: Scene-agnostic co- ordinate regression for visual localization. InCVPRW, pages 688–698, 2024. 2
work page 2024
-
[60]
Orb: An efficient alternative to sift or surf
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. InICCV, pages 2564–2571, 2011. 1, 2
work page 2011
-
[61]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, pages 4938– 4947, 2020. 1, 2
work page 2020
-
[62]
Habitat: A plat- form for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A plat- form for embodied ai research. In ICCV, pages 9339–9347,
-
[63]
Simplere- con: 3d reconstruction without 3d convolutions
Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. In ECCV, pages 1–19, 2022. 2, 6
work page 2022
-
[64]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In CVPR, pages 4104–4113, 2016. 1, 2, 3
work page 2016
-
[65]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518,
-
[66]
A multi-view stereo benchmark with high- resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. In CVPR, pages 3260–3269, 2017. 8
work page 2017
-
[67]
Scene co- ordinate regression forests for camera relocalization in rgb-d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. In CVPR, pages 2930–2937, 2013. 5, 6
work page 2013
-
[68]
Photo tourism: exploring photo collections in 3d
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. TOG, 25(3):835– 846, 2006. 1, 2
work page 2006
-
[69]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, pages 18221–18232, 2024. 3
work page 2024
-
[70]
A benchmark for the evalua- tion of rgb-d slam systems
J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In IROS, pages 573–580, 2012. 8
work page 2012
-
[71]
imap: Implicit mapping and positioning in real-time
Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. In ICCV, pages 6229–6238, 2021. 3
work page 2021
-
[72]
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InNIPS, pages 2440– 2448, 2015. 2, 3
work page 2015
-
[73]
Loftr: Detector-free local feature matching with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021. 1
work page 2021
-
[74]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 5
work page 2020
-
[75]
Optimizing the viewing graph for structure-from-motion
Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. In ICCV, pages 801–809, 2015. 1, 2
work page 2015
-
[76]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419, 2020. 3
work page 2020
-
[77]
Bundle adjustment—a modern synthe- sis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. In ICCVW, pages 298–372, 2000. 1, 2
work page 2000
-
[78]
Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam
Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In CVPR, pages 13293–13302, 2023. 3, 5
work page 2023
-
[79]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In CVPR, pages 21686–21697, 2024. 1
work page 2024
-
[80]
Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In NIPS, pages 27171–27183, 2021. 3
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.