Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Dong Wu; Fei Xue; Hongbin Zha; Shunkai Zhou; Yuchen Deng; Zike Yan

arxiv: 2604.09480 · v1 · submitted 2026-04-10 · 💻 cs.CV

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Shunkai Zhou , Zike Yan , Fei Xue , Dong Wu , Yuchen Deng , Hongbin Zha This is my paper

Pith reviewed 2026-05-10 17:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords online learningsequential 3D reconstructionvisual promptsself-supervised learningconsistency constraintsgeometry foundation modeltest-time adaptation3D reconstruction

0 comments

The pith

Injecting learnable visual prompts into a frozen geometry foundation model and training them online with local-global consistency constraints enables adaptive sequential 3D reconstruction without ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Online3R, a sequential reconstruction framework that adapts to new scenes by online learning of lightweight visual prompts added to a pretrained frozen geometry foundation model. This setup resolves inconsistency issues by using a local-global self-supervised learning strategy that enforces consistency constraints on predictions. Local constraints operate on intermediate and fused results to provide high-quality pseudo-ground-truth signals, while global constraints on sparse keyframes allow efficient learning over long trajectories. By keeping the foundation model frozen, the approach preserves its general geometry prediction capabilities while capturing new environment knowledge. Experiments indicate that this method outperforms previous state-of-the-art approaches on various benchmarks.

Core claim

Online3R demonstrates that learnable lightweight visual prompts can be introduced into a pretrained, frozen geometry foundation model to capture knowledge of new environments, with the prompts updated at test time through a local-global self-supervised strategy enforcing local consistency on intermediate and previously fused results and global consistency on sparse keyframes spanning long distances, thereby enabling consistent sequential reconstruction that adapts to new scenes without any ground truth.

What carries the argument

Learnable lightweight visual prompts inserted into the frozen geometry foundation model and trained via local-global self-supervised consistency constraints.

If this is right

The framework can adapt to new scenes at test time without retraining the entire foundation model.
Local consistency constraints supply pseudo-ground-truth signals for effective prompt training.
Global consistency on sparse keyframes enables efficient learning over long trajectories.
The fundamental geometry prediction capability of the foundation model is preserved during adaptation.
Sequential reconstruction achieves better consistency than prior methods across multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This prompt-based adaptation could extend to other frozen foundation models for test-time tuning in vision tasks.
Reliance on self-supervised consistency signals suggests potential to lower the need for labeled 3D datasets in reconstruction.
The local-global split might generalize to other sequential learning problems where full supervision is unavailable.
Real-time applications such as robotics navigation could benefit if the prompt updates prove fast enough on-device.

Load-bearing premise

The local consistency constraints on intermediate and previously fused results together with global constraints on sparse keyframes provide high-quality pseudo-ground-truth signals sufficient to train the visual prompts effectively in the absence of ground truth.

What would settle it

Observing no improvement in consistency metrics or reconstruction accuracy when applying the online prompt updates compared to using the frozen model alone on a held-out sequential benchmark dataset would falsify the effectiveness of the learning strategy.

Figures

Figures reproduced from arXiv: 2604.09480 by Dong Wu, Fei Xue, Hongbin Zha, Shunkai Zhou, Yuchen Deng, Zike Yan.

**Figure 1.** Figure 1: Overview of our proposed Online3R. The core of our Online3R lies in constructing self-supervised methods and online prompt tuning, enabling the model to adapt to the current scene and ensuring consistent reconstruction results. We leverage a local consistency loss derived from temporally fused geometry to enhance the accuracy of subsequent predictions, and a global consistency loss that enforces geometric … view at source ↗

**Figure 2.** Figure 2: Qualitative Comparison on 3D Reconstruction Consistency. We present reconstruction results on two sequences, with 7-Scenes-heads on the left and NRGBD-staircase on the right, separated by dashed lines. The first row shows the global point cloud reconstruction from a far viewpoint. The second row zooms in to view the near viewpoint. The third row highlights details using bounding boxes. It is evident that o… view at source ↗

**Figure 3.** Figure 3: Qualitative results for reconstruction of non-overlapping [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Online3R adds lightweight prompts and local-global consistency to adapt a frozen geometry model online, but the approach risks locking in base-model biases without external checks.

read the letter

Online3R adds learnable visual prompts to a frozen geometry foundation model and trains them online using local-global self-supervised consistency for sequential reconstruction. The new part is the specific setup: lightweight prompts that adapt to new scenes while keeping the base model intact, plus the local consistency on intermediate fusions and global on sparse keyframes to create pseudo-groundtruth without labels. This targets the inconsistency problem in online 3D reconstruction efficiently. It does well by focusing on test-time adaptation that is lightweight and self-supervised, which fits real-time robotics or AR needs where full retraining isn't feasible. The main soft spot is the reliance on consistency alone. If the foundation model has systematic errors like depth bias or scale drift in a new scene, enforcing consistency on its own predictions can amplify those rather than fix them. The abstract claims outperformance on benchmarks but gives no numbers, baselines, or ablations, so it's hard to tell if the prompts improve absolute geometry or just reduce variance. The stress-test concern about locking in biases seems plausible based on what's described. This paper is for computer vision people working on online mapping and reconstruction who already use foundation models. A reader interested in practical adaptations would get value from the framework, even if they need to check the experiments themselves. It deserves a serious referee because the idea is grounded in a real problem and the method is clearly laid out, though revisions would likely be needed for the evidence. I would recommend sending it to peer review with requests for quantitative results and analysis showing that the adaptation corrects rather than reinforces errors.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Online3R, a sequential 3D reconstruction framework that performs online adaptation to new scenes by inserting a small set of learnable visual prompts into a frozen pretrained geometry foundation model. Adaptation is driven by a local-global self-supervised strategy: local consistency losses are applied to intermediate predictions and previously fused results to supply pseudo-ground-truth signals, while global consistency losses are applied only to sparse keyframes spanning long trajectories for efficiency. The central empirical claim is that this procedure yields more consistent reconstructions than prior state-of-the-art methods on standard benchmarks.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route for test-time adaptation of large geometry foundation models without ground-truth supervision, which is valuable for online robotics and AR applications. The explicit separation of local high-quality pseudo-labels from sparse global constraints is a clear design choice that addresses both signal quality and computational cost. No machine-checked proofs or fully reproducible code artifacts are described, but the project page reference indicates an intent to release implementation details.

major comments (2)

[Abstract / §4] Abstract and experimental evaluation section: the claim that “Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks” is presented without any quantitative metrics, listed baselines, ablation tables, or error analysis. Because the central contribution is an empirical improvement in consistency and accuracy, the absence of these data prevents verification of the data-to-claim link.
[§3.2] §3.2 (local-global self-supervised learning): the method assumes that enforcing consistency on the model’s own intermediate and fused predictions, together with sparse-keyframe global constraints, produces pseudo-ground-truth of sufficient quality to improve absolute geometric fidelity. No external anchor (e.g., comparison against available ground-truth depth or pose on held-out sequences) is reported to test whether systematic biases in the frozen foundation model are corrected rather than reinforced.

minor comments (1)

[§3] The notation for the visual prompts and the exact form of the local and global consistency losses could be stated more explicitly (e.g., as equations) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to better substantiate our claims.

read point-by-point responses

Referee: [Abstract / §4] Abstract and experimental evaluation section: the claim that “Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks” is presented without any quantitative metrics, listed baselines, ablation tables, or error analysis. Because the central contribution is an empirical improvement in consistency and accuracy, the absence of these data prevents verification of the data-to-claim link.

Authors: We agree that the abstract would benefit from a more direct link to the quantitative evidence. Section 4 already contains the full set of metrics, baselines, ablation tables, and error analysis on standard benchmarks. In the revised manuscript we have updated the abstract to include a concise summary of key quantitative results (e.g., consistency and accuracy gains over listed baselines) while preserving brevity. This makes the empirical claim immediately verifiable from the abstract. revision: yes
Referee: [§3.2] §3.2 (local-global self-supervised learning): the method assumes that enforcing consistency on the model’s own intermediate and fused predictions, together with sparse-keyframe global constraints, produces pseudo-ground-truth of sufficient quality to improve absolute geometric fidelity. No external anchor (e.g., comparison against available ground-truth depth or pose on held-out sequences) is reported to test whether systematic biases in the frozen foundation model are corrected rather than reinforced.

Authors: This is a valid concern about validating the pseudo-ground-truth signals. Our final benchmark evaluations (which use ground-truth depth and poses) already show net improvements in geometric fidelity, but we did not report an explicit side-by-side comparison of the intermediate pseudo-labels against held-out ground truth to isolate bias correction. We will add this analysis in the revision, using sequences with available ground truth, to confirm that the local-global consistency constraints reduce rather than reinforce systematic errors from the frozen model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces learnable visual prompts into a frozen foundation model and trains them at test time via local-global consistency losses that generate pseudo-GT from agreement among the model's own intermediate and keyframe predictions. This self-supervised mechanism does not reduce any claimed result to a fitted quantity by construction, nor does it invoke self-citations, uniqueness theorems, or ansatzes from prior author work as load-bearing justification. The central assertion of improved sequential reconstruction is supported by external benchmark comparisons rather than tautological redefinition of inputs as outputs. The approach is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a frozen foundation model retains usable geometry prediction capability while lightweight prompts can be adapted via self-supervised consistency signals; no free parameters beyond the prompts themselves are stated, and no new physical entities are postulated.

free parameters (1)

learnable visual prompts
Lightweight parameters inserted into the frozen model and updated at test time to capture new-scene knowledge.

axioms (1)

domain assumption The pretrained geometry foundation model possesses fundamental geometry prediction capability that remains intact when frozen.
Invoked when the method freezes the base model and relies on it to generate pseudo-ground-truth signals.

pith-pipeline@v0.9.0 · 5497 in / 1318 out tokens · 43739 ms · 2026-05-10T17:46:25.055232+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model... local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The local consistency constraints are conducted on intermediate and previously local fused results... global consistency constraints are operated on sparse keyframes spanning long distances

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog.,

work page
[2]

G ´omez Rodr´ıguez, Jos´e M

Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM.IEEE Trans. Robot., 2021. 6

work page 2021
[3]

Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM.IEEE Robot. Autom. Lett., 2020. 6

work page 2020
[4]

Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion. InIEEE Int. Conf. 3D Vision, 2025. 1, 2

work page 2025
[5]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 2

work page 2015
[6]

Mas- sively parallel multiview stereopsis by surface normal diffu- sion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Mas- sively parallel multiview stereopsis by surface normal diffu- sion. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2015. 2

work page 2015
[7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

work page 2022
[8]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEur. Conf. Comput. Vis., 2022. 1, 2, 3

work page 2022
[9]

Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

work page
[10]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., 2024. 1, 2, 3, 4, 6, 7, 8

work page 2024
[11]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2

work page 2021
[12]

Deep patch visual SLAM

Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. InEur. Conf. Comput. Vis., 2024. 6

work page 2024
[13]

Slam3r: Real- time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 1

work page 2025
[14]

Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025

Ziqi Lu et al. Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025. 2

work page 2025
[15]

Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv. Neural Inform. Process. Syst., 2025. 1, 2

work page 2025
[16]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In IEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025
[17]

A critique of structure-from-motion algorithms

John Oliensis. A critique of structure-from-motion algorithms. Computer Vision and Image Understanding, 80(2):172–214,

work page
[18]

A survey of structure from motion*.Acta Numerica, 26:305–364, 2017

Onur ¨Ozyes ¸il, Vladislav V oroninski, Ronen Basri, and Amit Singer. A survey of structure from motion*.Acta Numerica, 26:305–364, 2017. 2

work page 2017
[19]

AdapterHub: A framework for adapting transformers

Jonas Pfeiffer, Andreas R¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli ´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2

work page 2020
[20]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016. 1, 2

work page 2016
[21]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 2

work page 2016
[22]

Scene co- ordinate regression forests for camera relocalization in RGB- D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in RGB- D images. InIEEE Conf. Comput. Vis. Pattern Recog., 2013. 6, 7, 8

work page 2013
[23]

A benchmark for the evaluation of RGB-D SLAM systems

J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 6

work page 2012
[24]

DeepV2D: Video to depth with differentiable structure from motion

Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InInt. Conf. Learn. Represent., 2020. 6

work page 2020
[25]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InAdv. Neural Inform. Process. Syst., 2021. 6

work page 2021
[26]

GeoCalib: Learning single-image cali- bration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning single-image cali- bration with geometric optimization. InEur. Conf. Comput. Vis., 2024. 6

work page 2024
[27]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InIEEE Int. Conf. 3D Vision, 2025. 1, 2, 6, 7

work page 2025
[28]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE Conf. Com- put. Vis. Pattern Recog., 2025. 1, 2

work page 2025
[29]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conf. Comput. Vis. Pat- tern Recog., 2025. 1, 2, 6, 7

work page 2025
[30]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InIEEE Conf. Comput. Vis. Pattern Recog.,

work page
[31]

Adaptive patch deformation for textureless-resilient multi-view stereo

Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive patch deformation for textureless-resilient multi-view stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1621–1630,

work page
[32]

π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026. 1, 2

work page 2026
[33]

Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv. Neural Inform. Process. Syst., 2025. 1, 6, 7

work page 2025
[34]

Test3r: Learning to reconstruct 3d at test time.Adv

Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, and Xinchao Wang. Test3r: Learning to reconstruct 3d at test time.Adv. Neural Inform. Process. Syst., 2025. 1, 2, 3, 4, 7

work page 2025
[35]

Monst3r: A simple approach for estimating geometry in the presence of motion.Int

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.Int. Conf. Learn. Represent., 2025. 7

work page 2025
[36]

GO-SLAM: Global optimization for consistent 3D instant reconstruction

Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. InInt. Conf. Comput. Vis., 2023. 6

work page 2023

[1] [1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog.,

work page

[2] [2]

G ´omez Rodr´ıguez, Jos´e M

Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM.IEEE Trans. Robot., 2021. 6

work page 2021

[3] [3]

Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM.IEEE Robot. Autom. Lett., 2020. 6

work page 2020

[4] [4]

Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion. InIEEE Int. Conf. 3D Vision, 2025. 1, 2

work page 2025

[5] [5]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 2

work page 2015

[6] [6]

Mas- sively parallel multiview stereopsis by surface normal diffu- sion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Mas- sively parallel multiview stereopsis by surface normal diffu- sion. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2015. 2

work page 2015

[7] [7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

work page 2022

[8] [8]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEur. Conf. Comput. Vis., 2022. 1, 2, 3

work page 2022

[9] [9]

Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

work page

[10] [10]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., 2024. 1, 2, 3, 4, 6, 7, 8

work page 2024

[11] [11]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2

work page 2021

[12] [12]

Deep patch visual SLAM

Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. InEur. Conf. Comput. Vis., 2024. 6

work page 2024

[13] [13]

Slam3r: Real- time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 1

work page 2025

[14] [14]

Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025

Ziqi Lu et al. Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025. 2

work page 2025

[15] [15]

Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv. Neural Inform. Process. Syst., 2025. 1, 2

work page 2025

[16] [16]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In IEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025

[17] [17]

A critique of structure-from-motion algorithms

John Oliensis. A critique of structure-from-motion algorithms. Computer Vision and Image Understanding, 80(2):172–214,

work page

[18] [18]

A survey of structure from motion*.Acta Numerica, 26:305–364, 2017

Onur ¨Ozyes ¸il, Vladislav V oroninski, Ronen Basri, and Amit Singer. A survey of structure from motion*.Acta Numerica, 26:305–364, 2017. 2

work page 2017

[19] [19]

AdapterHub: A framework for adapting transformers

Jonas Pfeiffer, Andreas R¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli ´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2

work page 2020

[20] [20]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016. 1, 2

work page 2016

[21] [21]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 2

work page 2016

[22] [22]

Scene co- ordinate regression forests for camera relocalization in RGB- D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in RGB- D images. InIEEE Conf. Comput. Vis. Pattern Recog., 2013. 6, 7, 8

work page 2013

[23] [23]

A benchmark for the evaluation of RGB-D SLAM systems

J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 6

work page 2012

[24] [24]

DeepV2D: Video to depth with differentiable structure from motion

Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InInt. Conf. Learn. Represent., 2020. 6

work page 2020

[25] [25]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InAdv. Neural Inform. Process. Syst., 2021. 6

work page 2021

[26] [26]

GeoCalib: Learning single-image cali- bration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning single-image cali- bration with geometric optimization. InEur. Conf. Comput. Vis., 2024. 6

work page 2024

[27] [27]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InIEEE Int. Conf. 3D Vision, 2025. 1, 2, 6, 7

work page 2025

[28] [28]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE Conf. Com- put. Vis. Pattern Recog., 2025. 1, 2

work page 2025

[29] [29]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conf. Comput. Vis. Pat- tern Recog., 2025. 1, 2, 6, 7

work page 2025

[30] [30]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InIEEE Conf. Comput. Vis. Pattern Recog.,

work page

[31] [31]

Adaptive patch deformation for textureless-resilient multi-view stereo

Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive patch deformation for textureless-resilient multi-view stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1621–1630,

work page

[32] [32]

π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026. 1, 2

work page 2026

[33] [33]

Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv. Neural Inform. Process. Syst., 2025. 1, 6, 7

work page 2025

[34] [34]

Test3r: Learning to reconstruct 3d at test time.Adv

Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, and Xinchao Wang. Test3r: Learning to reconstruct 3d at test time.Adv. Neural Inform. Process. Syst., 2025. 1, 2, 3, 4, 7

work page 2025

[35] [35]

Monst3r: A simple approach for estimating geometry in the presence of motion.Int

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.Int. Conf. Learn. Represent., 2025. 7

work page 2025

[36] [36]

GO-SLAM: Global optimization for consistent 3D instant reconstruction

Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. InInt. Conf. Comput. Vis., 2023. 6

work page 2023