pith. sign in

arxiv: 2604.09480 · v1 · submitted 2026-04-10 · 💻 cs.CV

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

Pith reviewed 2026-05-10 17:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords online learningsequential 3D reconstructionvisual promptsself-supervised learningconsistency constraintsgeometry foundation modeltest-time adaptation3D reconstruction
0
0 comments X

The pith

Injecting learnable visual prompts into a frozen geometry foundation model and training them online with local-global consistency constraints enables adaptive sequential 3D reconstruction without ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Online3R, a sequential reconstruction framework that adapts to new scenes by online learning of lightweight visual prompts added to a pretrained frozen geometry foundation model. This setup resolves inconsistency issues by using a local-global self-supervised learning strategy that enforces consistency constraints on predictions. Local constraints operate on intermediate and fused results to provide high-quality pseudo-ground-truth signals, while global constraints on sparse keyframes allow efficient learning over long trajectories. By keeping the foundation model frozen, the approach preserves its general geometry prediction capabilities while capturing new environment knowledge. Experiments indicate that this method outperforms previous state-of-the-art approaches on various benchmarks.

Core claim

Online3R demonstrates that learnable lightweight visual prompts can be introduced into a pretrained, frozen geometry foundation model to capture knowledge of new environments, with the prompts updated at test time through a local-global self-supervised strategy enforcing local consistency on intermediate and previously fused results and global consistency on sparse keyframes spanning long distances, thereby enabling consistent sequential reconstruction that adapts to new scenes without any ground truth.

What carries the argument

Learnable lightweight visual prompts inserted into the frozen geometry foundation model and trained via local-global self-supervised consistency constraints.

If this is right

  • The framework can adapt to new scenes at test time without retraining the entire foundation model.
  • Local consistency constraints supply pseudo-ground-truth signals for effective prompt training.
  • Global consistency on sparse keyframes enables efficient learning over long trajectories.
  • The fundamental geometry prediction capability of the foundation model is preserved during adaptation.
  • Sequential reconstruction achieves better consistency than prior methods across multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This prompt-based adaptation could extend to other frozen foundation models for test-time tuning in vision tasks.
  • Reliance on self-supervised consistency signals suggests potential to lower the need for labeled 3D datasets in reconstruction.
  • The local-global split might generalize to other sequential learning problems where full supervision is unavailable.
  • Real-time applications such as robotics navigation could benefit if the prompt updates prove fast enough on-device.

Load-bearing premise

The local consistency constraints on intermediate and previously fused results together with global constraints on sparse keyframes provide high-quality pseudo-ground-truth signals sufficient to train the visual prompts effectively in the absence of ground truth.

What would settle it

Observing no improvement in consistency metrics or reconstruction accuracy when applying the online prompt updates compared to using the frozen model alone on a held-out sequential benchmark dataset would falsify the effectiveness of the learning strategy.

Figures

Figures reproduced from arXiv: 2604.09480 by Dong Wu, Fei Xue, Hongbin Zha, Shunkai Zhou, Yuchen Deng, Zike Yan.

Figure 1
Figure 1. Figure 1: Overview of our proposed Online3R. The core of our Online3R lies in constructing self-supervised methods and online prompt tuning, enabling the model to adapt to the current scene and ensuring consistent reconstruction results. We leverage a local consistency loss derived from temporally fused geometry to enhance the accuracy of subsequent predictions, and a global consistency loss that enforces geometric … view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Comparison on 3D Reconstruction Consistency. We present reconstruction results on two sequences, with 7-Scenes-heads on the left and NRGBD-staircase on the right, separated by dashed lines. The first row shows the global point cloud reconstruction from a far viewpoint. The second row zooms in to view the near viewpoint. The third row highlights details using bounding boxes. It is evident that o… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results for reconstruction of non-overlapping [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: https://shunkaizhou.github.io/online3r-1.0/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Online3R, a sequential 3D reconstruction framework that performs online adaptation to new scenes by inserting a small set of learnable visual prompts into a frozen pretrained geometry foundation model. Adaptation is driven by a local-global self-supervised strategy: local consistency losses are applied to intermediate predictions and previously fused results to supply pseudo-ground-truth signals, while global consistency losses are applied only to sparse keyframes spanning long trajectories for efficiency. The central empirical claim is that this procedure yields more consistent reconstructions than prior state-of-the-art methods on standard benchmarks.

Significance. If the empirical claims are substantiated, the work would demonstrate a practical route for test-time adaptation of large geometry foundation models without ground-truth supervision, which is valuable for online robotics and AR applications. The explicit separation of local high-quality pseudo-labels from sparse global constraints is a clear design choice that addresses both signal quality and computational cost. No machine-checked proofs or fully reproducible code artifacts are described, but the project page reference indicates an intent to release implementation details.

major comments (2)
  1. [Abstract / §4] Abstract and experimental evaluation section: the claim that “Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks” is presented without any quantitative metrics, listed baselines, ablation tables, or error analysis. Because the central contribution is an empirical improvement in consistency and accuracy, the absence of these data prevents verification of the data-to-claim link.
  2. [§3.2] §3.2 (local-global self-supervised learning): the method assumes that enforcing consistency on the model’s own intermediate and fused predictions, together with sparse-keyframe global constraints, produces pseudo-ground-truth of sufficient quality to improve absolute geometric fidelity. No external anchor (e.g., comparison against available ground-truth depth or pose on held-out sequences) is reported to test whether systematic biases in the frozen foundation model are corrected rather than reinforced.
minor comments (1)
  1. [§3] The notation for the visual prompts and the exact form of the local and global consistency losses could be stated more explicitly (e.g., as equations) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to better substantiate our claims.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and experimental evaluation section: the claim that “Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks” is presented without any quantitative metrics, listed baselines, ablation tables, or error analysis. Because the central contribution is an empirical improvement in consistency and accuracy, the absence of these data prevents verification of the data-to-claim link.

    Authors: We agree that the abstract would benefit from a more direct link to the quantitative evidence. Section 4 already contains the full set of metrics, baselines, ablation tables, and error analysis on standard benchmarks. In the revised manuscript we have updated the abstract to include a concise summary of key quantitative results (e.g., consistency and accuracy gains over listed baselines) while preserving brevity. This makes the empirical claim immediately verifiable from the abstract. revision: yes

  2. Referee: [§3.2] §3.2 (local-global self-supervised learning): the method assumes that enforcing consistency on the model’s own intermediate and fused predictions, together with sparse-keyframe global constraints, produces pseudo-ground-truth of sufficient quality to improve absolute geometric fidelity. No external anchor (e.g., comparison against available ground-truth depth or pose on held-out sequences) is reported to test whether systematic biases in the frozen foundation model are corrected rather than reinforced.

    Authors: This is a valid concern about validating the pseudo-ground-truth signals. Our final benchmark evaluations (which use ground-truth depth and poses) already show net improvements in geometric fidelity, but we did not report an explicit side-by-side comparison of the intermediate pseudo-labels against held-out ground truth to isolate bias correction. We will add this analysis in the revision, using sequences with available ground truth, to confirm that the local-global consistency constraints reduce rather than reinforce systematic errors from the frozen model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces learnable visual prompts into a frozen foundation model and trains them at test time via local-global consistency losses that generate pseudo-GT from agreement among the model's own intermediate and keyframe predictions. This self-supervised mechanism does not reduce any claimed result to a fitted quantity by construction, nor does it invoke self-citations, uniqueness theorems, or ansatzes from prior author work as load-bearing justification. The central assertion of improved sequential reconstruction is supported by external benchmark comparisons rather than tautological redefinition of inputs as outputs. The approach is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a frozen foundation model retains usable geometry prediction capability while lightweight prompts can be adapted via self-supervised consistency signals; no free parameters beyond the prompts themselves are stated, and no new physical entities are postulated.

free parameters (1)
  • learnable visual prompts
    Lightweight parameters inserted into the frozen model and updated at test time to capture new-scene knowledge.
axioms (1)
  • domain assumption The pretrained geometry foundation model possesses fundamental geometry prediction capability that remains intact when frozen.
    Invoked when the method freezes the base model and relies on it to generate pseudo-ground-truth signals.

pith-pipeline@v0.9.0 · 5497 in / 1318 out tokens · 43739 ms · 2026-05-10T17:46:25.055232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Neural rgb-d surface reconstruction

    Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog.,

  2. [2]

    G ´omez Rodr´ıguez, Jos´e M

    Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM.IEEE Trans. Robot., 2021. 6

  3. [3]

    Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM.IEEE Robot. Autom. Lett., 2020. 6

  4. [4]

    Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: A fully-integrated solution for unconstrained structure-from-motion. InIEEE Int. Conf. 3D Vision, 2025. 1, 2

  5. [5]

    Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

    Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 2

  6. [6]

    Mas- sively parallel multiview stereopsis by surface normal diffu- sion

    Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Mas- sively parallel multiview stereopsis by surface normal diffu- sion. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2015. 2

  7. [7]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

  8. [8]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEur. Conf. Comput. Vis., 2022. 1, 2, 3

  9. [9]

    Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.International Conference on 3D Vision (3DV),

  10. [10]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., 2024. 1, 2, 3, 4, 6, 7, 8

  11. [11]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. 2

  12. [12]

    Deep patch visual SLAM

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. InEur. Conf. Comput. Vis., 2024. 6

  13. [13]

    Slam3r: Real- time dense scene reconstruction from monocular rgb videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 1

  14. [14]

    Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025

    Ziqi Lu et al. Lora3d: Low-rank self-calibration of 3d geo- metric foundation models.ICLR, 2025. 2

  15. [15]

    Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.Adv. Neural Inform. Process. Syst., 2025. 1, 2

  16. [16]

    Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In IEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 3, 4, 6, 7, 8

  17. [17]

    A critique of structure-from-motion algorithms

    John Oliensis. A critique of structure-from-motion algorithms. Computer Vision and Image Understanding, 80(2):172–214,

  18. [18]

    A survey of structure from motion*.Acta Numerica, 26:305–364, 2017

    Onur ¨Ozyes ¸il, Vladislav V oroninski, Ronen Basri, and Amit Singer. A survey of structure from motion*.Acta Numerica, 26:305–364, 2017. 2

  19. [19]

    AdapterHub: A framework for adapting transformers

    Jonas Pfeiffer, Andreas R¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli ´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A framework for adapting transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 2

  20. [20]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016. 1, 2

  21. [21]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 2

  22. [22]

    Scene co- ordinate regression forests for camera relocalization in RGB- D images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in RGB- D images. InIEEE Conf. Comput. Vis. Pattern Recog., 2013. 6, 7, 8

  23. [23]

    A benchmark for the evaluation of RGB-D SLAM systems

    J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 6

  24. [24]

    DeepV2D: Video to depth with differentiable structure from motion

    Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InInt. Conf. Learn. Represent., 2020. 6

  25. [25]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InAdv. Neural Inform. Process. Syst., 2021. 6

  26. [26]

    GeoCalib: Learning single-image cali- bration with geometric optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning single-image cali- bration with geometric optimization. InEur. Conf. Comput. Vis., 2024. 6

  27. [27]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InIEEE Int. Conf. 3D Vision, 2025. 1, 2, 6, 7

  28. [28]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE Conf. Com- put. Vis. Pattern Recog., 2025. 1, 2

  29. [29]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conf. Comput. Vis. Pat- tern Recog., 2025. 1, 2, 6, 7

  30. [30]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InIEEE Conf. Comput. Vis. Pattern Recog.,

  31. [31]

    Adaptive patch deformation for textureless-resilient multi-view stereo

    Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive patch deformation for textureless-resilient multi-view stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1621–1630,

  32. [32]

    π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.ICLR, 2026. 1, 2

  33. [33]

    Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer mem- ory.Adv. Neural Inform. Process. Syst., 2025. 1, 6, 7

  34. [34]

    Test3r: Learning to reconstruct 3d at test time.Adv

    Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, and Xinchao Wang. Test3r: Learning to reconstruct 3d at test time.Adv. Neural Inform. Process. Syst., 2025. 1, 2, 3, 4, 7

  35. [35]

    Monst3r: A simple approach for estimating geometry in the presence of motion.Int

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.Int. Conf. Learn. Represent., 2025. 7

  36. [36]

    GO-SLAM: Global optimization for consistent 3D instant reconstruction

    Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. InInt. Conf. Comput. Vis., 2023. 6