pith. sign in

arxiv: 2606.30047 · v1 · pith:OPSDY4WMnew · submitted 2026-06-29 · 💻 cs.CV

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

Pith reviewed 2026-06-30 06:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords panoramic 3D reconstructionmetric reconstructionindoor scenescamera pose estimationdepth estimationpoint cloud reconstructionfeed-forward networkcovisibility
0
0 comments X

The pith

Argus reconstructs metric 3D indoor scenes from sparse unordered panoramic images by anchoring to an optimal reference view and decomposing pixel-to-world mappings under joint constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Realsee3D, a hybrid dataset of 10K indoor scenes with precise metric labels, to train feed-forward models on panoramic data. Argus uses a learned covisibility module to pick the best reference view and avoid global pose drift, while breaking the 3D mapping into supervised sub-steps that share cross-coordinate constraints. These choices produce state-of-the-art results on camera pose, depth, and point-cloud tasks within the benchmark's sparse capture setting. A sympathetic reader cares because reliable metric output from casual panoramic photos could support indoor mapping and navigation without dense sensor rigs.

Core claim

Argus is a feed-forward network for metric panoramic 3D reconstruction trained on the Realsee3D dataset. In the sparse unordered capture setting, a learned covisibility module selects the geometrically optimal reference view to anchor the metric world frame. The bidirectional pixel-to-world mapping is decomposed into interpretable sub-steps that receive per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction.

What carries the argument

The learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame, together with the decomposition of pixel-to-world mapping into per-step supervised sub-tasks under joint constraints.

If this is right

  • Metric camera pose estimation becomes feasible for unordered panoramic sequences without external anchors.
  • Depth maps and point clouds inherit improved global consistency from the shared coordinate constraints.
  • Feed-forward inference replaces iterative optimization for reconstruction in the sparse panoramic regime.
  • Multi-task training benefits from explicit geometric consistency across pose, depth, and reconstruction heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The covisibility selection mechanism could be tested on other sparse multi-view inputs such as smartphone photo bursts to check transfer beyond panoramas.
  • The hybrid real-synthetic construction of Realsee3D points to a scalable route for obtaining metric ground truth when purely real labeled data remains limited.
  • If the decomposed mapping generalizes, similar step-wise supervision might stabilize training for other geometric prediction tasks that mix 2D and 3D outputs.

Load-bearing premise

The Realsee3D dataset supplies precise metric annotations that are representative of real-world sparse unordered panoramic captures without systematic bias from its synthetic scenes.

What would settle it

Running Argus without retraining on a fresh collection of real panoramic sequences captured in indoor environments outside the Realsee3D distribution and measuring whether the reported gains in metric pose accuracy and depth error persist.

Figures

Figures reproduced from arXiv: 2606.30047 by Cihui Pan, Kai Zhang, Linyuan Li, Tong Rao, Xi Li, Xinchen Hui, Yan Wu.

Figure 1
Figure 1. Figure 1: Argus is a feed-forward 3D reconstruction network trained on our large-scale indoor panoramic 3D dataset Realsee3D. Given acquired sparse multi-view panoramic images, it rapidly reconstructs complete, consistent, and metric-scale 3D scenes. Abstract. Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGB￾D training data. We present Reals… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Argus. We first select the optimal reference frame via a Covis￾ibility Transformer, then aggregate multi-view geometric features through a reference￾based Geometry Transformer, followed by multiple prediction branches for intermediate geometric decomposition representations that facilitate multi-task learning. lay the foundation for indoor scene understanding with high-quality real-world panora… view at source ↗
Figure 3
Figure 3. Figure 3: Pixel-to-World Bidirectional Transformations for a Single View. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison. (IRLS) alignment [47], median alignment, and absolute metric evaluation with￾out any alignment. Results are given in Tab. 3. Argus outperforms all baselines in all metrics. We further evaluate Argus on monocular depth estimation for Realsee3D, and test its zero-shot generalization on two standard benchmarks: Matterport3D [12] and Stanford2D3D [5]. Although not specifically trained f… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the Effectiveness of Reference View Learning. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual Samples with Generalization Ability. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of Realsee3D. A More Dataset Details A.1 Dataset Characteristics Realsee3D is distinguished by several key features: – Panoramic Format & Validity Masks: Visual data is provided as high￾resolution 1600 × 800 ERP panoramic images. While synthetic scenes offer a complete 360◦ × 180◦ field of view, real-world captures inherently contain blind spots at the zenith and nadir due to hardware constraints.… view at source ↗
Figure 8
Figure 8. Figure 8: Depth Distribution of Realsee3D [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Masks Computed from Depth and Pose for Covis [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Zero-Shot Reconstruction on Unseen Datasets. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of the Covisibility Adjacency Matrix, Ground-Truth [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Verification of Covisibility Score Permutation Equivariance. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Perspective and Panoramic 3D Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Failure Cases of Argus. extremely narrow field of view, leading to poor computational efficiency. Miss￾ing point clouds from these cropped regions can be easily complemented using adjacent views or simple plane fitting. E.5 Limitations Due to the scarcity of outdoor datasets, our model exhibits suboptimal perfor￾mance in outdoor and aerial scenarios, and this limitation will be progressively mitigated wit… view at source ↗
read the original abstract

Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGB-D training data. We present Realsee3D, a hybrid dataset of 10K indoor scenes (1K real, 9K synthetic) with 299K panoramic viewpoints and precise metric annotations, and Argus, a feed-forward network trained on it for metric panoramic 3D reconstruction. In the sparse unordered capture setting of Realsee3D, a poorly chosen coordinate anchor can cause global pose drift. Argus addresses this with a learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame. To further improve multi-task learning, we decompose the bidirectional pixel-to-world mapping into interpretable sub-steps with per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction. Project page: https://argus-paper.realsee.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Realsee3D, a hybrid dataset of 10K indoor scenes (9K synthetic + 1K real) containing 299K panoramic viewpoints with precise metric annotations, and proposes Argus, a feed-forward network for metric panoramic 3D reconstruction from sparse unordered captures. Argus uses a learned covisibility module to select an optimal reference view as coordinate anchor and decomposes the pixel-to-world mapping into supervised sub-steps with cross-coordinate constraints. The central claim is that Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction on the Realsee3D benchmark.

Significance. If the empirical claims hold under independent real-only evaluation, the work would provide a useful large-scale hybrid dataset and a practical feed-forward approach to metric reconstruction from unordered panoramas, addressing the noted lack of training data and coordinate drift issues in this setting. The interpretable decomposition and per-step supervision could aid multi-task consistency, though the significance is tempered by the absence of quantitative numbers, error bars, or ablation details in the abstract.

major comments (2)
  1. [Dataset and Experiments sections] Dataset section (and benchmark evaluation): The Realsee3D dataset uses a 9:1 synthetic-to-real ratio. The SOTA claims for pose, depth, and reconstruction on the full benchmark are load-bearing for the paper's contribution, yet no real-only metrics, balanced test splits, or explicit comparison of synthetic vs. real performance are described. This leaves open whether gains exploit synthetic regularities (perfect geometry, calibration) rather than generalizing to real sparse captures with sensor noise and drift, directly undermining the claim of applicability to real-world unordered panoramic data.
  2. [Abstract and Experiments] Abstract and Experiments: The abstract states SOTA results without any quantitative numbers, error bars, ablation studies, or baseline comparisons. Without these details, it is impossible to assess whether the reported gains are robust or sensitive to post-hoc choices in the covisibility module or reference-view selection.
minor comments (1)
  1. [Abstract] The project page URL is given but no quantitative results or supplementary material links are referenced in the abstract; adding these would improve verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of evaluation and presentation. We address each major point below and will revise the manuscript accordingly to strengthen the claims regarding real-world applicability and clarity of results.

read point-by-point responses
  1. Referee: [Dataset and Experiments sections] Dataset section (and benchmark evaluation): The Realsee3D dataset uses a 9:1 synthetic-to-real ratio. The SOTA claims for pose, depth, and reconstruction on the full benchmark are load-bearing for the paper's contribution, yet no real-only metrics, balanced test splits, or explicit comparison of synthetic vs. real performance are described. This leaves open whether gains exploit synthetic regularities (perfect geometry, calibration) rather than generalizing to real sparse captures with sensor noise and drift, directly undermining the claim of applicability to real-world unordered panoramic data.

    Authors: We agree that explicit real-only evaluation is necessary to substantiate generalization claims for real sparse captures. The hybrid design of Realsee3D prioritizes scale while incorporating 1K real scenes, but the current manuscript reports only aggregate benchmark results. In the revision, we will add a new subsection with real-only metrics on the 1K real scenes (including pose, depth, and reconstruction errors), balanced test splits separating synthetic and real, and direct synthetic-vs-real performance comparisons. This will clarify whether the covisibility module and decomposed supervision generalize beyond synthetic regularities. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and Experiments: The abstract states SOTA results without any quantitative numbers, error bars, ablation studies, or baseline comparisons. Without these details, it is impossible to assess whether the reported gains are robust or sensitive to post-hoc choices in the covisibility module or reference-view selection.

    Authors: We acknowledge that the abstract's brevity makes it difficult to gauge the magnitude and robustness of the SOTA claims. While full quantitative results, error bars, and ablations appear in the Experiments section, we will revise the abstract to include key numerical results (e.g., relative improvements in camera pose error and depth accuracy on Realsee3D) and a brief note on the covisibility module's contribution. Full ablation tables and sensitivity analysis to reference-view selection will remain in the main text but will be cross-referenced more explicitly from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims on new benchmark are self-contained

full rationale

The paper introduces Realsee3D (hybrid synthetic/real dataset) and trains Argus on it, reporting direct empirical metrics for pose, depth, and reconstruction. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted parameters or self-citations; the covisibility module and per-step supervision are architectural choices evaluated externally on the benchmark splits. The central claims rest on held-out test performance rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence and quality of the Realsee3D dataset annotations plus the effectiveness of the learned covisibility and decomposed supervision modules. No explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5744 in / 1305 out tokens · 22584 ms · 2026-06-30T06:08:00.157671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

107 extracted references · 23 canonical work pages · 14 internal anchors

  1. [1]

    In: ICCV (2009)

    Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: ICCV (2009)

  2. [2]

    ai et al

    Ai,H.,Cao,Z.,Wang,L.:Asurveyofrepresentationlearning,optimizationstrate- gies, and applications for omnidirectional vision: H. ai et al. IJCV133(8), 4973– 5012 (2025)

  3. [3]

    In: IROS

    Alama, O., Bhattacharya, A., He, H., Kim, S., Qiu, Y., Wang, W., Ho, C., Keetha, N., Scherer, S.: Rayfronts: Open-set semantic ray frontiers for online scene under- standing and exploration. In: IROS. pp. 5930–5937 (2025)

  4. [4]

    In: ECCV

    Antequera, M.L., Gargallo, P., Hofinger, M., Bulo, S.R., Kuang, Y., Kontschieder, P.: Mapillary planet-scale depth dataset. In: ECCV. pp. 589–604 (2020)

  5. [5]

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Armeni,I.,Sax,S.,Zamir,A.R.,Savarese,S.:Joint2d-3d-semanticdataforindoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)

  6. [6]

    In: CVPR

    Barath, D., Matas, J.: Graph-cut ransac. In: CVPR. pp. 6733–6741 (2018)

  7. [7]

    In: Sensor fusion IV: control paradigms and data structures

    Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

  8. [8]

    In: CVPR

    Bian, J., Lin, W.Y., Matsushita, Y., Yeung, S.K., Nguyen, T.D., Cheng, M.M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: CVPR. pp. 4181–4190 (2017)

  9. [9]

    Virtual KITTI 2

    Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

  10. [10]

    IEEE Transactions on Robotics (2021)

    Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb- slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics (2021)

  11. [11]

    In: CVPR (2025)

    Cao, Z., Zhu, J., Zhang, W., Ai, H., Bai, H., Zhao, H., Wang, L.: Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. In: CVPR (2025)

  12. [12]

    In: 3DV (2017)

    Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. In: 3DV (2017)

  13. [13]

    TTT3R: 3D Reconstruction as Test-Time Training

    Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645 (2025)

  14. [14]

    In: CVPR

    Cruz, S., Hutchcroft, W., Li, Y., Khosravan, N., Boyadzhiev, I., Kang, S.B.: Zil- low indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In: CVPR. pp. 2133–2143 (2021)

  15. [15]

    In: CVPR (2017)

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- net: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

  16. [16]

    ToG36(4), 1 (2017)

    Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: Real- time globally consistent 3d reconstruction using on-the-fly surface reintegration. ToG36(4), 1 (2017)

  17. [17]

    In: NeurIPS (2022)

    Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. In: NeurIPS (2022)

  18. [18]

    Vision Transformers Need Registers

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

  19. [19]

    In: CVPR

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: CVPR. pp. 13142–13153 (2023)

  20. [20]

    arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

    Deng, J., Li, H., Xie, T., Ren, W., Zhang, Q., Tan, P., Guo, X.: Sail-recon: Large sfm by augmenting scene regression with localization. arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

  21. [21]

    In: CVPRW (2018)

    DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: CVPRW (2018)

  22. [22]

    Numerische Math- ematik1(1), 269–271 (1959)

    Dijkstra, E.: A note on two problems in connexion with graphs. Numerische Math- ematik1(1), 269–271 (1959)

  23. [23]

    In: 2025 International Conference on 3D Vision (3DV)

    Duisterhof, B.P., Zust, L., Weinzaepfel, P., Leroy, V., Cabon, Y., Revaud, J.: Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In: 2025 International Conference on 3D Vision (3DV). pp. 1–10 (2025)

  24. [24]

    In: CVPR (2023)

    Edstedt,J.,Athanasiadis,I.,Wadenbäck,M.,Felsberg,M.:Dkm:Densekernelized feature matching for geometry estimation. In: CVPR (2023)

  25. [25]

    In: CVPR (2024)

    Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: CVPR (2024)

  26. [26]

    Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

    Fei, X., Zheng, W., Duan, Y., Zhan, W., Tomizuka, M., Keutzer, K., Lu, J.: Driv3r: Learning dense 4d reconstruction for autonomous driving. arXiv preprint arXiv:2412.06777 (2024)

  27. [27]

    arXiv preprint arXiv:2509.21302 (2025)

    Feng, W., Qin, H., Wu, M., Yang, C., Li, Y., Li, X., An, Z., Huang, L., Zhang, Y., Magno, M., et al.: Quantized visual geometry grounded transformer. arXiv preprint arXiv:2509.21302 (2025)

  28. [28]

    Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

    Furukawa, Y., Hernández, C.: Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

  29. [29]

    2402–2409 (2006)

    Goesele,M.,Curless,B.,Seitz,S.M.:Multi-viewstereorevisited.In:CVPR.vol.2, pp. 2402–2409 (2006)

  30. [30]

    In: CVPR

    Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: CVPR. pp. 3749–3761 (2022)

  31. [31]

    In: CVPR (2025)

    Guo, Y., Garg, S., Miangoleh, S.M.H., Huang, X., Ren, L.: Depth any camera: Zero-shot metric depth estimation from any camera. In: CVPR (2025)

  32. [32]

    Hartley,R.,Zisserman,A.:Multipleviewgeometryincomputervision.Cambridge university press (2003)

  33. [33]

    In: CVPR (2021)

    Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In: CVPR (2021)

  34. [34]

    In: CVPR (2024)

    He, X., Sun, J., Wang, Y., Peng, S., Huang, Q., Bao, H., Zhou, X.: Detector-free structure from motion. In: CVPR (2024)

  35. [35]

    arXiv preprint arXiv:2312.08782 (2023)

    Hu, Y., Xie, Q., Jain, V., Francis, J., Patrikar, J., Keetha, N., Kim, S., Xie, Y., Zhang, T., Fang, H.S., et al.: Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023)

  36. [36]

    In: CVPR

    Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learning multi-view stereopsis. In: CVPR. pp. 2821–2830 (2018)

  37. [37]

    In: CVPR

    Jang, W., Weinzaepfel, P., Leroy, V., Agapito, L., Revaud, J.: Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. In: CVPR. pp. 1071–1081 (2025)

  38. [38]

    arXiv preprint arXiv:2512.22819 (2025)

    Jiang, H., Song, Z., Lou, Z., Xu, R., Tan, M.: Depth anything in360circ: Towards scale invariance in the wild. arXiv preprint arXiv:2512.22819 (2025)

  39. [39]

    TOG44(6), 1–16 (2025)

    Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. TOG44(6), 1–16 (2025)

  40. [40]

    In: ACM SIGGRAPH (2024)

    Jiang, Y., Yu, C., Xie, T., Li, X., Feng, Y., Wang, H., Li, M., Lau, H., Gao, F., Yang, Y., Jiang, C.: Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. In: ACM SIGGRAPH (2024)

  41. [41]

    In: ICCV (2025) 18 Xi Li et al

    Jung, D., Choi, J., Lee, Y., Manocha, D.: Im360: Large-scale indoor mapping with 360 cameras. In: ICCV (2025) 18 Xi Li et al

  42. [42]

    In: ICCV

    Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In: ICCV. pp. 6013–6022 (2025)

  43. [43]

    In: ECCV

    Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: ECCV. pp. 18–35 (2024)

  44. [44]

    In: CVPR

    Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J.: Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In: CVPR. pp. 21357–21366 (2024)

  45. [45]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

  46. [46]

    TOG (2023)

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG (2023)

  47. [47]

    In: NeurIPS

    Kümmerle, C., Verdun, C.M., Stöger, D.: Iteratively reweighted least squares for basis pursuit with global linear convergence rate. In: NeurIPS. pp. 2873–2886 (2021)

  48. [48]

    arXiv preprint arXiv:2509.26618 (2025)

    Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025)

  49. [49]

    In: ECCV

    Li, X.: Multi-view canonical pose 3d human body reconstruction based on volu- metric tsdf. In: ECCV. pp. 392–397 (2022)

  50. [50]

    In: ICCV

    Li, X., Rao, T., Pan, C.: Edm: Efficient deep feature matching. In: ICCV. pp. 26198–26208 (2025)

  51. [51]

    In: CVPR (2018)

    Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from inter- net photos. In: CVPR (2018)

  52. [52]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  53. [53]

    arXiv preprint arXiv:2512.16913 (2025)

    Lin, X., Song, M., Zhang, D., Lu, W., Li, H., Du, B., Yang, M.H., Nguyen, T., Qi, L.: Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913 (2025)

  54. [54]

    In: ICCV (2023)

    Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: ICCV (2023)

  55. [55]

    In: CVPR

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR. pp. 22160–22169 (2024)

  56. [56]

    arXiv preprint arXiv:2510.10726 (2025)

    Liu, Y., Min, Z., Wang, Z., Wu, J., Wang, T., Yuan, Y., Luo, Y., Guo, C.: Worldmirror: Universal 3d world reconstruction with any-prior prompting. arXiv preprint arXiv:2510.10726 (2025)

  57. [57]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  58. [58]

    IJCV (2004)

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)

  59. [59]

    In: ICCV

    Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: ICCV. pp. 6851–6860 (2019)

  60. [60]

    Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

  61. [61]

    In: International Workshop on Reproducible Research in Pattern Recog- nition

    Moulon,P.,Monasse,P.,Perrot,R.,Marlet,R.: Openmvg: Openmultipleviewge- ometry. In: International Workshop on Reproducible Research in Pattern Recog- nition. pp. 60–74. Springer (2016)

  62. [62]

    IEEE Transactions on Robotics (2015)

    Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (2015)

  63. [63]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:Dinov2:Learningrobust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  64. [64]

    In: ICCV

    Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for ego- centric 3d machine perception. In: ICCV. pp. 20133–20143 (2023)

  65. [65]

    In: CVPR (2025)

    Piccinelli, L., Sakaridis, C., Segu, M., Yang, Y.H., Li, S., Abbeloos, W., Van Gool, L.: Unik3d: Universal camera monocular 3d estimation. In: CVPR (2025)

  66. [66]

    In: ICCV (2021)

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

  67. [67]

    In: ICCV

    Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: ICCV. pp. 10901–10911 (2021)

  68. [68]

    In: ICCV

    Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV. pp. 10912–10922 (2021)

  69. [69]

    In: ICCV (2011)

    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: ICCV (2011)

  70. [70]

    In: CVPR (2020)

    Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: CVPR (2020)

  71. [71]

    In: CVPR (2016)

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

  72. [72]

    In: ECCV

    Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518 (2016)

  73. [73]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

  74. [74]

    In: CVPR

    Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR. vol. 1, pp. 519–528 (2006)

  75. [75]

    FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    Shen, Y., Zhang, Z., Qu, Y., Cao, L.: Fastvggt: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560 (2025)

  76. [76]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

  77. [77]

    In: CVPR (2021)

    Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: CVPR (2021)

  78. [78]

    In: CVPR

    Sun, W., Jiang, W., Trulls, E., Tagliasacchi, A., Yi, K.M.: Acne: Attentive context normalization for robust permutation-equivariant learning. In: CVPR. pp. 11286– 11295 (2020)

  79. [79]

    NeurIPS34, 251–266 (2021)

    Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS34, 251–266 (2021)

  80. [80]

    In: CVPR (2018) 20 Xi Li et al

    Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: Inloc: Indoor visual localization with dense matching and view synthesis. In: CVPR (2018) 20 Xi Li et al

Showing first 80 references.