Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

Cihui Pan; Kai Zhang; Linyuan Li; Tong Rao; Xi Li; Xinchen Hui; Yan Wu

arxiv: 2606.30047 · v1 · pith:OPSDY4WMnew · submitted 2026-06-29 · 💻 cs.CV

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

Xi Li , Linyuan Li , Yan Wu , Tong Rao , Kai Zhang , Xinchen Hui , Cihui Pan This is my paper

Pith reviewed 2026-06-30 06:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords panoramic 3D reconstructionmetric reconstructionindoor scenescamera pose estimationdepth estimationpoint cloud reconstructionfeed-forward networkcovisibility

0 comments

The pith

Argus reconstructs metric 3D indoor scenes from sparse unordered panoramic images by anchoring to an optimal reference view and decomposing pixel-to-world mappings under joint constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Realsee3D, a hybrid dataset of 10K indoor scenes with precise metric labels, to train feed-forward models on panoramic data. Argus uses a learned covisibility module to pick the best reference view and avoid global pose drift, while breaking the 3D mapping into supervised sub-steps that share cross-coordinate constraints. These choices produce state-of-the-art results on camera pose, depth, and point-cloud tasks within the benchmark's sparse capture setting. A sympathetic reader cares because reliable metric output from casual panoramic photos could support indoor mapping and navigation without dense sensor rigs.

Core claim

Argus is a feed-forward network for metric panoramic 3D reconstruction trained on the Realsee3D dataset. In the sparse unordered capture setting, a learned covisibility module selects the geometrically optimal reference view to anchor the metric world frame. The bidirectional pixel-to-world mapping is decomposed into interpretable sub-steps that receive per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction.

What carries the argument

The learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame, together with the decomposition of pixel-to-world mapping into per-step supervised sub-tasks under joint constraints.

If this is right

Metric camera pose estimation becomes feasible for unordered panoramic sequences without external anchors.
Depth maps and point clouds inherit improved global consistency from the shared coordinate constraints.
Feed-forward inference replaces iterative optimization for reconstruction in the sparse panoramic regime.
Multi-task training benefits from explicit geometric consistency across pose, depth, and reconstruction heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The covisibility selection mechanism could be tested on other sparse multi-view inputs such as smartphone photo bursts to check transfer beyond panoramas.
The hybrid real-synthetic construction of Realsee3D points to a scalable route for obtaining metric ground truth when purely real labeled data remains limited.
If the decomposed mapping generalizes, similar step-wise supervision might stabilize training for other geometric prediction tasks that mix 2D and 3D outputs.

Load-bearing premise

The Realsee3D dataset supplies precise metric annotations that are representative of real-world sparse unordered panoramic captures without systematic bias from its synthetic scenes.

What would settle it

Running Argus without retraining on a fresh collection of real panoramic sequences captured in indoor environments outside the Realsee3D distribution and measuring whether the reported gains in metric pose accuracy and depth error persist.

Figures

Figures reproduced from arXiv: 2606.30047 by Cihui Pan, Kai Zhang, Linyuan Li, Tong Rao, Xi Li, Xinchen Hui, Yan Wu.

**Figure 1.** Figure 1: Argus is a feed-forward 3D reconstruction network trained on our large-scale indoor panoramic 3D dataset Realsee3D. Given acquired sparse multi-view panoramic images, it rapidly reconstructs complete, consistent, and metric-scale 3D scenes. Abstract. Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGBD training data. We present Reals… view at source ↗

**Figure 2.** Figure 2: Overview of Argus. We first select the optimal reference frame via a Covisibility Transformer, then aggregate multi-view geometric features through a referencebased Geometry Transformer, followed by multiple prediction branches for intermediate geometric decomposition representations that facilitate multi-task learning. lay the foundation for indoor scene understanding with high-quality real-world panora… view at source ↗

**Figure 3.** Figure 3: Pixel-to-World Bidirectional Transformations for a Single View. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison. (IRLS) alignment [47], median alignment, and absolute metric evaluation without any alignment. Results are given in Tab. 3. Argus outperforms all baselines in all metrics. We further evaluate Argus on monocular depth estimation for Realsee3D, and test its zero-shot generalization on two standard benchmarks: Matterport3D [12] and Stanford2D3D [5]. Although not specifically trained f… view at source ↗

**Figure 5.** Figure 5: Visualization of the Effectiveness of Reference View Learning. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Visual Samples with Generalization Ability. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of Realsee3D. A More Dataset Details A.1 Dataset Characteristics Realsee3D is distinguished by several key features: – Panoramic Format & Validity Masks: Visual data is provided as highresolution 1600 × 800 ERP panoramic images. While synthetic scenes offer a complete 360◦ × 180◦ field of view, real-world captures inherently contain blind spots at the zenith and nadir due to hardware constraints.… view at source ↗

**Figure 8.** Figure 8: Depth Distribution of Realsee3D [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of Masks Computed from Depth and Pose for Covis [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-Shot Reconstruction on Unseen Datasets. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of the Covisibility Adjacency Matrix, Ground-Truth [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Verification of Covisibility Score Permutation Equivariance. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Perspective and Panoramic 3D Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Failure Cases of Argus. extremely narrow field of view, leading to poor computational efficiency. Missing point clouds from these cropped regions can be easily complemented using adjacent views or simple plane fitting. E.5 Limitations Due to the scarcity of outdoor datasets, our model exhibits suboptimal performance in outdoor and aerial scenarios, and this limitation will be progressively mitigated wit… view at source ↗

read the original abstract

Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGB-D training data. We present Realsee3D, a hybrid dataset of 10K indoor scenes (1K real, 9K synthetic) with 299K panoramic viewpoints and precise metric annotations, and Argus, a feed-forward network trained on it for metric panoramic 3D reconstruction. In the sparse unordered capture setting of Realsee3D, a poorly chosen coordinate anchor can cause global pose drift. Argus addresses this with a learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame. To further improve multi-task learning, we decompose the bidirectional pixel-to-world mapping into interpretable sub-steps with per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction. Project page: https://argus-paper.realsee.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Argus adds a new hybrid panoramic dataset and a covisibility anchor trick, but the 9:1 synthetic split leaves the real-world metric gains unproven.

read the letter

Argus and Realsee3D supply a new 10K-scene hybrid dataset (1K real, 9K synthetic) plus a feed-forward network that picks a good reference view via learned covisibility and breaks the pixel-to-world mapping into supervised sub-steps with cross-coordinate constraints.

The anchor selection piece directly targets the drift problem the authors flag for sparse unordered panoramas. Decomposing the bidirectional mapping and adding joint constraints is a straightforward way to keep pose, depth, and reconstruction outputs consistent during training. Those choices are concrete and tied to the stated limitations of prior monocular or multi-view work.

The soft spot is the data mix. With nine synthetic scenes per real one, any SOTA numbers on the combined benchmark could reflect how cleanly the model fits rendered geometry and perfect labels rather than handling real sensor noise, calibration drift, or sparse capture patterns. The abstract claims state-of-the-art results on pose, depth, and point clouds but gives no numbers, error bars, or real-only splits, so the transfer story stays untested until the full experiments are checked.

This paper is for people already working on indoor panoramic or 360-degree reconstruction who need more metric training data and a practical handle on coordinate anchoring. It is a narrow but usable increment rather than a broad shift.

Send it for peer review. The dataset is new and the technical modules address a real operational issue, so referees can verify the real-data results and ask for clearer ablations on the synthetic versus real portions.

Referee Report

2 major / 1 minor

Summary. The paper introduces Realsee3D, a hybrid dataset of 10K indoor scenes (9K synthetic + 1K real) containing 299K panoramic viewpoints with precise metric annotations, and proposes Argus, a feed-forward network for metric panoramic 3D reconstruction from sparse unordered captures. Argus uses a learned covisibility module to select an optimal reference view as coordinate anchor and decomposes the pixel-to-world mapping into supervised sub-steps with cross-coordinate constraints. The central claim is that Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction on the Realsee3D benchmark.

Significance. If the empirical claims hold under independent real-only evaluation, the work would provide a useful large-scale hybrid dataset and a practical feed-forward approach to metric reconstruction from unordered panoramas, addressing the noted lack of training data and coordinate drift issues in this setting. The interpretable decomposition and per-step supervision could aid multi-task consistency, though the significance is tempered by the absence of quantitative numbers, error bars, or ablation details in the abstract.

major comments (2)

[Dataset and Experiments sections] Dataset section (and benchmark evaluation): The Realsee3D dataset uses a 9:1 synthetic-to-real ratio. The SOTA claims for pose, depth, and reconstruction on the full benchmark are load-bearing for the paper's contribution, yet no real-only metrics, balanced test splits, or explicit comparison of synthetic vs. real performance are described. This leaves open whether gains exploit synthetic regularities (perfect geometry, calibration) rather than generalizing to real sparse captures with sensor noise and drift, directly undermining the claim of applicability to real-world unordered panoramic data.
[Abstract and Experiments] Abstract and Experiments: The abstract states SOTA results without any quantitative numbers, error bars, ablation studies, or baseline comparisons. Without these details, it is impossible to assess whether the reported gains are robust or sensitive to post-hoc choices in the covisibility module or reference-view selection.

minor comments (1)

[Abstract] The project page URL is given but no quantitative results or supplementary material links are referenced in the abstract; adding these would improve verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of evaluation and presentation. We address each major point below and will revise the manuscript accordingly to strengthen the claims regarding real-world applicability and clarity of results.

read point-by-point responses

Referee: [Dataset and Experiments sections] Dataset section (and benchmark evaluation): The Realsee3D dataset uses a 9:1 synthetic-to-real ratio. The SOTA claims for pose, depth, and reconstruction on the full benchmark are load-bearing for the paper's contribution, yet no real-only metrics, balanced test splits, or explicit comparison of synthetic vs. real performance are described. This leaves open whether gains exploit synthetic regularities (perfect geometry, calibration) rather than generalizing to real sparse captures with sensor noise and drift, directly undermining the claim of applicability to real-world unordered panoramic data.

Authors: We agree that explicit real-only evaluation is necessary to substantiate generalization claims for real sparse captures. The hybrid design of Realsee3D prioritizes scale while incorporating 1K real scenes, but the current manuscript reports only aggregate benchmark results. In the revision, we will add a new subsection with real-only metrics on the 1K real scenes (including pose, depth, and reconstruction errors), balanced test splits separating synthetic and real, and direct synthetic-vs-real performance comparisons. This will clarify whether the covisibility module and decomposed supervision generalize beyond synthetic regularities. revision: yes
Referee: [Abstract and Experiments] Abstract and Experiments: The abstract states SOTA results without any quantitative numbers, error bars, ablation studies, or baseline comparisons. Without these details, it is impossible to assess whether the reported gains are robust or sensitive to post-hoc choices in the covisibility module or reference-view selection.

Authors: We acknowledge that the abstract's brevity makes it difficult to gauge the magnitude and robustness of the SOTA claims. While full quantitative results, error bars, and ablations appear in the Experiments section, we will revise the abstract to include key numerical results (e.g., relative improvements in camera pose error and depth accuracy on Realsee3D) and a brief note on the covisibility module's contribution. Full ablation tables and sensitivity analysis to reference-view selection will remain in the main text but will be cross-referenced more explicitly from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims on new benchmark are self-contained

full rationale

The paper introduces Realsee3D (hybrid synthetic/real dataset) and trains Argus on it, reporting direct empirical metrics for pose, depth, and reconstruction. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted parameters or self-citations; the covisibility module and per-step supervision are architectural choices evaluated externally on the benchmark splits. The central claims rest on held-out test performance rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence and quality of the Realsee3D dataset annotations plus the effectiveness of the learned covisibility and decomposed supervision modules. No explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5744 in / 1305 out tokens · 22584 ms · 2026-06-30T06:08:00.157671+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

107 extracted references · 23 canonical work pages · 14 internal anchors

[1]

In: ICCV (2009)

Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: ICCV (2009)

2009
[2]

ai et al

Ai,H.,Cao,Z.,Wang,L.:Asurveyofrepresentationlearning,optimizationstrate- gies, and applications for omnidirectional vision: H. ai et al. IJCV133(8), 4973– 5012 (2025)

2025
[3]

In: IROS

Alama, O., Bhattacharya, A., He, H., Kim, S., Qiu, Y., Wang, W., Ho, C., Keetha, N., Scherer, S.: Rayfronts: Open-set semantic ray frontiers for online scene under- standing and exploration. In: IROS. pp. 5930–5937 (2025)

2025
[4]

In: ECCV

Antequera, M.L., Gargallo, P., Hofinger, M., Bulo, S.R., Kuang, Y., Kontschieder, P.: Mapillary planet-scale depth dataset. In: ECCV. pp. 589–604 (2020)

2020
[5]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Armeni,I.,Sax,S.,Zamir,A.R.,Savarese,S.:Joint2d-3d-semanticdataforindoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

In: CVPR

Barath, D., Matas, J.: Graph-cut ransac. In: CVPR. pp. 6733–6741 (2018)

2018
[7]

In: Sensor fusion IV: control paradigms and data structures

Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

1992
[8]

In: CVPR

Bian, J., Lin, W.Y., Matsushita, Y., Yeung, S.K., Nguyen, T.D., Cheng, M.M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: CVPR. pp. 4181–4190 (2017)

2017
[9]

Virtual KITTI 2

Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

IEEE Transactions on Robotics (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb- slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics (2021)

2021
[11]

In: CVPR (2025)

Cao, Z., Zhu, J., Zhang, W., Ai, H., Bai, H., Zhao, H., Wang, L.: Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. In: CVPR (2025)

2025
[12]

In: 3DV (2017)

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. In: 3DV (2017)

2017
[13]

TTT3R: 3D Reconstruction as Test-Time Training

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

In: CVPR

Cruz, S., Hutchcroft, W., Li, Y., Khosravan, N., Boyadzhiev, I., Kang, S.B.: Zil- low indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In: CVPR. pp. 2133–2143 (2021)

2021
[15]

In: CVPR (2017)

Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- net: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

2017
[16]

ToG36(4), 1 (2017)

Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: Real- time globally consistent 3d reconstruction using on-the-fly surface reintegration. ToG36(4), 1 (2017)

2017
[17]

In: NeurIPS (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. In: NeurIPS (2022)

2022
[18]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

In: CVPR

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: CVPR. pp. 13142–13153 (2023)

2023
[20]

arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

Deng, J., Li, H., Xie, T., Ren, W., Zhang, Q., Tan, P., Guo, X.: Sail-recon: Large sfm by augmenting scene regression with localization. arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

work page arXiv 2025
[21]

In: CVPRW (2018)

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: CVPRW (2018)

2018
[22]

Numerische Math- ematik1(1), 269–271 (1959)

Dijkstra, E.: A note on two problems in connexion with graphs. Numerische Math- ematik1(1), 269–271 (1959)

1959
[23]

In: 2025 International Conference on 3D Vision (3DV)

Duisterhof, B.P., Zust, L., Weinzaepfel, P., Leroy, V., Cabon, Y., Revaud, J.: Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In: 2025 International Conference on 3D Vision (3DV). pp. 1–10 (2025)

2025
[24]

In: CVPR (2023)

Edstedt,J.,Athanasiadis,I.,Wadenbäck,M.,Felsberg,M.:Dkm:Densekernelized feature matching for geometry estimation. In: CVPR (2023)

2023
[25]

In: CVPR (2024)

Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: CVPR (2024)

2024
[26]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Fei, X., Zheng, W., Duan, Y., Zhan, W., Tomizuka, M., Keutzer, K., Lu, J.: Driv3r: Learning dense 4d reconstruction for autonomous driving. arXiv preprint arXiv:2412.06777 (2024)

work page arXiv 2024
[27]

arXiv preprint arXiv:2509.21302 (2025)

Feng, W., Qin, H., Wu, M., Yang, C., Li, Y., Li, X., An, Z., Huang, L., Zhang, Y., Magno, M., et al.: Quantized visual geometry grounded transformer. arXiv preprint arXiv:2509.21302 (2025)

work page arXiv 2025
[28]

Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

Furukawa, Y., Hernández, C.: Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

2015
[29]

2402–2409 (2006)

Goesele,M.,Curless,B.,Seitz,S.M.:Multi-viewstereorevisited.In:CVPR.vol.2, pp. 2402–2409 (2006)

2006
[30]

In: CVPR

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: CVPR. pp. 3749–3761 (2022)

2022
[31]

In: CVPR (2025)

Guo, Y., Garg, S., Miangoleh, S.M.H., Huang, X., Ren, L.: Depth any camera: Zero-shot metric depth estimation from any camera. In: CVPR (2025)

2025
[32]

Hartley,R.,Zisserman,A.:Multipleviewgeometryincomputervision.Cambridge university press (2003)

2003
[33]

In: CVPR (2021)

Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In: CVPR (2021)

2021
[34]

In: CVPR (2024)

He, X., Sun, J., Wang, Y., Peng, S., Huang, Q., Bao, H., Zhou, X.: Detector-free structure from motion. In: CVPR (2024)

2024
[35]

Toward general-purpose robots via foundation models: A survey and meta-analysis, 2024

Hu, Y., Xie, Q., Jain, V., Francis, J., Patrikar, J., Keetha, N., Kim, S., Xie, Y., Zhang, T., Fang, H.S., et al.: Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023)

work page arXiv 2023
[36]

In: CVPR

Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learning multi-view stereopsis. In: CVPR. pp. 2821–2830 (2018)

2018
[37]

In: CVPR

Jang, W., Weinzaepfel, P., Leroy, V., Agapito, L., Revaud, J.: Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. In: CVPR. pp. 1071–1081 (2025)

2025
[38]

arXiv preprint arXiv:2512.22819 (2025)

Jiang, H., Song, Z., Lou, Z., Xu, R., Tan, M.: Depth anything in360circ: Towards scale invariance in the wild. arXiv preprint arXiv:2512.22819 (2025)

work page arXiv 2025
[39]

TOG44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. TOG44(6), 1–16 (2025)

2025
[40]

In: ACM SIGGRAPH (2024)

Jiang, Y., Yu, C., Xie, T., Li, X., Feng, Y., Wang, H., Li, M., Lau, H., Gao, F., Yang, Y., Jiang, C.: Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. In: ACM SIGGRAPH (2024)

2024
[41]

In: ICCV (2025) 18 Xi Li et al

Jung, D., Choi, J., Lee, Y., Manocha, D.: Im360: Large-scale indoor mapping with 360 cameras. In: ICCV (2025) 18 Xi Li et al

2025
[42]

In: ICCV

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In: ICCV. pp. 6013–6022 (2025)

2025
[43]

In: ECCV

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: ECCV. pp. 18–35 (2024)

2024
[44]

In: CVPR

Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J.: Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In: CVPR. pp. 21357–21366 (2024)

2024
[45]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

TOG (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG (2023)

2023
[47]

In: NeurIPS

Kümmerle, C., Verdun, C.M., Stöger, D.: Iteratively reweighted least squares for basis pursuit with global linear convergence rate. In: NeurIPS. pp. 2873–2886 (2021)

2021
[48]

arXiv preprint arXiv:2509.26618 (2025)

Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025)

work page arXiv 2025
[49]

In: ECCV

Li, X.: Multi-view canonical pose 3d human body reconstruction based on volu- metric tsdf. In: ECCV. pp. 392–397 (2022)

2022
[50]

In: ICCV

Li, X., Rao, T., Pan, C.: Edm: Efficient deep feature matching. In: ICCV. pp. 26198–26208 (2025)

2025
[51]

In: CVPR (2018)

Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from inter- net photos. In: CVPR (2018)

2018
[52]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

arXiv preprint arXiv:2512.16913 (2025)

Lin, X., Song, M., Zhang, D., Lu, W., Li, H., Du, B., Yang, M.H., Nguyen, T., Qi, L.: Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913 (2025)

work page arXiv 2025
[54]

In: ICCV (2023)

Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: ICCV (2023)

2023
[55]

In: CVPR

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR. pp. 22160–22169 (2024)

2024
[56]

arXiv preprint arXiv:2510.10726 (2025)

Liu, Y., Min, Z., Wang, Z., Wu, J., Wang, T., Yuan, Y., Luo, Y., Guo, C.: Worldmirror: Universal 3d world reconstruction with any-prior prompting. arXiv preprint arXiv:2510.10726 (2025)

work page arXiv 2025
[57]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

IJCV (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)

2004
[59]

In: ICCV

Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: ICCV. pp. 6851–6860 (2019)

2019
[60]

Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

2021
[61]

In: International Workshop on Reproducible Research in Pattern Recog- nition

Moulon,P.,Monasse,P.,Perrot,R.,Marlet,R.: Openmvg: Openmultipleviewge- ometry. In: International Workshop on Reproducible Research in Pattern Recog- nition. pp. 60–74. Springer (2016)

2016
[62]

IEEE Transactions on Robotics (2015)

Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (2015)

2015
[63]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:Dinov2:Learningrobust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

In: ICCV

Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for ego- centric 3d machine perception. In: ICCV. pp. 20133–20143 (2023)

2023
[65]

In: CVPR (2025)

Piccinelli, L., Sakaridis, C., Segu, M., Yang, Y.H., Li, S., Abbeloos, W., Van Gool, L.: Unik3d: Universal camera monocular 3d estimation. In: CVPR (2025)

2025
[66]

In: ICCV (2021)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

2021
[67]

In: ICCV

Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: ICCV. pp. 10901–10911 (2021)

2021
[68]

In: ICCV

Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV. pp. 10912–10922 (2021)

2021
[69]

In: ICCV (2011)

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: ICCV (2011)

2011
[70]

In: CVPR (2020)

Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: CVPR (2020)

2020
[71]

In: CVPR (2016)

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

2016
[72]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518 (2016)

2016
[73]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

In: CVPR

Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR. vol. 1, pp. 519–528 (2006)

2006
[75]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Shen, Y., Zhang, Z., Qu, Y., Cao, L.: Fastvggt: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[77]

In: CVPR (2021)

Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: CVPR (2021)

2021
[78]

In: CVPR

Sun, W., Jiang, W., Trulls, E., Tagliasacchi, A., Yi, K.M.: Acne: Attentive context normalization for robust permutation-equivariant learning. In: CVPR. pp. 11286– 11295 (2020)

2020
[79]

NeurIPS34, 251–266 (2021)

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS34, 251–266 (2021)

2021
[80]

In: CVPR (2018) 20 Xi Li et al

Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: Inloc: Indoor visual localization with dense matching and view synthesis. In: CVPR (2018) 20 Xi Li et al

2018

Showing first 80 references.

[1] [1]

In: ICCV (2009)

Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building rome in a day. In: ICCV (2009)

2009

[2] [2]

ai et al

Ai,H.,Cao,Z.,Wang,L.:Asurveyofrepresentationlearning,optimizationstrate- gies, and applications for omnidirectional vision: H. ai et al. IJCV133(8), 4973– 5012 (2025)

2025

[3] [3]

In: IROS

Alama, O., Bhattacharya, A., He, H., Kim, S., Qiu, Y., Wang, W., Ho, C., Keetha, N., Scherer, S.: Rayfronts: Open-set semantic ray frontiers for online scene under- standing and exploration. In: IROS. pp. 5930–5937 (2025)

2025

[4] [4]

In: ECCV

Antequera, M.L., Gargallo, P., Hofinger, M., Bulo, S.R., Kuang, Y., Kontschieder, P.: Mapillary planet-scale depth dataset. In: ECCV. pp. 589–604 (2020)

2020

[5] [5]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Armeni,I.,Sax,S.,Zamir,A.R.,Savarese,S.:Joint2d-3d-semanticdataforindoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

In: CVPR

Barath, D., Matas, J.: Graph-cut ransac. In: CVPR. pp. 6733–6741 (2018)

2018

[7] [7]

In: Sensor fusion IV: control paradigms and data structures

Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

1992

[8] [8]

In: CVPR

Bian, J., Lin, W.Y., Matsushita, Y., Yeung, S.K., Nguyen, T.D., Cheng, M.M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: CVPR. pp. 4181–4190 (2017)

2017

[9] [9]

Virtual KITTI 2

Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[10] [10]

IEEE Transactions on Robotics (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb- slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics (2021)

2021

[11] [11]

In: CVPR (2025)

Cao, Z., Zhu, J., Zhang, W., Ai, H., Bai, H., Zhao, H., Wang, L.: Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. In: CVPR (2025)

2025

[12] [12]

In: 3DV (2017)

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. In: 3DV (2017)

2017

[13] [13]

TTT3R: 3D Reconstruction as Test-Time Training

Chen, X., Chen, Y., Xiu, Y., Geiger, A., Chen, A.: Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

In: CVPR

Cruz, S., Hutchcroft, W., Li, Y., Khosravan, N., Boyadzhiev, I., Kang, S.B.: Zil- low indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In: CVPR. pp. 2133–2143 (2021)

2021

[15] [15]

In: CVPR (2017)

Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- net: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

2017

[16] [16]

ToG36(4), 1 (2017)

Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: Real- time globally consistent 3d reconstruction using on-the-fly surface reintegration. ToG36(4), 1 (2017)

2017

[17] [17]

In: NeurIPS (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. In: NeurIPS (2022)

2022

[18] [18]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

In: CVPR

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: CVPR. pp. 13142–13153 (2023)

2023

[20] [20]

arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

Deng, J., Li, H., Xie, T., Ren, W., Zhang, Q., Tan, P., Guo, X.: Sail-recon: Large sfm by augmenting scene regression with localization. arXiv preprint arXiv:2508.17972 (2025) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 17

work page arXiv 2025

[21] [21]

In: CVPRW (2018)

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: CVPRW (2018)

2018

[22] [22]

Numerische Math- ematik1(1), 269–271 (1959)

Dijkstra, E.: A note on two problems in connexion with graphs. Numerische Math- ematik1(1), 269–271 (1959)

1959

[23] [23]

In: 2025 International Conference on 3D Vision (3DV)

Duisterhof, B.P., Zust, L., Weinzaepfel, P., Leroy, V., Cabon, Y., Revaud, J.: Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In: 2025 International Conference on 3D Vision (3DV). pp. 1–10 (2025)

2025

[24] [24]

In: CVPR (2023)

Edstedt,J.,Athanasiadis,I.,Wadenbäck,M.,Felsberg,M.:Dkm:Densekernelized feature matching for geometry estimation. In: CVPR (2023)

2023

[25] [25]

In: CVPR (2024)

Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: CVPR (2024)

2024

[26] [26]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Fei, X., Zheng, W., Duan, Y., Zhan, W., Tomizuka, M., Keutzer, K., Lu, J.: Driv3r: Learning dense 4d reconstruction for autonomous driving. arXiv preprint arXiv:2412.06777 (2024)

work page arXiv 2024

[27] [27]

arXiv preprint arXiv:2509.21302 (2025)

Feng, W., Qin, H., Wu, M., Yang, C., Li, Y., Li, X., An, Z., Huang, L., Zhang, Y., Magno, M., et al.: Quantized visual geometry grounded transformer. arXiv preprint arXiv:2509.21302 (2025)

work page arXiv 2025

[28] [28]

Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

Furukawa, Y., Hernández, C.: Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision9(1-2), 1–148 (2015)

2015

[29] [29]

2402–2409 (2006)

Goesele,M.,Curless,B.,Seitz,S.M.:Multi-viewstereorevisited.In:CVPR.vol.2, pp. 2402–2409 (2006)

2006

[30] [30]

In: CVPR

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: CVPR. pp. 3749–3761 (2022)

2022

[31] [31]

In: CVPR (2025)

Guo, Y., Garg, S., Miangoleh, S.M.H., Huang, X., Ren, L.: Depth any camera: Zero-shot metric depth estimation from any camera. In: CVPR (2025)

2025

[32] [32]

Hartley,R.,Zisserman,A.:Multipleviewgeometryincomputervision.Cambridge university press (2003)

2003

[33] [33]

In: CVPR (2021)

Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In: CVPR (2021)

2021

[34] [34]

In: CVPR (2024)

He, X., Sun, J., Wang, Y., Peng, S., Huang, Q., Bao, H., Zhou, X.: Detector-free structure from motion. In: CVPR (2024)

2024

[35] [35]

Toward general-purpose robots via foundation models: A survey and meta-analysis, 2024

Hu, Y., Xie, Q., Jain, V., Francis, J., Patrikar, J., Keetha, N., Kim, S., Xie, Y., Zhang, T., Fang, H.S., et al.: Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782 (2023)

work page arXiv 2023

[36] [36]

In: CVPR

Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learning multi-view stereopsis. In: CVPR. pp. 2821–2830 (2018)

2018

[37] [37]

In: CVPR

Jang, W., Weinzaepfel, P., Leroy, V., Agapito, L., Revaud, J.: Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. In: CVPR. pp. 1071–1081 (2025)

2025

[38] [38]

arXiv preprint arXiv:2512.22819 (2025)

Jiang, H., Song, Z., Lou, Z., Xu, R., Tan, M.: Depth anything in360circ: Towards scale invariance in the wild. arXiv preprint arXiv:2512.22819 (2025)

work page arXiv 2025

[39] [39]

TOG44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. TOG44(6), 1–16 (2025)

2025

[40] [40]

In: ACM SIGGRAPH (2024)

Jiang, Y., Yu, C., Xie, T., Li, X., Feng, Y., Wang, H., Li, M., Lau, H., Gao, F., Yang, Y., Jiang, C.: Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. In: ACM SIGGRAPH (2024)

2024

[41] [41]

In: ICCV (2025) 18 Xi Li et al

Jung, D., Choi, J., Lee, Y., Manocha, D.: Im360: Large-scale indoor mapping with 360 cameras. In: ICCV (2025) 18 Xi Li et al

2025

[42] [42]

In: ICCV

Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In: ICCV. pp. 6013–6022 (2025)

2025

[43] [43]

In: ECCV

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: ECCV. pp. 18–35 (2024)

2024

[44] [44]

In: CVPR

Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J.: Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In: CVPR. pp. 21357–21366 (2024)

2024

[45] [45]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

TOG (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG (2023)

2023

[47] [47]

In: NeurIPS

Kümmerle, C., Verdun, C.M., Stöger, D.: Iteratively reweighted least squares for basis pursuit with global linear convergence rate. In: NeurIPS. pp. 2873–2886 (2021)

2021

[48] [48]

arXiv preprint arXiv:2509.26618 (2025)

Li, H., Zheng, W., He, J., Liu, Y., Lin, X., Yang, X., Chen, Y.C., Guo, C.: Da2: Depth anything in any direction. arXiv preprint arXiv:2509.26618 (2025)

work page arXiv 2025

[49] [49]

In: ECCV

Li, X.: Multi-view canonical pose 3d human body reconstruction based on volu- metric tsdf. In: ECCV. pp. 392–397 (2022)

2022

[50] [50]

In: ICCV

Li, X., Rao, T., Pan, C.: Edm: Efficient deep feature matching. In: ICCV. pp. 26198–26208 (2025)

2025

[51] [51]

In: CVPR (2018)

Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from inter- net photos. In: CVPR (2018)

2018

[52] [52]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

arXiv preprint arXiv:2512.16913 (2025)

Lin, X., Song, M., Zhang, D., Lu, W., Li, H., Du, B., Yang, M.H., Nguyen, T., Qi, L.: Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913 (2025)

work page arXiv 2025

[54] [54]

In: ICCV (2023)

Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: ICCV (2023)

2023

[55] [55]

In: CVPR

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: CVPR. pp. 22160–22169 (2024)

2024

[56] [56]

arXiv preprint arXiv:2510.10726 (2025)

Liu, Y., Min, Z., Wang, Z., Wu, J., Wang, T., Yuan, Y., Luo, Y., Guo, C.: Worldmirror: Universal 3d world reconstruction with any-prior prompting. arXiv preprint arXiv:2510.10726 (2025)

work page arXiv 2025

[57] [57]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

IJCV (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)

2004

[59] [59]

In: ICCV

Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In: ICCV. pp. 6851–6860 (2019)

2019

[60] [60]

Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021) Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes 19

2021

[61] [61]

In: International Workshop on Reproducible Research in Pattern Recog- nition

Moulon,P.,Monasse,P.,Perrot,R.,Marlet,R.: Openmvg: Openmultipleviewge- ometry. In: International Workshop on Reproducible Research in Pattern Recog- nition. pp. 60–74. Springer (2016)

2016

[62] [62]

IEEE Transactions on Robotics (2015)

Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (2015)

2015

[63] [63]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:Dinov2:Learningrobust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

In: ICCV

Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for ego- centric 3d machine perception. In: ICCV. pp. 20133–20143 (2023)

2023

[65] [65]

In: CVPR (2025)

Piccinelli, L., Sakaridis, C., Segu, M., Yang, Y.H., Li, S., Abbeloos, W., Van Gool, L.: Unik3d: Universal camera monocular 3d estimation. In: CVPR (2025)

2025

[66] [66]

In: ICCV (2021)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

2021

[67] [67]

In: ICCV

Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: ICCV. pp. 10901–10911 (2021)

2021

[68] [68]

In: ICCV

Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV. pp. 10912–10922 (2021)

2021

[69] [69]

In: ICCV (2011)

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: ICCV (2011)

2011

[70] [70]

In: CVPR (2020)

Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: CVPR (2020)

2020

[71] [71]

In: CVPR (2016)

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

2016

[72] [72]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518 (2016)

2016

[73] [73]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

In: CVPR

Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR. vol. 1, pp. 519–528 (2006)

2006

[75] [75]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Shen, Y., Zhang, Z., Qu, Y., Cao, L.: Fastvggt: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[77] [77]

In: CVPR (2021)

Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: CVPR (2021)

2021

[78] [78]

In: CVPR

Sun, W., Jiang, W., Trulls, E., Tagliasacchi, A., Yi, K.M.: Acne: Attentive context normalization for robust permutation-equivariant learning. In: CVPR. pp. 11286– 11295 (2020)

2020

[79] [79]

NeurIPS34, 251–266 (2021)

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS34, 251–266 (2021)

2021

[80] [80]

In: CVPR (2018) 20 Xi Li et al

Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T., Torii, A.: Inloc: Indoor visual localization with dense matching and view synthesis. In: CVPR (2018) 20 Xi Li et al

2018