Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Chunhua Shen; Guangkai Xu; Hao Chen; Hua Geng; Huanyi Zheng; Songyi Yin; Yanlong Sun

arxiv: 2604.21713 · v1 · submitted 2026-04-23 · 💻 cs.CV

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Guangkai Xu , Hua Geng , Huanyi Zheng , Songyi Yin , Yanlong Sun , Hao Chen , Chunhua Shen This is my paper

Pith reviewed 2026-05-09 21:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D visual geometry estimationfeed-forward modelsablation studiesconsistency losspoint cloud reconstructionvideo depth estimationcamera pose estimationdata scaling

0 comments

The pith

Ablation studies show that scaling data diversity, avoiding certain losses, and using joint alignment improve feed-forward 3D visual geometry estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the performance gap where multi-frame models for estimating 3D geometry from images achieve better consistency across frames but lower single-frame accuracy than per-frame approaches. Rigorous ablations isolate three main factors: increasing the variety and quality of training data produces gains even in advanced models; common confidence-aware and gradient-based losses tend to reduce overall results; and supervising both entire sequences and individual frames together works better than restricting alignment to small local regions. From these findings the authors build a consistency loss that ties depth maps, camera parameters, and point maps together, plus an efficient way to use high-resolution inputs, and package them in a model called CARVE that records strong results on point cloud reconstruction, video depth, and camera parameter tasks.

Core claim

Systematic ablations demonstrate that performance in feed-forward visual geometry estimation rises when data diversity and quality are increased, when confidence-aware and gradient-based losses are not used, and when supervision combines per-sequence and per-frame alignment while avoiding local region alignment. Adding a consistency loss that enforces agreement among depth maps, camera parameters, and point maps together with an architectural change for high-resolution processing produces CARVE, which delivers strong and robust accuracy across point cloud reconstruction, video depth estimation, and camera pose and intrinsic estimation benchmarks.

What carries the argument

The consistency loss function that enforces alignment between estimated depth maps, camera parameters, and point maps, combined with the ablation-driven insights on data scaling and alignment strategies.

If this is right

State-of-the-art visual geometry models continue to improve when trained on larger and more varied datasets.
Confidence-aware and gradient-based losses can be removed without harming, and sometimes improving, final accuracy.
Joint sequence-level and frame-level supervision produces better cross-frame consistency than local alignment alone.
High-resolution inputs can be incorporated efficiently once the consistency loss links depths, poses, and points.
The resulting CARVE model attains competitive numbers on point cloud, depth, and camera estimation tasks across multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data and alignment choices could be tested in related tasks such as object-level 3D reconstruction to check transfer.
Models that embed these factors during initial design rather than through later ablation may reach the observed gains with less trial and error.
If local alignment continues to underperform, future work might focus on global geometric constraints instead of fine-grained local matching.

Load-bearing premise

The patterns observed in the ablation studies on the chosen models and datasets will continue to hold when the same choices are applied to different architectures or to new real-world data.

What would settle it

A controlled experiment in which adding the confidence-aware loss or switching to local region alignment raises accuracy on a held-out benchmark, or in which CARVE fails to match or exceed strong per-frame baselines on a new diverse test set.

Figures

Figures reproduced from arXiv: 2604.21713 by Chunhua Shen, Guangkai Xu, Hao Chen, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun.

**Figure 1.** Figure 1: Network architecture of our proposed CARVE model. We extract the high-resolution feature and fuse it into the low-resolution [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative results of point cloud estimation on in-the-wild images. The red arrows highlight instances of failed estimations, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Removing the spatial gradient loss and confidence loss [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: More quantitative results of our CARVE model. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARVE adds a consistency loss and high-res design to feed-forward geometry estimation with some ablation flags on common losses, but the broader claims about critical factors rest on limited setups.

read the letter

The main takeaway is that CARVE improves results on point cloud reconstruction, video depth, and pose estimation by adding a consistency loss that aligns depth maps, camera parameters, and point maps, plus an efficient high-resolution architecture. The ablations also flag that confidence-aware and gradient-based losses can hold performance back while joint per-sequence and per-frame supervision works better than local alignment, and scaling data diversity helps even existing models.

Referee Report

2 major / 2 minor

Summary. The paper identifies a performance gap in feed-forward visual geometry estimation where multi-frame models offer better consistency but lag single-frame accuracy. Through ablation studies it claims that scaling data diversity and quality improves results, while confidence-aware and gradient-based losses hinder performance; joint per-sequence and per-frame supervision helps but local alignment degrades it. The authors introduce CARVE, which adds a consistency loss aligning depth maps, camera parameters and point maps plus an efficient high-resolution design, and report strong results on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation benchmarks.

Significance. If the ablation insights prove robust beyond the tested models and the CARVE gains hold under broader conditions, the work supplies practical guidance on supervision choices and architectural scaling for 3D geometry models. The manuscript's strength lies in its systematic empirical ablations and multi-task benchmark evaluation, which provide concrete, falsifiable observations that could inform subsequent feed-forward pipelines.

major comments (2)

[Abstract] Abstract: the assertion that confidence-aware and gradient-based losses 'may unintentionally hinder performance' and that local alignment 'surprisingly degrades performance' is presented as a general critical factor, yet the ablations are conducted within fixed architectures and training regimes; no architecture-swap or distribution-shift experiments are described to test whether the directional effects persist, weakening the claim that these findings unlock broader progress.
[Method / CARVE description] The consistency loss and high-resolution design are central to CARVE's claimed advantages, but the manuscript provides only a high-level description ('enforces alignment between depth maps, camera parameters, and point maps') without the explicit loss formulation, weighting schedule, or architectural diagram, preventing independent verification of how these components produce the reported benchmark gains.

minor comments (2)

[Experiments] Add a table or section listing all datasets, benchmarks, and baseline implementations with exact references and training details to support reproducibility of the ablation and final results.
[Abstract] The acronym CARVE is used without expansion on first appearance; provide the full name or definition at its introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important points regarding the scope of our claims and the need for greater methodological detail. We address each major comment below, clarifying our experimental scope and committing to revisions that enhance reproducibility and precision.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that confidence-aware and gradient-based losses 'may unintentionally hinder performance' and that local alignment 'surprisingly degrades performance' is presented as a general critical factor, yet the ablations are conducted within fixed architectures and training regimes; no architecture-swap or distribution-shift experiments are described to test whether the directional effects persist, weakening the claim that these findings unlock broader progress.

Authors: We appreciate this point on the generality of our observations. Our ablation studies were systematically conducted within established feed-forward visual geometry estimation architectures and training regimes to isolate critical factors in the current state-of-the-art setting. The abstract employs cautious phrasing ('may unintentionally hinder' and 'surprisingly degrades') to reflect these as empirical findings rather than universal claims. To address the concern, we will revise the abstract and the discussion section to explicitly note that the directional effects are observed within the tested models and regimes, and we will add a statement encouraging validation across additional architectures and distributions. This clarification strengthens the manuscript without altering the core empirical contributions. revision: partial
Referee: [Method / CARVE description] The consistency loss and high-resolution design are central to CARVE's claimed advantages, but the manuscript provides only a high-level description ('enforces alignment between depth maps, camera parameters, and point maps') without the explicit loss formulation, weighting schedule, or architectural diagram, preventing independent verification of how these components produce the reported benchmark gains.

Authors: We agree that additional details are essential for reproducibility and independent verification. In the revised manuscript, we will provide the explicit mathematical formulation of the consistency loss, including the precise terms for aligning depth maps, camera parameters, and point maps, along with the weighting schedule used during training. We will also include a detailed architectural diagram of the high-resolution design and its integration with the base feed-forward model. These additions will directly enable readers to understand and replicate how these components contribute to the benchmark improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation chain is self-contained

full rationale

The paper's central claims rest on systematic ablation experiments that identify performance factors (data scaling, loss choices, alignment strategies) and then integrate two new design elements (consistency loss and high-resolution architecture) into CARVE. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. No load-bearing self-citations or uniqueness theorems are invoked to justify the core argument. Performance is reported on external benchmarks (point cloud reconstruction, video depth, pose estimation), keeping the derivation independent of its own inputs. This is the expected non-finding for an empirical computer-vision study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the named model CARVE; all details on training assumptions or data processing are absent.

pith-pipeline@v0.9.0 · 5517 in / 1106 out tokens · 36344 ms · 2026-05-09T21:44:23.057622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages

[1]

ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InAdv. Neural In- form. Process. Syst., 2021. 4, 1, 2, 3

work page 2021
[2]

Depth pro: Sharp monocular metric depth in less than a second.Int

Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.Int. Conf. Learn. Represent., 2024. 1, 2, 3

work page 2024
[3]

Vir- tual kitti 2.arXiv: Comp

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv: Comp. Res. Repository, 2020. 4, 1, 2, 3

work page 2020
[4]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5410–5418, 2018. 2

work page 2018
[5]

3d point cloud processing and learning for autonomous driving: Impacting map cre- ation, localization, and perception.IEEE Trans

Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi- Gonzalez, and Carl Wellington. 3d point cloud processing and learning for autonomous driving: Impacting map cre- ation, localization, and perception.IEEE Trans. Signal Pro- cess., 38(1):68–86, 2020. 1

work page 2020
[6]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22831– 22840, 2025. 3

work page 2025
[7]

Depth estimation for colonoscopy images with self- supervised learning from videos

Kai Cheng, Yiting Ma, Bin Sun, Yang Li, and Xuejin Chen. Depth estimation for colonoscopy images with self- supervised learning from videos. InInt. Conf. Med. Image Comput. Comput. Assist. Interv., pages 119–128. Springer,

work page
[8]

Hierarchical neural architecture search for deep stereo matching.Adv

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching.Adv. Neural Inform. Process. Syst., 33:22158–22169, 2020. 2

work page 2020
[9]

Hsfm: Hybrid structure-from-motion

Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 1212–1221, 2017. 1, 2

work page 2017
[10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5828–5839,

work page
[11]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInt. Conf. Learn. Represent.,

work page
[12]

Transmvsnet: Global context-aware multi-view stereo network with trans- formers

Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with trans- formers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8585–8594, 2022. 2

work page 2022
[13]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEur. Conf. Com- put. Vis., pages 834–849. Springer, 2014. 2

work page 2014
[14]

Direct sparse odometry.IEEE Trans

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE Trans. Pattern Anal. Mach. Intell., 40(3):611–625, 2017. 2

work page 2017
[15]

Graspnet-1billion: A large-scale benchmark for general ob- ject grasping

Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InIEEE Conf. Comput. Vis. Pattern Recog., pages 11444–11453, 2020. 4, 1, 2, 3

work page 2020
[16]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. InEur. Conf. Comput. Vis., pages 241–258. Springer, 2024. 2

work page 2024
[17]

Accurate, dense, and robust multi-view stereopsis.Int

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis.Int. J. Comput. Vis., 85(1):1– 15, 2009. 2

work page 2009
[18]

Vision meets robotics: The kitti dataset.Int

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.Int. J. Robot. Res., 32(11):1231–1237, 2013. 2, 8, 3

work page 2013
[19]

Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for com- munity photo collections. InInt. Conf. Comput. Vis., pages 1–8, 2007. 2

work page 2007
[20]

Honnotate: A method for 3d annotation of hand and object poses

Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vin- cent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3196–3206, 2020. 2, 8, 3

work page 2020
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 5

work page 2016
[22]

Diffcalib: Reformulating monocu- lar camera calibration as diffusion-based dense incident map generation

Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, and Dongyan Guo. Diffcalib: Reformulating monocu- lar camera calibration as diffusion-based dense incident map generation. InProc. AAAI Conf. Artif. Intell., pages 3428– 3436, 2025. 2

work page 2025
[23]

Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Trans

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Trans. Pattern Anal. Mach. Intell.,

work page
[24]

Deepmvs: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2821–2830, 2018. 2, 4, 1, 3

work page 2018
[25]

Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1071– 1081, 2025. 2

work page 2025
[26]

On the importance of accurate geometry data for dense 3d vision tasks

HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. On the importance of accurate geometry data for dense 3d vision tasks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 780–791, 2023. 2, 8, 3

work page 2023
[27]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 9492–9502, 2024. 2

work page 2024
[28]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., pages 71–91. Springer, 2024. 1, 2, 3

work page 2024
[29]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10486–10496,

work page
[30]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth any- thing 3: Recovering the visual space from any views. InInt. Conf. Learn. Represent., 2026. 1, 2

work page 2026
[31]

Local similarity pattern and cost self-reassembling for deep stereo matching networks

Biyang Liu, Huimin Yu, and Yangqi Long. Local similarity pattern and cost self-reassembling for deep stereo matching networks. InProc. AAAI Conf. Artif. Intell., pages 1647– 1655, 2022. 2

work page 2022
[32]

Robotic online path planning on point cloud

Ming Liu. Robotic online path planning on point cloud. IEEE Trans. Cybern., 46(5):1217–1228, 2015. 1

work page 2015
[33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInt. Conf. Learn. Represent., 2019. 1

work page 2019
[34]

Bim- based registration and localization of 3d point clouds of in- door scenes using geometric features for augmented reality

Bilawal Mahmood, SangUk Han, and Dong-Eun Lee. Bim- based registration and localization of 3d point clouds of in- door scenes using geometric features for augmented reality. Remote Sens., 12(14):2302, 2020. 1

work page 2020
[35]

Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr ´es Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4981–4991, 2023. 4, 1, 2, 3

work page 2023
[36]

Global fusion of relative motions for robust, accurate and scalable structure from motion

Pierre Moulon, Pascal Monasse, and Renaud Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. InInt. Conf. Comput. Vis., pages 3248–3255, 2013. 1, 2

work page 2013
[37]

Orb-slam: A versatile and accurate monocular slam system.IEEE Trans

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE Trans. Robot., 31(5):1147–1163, 2015. 2

work page 2015
[38]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 2

work page 2024
[39]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Gigu `ere, and C. Stach- niss. ReFusion: 3D Reconstruction in Dynamic Envi- ronments for RGB-D Cameras Exploiting Residuals. In IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019. 2, 8, 3

work page 2019
[40]

Global structure-from-motion revisited

Linfei Pan, D ´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEur. Conf. Comput. Vis., pages 58–77. Springer, 2024. 1, 2

work page 2024
[41]

Tartan- ground: A large-scale dataset for ground robot perception and navigation

Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 20524–20531, 2025. 4, 1, 2, 3

work page 2025
[42]

Low-cost aug- mented reality systems via 3d point cloud sensors

Alessio Pierluigi Placitelli and Luigi Gallo. Low-cost aug- mented reality systems via 3d point cloud sensors. InInt. Conf. Signal Image Technol. Internet-Based Syst., pages 188–192. IEEE, 2011. 1

work page 2011
[43]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. InInt. Conf. Comput. Vis., pages 10912–10922, 2021. 4, 1, 2, 3

work page 2021
[44]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 4938–4947, 2020. 2

work page 2020
[45]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4104–4113, 2016. 1, 2

work page 2016
[46]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 501–518. Springer, 2016. 2

work page 2016
[47]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3260–3269, 2017. 2, 8, 3

work page 2017
[48]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2930–2937, 2013. 2, 8, 3

work page 2013
[49]

Super-convergence: Very fast training of neural networks using large learn- ing rates

Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learn- ing rates. InArtificial intelligence and machine learning for multi-domain operations applications, pages 369–386. SPIE, 2019. 1

work page 2019
[50]

Three-dimensional reconstruction for medical-cad modeling

B Starly, Z Fang, W Sun, A Shokoufandeh, and W Regli. Three-dimensional reconstruction for medical-cad modeling. Comput. Aided Des. Appl., 2(1-4):431–438, 2005. 1

work page 2005
[51]

A benchmark for the evalua- tion of RGB-D slam systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of RGB-D slam systems. InIEEE/RSJ Int. Conf. Intell. Robots Syst., 2012. 2, 8, 3

work page 2012
[52]

Loftr: Detector-free local feature match- ing with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature match- ing with transformers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8922–8931, 2021. 2, 8

work page 2021
[53]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv. Neural In- form. Process. Syst., 34:16558–16569, 2021. 2

work page 2021
[54]

Open synthetic dataset for improving cyclist detection, 2021

Phillip Thomas, Lars Pandikow, Alex Kim, Michael Stan- ley, and James Grieve. Open synthetic dataset for improving cyclist detection, 2021. 4, 1, 2, 3

work page 2021
[55]

Smd-nets: Stereo mixture density networks

Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8942–8952, 2021. 4, 1, 2, 3

work page 2021
[56]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInt. Conf. 3D Vision, 2025. 1, 2, 3, 8

work page 2025
[57]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5294–5306, 2025. 1, 2, 3, 4, 5, 8

work page 2025
[58]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10510–10522, 2025. 1, 2, 3

work page 2025
[59]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5261–5271, 2025. 1, 2, 3

work page 2025
[60]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdv. Neural Inform. Process. Syst.,

work page
[61]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InIEEE Conf. Comput. Vis. Pattern Recog., pages 20697–20709, 2024. 1, 2, 3

work page 2024
[62]

Tartanair: A dataset to push the limits of vi- sual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of vi- sual slam. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 4909–4916. IEEE, 2020. 4, 1, 2, 3

work page 2020
[63]

Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif

Xixun Wang, Yoshiki Mizukami, Makoto Tada, and Fumi- toshi Matsuno. Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif. Life Robot., 26 (1):10–20, 2021. 1

work page 2021
[64]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning. InInt. Conf. Learn. Represent., 2026. 1, 2, 8

work page 2026
[65]

Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models

Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, and Feng Zhao. Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models. InInt. Conf. Com- put. Vis., pages 9276–9286. IEEE, 2023. 2

work page 2023
[66]

What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt

Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt. Conf. Learn. Represent.,

work page
[67]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21924–21935, 2025. 1, 2, 3, 8

work page 2025
[68]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 10371–10381, 2024. 2

work page 2024
[69]

Depth any- thing v2.Adv

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Adv. Neural Inform. Process. Syst., 37:21875– 21911, 2024. 1, 2, 3

work page 2024
[70]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 767–783, 2018. 2

work page 2018
[71]

Blendedmvs: A large- scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1790–1799,

work page
[72]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InInt. Conf. Comput. Vis., pages 12–22, 2023. 4, 1, 2, 3

work page 2023
[73]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InIEEE Conf. Comput. Vis. Pattern Recog., pages 204–213, 2021. 2

work page 2021
[74]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InInt. Conf. Comput. Vis., pages 9043–9053, 2023. 2

work page 2023
[75]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In IEEE Conf. Comput. Vis. Pattern Recog., pages 21936– 21947, 2025. 2

work page 2025
[76]

3d lidar point cloud based intersec- tion recognition for autonomous driving

Quanwen Zhu, Long Chen, Qingquan Li, Ming Li, Andreas N¨uchter, and Jian Wang. 3d lidar point cloud based intersec- tion recognition for autonomous driving. InIEEE Intell. Veh. Symp., pages 456–461. IEEE, 2012. 1 Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation Supplementary Material

work page 2012
[77]

For a fair comparison, we further fine-tune VGGT on “Data3” under the same final training setting as CARVE

More Analysis Comparison with VGGT Fine-tuned on the Same Data. For a fair comparison, we further fine-tune VGGT on “Data3” under the same final training setting as CARVE. The model is initialized from the official VGGT pretrained weights and fully fine-tuned for 30K iterations. Other set- tings, including the optimizer, data preprocessing, and eval- uati...

work page
[78]

1) We present more training and evaluation details for the ablation study and main ex- periments; 2) We include extended visualization results in Figure 4

Experimental Setting Details In the supplementary material, we provide additional de- tails and quantitative results. 1) We present more training and evaluation details for the ablation study and main ex- periments; 2) We include extended visualization results in Figure 4. Common Training Details.The experiments were con- ducted on a server running Ubuntu...

work page
[79]

Lreg(Winv)+L F +L consis

withβ 1 = 0.9,β 2 = 0.99, and a weight decay of 0.01. The learning rate is scheduled using the OneCycleLR policy [49]. The longer side of the low-resolution input image is resized to 518 pixels, and the shorter side is then randomly RGB Figure 3. Removing the spatial gradient loss and confidence loss has minimal impact on qualitative results when continui...

work page 2025
[80]

For point cloud estimation, we aggregate the predictions of each sequence in world coordinates by stacking the in- dividual estimations

For KITTI, we use the sequences of 2011 09 26 0001, 2011 09 26 0009, 2011 09 26 0091, 2011 09 28 0001, 2011 09 29 0004, and 2011 09 29 0071. For point cloud estimation, we aggregate the predictions of each sequence in world coordinates by stacking the in- dividual estimations. To assess both per-view accuracy and cross-view consistency, we align the stack...

work page 2011

[1] [1]

ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InAdv. Neural In- form. Process. Syst., 2021. 4, 1, 2, 3

work page 2021

[2] [2]

Depth pro: Sharp monocular metric depth in less than a second.Int

Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.Int. Conf. Learn. Represent., 2024. 1, 2, 3

work page 2024

[3] [3]

Vir- tual kitti 2.arXiv: Comp

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv: Comp. Res. Repository, 2020. 4, 1, 2, 3

work page 2020

[4] [4]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5410–5418, 2018. 2

work page 2018

[5] [5]

3d point cloud processing and learning for autonomous driving: Impacting map cre- ation, localization, and perception.IEEE Trans

Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi- Gonzalez, and Carl Wellington. 3d point cloud processing and learning for autonomous driving: Impacting map cre- ation, localization, and perception.IEEE Trans. Signal Pro- cess., 38(1):68–86, 2020. 1

work page 2020

[6] [6]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22831– 22840, 2025. 3

work page 2025

[7] [7]

Depth estimation for colonoscopy images with self- supervised learning from videos

Kai Cheng, Yiting Ma, Bin Sun, Yang Li, and Xuejin Chen. Depth estimation for colonoscopy images with self- supervised learning from videos. InInt. Conf. Med. Image Comput. Comput. Assist. Interv., pages 119–128. Springer,

work page

[8] [8]

Hierarchical neural architecture search for deep stereo matching.Adv

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching.Adv. Neural Inform. Process. Syst., 33:22158–22169, 2020. 2

work page 2020

[9] [9]

Hsfm: Hybrid structure-from-motion

Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 1212–1221, 2017. 1, 2

work page 2017

[10] [10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5828–5839,

work page

[11] [11]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInt. Conf. Learn. Represent.,

work page

[12] [12]

Transmvsnet: Global context-aware multi-view stereo network with trans- formers

Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with trans- formers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8585–8594, 2022. 2

work page 2022

[13] [13]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEur. Conf. Com- put. Vis., pages 834–849. Springer, 2014. 2

work page 2014

[14] [14]

Direct sparse odometry.IEEE Trans

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE Trans. Pattern Anal. Mach. Intell., 40(3):611–625, 2017. 2

work page 2017

[15] [15]

Graspnet-1billion: A large-scale benchmark for general ob- ject grasping

Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InIEEE Conf. Comput. Vis. Pattern Recog., pages 11444–11453, 2020. 4, 1, 2, 3

work page 2020

[16] [16]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. InEur. Conf. Comput. Vis., pages 241–258. Springer, 2024. 2

work page 2024

[17] [17]

Accurate, dense, and robust multi-view stereopsis.Int

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis.Int. J. Comput. Vis., 85(1):1– 15, 2009. 2

work page 2009

[18] [18]

Vision meets robotics: The kitti dataset.Int

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.Int. J. Robot. Res., 32(11):1231–1237, 2013. 2, 8, 3

work page 2013

[19] [19]

Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for com- munity photo collections. InInt. Conf. Comput. Vis., pages 1–8, 2007. 2

work page 2007

[20] [20]

Honnotate: A method for 3d annotation of hand and object poses

Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vin- cent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3196–3206, 2020. 2, 8, 3

work page 2020

[21] [21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 5

work page 2016

[22] [22]

Diffcalib: Reformulating monocu- lar camera calibration as diffusion-based dense incident map generation

Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, and Dongyan Guo. Diffcalib: Reformulating monocu- lar camera calibration as diffusion-based dense incident map generation. InProc. AAAI Conf. Artif. Intell., pages 3428– 3436, 2025. 2

work page 2025

[23] [23]

Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Trans

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Trans. Pattern Anal. Mach. Intell.,

work page

[24] [24]

Deepmvs: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2821–2830, 2018. 2, 4, 1, 3

work page 2018

[25] [25]

Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1071– 1081, 2025. 2

work page 2025

[26] [26]

On the importance of accurate geometry data for dense 3d vision tasks

HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. On the importance of accurate geometry data for dense 3d vision tasks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 780–791, 2023. 2, 8, 3

work page 2023

[27] [27]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 9492–9502, 2024. 2

work page 2024

[28] [28]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., pages 71–91. Springer, 2024. 1, 2, 3

work page 2024

[29] [29]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10486–10496,

work page

[30] [30]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth any- thing 3: Recovering the visual space from any views. InInt. Conf. Learn. Represent., 2026. 1, 2

work page 2026

[31] [31]

Local similarity pattern and cost self-reassembling for deep stereo matching networks

Biyang Liu, Huimin Yu, and Yangqi Long. Local similarity pattern and cost self-reassembling for deep stereo matching networks. InProc. AAAI Conf. Artif. Intell., pages 1647– 1655, 2022. 2

work page 2022

[32] [32]

Robotic online path planning on point cloud

Ming Liu. Robotic online path planning on point cloud. IEEE Trans. Cybern., 46(5):1217–1228, 2015. 1

work page 2015

[33] [33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInt. Conf. Learn. Represent., 2019. 1

work page 2019

[34] [34]

Bim- based registration and localization of 3d point clouds of in- door scenes using geometric features for augmented reality

Bilawal Mahmood, SangUk Han, and Dong-Eun Lee. Bim- based registration and localization of 3d point clouds of in- door scenes using geometric features for augmented reality. Remote Sens., 12(14):2302, 2020. 1

work page 2020

[35] [35]

Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr ´es Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4981–4991, 2023. 4, 1, 2, 3

work page 2023

[36] [36]

Global fusion of relative motions for robust, accurate and scalable structure from motion

Pierre Moulon, Pascal Monasse, and Renaud Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. InInt. Conf. Comput. Vis., pages 3248–3255, 2013. 1, 2

work page 2013

[37] [37]

Orb-slam: A versatile and accurate monocular slam system.IEEE Trans

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE Trans. Robot., 31(5):1147–1163, 2015. 2

work page 2015

[38] [38]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 2

work page 2024

[39] [39]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Gigu `ere, and C. Stach- niss. ReFusion: 3D Reconstruction in Dynamic Envi- ronments for RGB-D Cameras Exploiting Residuals. In IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019. 2, 8, 3

work page 2019

[40] [40]

Global structure-from-motion revisited

Linfei Pan, D ´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEur. Conf. Comput. Vis., pages 58–77. Springer, 2024. 1, 2

work page 2024

[41] [41]

Tartan- ground: A large-scale dataset for ground robot perception and navigation

Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 20524–20531, 2025. 4, 1, 2, 3

work page 2025

[42] [42]

Low-cost aug- mented reality systems via 3d point cloud sensors

Alessio Pierluigi Placitelli and Luigi Gallo. Low-cost aug- mented reality systems via 3d point cloud sensors. InInt. Conf. Signal Image Technol. Internet-Based Syst., pages 188–192. IEEE, 2011. 1

work page 2011

[43] [43]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. InInt. Conf. Comput. Vis., pages 10912–10922, 2021. 4, 1, 2, 3

work page 2021

[44] [44]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 4938–4947, 2020. 2

work page 2020

[45] [45]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4104–4113, 2016. 1, 2

work page 2016

[46] [46]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 501–518. Springer, 2016. 2

work page 2016

[47] [47]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3260–3269, 2017. 2, 8, 3

work page 2017

[48] [48]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2930–2937, 2013. 2, 8, 3

work page 2013

[49] [49]

Super-convergence: Very fast training of neural networks using large learn- ing rates

Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learn- ing rates. InArtificial intelligence and machine learning for multi-domain operations applications, pages 369–386. SPIE, 2019. 1

work page 2019

[50] [50]

Three-dimensional reconstruction for medical-cad modeling

B Starly, Z Fang, W Sun, A Shokoufandeh, and W Regli. Three-dimensional reconstruction for medical-cad modeling. Comput. Aided Des. Appl., 2(1-4):431–438, 2005. 1

work page 2005

[51] [51]

A benchmark for the evalua- tion of RGB-D slam systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of RGB-D slam systems. InIEEE/RSJ Int. Conf. Intell. Robots Syst., 2012. 2, 8, 3

work page 2012

[52] [52]

Loftr: Detector-free local feature match- ing with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature match- ing with transformers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8922–8931, 2021. 2, 8

work page 2021

[53] [53]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv. Neural In- form. Process. Syst., 34:16558–16569, 2021. 2

work page 2021

[54] [54]

Open synthetic dataset for improving cyclist detection, 2021

Phillip Thomas, Lars Pandikow, Alex Kim, Michael Stan- ley, and James Grieve. Open synthetic dataset for improving cyclist detection, 2021. 4, 1, 2, 3

work page 2021

[55] [55]

Smd-nets: Stereo mixture density networks

Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8942–8952, 2021. 4, 1, 2, 3

work page 2021

[56] [56]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInt. Conf. 3D Vision, 2025. 1, 2, 3, 8

work page 2025

[57] [57]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5294–5306, 2025. 1, 2, 3, 4, 5, 8

work page 2025

[58] [58]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10510–10522, 2025. 1, 2, 3

work page 2025

[59] [59]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5261–5271, 2025. 1, 2, 3

work page 2025

[60] [60]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdv. Neural Inform. Process. Syst.,

work page

[61] [61]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InIEEE Conf. Comput. Vis. Pattern Recog., pages 20697–20709, 2024. 1, 2, 3

work page 2024

[62] [62]

Tartanair: A dataset to push the limits of vi- sual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of vi- sual slam. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 4909–4916. IEEE, 2020. 4, 1, 2, 3

work page 2020

[63] [63]

Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif

Xixun Wang, Yoshiki Mizukami, Makoto Tada, and Fumi- toshi Matsuno. Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif. Life Robot., 26 (1):10–20, 2021. 1

work page 2021

[64] [64]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning. InInt. Conf. Learn. Represent., 2026. 1, 2, 8

work page 2026

[65] [65]

Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models

Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, and Feng Zhao. Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models. InInt. Conf. Com- put. Vis., pages 9276–9286. IEEE, 2023. 2

work page 2023

[66] [66]

What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt

Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt. Conf. Learn. Represent.,

work page

[67] [67]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21924–21935, 2025. 1, 2, 3, 8

work page 2025

[68] [68]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 10371–10381, 2024. 2

work page 2024

[69] [69]

Depth any- thing v2.Adv

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Adv. Neural Inform. Process. Syst., 37:21875– 21911, 2024. 1, 2, 3

work page 2024

[70] [70]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 767–783, 2018. 2

work page 2018

[71] [71]

Blendedmvs: A large- scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1790–1799,

work page

[72] [72]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InInt. Conf. Comput. Vis., pages 12–22, 2023. 4, 1, 2, 3

work page 2023

[73] [73]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InIEEE Conf. Comput. Vis. Pattern Recog., pages 204–213, 2021. 2

work page 2021

[74] [74]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InInt. Conf. Comput. Vis., pages 9043–9053, 2023. 2

work page 2023

[75] [75]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In IEEE Conf. Comput. Vis. Pattern Recog., pages 21936– 21947, 2025. 2

work page 2025

[76] [76]

3d lidar point cloud based intersec- tion recognition for autonomous driving

Quanwen Zhu, Long Chen, Qingquan Li, Ming Li, Andreas N¨uchter, and Jian Wang. 3d lidar point cloud based intersec- tion recognition for autonomous driving. InIEEE Intell. Veh. Symp., pages 456–461. IEEE, 2012. 1 Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation Supplementary Material

work page 2012

[77] [77]

For a fair comparison, we further fine-tune VGGT on “Data3” under the same final training setting as CARVE

More Analysis Comparison with VGGT Fine-tuned on the Same Data. For a fair comparison, we further fine-tune VGGT on “Data3” under the same final training setting as CARVE. The model is initialized from the official VGGT pretrained weights and fully fine-tuned for 30K iterations. Other set- tings, including the optimizer, data preprocessing, and eval- uati...

work page

[78] [78]

1) We present more training and evaluation details for the ablation study and main ex- periments; 2) We include extended visualization results in Figure 4

Experimental Setting Details In the supplementary material, we provide additional de- tails and quantitative results. 1) We present more training and evaluation details for the ablation study and main ex- periments; 2) We include extended visualization results in Figure 4. Common Training Details.The experiments were con- ducted on a server running Ubuntu...

work page

[79] [79]

Lreg(Winv)+L F +L consis

withβ 1 = 0.9,β 2 = 0.99, and a weight decay of 0.01. The learning rate is scheduled using the OneCycleLR policy [49]. The longer side of the low-resolution input image is resized to 518 pixels, and the shorter side is then randomly RGB Figure 3. Removing the spatial gradient loss and confidence loss has minimal impact on qualitative results when continui...

work page 2025

[80] [80]

For point cloud estimation, we aggregate the predictions of each sequence in world coordinates by stacking the in- dividual estimations

For KITTI, we use the sequences of 2011 09 26 0001, 2011 09 26 0009, 2011 09 26 0091, 2011 09 28 0001, 2011 09 29 0004, and 2011 09 29 0071. For point cloud estimation, we aggregate the predictions of each sequence in world coordinates by stacking the in- dividual estimations. To assess both per-view accuracy and cross-view consistency, we align the stack...

work page 2011