Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Pith reviewed 2026-05-09 21:44 UTC · model grok-4.3
The pith
Ablation studies show that scaling data diversity, avoiding certain losses, and using joint alignment improve feed-forward 3D visual geometry estimation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic ablations demonstrate that performance in feed-forward visual geometry estimation rises when data diversity and quality are increased, when confidence-aware and gradient-based losses are not used, and when supervision combines per-sequence and per-frame alignment while avoiding local region alignment. Adding a consistency loss that enforces agreement among depth maps, camera parameters, and point maps together with an architectural change for high-resolution processing produces CARVE, which delivers strong and robust accuracy across point cloud reconstruction, video depth estimation, and camera pose and intrinsic estimation benchmarks.
What carries the argument
The consistency loss function that enforces alignment between estimated depth maps, camera parameters, and point maps, combined with the ablation-driven insights on data scaling and alignment strategies.
If this is right
- State-of-the-art visual geometry models continue to improve when trained on larger and more varied datasets.
- Confidence-aware and gradient-based losses can be removed without harming, and sometimes improving, final accuracy.
- Joint sequence-level and frame-level supervision produces better cross-frame consistency than local alignment alone.
- High-resolution inputs can be incorporated efficiently once the consistency loss links depths, poses, and points.
- The resulting CARVE model attains competitive numbers on point cloud, depth, and camera estimation tasks across multiple benchmarks.
Where Pith is reading between the lines
- The same data and alignment choices could be tested in related tasks such as object-level 3D reconstruction to check transfer.
- Models that embed these factors during initial design rather than through later ablation may reach the observed gains with less trial and error.
- If local alignment continues to underperform, future work might focus on global geometric constraints instead of fine-grained local matching.
Load-bearing premise
The patterns observed in the ablation studies on the chosen models and datasets will continue to hold when the same choices are applied to different architectures or to new real-world data.
What would settle it
A controlled experiment in which adding the confidence-aware loss or switching to local region alignment raises accuracy on a held-out benchmark, or in which CARVE fails to match or exceed strong per-frame baselines on a new diverse test set.
Figures
read the original abstract
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a performance gap in feed-forward visual geometry estimation where multi-frame models offer better consistency but lag single-frame accuracy. Through ablation studies it claims that scaling data diversity and quality improves results, while confidence-aware and gradient-based losses hinder performance; joint per-sequence and per-frame supervision helps but local alignment degrades it. The authors introduce CARVE, which adds a consistency loss aligning depth maps, camera parameters and point maps plus an efficient high-resolution design, and report strong results on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation benchmarks.
Significance. If the ablation insights prove robust beyond the tested models and the CARVE gains hold under broader conditions, the work supplies practical guidance on supervision choices and architectural scaling for 3D geometry models. The manuscript's strength lies in its systematic empirical ablations and multi-task benchmark evaluation, which provide concrete, falsifiable observations that could inform subsequent feed-forward pipelines.
major comments (2)
- [Abstract] Abstract: the assertion that confidence-aware and gradient-based losses 'may unintentionally hinder performance' and that local alignment 'surprisingly degrades performance' is presented as a general critical factor, yet the ablations are conducted within fixed architectures and training regimes; no architecture-swap or distribution-shift experiments are described to test whether the directional effects persist, weakening the claim that these findings unlock broader progress.
- [Method / CARVE description] The consistency loss and high-resolution design are central to CARVE's claimed advantages, but the manuscript provides only a high-level description ('enforces alignment between depth maps, camera parameters, and point maps') without the explicit loss formulation, weighting schedule, or architectural diagram, preventing independent verification of how these components produce the reported benchmark gains.
minor comments (2)
- [Experiments] Add a table or section listing all datasets, benchmarks, and baseline implementations with exact references and training details to support reproducibility of the ablation and final results.
- [Abstract] The acronym CARVE is used without expansion on first appearance; provide the full name or definition at its introduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important points regarding the scope of our claims and the need for greater methodological detail. We address each major comment below, clarifying our experimental scope and committing to revisions that enhance reproducibility and precision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that confidence-aware and gradient-based losses 'may unintentionally hinder performance' and that local alignment 'surprisingly degrades performance' is presented as a general critical factor, yet the ablations are conducted within fixed architectures and training regimes; no architecture-swap or distribution-shift experiments are described to test whether the directional effects persist, weakening the claim that these findings unlock broader progress.
Authors: We appreciate this point on the generality of our observations. Our ablation studies were systematically conducted within established feed-forward visual geometry estimation architectures and training regimes to isolate critical factors in the current state-of-the-art setting. The abstract employs cautious phrasing ('may unintentionally hinder' and 'surprisingly degrades') to reflect these as empirical findings rather than universal claims. To address the concern, we will revise the abstract and the discussion section to explicitly note that the directional effects are observed within the tested models and regimes, and we will add a statement encouraging validation across additional architectures and distributions. This clarification strengthens the manuscript without altering the core empirical contributions. revision: partial
-
Referee: [Method / CARVE description] The consistency loss and high-resolution design are central to CARVE's claimed advantages, but the manuscript provides only a high-level description ('enforces alignment between depth maps, camera parameters, and point maps') without the explicit loss formulation, weighting schedule, or architectural diagram, preventing independent verification of how these components produce the reported benchmark gains.
Authors: We agree that additional details are essential for reproducibility and independent verification. In the revised manuscript, we will provide the explicit mathematical formulation of the consistency loss, including the precise terms for aligning depth maps, camera parameters, and point maps, along with the weighting schedule used during training. We will also include a detailed architectural diagram of the high-resolution design and its integration with the base feed-forward model. These additions will directly enable readers to understand and replicate how these components contribute to the benchmark improvements. revision: yes
Circularity Check
No circularity: empirical ablation chain is self-contained
full rationale
The paper's central claims rest on systematic ablation experiments that identify performance factors (data scaling, loss choices, alignment strategies) and then integrate two new design elements (consistency loss and high-resolution architecture) into CARVE. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. No load-bearing self-citations or uniqueness theorems are invoked to justify the core argument. Performance is reported on external benchmarks (point cloud reconstruction, video depth, pose estimation), keeping the derivation independent of its own inputs. This is the expected non-finding for an empirical computer-vision study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK- itscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InAdv. Neural In- form. Process. Syst., 2021. 4, 1, 2, 3
work page 2021
-
[2]
Depth pro: Sharp monocular metric depth in less than a second.Int
Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.Int. Conf. Learn. Represent., 2024. 1, 2, 3
work page 2024
-
[3]
Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv: Comp. Res. Repository, 2020. 4, 1, 2, 3
work page 2020
-
[4]
Pyramid stereo matching network
Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5410–5418, 2018. 2
work page 2018
-
[5]
Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi- Gonzalez, and Carl Wellington. 3d point cloud processing and learning for autonomous driving: Impacting map cre- ation, localization, and perception.IEEE Trans. Signal Pro- cess., 38(1):68–86, 2020. 1
work page 2020
-
[6]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22831– 22840, 2025. 3
work page 2025
-
[7]
Depth estimation for colonoscopy images with self- supervised learning from videos
Kai Cheng, Yiting Ma, Bin Sun, Yang Li, and Xuejin Chen. Depth estimation for colonoscopy images with self- supervised learning from videos. InInt. Conf. Med. Image Comput. Comput. Assist. Interv., pages 119–128. Springer,
-
[8]
Hierarchical neural architecture search for deep stereo matching.Adv
Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching.Adv. Neural Inform. Process. Syst., 33:22158–22169, 2020. 2
work page 2020
-
[9]
Hsfm: Hybrid structure-from-motion
Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 1212–1221, 2017. 1, 2
work page 2017
-
[10]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5828–5839,
-
[11]
FlashAttention-2: Faster attention with better paral- lelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInt. Conf. Learn. Represent.,
-
[12]
Transmvsnet: Global context-aware multi-view stereo network with trans- formers
Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with trans- formers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8585–8594, 2022. 2
work page 2022
-
[13]
Lsd- slam: Large-scale direct monocular slam
Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEur. Conf. Com- put. Vis., pages 834–849. Springer, 2014. 2
work page 2014
-
[14]
Direct sparse odometry.IEEE Trans
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE Trans. Pattern Anal. Mach. Intell., 40(3):611–625, 2017. 2
work page 2017
-
[15]
Graspnet-1billion: A large-scale benchmark for general ob- ject grasping
Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InIEEE Conf. Comput. Vis. Pattern Recog., pages 11444–11453, 2020. 4, 1, 2, 3
work page 2020
-
[16]
Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. InEur. Conf. Comput. Vis., pages 241–258. Springer, 2024. 2
work page 2024
-
[17]
Accurate, dense, and robust multi-view stereopsis.Int
Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis.Int. J. Comput. Vis., 85(1):1– 15, 2009. 2
work page 2009
-
[18]
Vision meets robotics: The kitti dataset.Int
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.Int. J. Robot. Res., 32(11):1231–1237, 2013. 2, 8, 3
work page 2013
-
[19]
Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. Multi-view stereo for com- munity photo collections. InInt. Conf. Comput. Vis., pages 1–8, 2007. 2
work page 2007
-
[20]
Honnotate: A method for 3d annotation of hand and object poses
Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vin- cent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3196–3206, 2020. 2, 8, 3
work page 2020
-
[21]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 5
work page 2016
-
[22]
Xiankang He, Guangkai Xu, Bo Zhang, Hao Chen, Ying Cui, and Dongyan Guo. Diffcalib: Reformulating monocu- lar camera calibration as diffusion-based dense incident map generation. InProc. AAAI Conf. Artif. Intell., pages 3428– 3436, 2025. 2
work page 2025
-
[23]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Trans. Pattern Anal. Mach. Intell.,
-
[24]
Deepmvs: Learning multi-view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2821–2830, 2018. 2, 4, 1, 3
work page 2018
-
[25]
Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lour- des Agapito, and Jerome Revaud. Pow3r: Empowering un- constrained 3d reconstruction with camera and scene priors. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1071– 1081, 2025. 2
work page 2025
-
[26]
On the importance of accurate geometry data for dense 3d vision tasks
HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. On the importance of accurate geometry data for dense 3d vision tasks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 780–791, 2023. 2, 8, 3
work page 2023
-
[27]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 9492–9502, 2024. 2
work page 2024
-
[28]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEur. Conf. Comput. Vis., pages 71–91. Springer, 2024. 1, 2, 3
work page 2024
-
[29]
Megasam: Accurate, fast and robust structure and motion from casual dynamic videos
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10486–10496,
-
[30]
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth any- thing 3: Recovering the visual space from any views. InInt. Conf. Learn. Represent., 2026. 1, 2
work page 2026
-
[31]
Local similarity pattern and cost self-reassembling for deep stereo matching networks
Biyang Liu, Huimin Yu, and Yangqi Long. Local similarity pattern and cost self-reassembling for deep stereo matching networks. InProc. AAAI Conf. Artif. Intell., pages 1647– 1655, 2022. 2
work page 2022
-
[32]
Robotic online path planning on point cloud
Ming Liu. Robotic online path planning on point cloud. IEEE Trans. Cybern., 46(5):1217–1228, 2015. 1
work page 2015
-
[33]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInt. Conf. Learn. Represent., 2019. 1
work page 2019
-
[34]
Bilawal Mahmood, SangUk Han, and Dong-Eun Lee. Bim- based registration and localization of 3d point clouds of in- door scenes using geometric features for augmented reality. Remote Sens., 12(14):2302, 2020. 1
work page 2020
-
[35]
Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr ´es Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4981–4991, 2023. 4, 1, 2, 3
work page 2023
-
[36]
Global fusion of relative motions for robust, accurate and scalable structure from motion
Pierre Moulon, Pascal Monasse, and Renaud Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. InInt. Conf. Comput. Vis., pages 3248–3255, 2013. 1, 2
work page 2013
-
[37]
Orb-slam: A versatile and accurate monocular slam system.IEEE Trans
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE Trans. Robot., 31(5):1147–1163, 2015. 2
work page 2015
-
[38]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 2
work page 2024
-
[39]
E. Palazzolo, J. Behley, P. Lottes, P. Gigu `ere, and C. Stach- niss. ReFusion: 3D Reconstruction in Dynamic Envi- ronments for RGB-D Cameras Exploiting Residuals. In IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019. 2, 8, 3
work page 2019
-
[40]
Global structure-from-motion revisited
Linfei Pan, D ´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEur. Conf. Comput. Vis., pages 58–77. Springer, 2024. 1, 2
work page 2024
-
[41]
Tartan- ground: A large-scale dataset for ground robot perception and navigation
Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 20524–20531, 2025. 4, 1, 2, 3
work page 2025
-
[42]
Low-cost aug- mented reality systems via 3d point cloud sensors
Alessio Pierluigi Placitelli and Luigi Gallo. Low-cost aug- mented reality systems via 3d point cloud sensors. InInt. Conf. Signal Image Technol. Internet-Based Syst., pages 188–192. IEEE, 2011. 1
work page 2011
-
[43]
Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. InInt. Conf. Comput. Vis., pages 10912–10922, 2021. 4, 1, 2, 3
work page 2021
-
[44]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 4938–4947, 2020. 2
work page 2020
-
[45]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., pages 4104–4113, 2016. 1, 2
work page 2016
-
[46]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 501–518. Springer, 2016. 2
work page 2016
-
[47]
A multi-view stereo benchmark with high- resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3260–3269, 2017. 2, 8, 3
work page 2017
-
[48]
Scene co- ordinate regression forests for camera relocalization in rgb-d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2930–2937, 2013. 2, 8, 3
work page 2013
-
[49]
Super-convergence: Very fast training of neural networks using large learn- ing rates
Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learn- ing rates. InArtificial intelligence and machine learning for multi-domain operations applications, pages 369–386. SPIE, 2019. 1
work page 2019
-
[50]
Three-dimensional reconstruction for medical-cad modeling
B Starly, Z Fang, W Sun, A Shokoufandeh, and W Regli. Three-dimensional reconstruction for medical-cad modeling. Comput. Aided Des. Appl., 2(1-4):431–438, 2005. 1
work page 2005
-
[51]
A benchmark for the evalua- tion of RGB-D slam systems
J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of RGB-D slam systems. InIEEE/RSJ Int. Conf. Intell. Robots Syst., 2012. 2, 8, 3
work page 2012
-
[52]
Loftr: Detector-free local feature match- ing with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature match- ing with transformers. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8922–8931, 2021. 2, 8
work page 2021
-
[53]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Adv. Neural In- form. Process. Syst., 34:16558–16569, 2021. 2
work page 2021
-
[54]
Open synthetic dataset for improving cyclist detection, 2021
Phillip Thomas, Lars Pandikow, Alex Kim, Michael Stan- ley, and James Grieve. Open synthetic dataset for improving cyclist detection, 2021. 4, 1, 2, 3
work page 2021
-
[55]
Smd-nets: Stereo mixture density networks
Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8942–8952, 2021. 4, 1, 2, 3
work page 2021
-
[56]
3d reconstruction with spatial memory
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInt. Conf. 3D Vision, 2025. 1, 2, 3, 8
work page 2025
-
[57]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5294–5306, 2025. 1, 2, 3, 4, 5, 8
work page 2025
-
[58]
Continuous 3d per- ception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10510–10522, 2025. 1, 2, 3
work page 2025
-
[59]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5261–5271, 2025. 1, 2, 3
work page 2025
-
[60]
Moge-2: Accurate monocular geometry with metric scale and sharp details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdv. Neural Inform. Process. Syst.,
-
[61]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InIEEE Conf. Comput. Vis. Pattern Recog., pages 20697–20709, 2024. 1, 2, 3
work page 2024
-
[62]
Tartanair: A dataset to push the limits of vi- sual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of vi- sual slam. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 4909–4916. IEEE, 2020. 4, 1, 2, 3
work page 2020
-
[63]
Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif
Xixun Wang, Yoshiki Mizukami, Makoto Tada, and Fumi- toshi Matsuno. Navigation of a mobile robot in a dynamic environment using a point cloud map.Artif. Life Robot., 26 (1):10–20, 2021. 1
work page 2021
-
[64]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning. InInt. Conf. Learn. Represent., 2026. 1, 2, 8
work page 2026
-
[65]
Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models
Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, and Feng Zhao. Frozenrecon: Pose-free 3d scene reconstruction with frozen depth models. InInt. Conf. Com- put. Vis., pages 9276–9286. IEEE, 2023. 2
work page 2023
-
[66]
What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt
Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for gen- eral dense perception tasks? InInt. Conf. Learn. Represent.,
-
[67]
Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21924–21935, 2025. 1, 2, 3, 8
work page 2025
-
[68]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE Conf. Com- put. Vis. Pattern Recog., pages 10371–10381, 2024. 2
work page 2024
-
[69]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Adv. Neural Inform. Process. Syst., 37:21875– 21911, 2024. 1, 2, 3
work page 2024
-
[70]
Mvsnet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InEur. Conf. Comput. Vis., pages 767–783, 2018. 2
work page 2018
-
[71]
Blendedmvs: A large- scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1790–1799,
-
[72]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InInt. Conf. Comput. Vis., pages 12–22, 2023. 4, 1, 2, 3
work page 2023
-
[73]
Learning to recover 3d scene shape from a single image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InIEEE Conf. Comput. Vis. Pattern Recog., pages 204–213, 2021. 2
work page 2021
-
[74]
Metric3d: Towards zero-shot metric 3d prediction from a single image
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InInt. Conf. Comput. Vis., pages 9043–9053, 2023. 2
work page 2023
-
[75]
Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In IEEE Conf. Comput. Vis. Pattern Recog., pages 21936– 21947, 2025. 2
work page 2025
-
[76]
3d lidar point cloud based intersec- tion recognition for autonomous driving
Quanwen Zhu, Long Chen, Qingquan Li, Ming Li, Andreas N¨uchter, and Jian Wang. 3d lidar point cloud based intersec- tion recognition for autonomous driving. InIEEE Intell. Veh. Symp., pages 456–461. IEEE, 2012. 1 Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation Supplementary Material
work page 2012
-
[77]
More Analysis Comparison with VGGT Fine-tuned on the Same Data. For a fair comparison, we further fine-tune VGGT on “Data3” under the same final training setting as CARVE. The model is initialized from the official VGGT pretrained weights and fully fine-tuned for 30K iterations. Other set- tings, including the optimizer, data preprocessing, and eval- uati...
-
[78]
Experimental Setting Details In the supplementary material, we provide additional de- tails and quantitative results. 1) We present more training and evaluation details for the ablation study and main ex- periments; 2) We include extended visualization results in Figure 4. Common Training Details.The experiments were con- ducted on a server running Ubuntu...
-
[79]
withβ 1 = 0.9,β 2 = 0.99, and a weight decay of 0.01. The learning rate is scheduled using the OneCycleLR policy [49]. The longer side of the low-resolution input image is resized to 518 pixels, and the shorter side is then randomly RGB Figure 3. Removing the spatial gradient loss and confidence loss has minimal impact on qualitative results when continui...
work page 2025
-
[80]
For KITTI, we use the sequences of 2011 09 26 0001, 2011 09 26 0009, 2011 09 26 0091, 2011 09 28 0001, 2011 09 29 0004, and 2011 09 29 0071. For point cloud estimation, we aggregate the predictions of each sequence in world coordinates by stacking the in- dividual estimations. To assess both per-view accuracy and cross-view consistency, we align the stack...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.