Scene Grounding In the Wild
Pith reviewed 2026-05-14 23:24 UTC · model grok-4.3
The pith
Partial 3D reconstructions from sparse in-the-wild images can be globally aligned to a complete reference scene model derived from Google Earth renderings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We represent the reference model using 3D Gaussian Splatting augmented with semantic features and formulate alignment as an inverse feature-based optimization that estimates a global 6DoF pose and scale while keeping the reference fixed. This grounds each partial reconstruction to the complete reference, producing globally consistent results even without visual overlap between input views. We also introduce the WikiEarth dataset that registers existing partial reconstructions with the pseudo-synthetic reference models.
What carries the argument
Augmented 3D Gaussian Splatting features used in inverse feature-based optimization to recover global 6DoF pose and scale for each partial reconstruction.
Load-bearing premise
Real-world photographs and pseudo-synthetic renderings share the same underlying scene semantics that can be captured by augmented Gaussian features despite large appearance differences.
What would settle it
Apply the alignment to the WikiEarth dataset with ground-truth registrations available and check whether estimated poses remain accurate when semantic feature augmentation is removed from the Gaussians.
Figures
read the original abstract
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that partial 3D reconstructions from unstructured in-the-wild imagery can be globally aligned to a complete reference model derived from Google Earth Studio pseudo-synthetic renderings by representing the reference with 3D Gaussian Splatting augmented by semantic features and solving an inverse feature-based optimization for 6DoF pose and scale. The approach is shown to improve alignment when initialized from classical or learning-based pipelines, mitigate end-to-end model failures, and is supported by the new WikiEarth dataset that registers partial reconstructions to the reference models.
Significance. If the central claim holds, the work would provide a practical route to consistent large-scale scene reconstruction under minimal overlap, leveraging domain-invariant semantics to connect real imagery with geospatial references. This could benefit downstream tasks such as city-scale mapping, AR/VR content creation, and change detection. The WikiEarth dataset itself would be a useful benchmark resource.
major comments (2)
- [Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.
- [Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.
minor comments (1)
- [Abstract] The abstract and introduction use the term 'augmented Gaussian features' without an early pointer to the precise definition or equation that introduces the feature vector attached to each Gaussian.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the method and experiments.
read point-by-point responses
-
Referee: [Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.
Authors: We agree that the Method section currently describes the semantic feature augmentation at a high level. In the revised manuscript we will expand this subsection to specify the feature extractor architecture, its pre-training data, and the invariance mechanism used to support cross-domain correspondences. revision: yes
-
Referee: [Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.
Authors: We acknowledge that the current Experiments section does not contain ablations isolating the semantic features or quantitative cross-domain matching accuracy. In the revision we will add these studies together with error analysis to better attribute the observed gains. revision: yes
Circularity Check
No significant circularity; derivation relies on external optimization and stated insight
full rationale
The paper formulates alignment as an inverse feature-based optimization that estimates 6DoF pose and scale while holding the reference 3D Gaussian Splatting model fixed. The key insight that real photographs and Google Earth Studio renderings share underlying scene semantics is asserted directly rather than derived from any equation or prior result within the paper. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The approach is therefore self-contained against the external WikiEarth dataset and reference models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world photographs and Google Earth pseudo-synthetic renderings share the same underlying scene semantics
Reference graph
Works this paper leans on
-
[1]
P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 14(2):239–256, 1992. 3
work page 1992
-
[2]
Extreme rotation estimation in the wild,
Hana Bezalel, Dotan Ankri, Ruojin Cai, and Hadar Averbuch-Elor. Extreme rotation estimation in the wild,
-
[3]
Extreme rotation estimation using dense cor- relation volumes, 2021
Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme rotation estimation using dense cor- relation volumes, 2021. 2
work page 2021
-
[4]
Gaussreg: Fast 3d registration with gaussian splatting, 2024
Jiahao Chang, Yinglin Xu, Yihao Li, Yuantao Chen, and Xi- aoguang Han. Gaussreg: Fast 3d registration with gaussian splatting, 2024. 3
work page 2024
-
[5]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 3
work page 2020
-
[6]
Wide- baseline relative camera pose estimation with directional learning, 2021
Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning, 2021. 2
work page 2021
-
[7]
Dreg-nerf: Deep registration for neural radiance fields, 2023
Yu Chen and Gim Hee Lee. Dreg-nerf: Deep registration for neural radiance fields, 2023. 3
work page 2023
-
[8]
Deep global registration, 2020
Christopher Choy, Wei Dong, and Vladlen Koltun. Deep global registration, 2020. 3
work page 2020
-
[9]
Indoor-outdoor 3d reconstruction alignment
Andrea Cohen, Johannes L Sch ¨onberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-outdoor 3d reconstruction alignment. InEuropean Conference on Computer Vision, pages 285–300. Springer,
-
[10]
Ingrid Daubechies, Ronald DeV ore, Massimo Fornasier, and C. Sinan Gunturk. Iteratively re-weighted least squares min- imization for sparse recovery, 2008. 8
work page 2008
-
[11]
Estimating ex- treme 3d image rotations using cascaded attention
Shay Dekel, Yosi Keller, and Martin Cadik. Estimating ex- treme 3d image rotations using cascaded attention. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2588–2598, 2024. 2
work page 2024
-
[12]
Superpoint: Self-supervised interest point detection and description, 2018
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description, 2018. 4, 5
work page 2018
-
[13]
3d object detection and localization using multimodal point pair features
Bertram Drost and Slobodan Ilic. 3d object detection and localization using multimodal point pair features. In2012 Second International Conference on 3D Imaging, Model- ing, Processing, Visualization & Transmission, pages 9–16. IEEE, 2012. 3
work page 2012
-
[14]
Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections
Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, and Hadar Averbuch-Elor. Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections. InComputer Graphics Forum, page e15006. Wiley Online Library, 2024. 4
work page 2024
-
[15]
Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference
Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, De- jia Xu, Hanwen Jiang, and Zhangyang Wang. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference.arXiv preprint arXiv:2305.15727, 2023. 2
-
[16]
nerf2nerf: Pairwise registration of neural radiance fields
Lily Goli, Daniel Rebain, Sara Sabour, Animesh Garg, and Andrea Tagliasacchi. nerf2nerf: Pairwise registration of neural radiance fields. In2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 9354–9361,
-
[17]
Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, and Jianwei Wan. 3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014. 3
work page 2014
-
[18]
Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, Jianwei Wan, and Ngai Kwok. A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015. 3
work page 2015
-
[19]
Gerd H ¨ausler and D Ritter. Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999. 3
work page 1999
-
[20]
Deepbbs: Deep best buddies for point cloud registration,
Itan Hezroni, Amnon Drory, Raja Giryes, and Shai Avidan. Deepbbs: Deep best buddies for point cloud registration,
-
[21]
Nerf-rpn: A general framework for object detection in nerfs
Benran Hu, Junkai Huang, Yichen Liu, Yu-Wing Tai, and Chi-Keung Tang. Nerf-rpn: A general framework for object detection in nerfs. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23528–23538, 2023. 3
work page 2023
-
[22]
Image matching across wide baselines: From paper to practice
Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547,
-
[23]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[24]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4
work page 2023
-
[25]
Lerf: Language embedded radiance fields, 2023
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields, 2023. 3
work page 2023
-
[26]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 3
work page 2023
-
[27]
Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes
Shai Krakovsky, Gal Fiebelman, Sagie Benaim, and Hadar Averbuch-Elor. Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 4
work page 2025
-
[28]
Ground- ing image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 5, 6
work page 2024
-
[29]
Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl
Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation, 2022. 3, 7, 8
work page 2022
-
[30]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 5
work page 2041
-
[31]
Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023
Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tul- siani. Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023. 2
-
[32]
Pixel-perfect structure-from-motion with featuremetric refinement, 2021
Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement, 2021. 4
work page 2021
-
[33]
Lightglue: Local feature matching at light speed, 2023
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed, 2023. 4, 5
work page 2023
-
[34]
Nerf- loc: Visual localization with conditional neural radiance field
Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf- loc: Visual localization with conditional neural radiance field. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023. 3
work page 2023
-
[35]
The 3d jigsaw puzzle: Mapping large indoor spaces
Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3d jigsaw puzzle: Mapping large indoor spaces. InEuropean Conference on Computer Vision, pages 1–16. Springer, 2014. 2
work page 2014
-
[36]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3
work page 2021
-
[37]
Lens: Localization enhanced by nerf synthesis
Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on Robot Learn- ing, pages 1347–1356. PMLR, 2022. 3
work page 2022
-
[38]
Dinov2: Learning robust visual features with- out supervision, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...
work page 2024
-
[39]
Meshloc: Mesh-based visual localization, 2022
V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Meshloc: Mesh-based visual localization, 2022. 3
work page 2022
-
[40]
Visual localization using imperfect 3d models from the internet
V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Visual localization using imperfect 3d models from the internet. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13175–13186,
-
[41]
Langsplat: 3d language gaussian splatting,
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting,
-
[42]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3
work page 2021
- [43]
-
[44]
Back to the feature: Learning robust cam- era localization from pixels to pose, 2021
Paul-Edouard Sarlin, Ajaykumar Unagar, M ˚ans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the feature: Learning robust cam- era localization from pixels to pose, 2021. 3
work page 2021
-
[45]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1
work page 2016
-
[46]
Sch ¨onberger and Jan-Michael Frahm
Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4104– 4113, 2016. 4, 5
work page 2016
-
[47]
Vf-nerf: Viewshed fields for rigid nerf registration, 2024
Leo Segre and Shai Avidan. Vf-nerf: Viewshed fields for rigid nerf registration, 2024. 2, 3, 4, 7
work page 2024
-
[48]
Language embedded 3d gaussians for open- vocabulary scene understanding, 2023
Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding, 2023. 3
work page 2023
-
[49]
Photo tourism: Exploring photo collections in 3D
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. 2006. 1
work page 2006
-
[50]
Neural 3d reconstruction in the wild
Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. InACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 1
work page 2022
-
[51]
Large scale sfm with the distributed camera model, 2016
Chris Sweeney, Victor Fragoso, Tobias Hollerer, and Matthew Turk. Large scale sfm with the distributed camera model, 2016. 4, 5
work page 2016
-
[52]
Nerfstudio: A modular framework for neural radiance field development
Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David Mcallister, Justin Kerr, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSpe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Confe...
work page 2023
-
[53]
Megascenes: Scene-level view synthesis at scale
Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. arXiv preprint arXiv:2406.11819, 2024. 5
-
[54]
Suhani V ora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes,
-
[55]
Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis
Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2
work page 2025
-
[56]
3d reconstruction with spatial memory, 2024
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. 3
work page 2024
-
[57]
Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment
Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,
-
[58]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 5, 6
work page 2025
-
[59]
Semantic is enough: Only semantic information for nerf reconstruction
Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, and Wei Yan. Semantic is enough: Only semantic information for nerf reconstruction. In2023 IEEE International Conference on Unmanned Systems (ICUS), page 906–912. IEEE, 2023. 3
work page 2023
-
[60]
Dust3r: Geometric 3d vi- sion made easy, 2024
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy, 2024. 1, 2, 3, 5, 6
work page 2024
-
[61]
Yue Wang and Justin M. Solomon. Deep closest point: Learning representations for point cloud registration, 2019. 3
work page 2019
-
[62]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning, 2025. 2, 5, 6
work page 2025
-
[63]
Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision
Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 428–437, 2021. 2, 5, 7
work page 2021
-
[64]
Denoising vision transformers,
Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers,
-
[65]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 3
work page 2025
-
[66]
Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin
Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert- ing neural radiance fields for pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. 2, 3, 4, 6, 7
work page 2021
-
[67]
Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild
Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild. InEuropean Conference on Computer Vi- sion, pages 592–611. Springer, 2022. 2
work page 2022
-
[68]
3d registration with maximal cliques, 2023
Xiyu Zhang, Jiaqi Yang, Shikun Zhang, and Yanning Zhang. 3d registration with maximal cliques, 2023. 3
work page 2023
-
[69]
Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J. Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 3
work page 2021
-
[70]
Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global registration. 2016. 3
work page 2016
-
[71]
Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields
Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3, 4, 5 Scene Groun...
work page 2024
-
[72]
Additional Results and Comparisons 7.1. Additional Quantitative Results In addition to the averaged∆R,∆Treported in the main paper (Table 1), in Tables 4, 5 6, we report a per meta-image performance breakdown for all the initializations. From this breakdown, we observe that our method successfully regis- ters meta-images where the baseline exhibits large∆...
-
[73]
Limitations While our method is not specifically designed for single- shot scenarios, we evaluate its reliability with fewer images per meta-image in Fig. 15 (left). We evaluate performance by randomly sub-sampling subsets of varying sizes from each meta-image, reporting the average error across five independent samples Performance drops over very small m...
-
[74]
Implementation Details 9.1. The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio. We resize each image to1400X1400and then use the pretrained backbone dinov2 vits14, which outputs dense feature map100X100. We chose DINOv2 with embedding size of 384. We use the DINO implementationfacebookresea...
-
[75]
The Google Earth Studio rendering UI is presented at Fig
TheWikiEarthBenchmark We rendered images around each landmark using Google Earth Studio, the camera trajectories for each landmark will be published with the benchmark. The Google Earth Studio rendering UI is presented at Fig. 16. After rendering the images, we create a COLMAP us- ing the rendered images of the landmark from Google Earth Studio. We use CO...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.