pith. sign in

arxiv: 2603.26584 · v2 · submitted 2026-03-27 · 💻 cs.CV

Scene Grounding In the Wild

Pith reviewed 2026-05-14 23:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene reconstructionglobal alignmentGaussian splattingin-the-wild imageryreference modelpose estimationsemantic featureslarge-scale scenes
0
0 comments X

The pith

Partial 3D reconstructions from sparse in-the-wild images can be globally aligned to a complete reference scene model derived from Google Earth renderings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that takes disconnected or misaligned partial 3D models built from real photographs and anchors each one inside a single, full-coverage reference model. The references come from dense pseudo-synthetic renderings of the entire scene, which provide complete geometry even when the input photos have almost no visual overlap. The method works by representing the reference with 3D Gaussians that carry semantic features and then solving an inverse optimization to recover the correct 6DoF pose and scale for each partial piece. A reader should care because standard reconstruction pipelines commonly produce fragmented outputs or wrongly fused geometry on large outdoor scenes, and this grounding step corrects those failures across many different starting methods.

Core claim

We represent the reference model using 3D Gaussian Splatting augmented with semantic features and formulate alignment as an inverse feature-based optimization that estimates a global 6DoF pose and scale while keeping the reference fixed. This grounds each partial reconstruction to the complete reference, producing globally consistent results even without visual overlap between input views. We also introduce the WikiEarth dataset that registers existing partial reconstructions with the pseudo-synthetic reference models.

What carries the argument

Augmented 3D Gaussian Splatting features used in inverse feature-based optimization to recover global 6DoF pose and scale for each partial reconstruction.

Load-bearing premise

Real-world photographs and pseudo-synthetic renderings share the same underlying scene semantics that can be captured by augmented Gaussian features despite large appearance differences.

What would settle it

Apply the alignment to the WikiEarth dataset with ground-truth registrations available and check whether estimated poses remain accurate when semantic feature augmentation is removed from the Gaussians.

Figures

Figures reproduced from arXiv: 2603.26584 by Hadar Averbuch-Elor, Leo Segre, Shai Avidan, Shay Shomer-Chai, Tamir Cohen.

Figure 1
Figure 1. Figure 1: Given a partial 3D reconstruction produced by running structure from motion on Internet images capturing large-scale landmarks, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scene Grounding via Semantic Feature-based Robust Optimization. Given a 3DGS reference model M (left) and a set of Internet images I (right), we propose an inverse optimization scheme that predicts a global 6DoF+scale alignment T while keeping the parameters of M fixed. We obtain an initial transformation T (in red) using a traditional SfM technique. During optimization, we calculate a semantic feature los… view at source ↗
Figure 3
Figure 3. Figure 3: Challenges of aligning internet photos to the ref￾erence model. Visualization of input Internet images (first and third columns) and views rendered from the reference model at the ground-truth locations (second and fourth columns). As illus￾trated above, high Lsem values (bottom row) often indicate outlier images, which our approach overcomes via a robust optimization scheme, as further detailed in Section… view at source ↗
Figure 4
Figure 4. Figure 4: The WikiEarth Benchmark. Reconstruction of four landmarks from WikiEarth. The blue frustums depicts the ren￾dered images from Google Earth Studio, and the red frustums the images from WikiScenes. the partial reconstructions from each meta-image into a uni￾fied whole, overcoming limitations seen in traditional SfM methods that produce disjoint or incomplete reconstructions (as illustrated in [PITH_FULL_IMA… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison. A visualization of the align￾ment results for our method compared to the three baselines. Each image shows the ground truth in the lower half and the rendered image from the reference model M after alignment in the top half. As demonstrated, our inverse optimization-based approach predicts precise transformations, even in the presence of challeng￾ing, inaccurate initializations. 5.1… view at source ↗
Figure 6
Figure 6. Figure 6: Grounding Multiple Meta Images. Above we show three scenes containing two meta-images per scene (visualized in green and purple), both grounded to the scene’s global reference model. We show both the COLMAP initialization, and our final result. Ground truth reconstructions are provided on the bottom. marks, comparing our method and the COLMAP baseline to the ground truth. As can be observed from these visu… view at source ↗
Figure 7
Figure 7. Figure 7: Aligning a meta-image to a reference model with π 3 . As illustrated above, while π 3 successfully registers the Google Earth images, it struggles to correctly align the Internet images in this model; see, for instance, the ghost structure in the center of the red box on the bottom row. The top result, reconstructed from Google Earth images only, is shown for reference [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 8
Figure 8. Figure 8: Generalization to Drone-Based Reference Models. We evaluate our method using a reference model reconstructed from drone video frames sourced from YouTube. As illustrated above, our approach significantly improves the alignment over the COLMAP baseline, which serves as our initialization [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reconstructions of Feed-Forward Models. We vi￾sualize three reconstructions obtained by running DUSt3R [60] MASt3R [28] and VGGT [58] over images sampled from two meta-images (visualized in green and purple). As illustrated above, all methods fail to reconstruct the Milan and Lincoln Cathe￾dral, showing either broken or overlapping meta-images, while these cameras capture non-overlapping regions as as seen… view at source ↗
Figure 10
Figure 10. Figure 10: Aligning a meta-image to a reference model with VGGT. Meta-image cameras are visualized in purple, while Google Earth images from the reference model are in blue. As illustrated above, VGGT failed to register the Murcia meta-image (i.e., the output contains two distinct regions, one for the Internet images and one for the Google Earth images) and also failed to reconstruct the reference model in the Freib… view at source ↗
Figure 11
Figure 11. Figure 11: Drone Reference model - Additional examples Re￾sults using a reference model from drone video frames are depicted above. The drone videos of the Freiburg Cathedral are taken from Youtube. As illustrated our approach significantly improves the alignment, in comparison to the COLMAP baseline, which serves as our initialization. 9. Implementation Details 9.1. The reference model First we extract DINOv2 [38] … view at source ↗
Figure 13
Figure 13. Figure 13: Additional Qualitative Comparison: A visualization of the alignment results for our method and the COLMAP baseline. Each image shows the ground truth in the lower half and the rendered image from the reference model M after alignment in the top half. As demonstrated, our inverse optimization-based approach predicts precise transformations, even in the presence of challenging, inaccurate initializations. 3… view at source ↗
Figure 14
Figure 14. Figure 14: Additional Analysis of our Robust Optimization Framework. Our method uses LTS to ignore the images with loss values that are higher than the median image loss. In the figure, we show a sample of images in several bins. The image below the diagonal are the real-world Internet images, and above is the rendered image from the reference model (rendered at the end of the optimization). As illustrated by the ri… view at source ↗
Figure 15
Figure 15. Figure 15: Robustness Analysis. Average errors ∆R and ∆T across the benchmark as a function of the meta-image size and the initialization noise (rotation). The graphs indicate that the error increases with smaller meta-images, reaching a plateau at a size of approximately 6. Furthermore, the initialization graph demon￾strate that the method aligns the images successfully once the noise is below a specific threshold … view at source ↗
Figure 16
Figure 16. Figure 16: Google Earth Studio UI: Screenshot of Google Earth Studio, showing the 3D model of the Geneva Cathedral. For each landmark we create a camera trajectory and rendered the images on the trajectory using the program nore watermarks”. We aligned this model images from the WikiScenes dataset, we chose only images in the exterior category for each landmark. The images are mostly not registered cor￾rectly with t… view at source ↗
read the original abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that partial 3D reconstructions from unstructured in-the-wild imagery can be globally aligned to a complete reference model derived from Google Earth Studio pseudo-synthetic renderings by representing the reference with 3D Gaussian Splatting augmented by semantic features and solving an inverse feature-based optimization for 6DoF pose and scale. The approach is shown to improve alignment when initialized from classical or learning-based pipelines, mitigate end-to-end model failures, and is supported by the new WikiEarth dataset that registers partial reconstructions to the reference models.

Significance. If the central claim holds, the work would provide a practical route to consistent large-scale scene reconstruction under minimal overlap, leveraging domain-invariant semantics to connect real imagery with geospatial references. This could benefit downstream tasks such as city-scale mapping, AR/VR content creation, and change detection. The WikiEarth dataset itself would be a useful benchmark resource.

major comments (2)
  1. [Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.
  2. [Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.
minor comments (1)
  1. [Abstract] The abstract and introduction use the term 'augmented Gaussian features' without an early pointer to the precise definition or equation that introduces the feature vector attached to each Gaussian.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the method and experiments.

read point-by-point responses
  1. Referee: [Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.

    Authors: We agree that the Method section currently describes the semantic feature augmentation at a high level. In the revised manuscript we will expand this subsection to specify the feature extractor architecture, its pre-training data, and the invariance mechanism used to support cross-domain correspondences. revision: yes

  2. Referee: [Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.

    Authors: We acknowledge that the current Experiments section does not contain ablations isolating the semantic features or quantitative cross-domain matching accuracy. In the revision we will add these studies together with error analysis to better attribute the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external optimization and stated insight

full rationale

The paper formulates alignment as an inverse feature-based optimization that estimates 6DoF pose and scale while holding the reference 3D Gaussian Splatting model fixed. The key insight that real photographs and Google Earth Studio renderings share underlying scene semantics is asserted directly rather than derived from any equation or prior result within the paper. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The approach is therefore self-contained against the external WikiEarth dataset and reference models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic features transfer across the real-to-pseudo-synthetic gap; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Real-world photographs and Google Earth pseudo-synthetic renderings share the same underlying scene semantics
    Stated as the key insight enabling feature-based alignment despite appearance differences.

pith-pipeline@v0.9.0 · 5546 in / 1124 out tokens · 29897 ms · 2026-05-14T23:24:54.558357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

  1. [1]

    Besl and Neil D

    P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 14(2):239–256, 1992. 3

  2. [2]

    Extreme rotation estimation in the wild,

    Hana Bezalel, Dotan Ankri, Ruojin Cai, and Hadar Averbuch-Elor. Extreme rotation estimation in the wild,

  3. [3]

    Extreme rotation estimation using dense cor- relation volumes, 2021

    Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme rotation estimation using dense cor- relation volumes, 2021. 2

  4. [4]

    Gaussreg: Fast 3d registration with gaussian splatting, 2024

    Jiahao Chang, Yinglin Xu, Yihao Li, Yuantao Chen, and Xi- aoguang Han. Gaussreg: Fast 3d registration with gaussian splatting, 2024. 3

  5. [5]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 3

  6. [6]

    Wide- baseline relative camera pose estimation with directional learning, 2021

    Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning, 2021. 2

  7. [7]

    Dreg-nerf: Deep registration for neural radiance fields, 2023

    Yu Chen and Gim Hee Lee. Dreg-nerf: Deep registration for neural radiance fields, 2023. 3

  8. [8]

    Deep global registration, 2020

    Christopher Choy, Wei Dong, and Vladlen Koltun. Deep global registration, 2020. 3

  9. [9]

    Indoor-outdoor 3d reconstruction alignment

    Andrea Cohen, Johannes L Sch ¨onberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-outdoor 3d reconstruction alignment. InEuropean Conference on Computer Vision, pages 285–300. Springer,

  10. [10]

    Sinan Gunturk

    Ingrid Daubechies, Ronald DeV ore, Massimo Fornasier, and C. Sinan Gunturk. Iteratively re-weighted least squares min- imization for sparse recovery, 2008. 8

  11. [11]

    Estimating ex- treme 3d image rotations using cascaded attention

    Shay Dekel, Yosi Keller, and Martin Cadik. Estimating ex- treme 3d image rotations using cascaded attention. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2588–2598, 2024. 2

  12. [12]

    Superpoint: Self-supervised interest point detection and description, 2018

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description, 2018. 4, 5

  13. [13]

    3d object detection and localization using multimodal point pair features

    Bertram Drost and Slobodan Ilic. 3d object detection and localization using multimodal point pair features. In2012 Second International Conference on 3D Imaging, Model- ing, Processing, Visualization & Transmission, pages 9–16. IEEE, 2012. 3

  14. [14]

    Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections

    Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, and Hadar Averbuch-Elor. Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections. InComputer Graphics Forum, page e15006. Wiley Online Library, 2024. 4

  15. [15]

    Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

    Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, De- jia Xu, Hanwen Jiang, and Zhangyang Wang. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference.arXiv preprint arXiv:2305.15727, 2023. 2

  16. [16]

    nerf2nerf: Pairwise registration of neural radiance fields

    Lily Goli, Daniel Rebain, Sara Sabour, Animesh Garg, and Andrea Tagliasacchi. nerf2nerf: Pairwise registration of neural radiance fields. In2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 9354–9361,

  17. [17]

    3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014

    Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, and Jianwei Wan. 3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014. 3

  18. [18]

    A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015

    Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, Jianwei Wan, and Ngai Kwok. A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015. 3

  19. [19]

    Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999

    Gerd H ¨ausler and D Ritter. Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999. 3

  20. [20]

    Deepbbs: Deep best buddies for point cloud registration,

    Itan Hezroni, Amnon Drory, Raja Giryes, and Shai Avidan. Deepbbs: Deep best buddies for point cloud registration,

  21. [21]

    Nerf-rpn: A general framework for object detection in nerfs

    Benran Hu, Junkai Huang, Yichen Liu, Yu-Wing Tai, and Chi-Keung Tang. Nerf-rpn: A general framework for object detection in nerfs. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23528–23538, 2023. 3

  22. [22]

    Image matching across wide baselines: From paper to practice

    Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547,

  23. [23]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  24. [24]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4

  25. [25]

    Lerf: Language embedded radiance fields, 2023

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields, 2023. 3

  26. [26]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 3

  27. [27]

    Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes

    Shai Krakovsky, Gal Fiebelman, Sagie Benaim, and Hadar Averbuch-Elor. Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 4

  28. [28]

    Ground- ing image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 5, 6

  29. [29]

    Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl

    Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation, 2022. 3, 7, 8

  30. [30]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 5

  31. [31]

    Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023

    Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tul- siani. Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023. 2

  32. [32]

    Pixel-perfect structure-from-motion with featuremetric refinement, 2021

    Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement, 2021. 4

  33. [33]

    Lightglue: Local feature matching at light speed, 2023

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed, 2023. 4, 5

  34. [34]

    Nerf- loc: Visual localization with conditional neural radiance field

    Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf- loc: Visual localization with conditional neural radiance field. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023. 3

  35. [35]

    The 3d jigsaw puzzle: Mapping large indoor spaces

    Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3d jigsaw puzzle: Mapping large indoor spaces. InEuropean Conference on Computer Vision, pages 1–16. Springer, 2014. 2

  36. [36]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

  37. [37]

    Lens: Localization enhanced by nerf synthesis

    Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on Robot Learn- ing, pages 1347–1356. PMLR, 2022. 3

  38. [38]

    Dinov2: Learning robust visual features with- out supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  39. [39]

    Meshloc: Mesh-based visual localization, 2022

    V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Meshloc: Mesh-based visual localization, 2022. 3

  40. [40]

    Visual localization using imperfect 3d models from the internet

    V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Visual localization using imperfect 3d models from the internet. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13175–13186,

  41. [41]

    Langsplat: 3d language gaussian splatting,

    Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting,

  42. [42]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3

  43. [43]

    Rousseeuw

    Peter J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79(388): 871–880, 1984. 4, 3

  44. [44]

    Back to the feature: Learning robust cam- era localization from pixels to pose, 2021

    Paul-Edouard Sarlin, Ajaykumar Unagar, M ˚ans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the feature: Learning robust cam- era localization from pixels to pose, 2021. 3

  45. [45]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1

  46. [46]

    Sch ¨onberger and Jan-Michael Frahm

    Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4104– 4113, 2016. 4, 5

  47. [47]

    Vf-nerf: Viewshed fields for rigid nerf registration, 2024

    Leo Segre and Shai Avidan. Vf-nerf: Viewshed fields for rigid nerf registration, 2024. 2, 3, 4, 7

  48. [48]

    Language embedded 3d gaussians for open- vocabulary scene understanding, 2023

    Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding, 2023. 3

  49. [49]

    Photo tourism: Exploring photo collections in 3D

    Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. 2006. 1

  50. [50]

    Neural 3d reconstruction in the wild

    Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. InACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 1

  51. [51]

    Large scale sfm with the distributed camera model, 2016

    Chris Sweeney, Victor Fragoso, Tobias Hollerer, and Matthew Turk. Large scale sfm with the distributed camera model, 2016. 4, 5

  52. [52]

    Nerfstudio: A modular framework for neural radiance field development

    Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David Mcallister, Justin Kerr, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSpe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Confe...

  53. [53]

    Megascenes: Scene-level view synthesis at scale

    Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. arXiv preprint arXiv:2406.11819, 2024. 5

  54. [54]

    Suhani V ora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes,

  55. [55]

    Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

    Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2

  56. [56]

    3d reconstruction with spatial memory, 2024

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. 3

  57. [57]

    Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment

    Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,

  58. [58]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 5, 6

  59. [59]

    Semantic is enough: Only semantic information for nerf reconstruction

    Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, and Wei Yan. Semantic is enough: Only semantic information for nerf reconstruction. In2023 IEEE International Conference on Unmanned Systems (ICUS), page 906–912. IEEE, 2023. 3

  60. [60]

    Dust3r: Geometric 3d vi- sion made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy, 2024. 1, 2, 3, 5, 6

  61. [61]

    Yue Wang and Justin M. Solomon. Deep closest point: Learning representations for point cloud registration, 2019. 3

  62. [62]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning, 2025. 2, 5, 6

  63. [63]

    Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision

    Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 428–437, 2021. 2, 5, 7

  64. [64]

    Denoising vision transformers,

    Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers,

  65. [65]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 3

  66. [66]

    Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin

    Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert- ing neural radiance fields for pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. 2, 3, 4, 6, 7

  67. [67]

    Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild

    Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild. InEuropean Conference on Computer Vi- sion, pages 592–611. Springer, 2022. 2

  68. [68]

    3d registration with maximal cliques, 2023

    Xiyu Zhang, Jiaqi Yang, Shikun Zhang, and Yanning Zhang. 3d registration with maximal cliques, 2023. 3

  69. [69]

    Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J. Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 3

  70. [70]

    Fast global registration

    Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global registration. 2016. 3

  71. [71]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3, 4, 5 Scene Groun...

  72. [72]

    Additional Results and Comparisons 7.1. Additional Quantitative Results In addition to the averaged∆R,∆Treported in the main paper (Table 1), in Tables 4, 5 6, we report a per meta-image performance breakdown for all the initializations. From this breakdown, we observe that our method successfully regis- ters meta-images where the baseline exhibits large∆...

  73. [73]

    15 (left)

    Limitations While our method is not specifically designed for single- shot scenarios, we evaluate its reliability with fewer images per meta-image in Fig. 15 (left). We evaluate performance by randomly sub-sampling subsets of varying sizes from each meta-image, reporting the average error across five independent samples Performance drops over very small m...

  74. [74]

    The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio

    Implementation Details 9.1. The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio. We resize each image to1400X1400and then use the pretrained backbone dinov2 vits14, which outputs dense feature map100X100. We chose DINOv2 with embedding size of 384. We use the DINO implementationfacebookresea...

  75. [75]

    The Google Earth Studio rendering UI is presented at Fig

    TheWikiEarthBenchmark We rendered images around each landmark using Google Earth Studio, the camera trajectories for each landmark will be published with the benchmark. The Google Earth Studio rendering UI is presented at Fig. 16. After rendering the images, we create a COLMAP us- ing the rendered images of the landmark from Google Earth Studio. We use CO...