Scene Grounding In the Wild

Hadar Averbuch-Elor; Leo Segre; Shai Avidan; Shay Shomer-Chai; Tamir Cohen

arxiv: 2603.26584 · v2 · submitted 2026-03-27 · 💻 cs.CV

Scene Grounding In the Wild

Tamir Cohen , Leo Segre , Shay Shomer-Chai , Shai Avidan , Hadar Averbuch-Elor This is my paper

Pith reviewed 2026-05-14 23:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene reconstructionglobal alignmentGaussian splattingin-the-wild imageryreference modelpose estimationsemantic featureslarge-scale scenes

0 comments

The pith

Partial 3D reconstructions from sparse in-the-wild images can be globally aligned to a complete reference scene model derived from Google Earth renderings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that takes disconnected or misaligned partial 3D models built from real photographs and anchors each one inside a single, full-coverage reference model. The references come from dense pseudo-synthetic renderings of the entire scene, which provide complete geometry even when the input photos have almost no visual overlap. The method works by representing the reference with 3D Gaussians that carry semantic features and then solving an inverse optimization to recover the correct 6DoF pose and scale for each partial piece. A reader should care because standard reconstruction pipelines commonly produce fragmented outputs or wrongly fused geometry on large outdoor scenes, and this grounding step corrects those failures across many different starting methods.

Core claim

We represent the reference model using 3D Gaussian Splatting augmented with semantic features and formulate alignment as an inverse feature-based optimization that estimates a global 6DoF pose and scale while keeping the reference fixed. This grounds each partial reconstruction to the complete reference, producing globally consistent results even without visual overlap between input views. We also introduce the WikiEarth dataset that registers existing partial reconstructions with the pseudo-synthetic reference models.

What carries the argument

Augmented 3D Gaussian Splatting features used in inverse feature-based optimization to recover global 6DoF pose and scale for each partial reconstruction.

Load-bearing premise

Real-world photographs and pseudo-synthetic renderings share the same underlying scene semantics that can be captured by augmented Gaussian features despite large appearance differences.

What would settle it

Apply the alignment to the WikiEarth dataset with ground-truth registrations available and check whether estimated poses remain accurate when semantic feature augmentation is removed from the Gaussians.

Figures

Figures reproduced from arXiv: 2603.26584 by Hadar Averbuch-Elor, Leo Segre, Shai Avidan, Shay Shomer-Chai, Tamir Cohen.

**Figure 1.** Figure 1: Given a partial 3D reconstruction produced by running structure from motion on Internet images capturing large-scale landmarks, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Scene Grounding via Semantic Feature-based Robust Optimization. Given a 3DGS reference model M (left) and a set of Internet images I (right), we propose an inverse optimization scheme that predicts a global 6DoF+scale alignment T while keeping the parameters of M fixed. We obtain an initial transformation T (in red) using a traditional SfM technique. During optimization, we calculate a semantic feature los… view at source ↗

**Figure 3.** Figure 3: Challenges of aligning internet photos to the reference model. Visualization of input Internet images (first and third columns) and views rendered from the reference model at the ground-truth locations (second and fourth columns). As illustrated above, high Lsem values (bottom row) often indicate outlier images, which our approach overcomes via a robust optimization scheme, as further detailed in Section… view at source ↗

**Figure 4.** Figure 4: The WikiEarth Benchmark. Reconstruction of four landmarks from WikiEarth. The blue frustums depicts the rendered images from Google Earth Studio, and the red frustums the images from WikiScenes. the partial reconstructions from each meta-image into a unified whole, overcoming limitations seen in traditional SfM methods that produce disjoint or incomplete reconstructions (as illustrated in [PITH_FULL_IMA… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison. A visualization of the alignment results for our method compared to the three baselines. Each image shows the ground truth in the lower half and the rendered image from the reference model M after alignment in the top half. As demonstrated, our inverse optimization-based approach predicts precise transformations, even in the presence of challenging, inaccurate initializations. 5.1… view at source ↗

**Figure 6.** Figure 6: Grounding Multiple Meta Images. Above we show three scenes containing two meta-images per scene (visualized in green and purple), both grounded to the scene’s global reference model. We show both the COLMAP initialization, and our final result. Ground truth reconstructions are provided on the bottom. marks, comparing our method and the COLMAP baseline to the ground truth. As can be observed from these visu… view at source ↗

**Figure 7.** Figure 7: Aligning a meta-image to a reference model with π 3 . As illustrated above, while π 3 successfully registers the Google Earth images, it struggles to correctly align the Internet images in this model; see, for instance, the ghost structure in the center of the red box on the bottom row. The top result, reconstructed from Google Earth images only, is shown for reference [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 8.** Figure 8: Generalization to Drone-Based Reference Models. We evaluate our method using a reference model reconstructed from drone video frames sourced from YouTube. As illustrated above, our approach significantly improves the alignment over the COLMAP baseline, which serves as our initialization [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Reconstructions of Feed-Forward Models. We visualize three reconstructions obtained by running DUSt3R [60] MASt3R [28] and VGGT [58] over images sampled from two meta-images (visualized in green and purple). As illustrated above, all methods fail to reconstruct the Milan and Lincoln Cathedral, showing either broken or overlapping meta-images, while these cameras capture non-overlapping regions as as seen… view at source ↗

**Figure 10.** Figure 10: Aligning a meta-image to a reference model with VGGT. Meta-image cameras are visualized in purple, while Google Earth images from the reference model are in blue. As illustrated above, VGGT failed to register the Murcia meta-image (i.e., the output contains two distinct regions, one for the Internet images and one for the Google Earth images) and also failed to reconstruct the reference model in the Freib… view at source ↗

**Figure 11.** Figure 11: Drone Reference model - Additional examples Results using a reference model from drone video frames are depicted above. The drone videos of the Freiburg Cathedral are taken from Youtube. As illustrated our approach significantly improves the alignment, in comparison to the COLMAP baseline, which serves as our initialization. 9. Implementation Details 9.1. The reference model First we extract DINOv2 [38] … view at source ↗

**Figure 13.** Figure 13: Additional Qualitative Comparison: A visualization of the alignment results for our method and the COLMAP baseline. Each image shows the ground truth in the lower half and the rendered image from the reference model M after alignment in the top half. As demonstrated, our inverse optimization-based approach predicts precise transformations, even in the presence of challenging, inaccurate initializations. 3… view at source ↗

**Figure 14.** Figure 14: Additional Analysis of our Robust Optimization Framework. Our method uses LTS to ignore the images with loss values that are higher than the median image loss. In the figure, we show a sample of images in several bins. The image below the diagonal are the real-world Internet images, and above is the rendered image from the reference model (rendered at the end of the optimization). As illustrated by the ri… view at source ↗

**Figure 15.** Figure 15: Robustness Analysis. Average errors ∆R and ∆T across the benchmark as a function of the meta-image size and the initialization noise (rotation). The graphs indicate that the error increases with smaller meta-images, reaching a plateau at a size of approximately 6. Furthermore, the initialization graph demonstrate that the method aligns the images successfully once the noise is below a specific threshold … view at source ↗

**Figure 16.** Figure 16: Google Earth Studio UI: Screenshot of Google Earth Studio, showing the 3D model of the Geneva Cathedral. For each landmark we create a camera trajectory and rendered the images on the trajectory using the program nore watermarks”. We aligned this model images from the WikiScenes dataset, we chose only images in the exterior category for each landmark. The images are mostly not registered correctly with t… view at source ↗

read the original abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to anchor partial 3D reconstructions to a fixed Google Earth reference via semantic-augmented Gaussians, but the domain-invariance claim rests on thin evidence.

read the letter

The main takeaway is that the authors tackle disconnected partial reconstructions by matching them to a complete reference model built from Google Earth Studio renderings. They represent the reference with 3D Gaussian Splatting, add semantic features to each Gaussian, and solve for a global 6DoF pose and scale while leaving the reference fixed. They also release the WikiEarth dataset of registered partial reconstructions and pseudo-synthetic references. This setup is meant to improve global consistency even when input views have no overlap and to reduce some failure cases seen in end-to-end pipelines. The combination of semantic augmentation and the new dataset is not in the prior work they cite, so that part is genuinely new. The optimization framing is clean and directly addresses the merging errors that plague standard SfM or learning-based methods on large scenes. The motivation for using pseudo-synthetic references for full coverage is sensible. The soft spot is the handling of the appearance gap. The abstract states that real photos and the renders share underlying semantics that the augmented features can capture, yet it gives no description of the feature network, its training data, or any invariance mechanism. There are also no ablations on cross-domain matching accuracy or sensitivity to initialization. Without those, the reported gains on WikiEarth cannot be cleanly attributed to the semantic component rather than other factors. The stress-test concern about domain invariance holds up on the given text. This is for people working on large-scale 3D reconstruction, urban modeling, or robotics registration pipelines. A reader who needs practical alignment fixes for low-overlap data would get concrete ideas from the framework and dataset. It deserves a serious referee because the problem is central and the approach is distinct enough to warrant checking the missing implementation details and experiments.

Referee Report

2 major / 1 minor

Summary. The paper claims that partial 3D reconstructions from unstructured in-the-wild imagery can be globally aligned to a complete reference model derived from Google Earth Studio pseudo-synthetic renderings by representing the reference with 3D Gaussian Splatting augmented by semantic features and solving an inverse feature-based optimization for 6DoF pose and scale. The approach is shown to improve alignment when initialized from classical or learning-based pipelines, mitigate end-to-end model failures, and is supported by the new WikiEarth dataset that registers partial reconstructions to the reference models.

Significance. If the central claim holds, the work would provide a practical route to consistent large-scale scene reconstruction under minimal overlap, leveraging domain-invariant semantics to connect real imagery with geospatial references. This could benefit downstream tasks such as city-scale mapping, AR/VR content creation, and change detection. The WikiEarth dataset itself would be a useful benchmark resource.

major comments (2)

[Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.
[Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.

minor comments (1)

[Abstract] The abstract and introduction use the term 'augmented Gaussian features' without an early pointer to the precise definition or equation that introduces the feature vector attached to each Gaussian.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the method and experiments.

read point-by-point responses

Referee: [Method] Method section: the semantic feature augmentation of the 3D Gaussians is described only at a high level; no details are given on the feature extractor (network architecture, pre-training data, or invariance mechanism), which is load-bearing for the claim that the features produce reliable cross-domain correspondences despite the stated appearance gap between real photos and Google Earth Studio renderings.

Authors: We agree that the Method section currently describes the semantic feature augmentation at a high level. In the revised manuscript we will expand this subsection to specify the feature extractor architecture, its pre-training data, and the invariance mechanism used to support cross-domain correspondences. revision: yes
Referee: [Experiments] Experiments section: the reported improvements on WikiEarth lack ablation studies isolating the contribution of the semantic features versus initialization or optimization details, and no quantitative cross-domain matching accuracy or error analysis is presented, so the source of the gains over baselines cannot be isolated from possible artifacts.

Authors: We acknowledge that the current Experiments section does not contain ablations isolating the semantic features or quantitative cross-domain matching accuracy. In the revision we will add these studies together with error analysis to better attribute the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external optimization and stated insight

full rationale

The paper formulates alignment as an inverse feature-based optimization that estimates 6DoF pose and scale while holding the reference 3D Gaussian Splatting model fixed. The key insight that real photographs and Google Earth Studio renderings share underlying scene semantics is asserted directly rather than derived from any equation or prior result within the paper. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The approach is therefore self-contained against the external WikiEarth dataset and reference models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic features transfer across the real-to-pseudo-synthetic gap; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Real-world photographs and Google Earth pseudo-synthetic renderings share the same underlying scene semantics
Stated as the key insight enabling feature-based alignment despite appearance differences.

pith-pipeline@v0.9.0 · 5546 in / 1124 out tokens · 29897 ms · 2026-05-14T23:24:54.558357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

Besl and Neil D

P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 14(2):239–256, 1992. 3

work page 1992
[2]

Extreme rotation estimation in the wild,

Hana Bezalel, Dotan Ankri, Ruojin Cai, and Hadar Averbuch-Elor. Extreme rotation estimation in the wild,

work page
[3]

Extreme rotation estimation using dense cor- relation volumes, 2021

Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme rotation estimation using dense cor- relation volumes, 2021. 2

work page 2021
[4]

Gaussreg: Fast 3d registration with gaussian splatting, 2024

Jiahao Chang, Yinglin Xu, Yihao Li, Yuantao Chen, and Xi- aoguang Han. Gaussreg: Fast 3d registration with gaussian splatting, 2024. 3

work page 2024
[5]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 3

work page 2020
[6]

Wide- baseline relative camera pose estimation with directional learning, 2021

Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning, 2021. 2

work page 2021
[7]

Dreg-nerf: Deep registration for neural radiance fields, 2023

Yu Chen and Gim Hee Lee. Dreg-nerf: Deep registration for neural radiance fields, 2023. 3

work page 2023
[8]

Deep global registration, 2020

Christopher Choy, Wei Dong, and Vladlen Koltun. Deep global registration, 2020. 3

work page 2020
[9]

Indoor-outdoor 3d reconstruction alignment

Andrea Cohen, Johannes L Sch ¨onberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-outdoor 3d reconstruction alignment. InEuropean Conference on Computer Vision, pages 285–300. Springer,

work page
[10]

Sinan Gunturk

Ingrid Daubechies, Ronald DeV ore, Massimo Fornasier, and C. Sinan Gunturk. Iteratively re-weighted least squares min- imization for sparse recovery, 2008. 8

work page 2008
[11]

Estimating ex- treme 3d image rotations using cascaded attention

Shay Dekel, Yosi Keller, and Martin Cadik. Estimating ex- treme 3d image rotations using cascaded attention. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2588–2598, 2024. 2

work page 2024
[12]

Superpoint: Self-supervised interest point detection and description, 2018

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description, 2018. 4, 5

work page 2018
[13]

3d object detection and localization using multimodal point pair features

Bertram Drost and Slobodan Ilic. 3d object detection and localization using multimodal point pair features. In2012 Second International Conference on 3D Imaging, Model- ing, Processing, Visualization & Transmission, pages 9–16. IEEE, 2012. 3

work page 2012
[14]

Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections

Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, and Hadar Averbuch-Elor. Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections. InComputer Graphics Forum, page e15006. Wiley Online Library, 2024. 4

work page 2024
[15]

Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, De- jia Xu, Hanwen Jiang, and Zhangyang Wang. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference.arXiv preprint arXiv:2305.15727, 2023. 2

work page arXiv 2023
[16]

nerf2nerf: Pairwise registration of neural radiance fields

Lily Goli, Daniel Rebain, Sara Sabour, Animesh Garg, and Andrea Tagliasacchi. nerf2nerf: Pairwise registration of neural radiance fields. In2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 9354–9361,

work page
[17]

3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014

Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, and Jianwei Wan. 3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014. 3

work page 2014
[18]

A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015

Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, Jianwei Wan, and Ngai Kwok. A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015. 3

work page 2015
[19]

Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999

Gerd H ¨ausler and D Ritter. Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999. 3

work page 1999
[20]

Deepbbs: Deep best buddies for point cloud registration,

Itan Hezroni, Amnon Drory, Raja Giryes, and Shai Avidan. Deepbbs: Deep best buddies for point cloud registration,

work page
[21]

Nerf-rpn: A general framework for object detection in nerfs

Benran Hu, Junkai Huang, Yichen Liu, Yu-Wing Tai, and Chi-Keung Tang. Nerf-rpn: A general framework for object detection in nerfs. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23528–23538, 2023. 3

work page 2023
[22]

Image matching across wide baselines: From paper to practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547,

work page
[23]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[24]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4

work page 2023
[25]

Lerf: Language embedded radiance fields, 2023

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields, 2023. 3

work page 2023
[26]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 3

work page 2023
[27]

Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes

Shai Krakovsky, Gal Fiebelman, Sagie Benaim, and Hadar Averbuch-Elor. Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 4

work page 2025
[28]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 5, 6

work page 2024
[29]

Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl

Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation, 2022. 3, 7, 8

work page 2022
[30]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 5

work page 2041
[31]

Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023

Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tul- siani. Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023. 2

work page arXiv 2023
[32]

Pixel-perfect structure-from-motion with featuremetric refinement, 2021

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement, 2021. 4

work page 2021
[33]

Lightglue: Local feature matching at light speed, 2023

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed, 2023. 4, 5

work page 2023
[34]

Nerf- loc: Visual localization with conditional neural radiance field

Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf- loc: Visual localization with conditional neural radiance field. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023. 3

work page 2023
[35]

The 3d jigsaw puzzle: Mapping large indoor spaces

Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3d jigsaw puzzle: Mapping large indoor spaces. InEuropean Conference on Computer Vision, pages 1–16. Springer, 2014. 2

work page 2014
[36]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

work page 2021
[37]

Lens: Localization enhanced by nerf synthesis

Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on Robot Learn- ing, pages 1347–1356. PMLR, 2022. 3

work page 2022
[38]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[39]

Meshloc: Mesh-based visual localization, 2022

V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Meshloc: Mesh-based visual localization, 2022. 3

work page 2022
[40]

Visual localization using imperfect 3d models from the internet

V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Visual localization using imperfect 3d models from the internet. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13175–13186,

work page
[41]

Langsplat: 3d language gaussian splatting,

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting,

work page
[42]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3

work page 2021
[43]

Rousseeuw

Peter J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79(388): 871–880, 1984. 4, 3

work page 1984
[44]

Back to the feature: Learning robust cam- era localization from pixels to pose, 2021

Paul-Edouard Sarlin, Ajaykumar Unagar, M ˚ans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the feature: Learning robust cam- era localization from pixels to pose, 2021. 3

work page 2021
[45]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1

work page 2016
[46]

Sch ¨onberger and Jan-Michael Frahm

Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4104– 4113, 2016. 4, 5

work page 2016
[47]

Vf-nerf: Viewshed fields for rigid nerf registration, 2024

Leo Segre and Shai Avidan. Vf-nerf: Viewshed fields for rigid nerf registration, 2024. 2, 3, 4, 7

work page 2024
[48]

Language embedded 3d gaussians for open- vocabulary scene understanding, 2023

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding, 2023. 3

work page 2023
[49]

Photo tourism: Exploring photo collections in 3D

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. 2006. 1

work page 2006
[50]

Neural 3d reconstruction in the wild

Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. InACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 1

work page 2022
[51]

Large scale sfm with the distributed camera model, 2016

Chris Sweeney, Victor Fragoso, Tobias Hollerer, and Matthew Turk. Large scale sfm with the distributed camera model, 2016. 4, 5

work page 2016
[52]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David Mcallister, Justin Kerr, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSpe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Confe...

work page 2023
[53]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. arXiv preprint arXiv:2406.11819, 2024. 5

work page arXiv 2024
[54]

Suhani V ora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes,

work page
[55]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2

work page 2025
[56]

3d reconstruction with spatial memory, 2024

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. 3

work page 2024
[57]

Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment

Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,

work page
[58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 5, 6

work page 2025
[59]

Semantic is enough: Only semantic information for nerf reconstruction

Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, and Wei Yan. Semantic is enough: Only semantic information for nerf reconstruction. In2023 IEEE International Conference on Unmanned Systems (ICUS), page 906–912. IEEE, 2023. 3

work page 2023
[60]

Dust3r: Geometric 3d vi- sion made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy, 2024. 1, 2, 3, 5, 6

work page 2024
[61]

Yue Wang and Justin M. Solomon. Deep closest point: Learning representations for point cloud registration, 2019. 3

work page 2019
[62]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning, 2025. 2, 5, 6

work page 2025
[63]

Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 428–437, 2021. 2, 5, 7

work page 2021
[64]

Denoising vision transformers,

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers,

work page
[65]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 3

work page 2025
[66]

Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin

Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert- ing neural radiance fields for pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. 2, 3, 4, 6, 7

work page 2021
[67]

Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild

Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild. InEuropean Conference on Computer Vi- sion, pages 592–611. Springer, 2022. 2

work page 2022
[68]

3d registration with maximal cliques, 2023

Xiyu Zhang, Jiaqi Yang, Shikun Zhang, and Yanning Zhang. 3d registration with maximal cliques, 2023. 3

work page 2023
[69]

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J. Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 3

work page 2021
[70]

Fast global registration

Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global registration. 2016. 3

work page 2016
[71]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3, 4, 5 Scene Groun...

work page 2024
[72]

Additional Results and Comparisons 7.1. Additional Quantitative Results In addition to the averaged∆R,∆Treported in the main paper (Table 1), in Tables 4, 5 6, we report a per meta-image performance breakdown for all the initializations. From this breakdown, we observe that our method successfully regis- ters meta-images where the baseline exhibits large∆...

work page
[73]

15 (left)

Limitations While our method is not specifically designed for single- shot scenarios, we evaluate its reliability with fewer images per meta-image in Fig. 15 (left). We evaluate performance by randomly sub-sampling subsets of varying sizes from each meta-image, reporting the average error across five independent samples Performance drops over very small m...

work page
[74]

The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio

Implementation Details 9.1. The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio. We resize each image to1400X1400and then use the pretrained backbone dinov2 vits14, which outputs dense feature map100X100. We chose DINOv2 with embedding size of 384. We use the DINO implementationfacebookresea...

work page
[75]

The Google Earth Studio rendering UI is presented at Fig

TheWikiEarthBenchmark We rendered images around each landmark using Google Earth Studio, the camera trajectories for each landmark will be published with the benchmark. The Google Earth Studio rendering UI is presented at Fig. 16. After rendering the images, we create a COLMAP us- ing the rendered images of the landmark from Google Earth Studio. We use CO...

work page

[1] [1]

Besl and Neil D

P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 14(2):239–256, 1992. 3

work page 1992

[2] [2]

Extreme rotation estimation in the wild,

Hana Bezalel, Dotan Ankri, Ruojin Cai, and Hadar Averbuch-Elor. Extreme rotation estimation in the wild,

work page

[3] [3]

Extreme rotation estimation using dense cor- relation volumes, 2021

Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme rotation estimation using dense cor- relation volumes, 2021. 2

work page 2021

[4] [4]

Gaussreg: Fast 3d registration with gaussian splatting, 2024

Jiahao Chang, Yinglin Xu, Yihao Li, Yuantao Chen, and Xi- aoguang Han. Gaussreg: Fast 3d registration with gaussian splatting, 2024. 3

work page 2024

[5] [5]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 3

work page 2020

[6] [6]

Wide- baseline relative camera pose estimation with directional learning, 2021

Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide- baseline relative camera pose estimation with directional learning, 2021. 2

work page 2021

[7] [7]

Dreg-nerf: Deep registration for neural radiance fields, 2023

Yu Chen and Gim Hee Lee. Dreg-nerf: Deep registration for neural radiance fields, 2023. 3

work page 2023

[8] [8]

Deep global registration, 2020

Christopher Choy, Wei Dong, and Vladlen Koltun. Deep global registration, 2020. 3

work page 2020

[9] [9]

Indoor-outdoor 3d reconstruction alignment

Andrea Cohen, Johannes L Sch ¨onberger, Pablo Speciale, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. Indoor-outdoor 3d reconstruction alignment. InEuropean Conference on Computer Vision, pages 285–300. Springer,

work page

[10] [10]

Sinan Gunturk

Ingrid Daubechies, Ronald DeV ore, Massimo Fornasier, and C. Sinan Gunturk. Iteratively re-weighted least squares min- imization for sparse recovery, 2008. 8

work page 2008

[11] [11]

Estimating ex- treme 3d image rotations using cascaded attention

Shay Dekel, Yosi Keller, and Martin Cadik. Estimating ex- treme 3d image rotations using cascaded attention. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2588–2598, 2024. 2

work page 2024

[12] [12]

Superpoint: Self-supervised interest point detection and description, 2018

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description, 2018. 4, 5

work page 2018

[13] [13]

3d object detection and localization using multimodal point pair features

Bertram Drost and Slobodan Ilic. 3d object detection and localization using multimodal point pair features. In2012 Second International Conference on 3D Imaging, Model- ing, Processing, Visualization & Transmission, pages 9–16. IEEE, 2012. 3

work page 2012

[14] [14]

Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections

Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, and Hadar Averbuch-Elor. Halo-nerf: Learn- ing geometry-guided semantics for exploring unconstrained photo collections. InComputer Graphics Forum, page e15006. Wiley Online Library, 2024. 4

work page 2024

[15] [15]

Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference

Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, De- jia Xu, Hanwen Jiang, and Zhangyang Wang. Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference.arXiv preprint arXiv:2305.15727, 2023. 2

work page arXiv 2023

[16] [16]

nerf2nerf: Pairwise registration of neural radiance fields

Lily Goli, Daniel Rebain, Sara Sabour, Animesh Garg, and Andrea Tagliasacchi. nerf2nerf: Pairwise registration of neural radiance fields. In2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 9354–9361,

work page

[17] [17]

3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014

Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, and Jianwei Wan. 3d object recognition in cluttered scenes with local surface features: A survey.IEEE transac- tions on pattern analysis and machine intelligence, 36(11): 2270–2287, 2014. 3

work page 2014

[18] [18]

A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015

Yulan Guo, Mohammed Bennamoun, Ferdous Sohel, Min Lu, Jianwei Wan, and Ngai Kwok. A comprehensive per- formance evaluation of 3d local feature descriptors.Interna- tional Journal of Computer Vision, 116, 2015. 3

work page 2015

[19] [19]

Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999

Gerd H ¨ausler and D Ritter. Feature-based object recogni- tion and localization in 3d-space, using a single video im- age.Computer Vision and Image Understanding, 73(1):64– 81, 1999. 3

work page 1999

[20] [20]

Deepbbs: Deep best buddies for point cloud registration,

Itan Hezroni, Amnon Drory, Raja Giryes, and Shai Avidan. Deepbbs: Deep best buddies for point cloud registration,

work page

[21] [21]

Nerf-rpn: A general framework for object detection in nerfs

Benran Hu, Junkai Huang, Yichen Liu, Yu-Wing Tai, and Chi-Keung Tang. Nerf-rpn: A general framework for object detection in nerfs. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23528–23538, 2023. 3

work page 2023

[22] [22]

Image matching across wide baselines: From paper to practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547,

work page

[23] [23]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[24] [24]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4

work page 2023

[25] [25]

Lerf: Language embedded radiance fields, 2023

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields, 2023. 3

work page 2023

[26] [26]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 3

work page 2023

[27] [27]

Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes

Shai Krakovsky, Gal Fiebelman, Sagie Benaim, and Hadar Averbuch-Elor. Lang3d-xl: Language embedded 3d gaus- sians for large-scale scenes. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 4

work page 2025

[28] [28]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 5, 6

work page 2024

[29] [29]

Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl

Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation, 2022. 3, 7, 8

work page 2022

[30] [30]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 5

work page 2041

[31] [31]

Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023

Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tul- siani. Relpose++: Recovering 6d poses from sparse-view observations.arXiv preprint arXiv:2305.04926, 2023. 2

work page arXiv 2023

[32] [32]

Pixel-perfect structure-from-motion with featuremetric refinement, 2021

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement, 2021. 4

work page 2021

[33] [33]

Lightglue: Local feature matching at light speed, 2023

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed, 2023. 4, 5

work page 2023

[34] [34]

Nerf- loc: Visual localization with conditional neural radiance field

Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf- loc: Visual localization with conditional neural radiance field. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023. 3

work page 2023

[35] [35]

The 3d jigsaw puzzle: Mapping large indoor spaces

Ricardo Martin-Brualla, Yanling He, Bryan C Russell, and Steven M Seitz. The 3d jigsaw puzzle: Mapping large indoor spaces. InEuropean Conference on Computer Vision, pages 1–16. Springer, 2014. 2

work page 2014

[36] [36]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

work page 2021

[37] [37]

Lens: Localization enhanced by nerf synthesis

Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Lens: Localization enhanced by nerf synthesis. InConference on Robot Learn- ing, pages 1347–1356. PMLR, 2022. 3

work page 2022

[38] [38]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[39] [39]

Meshloc: Mesh-based visual localization, 2022

V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Meshloc: Mesh-based visual localization, 2022. 3

work page 2022

[40] [40]

Visual localization using imperfect 3d models from the internet

V ojtech Panek, Zuzana Kukelova, and Torsten Sattler. Visual localization using imperfect 3d models from the internet. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13175–13186,

work page

[41] [41]

Langsplat: 3d language gaussian splatting,

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting,

work page

[42] [42]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3

work page 2021

[43] [43]

Rousseeuw

Peter J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79(388): 871–880, 1984. 4, 3

work page 1984

[44] [44]

Back to the feature: Learning robust cam- era localization from pixels to pose, 2021

Paul-Edouard Sarlin, Ajaykumar Unagar, M ˚ans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the feature: Learning robust cam- era localization from pixels to pose, 2021. 3

work page 2021

[45] [45]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1

work page 2016

[46] [46]

Sch ¨onberger and Jan-Michael Frahm

Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure- from-motion revisited. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4104– 4113, 2016. 4, 5

work page 2016

[47] [47]

Vf-nerf: Viewshed fields for rigid nerf registration, 2024

Leo Segre and Shai Avidan. Vf-nerf: Viewshed fields for rigid nerf registration, 2024. 2, 3, 4, 7

work page 2024

[48] [48]

Language embedded 3d gaussians for open- vocabulary scene understanding, 2023

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding, 2023. 3

work page 2023

[49] [49]

Photo tourism: Exploring photo collections in 3D

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3D. 2006. 1

work page 2006

[50] [50]

Neural 3d reconstruction in the wild

Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. InACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 1

work page 2022

[51] [51]

Large scale sfm with the distributed camera model, 2016

Chris Sweeney, Victor Fragoso, Tobias Hollerer, and Matthew Turk. Large scale sfm with the distributed camera model, 2016. 4, 5

work page 2016

[52] [52]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David Mcallister, Justin Kerr, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSpe- cial Interest Group on Computer Graphics and Interactive Techniques Conference Confe...

work page 2023

[53] [53]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. arXiv preprint arXiv:2406.11819, 2024. 5

work page arXiv 2024

[54] [54]

Suhani V ora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi S. M. Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes,

work page

[55] [55]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2

work page 2025

[56] [56]

3d reconstruction with spatial memory, 2024

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. 3

work page 2024

[57] [57]

Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment

Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9773–9783,

work page

[58] [58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 5, 6

work page 2025

[59] [59]

Semantic is enough: Only semantic information for nerf reconstruction

Ruibo Wang, Song Zhang, Ping Huang, Donghai Zhang, and Wei Yan. Semantic is enough: Only semantic information for nerf reconstruction. In2023 IEEE International Conference on Unmanned Systems (ICUS), page 906–912. IEEE, 2023. 3

work page 2023

[60] [60]

Dust3r: Geometric 3d vi- sion made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy, 2024. 1, 2, 3, 5, 6

work page 2024

[61] [61]

Yue Wang and Justin M. Solomon. Deep closest point: Learning representations for point cloud registration, 2019. 3

work page 2019

[62] [62]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning, 2025. 2, 5, 6

work page 2025

[63] [63]

Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, and Noah Snavely. Towers of babel: Combining images, language, and 3d geometry for learning multimodal vision. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 428–437, 2021. 2, 5, 7

work page 2021

[64] [64]

Denoising vision transformers,

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers,

work page

[65] [65]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. 3

work page 2025

[66] [66]

Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin

Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert- ing neural radiance fields for pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. 2, 3, 4, 6, 7

work page 2021

[67] [67]

Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild

Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Rel- pose: Predicting probabilistic relative rotation for single ob- jects in the wild. InEuropean Conference on Computer Vi- sion, pages 592–611. Springer, 2022. 2

work page 2022

[68] [68]

3d registration with maximal cliques, 2023

Xiyu Zhang, Jiaqi Yang, Shikun Zhang, and Yanning Zhang. 3d registration with maximal cliques, 2023. 3

work page 2023

[69] [69]

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J. Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 3

work page 2021

[70] [70]

Fast global registration

Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global registration. 2016. 3

work page 2016

[71] [71]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3, 4, 5 Scene Groun...

work page 2024

[72] [72]

Additional Results and Comparisons 7.1. Additional Quantitative Results In addition to the averaged∆R,∆Treported in the main paper (Table 1), in Tables 4, 5 6, we report a per meta-image performance breakdown for all the initializations. From this breakdown, we observe that our method successfully regis- ters meta-images where the baseline exhibits large∆...

work page

[73] [73]

15 (left)

Limitations While our method is not specifically designed for single- shot scenarios, we evaluate its reliability with fewer images per meta-image in Fig. 15 (left). We evaluate performance by randomly sub-sampling subsets of varying sizes from each meta-image, reporting the average error across five independent samples Performance drops over very small m...

work page

[74] [74]

The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio

Implementation Details 9.1. The reference model First we extract DINOv2 [38] dense features per rendered landmark image from Google Earth Studio. We resize each image to1400X1400and then use the pretrained backbone dinov2 vits14, which outputs dense feature map100X100. We chose DINOv2 with embedding size of 384. We use the DINO implementationfacebookresea...

work page

[75] [75]

The Google Earth Studio rendering UI is presented at Fig

TheWikiEarthBenchmark We rendered images around each landmark using Google Earth Studio, the camera trajectories for each landmark will be published with the benchmark. The Google Earth Studio rendering UI is presented at Fig. 16. After rendering the images, we create a COLMAP us- ing the rendered images of the landmark from Google Earth Studio. We use CO...

work page