pith. machine review for the scientific record. sign in

arxiv: 2605.07978 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-view 3D reconstructionsatellite imageryUAV droneground images6-DoF pose estimationfeed-forward modelCross3Rcross-view localization
0
0 comments X

The pith

A single UAV image supplies cues for 6-DoF poses and 3D structure in feed-forward reconstruction from satellite and ground views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Nadir satellite images alone provide no direct information on camera roll, pitch, or altitude, so prior cross-view methods are restricted to 3-DoF estimates that assume planar motion and zero tilt. These assumptions fail when terrain slopes or cameras are tilted. The paper introduces one UAV image as an intermediate viewpoint that reveals the missing 3D structure and supplies the needed orientation and height cues, while requiring only spatial overlap with the ground image rather than a known relative pose. Cross3R is a single-pass feed-forward model that accepts any combination of satellite tile, UAV image, and ground image and outputs the joint 3D point cloud, all 6-DoF camera poses, and the on-tile position plus yaw of the perspective cameras. The model is trained on the new CrossGeo dataset of 278K tri-view images and outperforms feed-forward baselines on that data as well as dedicated cross-view methods on KITTI without any KITTI training.

Core claim

Cross3R ingests a satellite tile together with a UAV image, a ground image, or both, and in a single forward pass recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile (x,y) position and yaw of each perspective camera.

What carries the argument

The Cross3R feed-forward model, which jointly processes satellite, UAV, and ground images to estimate 3D points and full 6-DoF poses without requiring known relative pose between views.

Load-bearing premise

That one UAV image with only spatial overlap is enough to supply reliable roll, pitch, altitude, and 3D structure cues that the satellite view lacks, and that the model trained on CrossGeo generalizes to new scenes without domain-specific retraining.

What would settle it

Measure the model's estimated roll and pitch errors on a held-out set of images from sloped terrain with independent IMU ground truth; if the 6-DoF errors are no smaller than those of a 3-DoF baseline restricted to planar motion, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.07978 by Qiwei Wang, Xianghui Ze, Yujiao Shi, Zhongyao Tuo.

Figure 1
Figure 1. Figure 1: Cross3R ingests a satellite tile along with one or two perspective views (UAV, ground, or [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CrossGeo data sources. (a) Ground views and their coarse depth from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Cross3R. Satellite, UAV, and ground images are encoded and processed through [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-sample altitude redefinition. Per-sample altitude redefinition. Raw satel￾lite poses in CrossGeo are anchored at 5,726 m above ground (Section 2.1), and back￾propagating through translations that large desta￾bilizes training. We therefore re-anchor altitudes locally at dataset-preparation time. CrossGeo is collected as image pairs, each containing two ground, two UAV, and two satellite views of overlap… view at source ↗
Figure 5
Figure 5. Figure 5: Predicted point clouds on three CrossGeo samples (left to right: satellite [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-view localization and pixel matching on four CrossGeo samples (left two: ground– [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-cell value is metricsat+grd − metrictri for ground position (left, m) and ground yaw (right, ◦ ); red = the UAV helps, blue = the UAV hurts [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: CrossGeo tri-view coverage and statistics: (a) a representative tri-view sample, (b) global [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the five core modalities in the CrossGeo dataset. Row 1: representative scene [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: CrossGeo depth-acquisition and tri-view alignment pipeline: each modality produces a [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative ground-depth samples and the PCC quality filter. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Failure (top row) and success (bottom row) cases of Cross3R on CrossGeo [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Zero-shot cross-view localization, pixel matching, and tri-view reconstruction on KITTI, [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Out-of-distribution cross-view localization, pixel matching, and tri-view reconstruction [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Cross3R, a flexible feed-forward model that ingests a satellite tile with optional UAV and/or ground images to recover, in one forward pass, a cross-view 3D point cloud, 6-DoF poses for all input cameras, and the on-tile (x,y) position plus yaw for each perspective camera. It introduces the CrossGeo dataset (278K tri-view images across 85 global scenes) for training and evaluation, claiming consistent outperformance over feed-forward 3D baselines on CrossGeo for point-cloud reconstruction, 6-DoF pose estimation, and cross-view localization, plus generalization to KITTI without KITTI-specific training.

Significance. If the empirical claims hold after detailed validation, the work offers a meaningful advance in cross-view 3D reconstruction by using an intermediate UAV view to relax planar-motion and zero-tilt assumptions, enabling full 6-DoF recovery from nadir satellite imagery. The large-scale, multi-continent CrossGeo dataset is a clear contribution. The feed-forward, input-flexible design is practically attractive for applications in localization and mapping.

major comments (2)
  1. [Abstract] Abstract: the claims of outperformance on CrossGeo and KITTI are presented without any architectural details, loss functions, training procedure, or error analysis. These omissions are load-bearing because the central contribution is an empirical demonstration of a new model on a new dataset; without them, reproducibility and the source of gains cannot be assessed.
  2. [Abstract] Abstract and introduction: the key modeling assumption that a single UAV image with only spatial overlap (no known relative pose) supplies reliable cues for roll, pitch, altitude, and 3D structure is stated but not accompanied by ablations, sensitivity analysis, or failure-case discussion. This assumption directly underpins the 6-DoF claims and generalization statements.
minor comments (1)
  1. [Abstract] Abstract: the phrasing of the KITTI generalization result (outperforms dedicated methods 'on most metrics') would be clearer if the specific metrics and the magnitude of improvement were summarized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the central modeling assumption. We address each major comment below, clarifying where the full manuscript already provides the requested details and proposing targeted revisions to improve accessibility and explicitness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims of outperformance on CrossGeo and KITTI are presented without any architectural details, loss functions, training procedure, or error analysis. These omissions are load-bearing because the central contribution is an empirical demonstration of a new model on a new dataset; without them, reproducibility and the source of gains cannot be assessed.

    Authors: The abstract is intentionally concise as a high-level overview. The full manuscript provides the architectural details in Section 3, the loss functions and training procedure in Section 4, and error analysis together with ablation studies in Section 5. To directly address the concern about assessing reproducibility and sources of gains from the abstract, we will revise it to include a brief outline of the model components, training approach, and key evaluation metrics. revision: yes

  2. Referee: [Abstract] Abstract and introduction: the key modeling assumption that a single UAV image with only spatial overlap (no known relative pose) supplies reliable cues for roll, pitch, altitude, and 3D structure is stated but not accompanied by ablations, sensitivity analysis, or failure-case discussion. This assumption directly underpins the 6-DoF claims and generalization statements.

    Authors: The assumption is central and is supported by empirical evidence already present in the manuscript. Section 5.2 contains ablations isolating the UAV view's contribution to 6-DoF recovery, Section 5.3 provides sensitivity analysis across overlap ratios and pose variations, and Section 5.4 discusses failure cases where the UAV cue is insufficient. To make this linkage more explicit in the introduction (as requested), we will add a short paragraph summarizing these results while retaining the existing detailed analysis in the experiments section. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new feed-forward model (Cross3R) and a new tri-view dataset (CrossGeo) for joint 3D reconstruction and pose estimation from satellite/UAV/ground images. All central claims rest on training the model on CrossGeo and reporting empirical metrics on CrossGeo plus zero-shot generalization to KITTI. No equations, derivations, fitted parameters, or self-citations are presented that reduce any output to the inputs by construction. The argument is self-contained as a standard empirical ML contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine-learning assumptions that a sufficiently large and diverse image dataset allows a feed-forward network to learn cross-view 3D geometry; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption A neural network trained on multi-view image pairs can infer 3D structure and 6-DoF poses from spatial overlap alone
    Implicit in the claim that the UAV image supplies the missing cues without known relative pose.

pith-pipeline@v0.9.0 · 5603 in / 1370 out tokens · 41556 ms · 2026-05-11T03:01:49.269218+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Pearson correlation coefficient

    Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. InNoise reduction in speech processing, pages 1–4. Springer, 2009

  2. [2]

    Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

    Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024

  3. [3]

    Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion

    Arda Duzceker, Silvano Galliani, Christoph V ogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15324–15333, 2021

  4. [4]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

  5. [5]

    Light3r-sfm: Towards feed-forward structure- from-motion

    Sven Elflein, Qunjie Zhou, and Laura Leal-Taixé. Light3r-sfm: Towards feed-forward structure- from-motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16774–16784, 2025

  6. [6]

    Multi-view stereo: A tutorial.Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

    Yasutaka Furukawa and Carlos Hernández. Multi-view stereo: A tutorial.Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

  7. [7]

    Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013

  8. [8]

    Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026

    Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, and Yujiao Shi. Panovggt: Feed-forward 3d reconstruction from panoramic imagery.arXiv preprint arXiv:2603.17571, 2026

  9. [9]

    Towards high- resolution large-scale multi-view stereo

    Vu Hoang Hiep, Renaud Keriven, Patrick Labatut, and Jean-Philippe Pons. Towards high- resolution large-scale multi-view stereo. In2009 IEEE conference on computer vision and pattern recognition, pages 1430–1437. IEEE, 2009

  10. [10]

    Mvsany- where: Zero-shot multi-view stereo

    Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Tur- mukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, and Jamie Watson. Mvsany- where: Zero-shot multi-view stereo. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11493–11504, 2025

  11. [11]

    Large scale multi-view stereopsis evaluation

    Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

  12. [12]

    Game4loc: A uav geo-localization benchmark from game data

    Yuxiang Ji, Boyong He, Zhuoyue Tan, and Liaoni Wu. Game4loc: A uav geo-localization benchmark from game data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3913–3921, 2025

  13. [13]

    Image matching across wide baselines: From paper to practice.International Journal of Computer Vision, 129(2):517–547, 2021

    Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice.International Journal of Computer Vision, 129(2):517–547, 2021. 10

  14. [14]

    Ultrra challenge 2025, 2024

    Neil Joshi, Joshua Carney, Nathanael Kuo, Homer Li, Cheng Peng, and Myron Brown. Ultrra challenge 2025, 2024. URLhttps://dx.doi.org/10.21227/2zs6-ht63

  15. [15]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

  16. [16]

    Pidloc: Cross-view pose optimization network inspired by pid controllers

    Wooju Lee, Juhye Park, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, and Hyun Myung. Pidloc: Cross-view pose optimization network inspired by pid controllers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21981–21990, 2025

  17. [17]

    Slicematch: Geometry-guided aggregation for cross-view pose estimation

    Ted Lentsch, Zimin Xia, Holger Caesar, and Julian FP Kooij. Slicematch: Geometry-guided aggregation for cross-view pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17225–17234, 2023

  18. [18]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

  19. [19]

    Cvd- sfm: A cross-view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes

    Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, and Brendan Englot. Cvd- sfm: A cross-view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10741–10748. IEEE, 2025

  20. [20]

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

  21. [21]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  22. [22]

    Pixel-perfect structure-from-motion with featuremetric refinement

    Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 5987–5997, 2021

  23. [23]

    Lending orientation to neural networks for cross-view geo- localization

    Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo- localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019

  24. [24]

    Slam3r: Real-time dense scene reconstruction from monocular rgb videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

  25. [25]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  26. [26]

    Global structure- from-motion revisited

    Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes L Schönberger. Global structure- from-motion revisited. InEuropean Conference on Computer Vision, pages 58–77. Springer, 2024

  27. [27]

    OrienterNet: Visual Localization in 2D Public Maps with Neural Matching

    Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, and Vasileios Balntas. OrienterNet: Visual Localization in 2D Public Maps with Neural Matching. InCVPR, 2023

  28. [28]

    Snap: Self-supervised neural maps for visual positioning and semantic understanding.Advances in Neural Information Processing Systems, 36:7697–7729, 2023

    Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, and Simon Lynen. Snap: Self-supervised neural maps for visual positioning and semantic understanding.Advances in Neural Information Processing Systems, 36:7697–7729, 2023. 11

  29. [29]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

  30. [30]

    A vote-and-verify strategy for fast spatial verification in image retrieval

    Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. A vote-and-verify strategy for fast spatial verification in image retrieval. InAsian Conference on Computer Vision (ACCV), 2016

  31. [31]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

  32. [32]

    Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image

    Yujiao Shi and Hongdong Li. Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17010–17020, 2022

  33. [33]

    Accurate 3-dof camera geo-localization via ground-to-satellite image matching.IEEE transactions on pattern analysis and machine intelligence, 45(3):2682–2697, 2022

    Yujiao Shi, Xin Yu, Liu Liu, Dylan Campbell, Piotr Koniusz, and Hongdong Li. Accurate 3-dof camera geo-localization via ground-to-satellite image matching.IEEE transactions on pattern analysis and machine intelligence, 45(3):2682–2697, 2022

  34. [34]

    Weakly-supervised camera localization by ground-to-satellite image registration

    Yujiao Shi, Hongdong Li, Akhil Perincherry, and Ankit V ora. Weakly-supervised camera localization by ground-to-satellite image registration. InEuropean Conference on Computer Vision, pages 39–57. Springer, 2024

  35. [35]

    Learning dense flow field for highly-accurate cross-view camera localization.Advances in Neural Information Processing Systems, 36: 70612–70625, 2023

    Zhenbo Song, Jianfeng Lu, Yujiao Shi, et al. Learning dense flow field for highly-accurate cross-view camera localization.Advances in Neural Information Processing Systems, 36: 70612–70625, 2023

  36. [36]

    Geodistill: Geometry- guided self-distillation for weakly supervised cross-view localization

    Shaowen Tong, Zimin Xia, Alexandre Alahi, Xuming He, and Yujiao Shi. Geodistill: Geometry- guided self-distillation for weakly supervised cross-view localization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25357–25366, 2025

  37. [37]

    Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

    Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21674–21684, 2025

  38. [38]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

  39. [39]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  40. [40]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  41. [41]

    Bevsplat: Resolving height ambiguity via feature- based gaussian primitives for weakly-supervised cross-view localization.arXiv preprint arXiv:2502.09080, 2025

    Qiwei Wang, Shaoxun Wu, and Yujiao Shi. Bevsplat: Resolving height ambiguity via feature- based gaussian primitives for weakly-supervised cross-view localization.arXiv preprint arXiv:2502.09080, 2025

  42. [42]

    View from above: Orthogonal-view aware cross-view localization

    Shan Wang, Chuong Nguyen, Jiawei Liu, Yanhao Zhang, Sundaram Muthu, Fahira Afzal Maken, Kaihao Zhang, and Hongdong Li. View from above: Orthogonal-view aware cross-view localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14843–14852, 2024

  43. [43]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 12

  44. [44]

    Fine-grained cross- view geo-localization using a correlation-aware homography estimator.Advances in Neural Information Processing Systems, 36:5301–5319, 2023

    Xiaolong Wang, Runsen Xu, Zhuofan Cui, Zeyu Wan, and Yu Zhang. Fine-grained cross- view geo-localization using a correlation-aware homography estimator.Advances in Neural Information Processing Systems, 36:5301–5319, 2023

  45. [45]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

  46. [46]

    Depth anything with any prior.arXiv preprint arXiv:2505.10565, 2025

    Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, and Zhou Zhao. Depth anything with any prior.arXiv preprint arXiv:2505.10565, 2025

  47. [47]

    Flying co-stereo: Enabling long-range aerial dense mapping via collaborative stereo vision of dynamic-baseline.IEEE Transactions on Robotics, 2026

    Zhaoying Wang, Xingxing Zuo, and Wei Dong. Flying co-stereo: Enabling long-range aerial dense mapping via collaborative stereo vision of dynamic-baseline.IEEE Transactions on Robotics, 2026

  48. [48]

    Fcos: Fully convolutional one-stage object detection

    Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference imagery. InIEEE International Conference on Computer Vision (ICCV), pages 1–9, 2015. doi: 10.1109/ICCV .2015.451. Acceptance rate: 30.3%

  49. [49]

    Fgˆ 2: Fine-grained cross-view localization by fine-grained feature matching

    Zimin Xia and Alexandre Alahi. Fgˆ 2: Fine-grained cross-view localization by fine-grained feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6362–6372, 2025

  50. [50]

    Convolutional cross-view pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3813–3831, 2023

    Zimin Xia, Olaf Booij, and Julian FP Kooij. Convolutional cross-view pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3813–3831, 2023

  51. [51]

    Uav-visloc: A large-scale dataset for uav visual localization,

    Wenjia Xu, Yaxuan Yao, Jiaqi Cao, Zhiwei Wei, Chunbo Liu, Jiuniu Wang, and Mugen Peng. Uav-visloc: A large-scale dataset for uav visual localization.arXiv preprint arXiv:2405.11936, 2024

  52. [52]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

  53. [53]

    Articulated pose estimation with flexible mixtures-of-parts

    Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011, pages 1385–1392. IEEE, 2011

  54. [54]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783, 2018

  55. [55]

    Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

    Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, and Tao Tan. Exploring the best way for uav visual localization under low-altitude multi-view observation condition: a benchmark.arXiv preprint arXiv:2503.10692, 2025

  56. [56]

    Learning to find good correspondences

    Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2666–2674, 2018

  57. [57]

    Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion

    Qitao Zhao, Amy Lin, Jeff Tan, Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6317–6326, 2025

  58. [58]

    University-1652: A multi-view multi-source benchmark for drone-based geo-localization

    Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403, 2020

  59. [59]

    Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023

    Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023. doi: 10.1109/ TCSVT.2023.3249204. 13

  60. [60]

    Multiple

    Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross-view image geo-localization beyond one-to-one retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3640–3649, 2021. 14 Supplementary Contents A Related Work 15 B CrossGeo Dataset Details 16 B.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . ....