arxiv: 2605.07978 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

Qiwei Wang , Zhongyao Tuo , Xianghui Ze , Yujiao Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-view 3D reconstructionsatellite imageryUAV droneground images6-DoF pose estimationfeed-forward modelCross3Rcross-view localization

0 comments

The pith

A single UAV image supplies cues for 6-DoF poses and 3D structure in feed-forward reconstruction from satellite and ground views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Nadir satellite images alone provide no direct information on camera roll, pitch, or altitude, so prior cross-view methods are restricted to 3-DoF estimates that assume planar motion and zero tilt. These assumptions fail when terrain slopes or cameras are tilted. The paper introduces one UAV image as an intermediate viewpoint that reveals the missing 3D structure and supplies the needed orientation and height cues, while requiring only spatial overlap with the ground image rather than a known relative pose. Cross3R is a single-pass feed-forward model that accepts any combination of satellite tile, UAV image, and ground image and outputs the joint 3D point cloud, all 6-DoF camera poses, and the on-tile position plus yaw of the perspective cameras. The model is trained on the new CrossGeo dataset of 278K tri-view images and outperforms feed-forward baselines on that data as well as dedicated cross-view methods on KITTI without any KITTI training.

Core claim

Cross3R ingests a satellite tile together with a UAV image, a ground image, or both, and in a single forward pass recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile (x,y) position and yaw of each perspective camera.

What carries the argument

The Cross3R feed-forward model, which jointly processes satellite, UAV, and ground images to estimate 3D points and full 6-DoF poses without requiring known relative pose between views.

Load-bearing premise

That one UAV image with only spatial overlap is enough to supply reliable roll, pitch, altitude, and 3D structure cues that the satellite view lacks, and that the model trained on CrossGeo generalizes to new scenes without domain-specific retraining.

What would settle it

Measure the model's estimated roll and pitch errors on a held-out set of images from sloped terrain with independent IMU ground truth; if the 6-DoF errors are no smaller than those of a 3-DoF baseline restricted to planar motion, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.07978 by Qiwei Wang, Xianghui Ze, Yujiao Shi, Zhongyao Tuo.

**Figure 2.** Figure 2: CrossGeo data sources. (a) Ground views and their coarse depth from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Cross3R. Satellite, UAV, and ground images are encoded and processed through [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Per-sample altitude redefinition. Per-sample altitude redefinition. Raw satellite poses in CrossGeo are anchored at 5,726 m above ground (Section 2.1), and backpropagating through translations that large destabilizes training. We therefore re-anchor altitudes locally at dataset-preparation time. CrossGeo is collected as image pairs, each containing two ground, two UAV, and two satellite views of overlap… view at source ↗

**Figure 5.** Figure 5: Predicted point clouds on three CrossGeo samples (left to right: satellite [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-view localization and pixel matching on four CrossGeo samples (left two: ground– [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Per-cell value is metricsat+grd − metrictri for ground position (left, m) and ground yaw (right, ◦ ); red = the UAV helps, blue = the UAV hurts [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 10.** Figure 10: CrossGeo tri-view coverage and statistics: (a) a representative tri-view sample, (b) global [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 9.** Figure 9: Overview of the five core modalities in the CrossGeo dataset. Row 1: representative scene [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 11.** Figure 11: CrossGeo depth-acquisition and tri-view alignment pipeline: each modality produces a [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative ground-depth samples and the PCC quality filter. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Failure (top row) and success (bottom row) cases of Cross3R on CrossGeo [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Zero-shot cross-view localization, pixel matching, and tri-view reconstruction on KITTI, [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Out-of-distribution cross-view localization, pixel matching, and tri-view reconstruction [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cross3R adds a UAV bridge view to enable feed-forward 6-DoF reconstruction and localization from satellite plus ground imagery, with a new large dataset and some cross-dataset generalization, but the abstract gives almost no implementation or error details.

read the letter

The main point is that this paper drops the planar-motion assumption in cross-view work by feeding a satellite tile plus an optional UAV image and ground image into one network that outputs a 3D point cloud, full 6-DoF poses for all cameras, and the ground camera's map position and yaw. The UAV shot is meant to supply the height, tilt, and structure that nadir satellite views lack, and it does so without any known relative pose between the inputs.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Cross3R, a flexible feed-forward model that ingests a satellite tile with optional UAV and/or ground images to recover, in one forward pass, a cross-view 3D point cloud, 6-DoF poses for all input cameras, and the on-tile (x,y) position plus yaw for each perspective camera. It introduces the CrossGeo dataset (278K tri-view images across 85 global scenes) for training and evaluation, claiming consistent outperformance over feed-forward 3D baselines on CrossGeo for point-cloud reconstruction, 6-DoF pose estimation, and cross-view localization, plus generalization to KITTI without KITTI-specific training.

Significance. If the empirical claims hold after detailed validation, the work offers a meaningful advance in cross-view 3D reconstruction by using an intermediate UAV view to relax planar-motion and zero-tilt assumptions, enabling full 6-DoF recovery from nadir satellite imagery. The large-scale, multi-continent CrossGeo dataset is a clear contribution. The feed-forward, input-flexible design is practically attractive for applications in localization and mapping.

major comments (2)

[Abstract] Abstract: the claims of outperformance on CrossGeo and KITTI are presented without any architectural details, loss functions, training procedure, or error analysis. These omissions are load-bearing because the central contribution is an empirical demonstration of a new model on a new dataset; without them, reproducibility and the source of gains cannot be assessed.
[Abstract] Abstract and introduction: the key modeling assumption that a single UAV image with only spatial overlap (no known relative pose) supplies reliable cues for roll, pitch, altitude, and 3D structure is stated but not accompanied by ablations, sensitivity analysis, or failure-case discussion. This assumption directly underpins the 6-DoF claims and generalization statements.

minor comments (1)

[Abstract] Abstract: the phrasing of the KITTI generalization result (outperforms dedicated methods 'on most metrics') would be clearer if the specific metrics and the magnitude of improvement were summarized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the central modeling assumption. We address each major comment below, clarifying where the full manuscript already provides the requested details and proposing targeted revisions to improve accessibility and explicitness.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of outperformance on CrossGeo and KITTI are presented without any architectural details, loss functions, training procedure, or error analysis. These omissions are load-bearing because the central contribution is an empirical demonstration of a new model on a new dataset; without them, reproducibility and the source of gains cannot be assessed.

Authors: The abstract is intentionally concise as a high-level overview. The full manuscript provides the architectural details in Section 3, the loss functions and training procedure in Section 4, and error analysis together with ablation studies in Section 5. To directly address the concern about assessing reproducibility and sources of gains from the abstract, we will revise it to include a brief outline of the model components, training approach, and key evaluation metrics. revision: yes
Referee: [Abstract] Abstract and introduction: the key modeling assumption that a single UAV image with only spatial overlap (no known relative pose) supplies reliable cues for roll, pitch, altitude, and 3D structure is stated but not accompanied by ablations, sensitivity analysis, or failure-case discussion. This assumption directly underpins the 6-DoF claims and generalization statements.

Authors: The assumption is central and is supported by empirical evidence already present in the manuscript. Section 5.2 contains ablations isolating the UAV view's contribution to 6-DoF recovery, Section 5.3 provides sensitivity analysis across overlap ratios and pose variations, and Section 5.4 discusses failure cases where the UAV cue is insufficient. To make this linkage more explicit in the introduction (as requested), we will add a short paragraph summarizing these results while retaining the existing detailed analysis in the experiments section. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new feed-forward model (Cross3R) and a new tri-view dataset (CrossGeo) for joint 3D reconstruction and pose estimation from satellite/UAV/ground images. All central claims rest on training the model on CrossGeo and reporting empirical metrics on CrossGeo plus zero-shot generalization to KITTI. No equations, derivations, fitted parameters, or self-citations are presented that reduce any output to the inputs by construction. The argument is self-contained as a standard empirical ML contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine-learning assumptions that a sufficiently large and diverse image dataset allows a feed-forward network to learn cross-view 3D geometry; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption A neural network trained on multi-view image pairs can infer 3D structure and 6-DoF poses from spatial overlap alone
Implicit in the claim that the UAV image supplies the missing cues without known relative pose.

pith-pipeline@v0.9.0 · 5603 in / 1370 out tokens · 41556 ms · 2026-05-11T03:01:49.269218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cross3R ... recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile (x,y) position and yaw ... via orthographic geometric prior ... per-sample altitude redefinition
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Loss functions ... Lgeo ... Lnorm ... Lcam

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

[1]

Pearson correlation coefficient

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. InNoise reduction in speech processing, pages 1–4. Springer, 2009

work page 2009
[2]

Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024

work page 2024
[3]

Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion

Arda Duzceker, Silvano Galliani, Christoph V ogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15324–15333, 2021

work page 2021
[4]

Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

work page 2014
[5]

Light3r-sfm: Towards feed-forward structure- from-motion

Sven Elflein, Qunjie Zhou, and Laura Leal-Taixé. Light3r-sfm: Towards feed-forward structure- from-motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16774–16784, 2025

work page 2025
[6]

Multi-view stereo: A tutorial.Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa and Carlos Hernández. Multi-view stereo: A tutorial.Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015

work page 2015
[7]

Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013

work page 2013
[8]

Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, and Yujiao Shi. Panovggt: Feed-forward 3d reconstruction from panoramic imagery.arXiv preprint arXiv:2603.17571, 2026

work page arXiv 2026
[9]

Towards high- resolution large-scale multi-view stereo

Vu Hoang Hiep, Renaud Keriven, Patrick Labatut, and Jean-Philippe Pons. Towards high- resolution large-scale multi-view stereo. In2009 IEEE conference on computer vision and pattern recognition, pages 1430–1437. IEEE, 2009

work page 2009
[10]

Mvsany- where: Zero-shot multi-view stereo

Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Tur- mukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, and Jamie Watson. Mvsany- where: Zero-shot multi-view stereo. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11493–11504, 2025

work page 2025
[11]

Large scale multi-view stereopsis evaluation

Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

work page 2014
[12]

Game4loc: A uav geo-localization benchmark from game data

Yuxiang Ji, Boyong He, Zhuoyue Tan, and Liaoni Wu. Game4loc: A uav geo-localization benchmark from game data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3913–3921, 2025

work page 2025
[13]

Image matching across wide baselines: From paper to practice.International Journal of Computer Vision, 129(2):517–547, 2021

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice.International Journal of Computer Vision, 129(2):517–547, 2021. 10

work page 2021
[14]

Ultrra challenge 2025, 2024

Neil Joshi, Joshua Carney, Nathanael Kuo, Homer Li, Cheng Peng, and Myron Brown. Ultrra challenge 2025, 2024. URLhttps://dx.doi.org/10.21227/2zs6-ht63

work page doi:10.21227/2zs6-ht63 2025
[15]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review arXiv 2025
[16]

Pidloc: Cross-view pose optimization network inspired by pid controllers

Wooju Lee, Juhye Park, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, and Hyun Myung. Pidloc: Cross-view pose optimization network inspired by pid controllers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21981–21990, 2025

work page 2025
[17]

Slicematch: Geometry-guided aggregation for cross-view pose estimation

Ted Lentsch, Zimin Xia, Holger Caesar, and Julian FP Kooij. Slicematch: Geometry-guided aggregation for cross-view pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17225–17234, 2023

work page 2023
[18]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

work page 2024
[19]

Cvd- sfm: A cross-view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes

Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, and Brendan Englot. Cvd- sfm: A cross-view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10741–10748. IEEE, 2025

work page 2025
[20]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

work page 2020
[21]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Pixel-perfect structure-from-motion with featuremetric refinement

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion with featuremetric refinement. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 5987–5997, 2021

work page 2021
[23]

Lending orientation to neural networks for cross-view geo- localization

Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo- localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019

work page 2019
[24]

Slam3r: Real-time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

work page 2025
[25]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

work page 2021
[26]

Global structure- from-motion revisited

Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes L Schönberger. Global structure- from-motion revisited. InEuropean Conference on Computer Vision, pages 58–77. Springer, 2024

work page 2024
[27]

OrienterNet: Visual Localization in 2D Public Maps with Neural Matching

Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, and Vasileios Balntas. OrienterNet: Visual Localization in 2D Public Maps with Neural Matching. InCVPR, 2023

work page 2023
[28]

Snap: Self-supervised neural maps for visual positioning and semantic understanding.Advances in Neural Information Processing Systems, 36:7697–7729, 2023

Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, and Simon Lynen. Snap: Self-supervised neural maps for visual positioning and semantic understanding.Advances in Neural Information Processing Systems, 36:7697–7729, 2023. 11

work page 2023
[29]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016
[30]

A vote-and-verify strategy for fast spatial verification in image retrieval

Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. A vote-and-verify strategy for fast spatial verification in image retrieval. InAsian Conference on Computer Vision (ACCV), 2016

work page 2016
[31]

A multi-view stereo benchmark with high-resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

work page 2017
[32]

Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image

Yujiao Shi and Hongdong Li. Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17010–17020, 2022

work page 2022
[33]

Accurate 3-dof camera geo-localization via ground-to-satellite image matching.IEEE transactions on pattern analysis and machine intelligence, 45(3):2682–2697, 2022

Yujiao Shi, Xin Yu, Liu Liu, Dylan Campbell, Piotr Koniusz, and Hongdong Li. Accurate 3-dof camera geo-localization via ground-to-satellite image matching.IEEE transactions on pattern analysis and machine intelligence, 45(3):2682–2697, 2022

work page 2022
[34]

Weakly-supervised camera localization by ground-to-satellite image registration

Yujiao Shi, Hongdong Li, Akhil Perincherry, and Ankit V ora. Weakly-supervised camera localization by ground-to-satellite image registration. InEuropean Conference on Computer Vision, pages 39–57. Springer, 2024

work page 2024
[35]

Learning dense flow field for highly-accurate cross-view camera localization.Advances in Neural Information Processing Systems, 36: 70612–70625, 2023

Zhenbo Song, Jianfeng Lu, Yujiao Shi, et al. Learning dense flow field for highly-accurate cross-view camera localization.Advances in Neural Information Processing Systems, 36: 70612–70625, 2023

work page 2023
[36]

Geodistill: Geometry- guided self-distillation for weakly supervised cross-view localization

Shaowen Tong, Zimin Xia, Alexandre Alahi, Xuming He, and Yujiao Shi. Geodistill: Geometry- guided self-distillation for weakly supervised cross-view localization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25357–25366, 2025

work page 2025
[37]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21674–21684, 2025

work page 2025
[38]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

work page 2024
[39]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[40]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025
[41]

Bevsplat: Resolving height ambiguity via feature- based gaussian primitives for weakly-supervised cross-view localization.arXiv preprint arXiv:2502.09080, 2025

Qiwei Wang, Shaoxun Wu, and Yujiao Shi. Bevsplat: Resolving height ambiguity via feature- based gaussian primitives for weakly-supervised cross-view localization.arXiv preprint arXiv:2502.09080, 2025

work page arXiv 2025
[42]

View from above: Orthogonal-view aware cross-view localization

Shan Wang, Chuong Nguyen, Jiawei Liu, Yanhao Zhang, Sundaram Muthu, Fahira Afzal Maken, Kaihao Zhang, and Hongdong Li. View from above: Orthogonal-view aware cross-view localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14843–14852, 2024

work page 2024
[43]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 12

work page 2024
[44]

Fine-grained cross- view geo-localization using a correlation-aware homography estimator.Advances in Neural Information Processing Systems, 36:5301–5319, 2023

Xiaolong Wang, Runsen Xu, Zhuofan Cui, Zeyu Wan, and Yu Zhang. Fine-grained cross- view geo-localization using a correlation-aware homography estimator.Advances in Neural Information Processing Systems, 36:5301–5319, 2023

work page 2023
[45]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review arXiv 2025
[46]

Depth anything with any prior.arXiv preprint arXiv:2505.10565, 2025

Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, and Zhou Zhao. Depth anything with any prior.arXiv preprint arXiv:2505.10565, 2025

work page arXiv 2025
[47]

Flying co-stereo: Enabling long-range aerial dense mapping via collaborative stereo vision of dynamic-baseline.IEEE Transactions on Robotics, 2026

Zhaoying Wang, Xingxing Zuo, and Wei Dong. Flying co-stereo: Enabling long-range aerial dense mapping via collaborative stereo vision of dynamic-baseline.IEEE Transactions on Robotics, 2026

work page 2026
[48]

Fcos: Fully convolutional one-stage object detection

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference imagery. InIEEE International Conference on Computer Vision (ICCV), pages 1–9, 2015. doi: 10.1109/ICCV .2015.451. Acceptance rate: 30.3%

work page doi:10.1109/iccv 2015
[49]

Fgˆ 2: Fine-grained cross-view localization by fine-grained feature matching

Zimin Xia and Alexandre Alahi. Fgˆ 2: Fine-grained cross-view localization by fine-grained feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6362–6372, 2025

work page 2025
[50]

Convolutional cross-view pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3813–3831, 2023

Zimin Xia, Olaf Booij, and Julian FP Kooij. Convolutional cross-view pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3813–3831, 2023

work page 2023
[51]

Uav-visloc: A large-scale dataset for uav visual localization,

Wenjia Xu, Yaxuan Yao, Jiaqi Cao, Zhiwei Wei, Chunbo Liu, Jiuniu Wang, and Mugen Peng. Uav-visloc: A large-scale dataset for uav visual localization.arXiv preprint arXiv:2405.11936, 2024

work page arXiv 2024
[52]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

work page 2025
[53]

Articulated pose estimation with flexible mixtures-of-parts

Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011, pages 1385–1392. IEEE, 2011

work page 2011
[54]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783, 2018

work page 2018
[55]

Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, and Tao Tan. Exploring the best way for uav visual localization under low-altitude multi-view observation condition: a benchmark.arXiv preprint arXiv:2503.10692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Learning to find good correspondences

Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2666–2674, 2018

work page 2018
[57]

Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion

Qitao Zhao, Amy Lin, Jeff Tan, Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6317–6326, 2025

work page 2025
[58]

University-1652: A multi-view multi-source benchmark for drone-based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403, 2020

work page 2020
[59]

Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023

Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023. doi: 10.1109/ TCSVT.2023.3249204. 13

work page arXiv 2023
[60]

Multiple

Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross-view image geo-localization beyond one-to-one retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3640–3649, 2021. 14 Supplementary Contents A Related Work 15 B CrossGeo Dataset Details 16 B.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2021