Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams
Pith reviewed 2026-05-16 15:12 UTC · model grok-4.3
The pith
Cross-view dual-encoder networks trained on synthetic image pairs and foundation-model semantic segmentation enable accurate rover localization in aerial maps using particle filters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dual-encoder cross-view networks that ingest ground-view images and predict their location inside an aerial map can be made to generalize from synthetic training data to real planetary images when the networks are supervised with semantic segmentation masks produced by vision foundation models; particle-filter state estimation on the network outputs then yields accurate position estimates along both simple and complex rover trajectories.
What carries the argument
Cross-view-localising dual-encoder deep neural networks that map ground-view images to positions inside an aerial map, guided by semantic segmentation from vision foundation models and trained on high-volume synthetic ground-aerial pairs.
If this is right
- Accurate position estimates are obtained over both simple and complex trajectories when particle filters combine successive cross-view network outputs.
- Localization performance remains usable even when only monocular ground-view RGB images with limited field of view are available.
- The same synthetic-plus-foundation-model pipeline produces usable cross-view matches on the new planetary-analogue real dataset.
- Ground-aerial robotic teams can therefore perform local map-based localization without requiring large quantities of labeled real flight data.
Where Pith is reading between the lines
- The method could be tested on orbital imagery of other bodies by swapping the aerial map source while keeping the same synthetic-training recipe.
- If the semantic segmentation step is replaced by a different foundation model, the localization error on real trajectories would indicate how sensitive the pipeline is to the choice of segmentation prior.
- Extending the particle filter to also estimate heading or velocity from the same ground-view sequence would turn the current position-only estimator into a full pose tracker without extra sensors.
Load-bearing premise
Semantic segmentation masks from vision foundation models plus large synthetic training sets are enough to make the learned cross-view matching reliable on real planetary images.
What would settle it
On the contributed real-world rover dataset, replace the synthetic-trained cross-view network with a network trained only on real images and measure whether particle-filter position error increases beyond the reported synthetic-trained performance.
Figures
read the original abstract
Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a cross-view localization method for planetary rovers that uses dual-encoder neural networks trained on synthetic image pairs. Vision foundation models provide semantic segmentation to help bridge the domain gap to real planetary images. New real-world rover trajectory datasets with ground-truth labels and corresponding high-volume synthetic datasets are contributed. Particle filters are applied for state estimation, with the claim that this yields accurate position estimates on both simple and complex trajectories from sequences of monocular ground-view RGB images.
Significance. If the domain-transfer claims hold, the approach would be valuable for planetary robotics, where labeled real data is scarce, by enabling localization against aerial maps without extensive real-world training. The release of new real and synthetic cross-view datasets is a clear positive contribution that could support future work. The combination of VFMs, synthetic data, and particle filtering is a reasonable empirical strategy for this setting.
major comments (2)
- [Abstract] Abstract: the central claim of 'accurate position estimation' on new real datasets is asserted without any reported quantitative metrics (e.g., mean position error, success rate, or error distributions), baselines, or ablation results, so the soundness of the generalization claim cannot be evaluated from the manuscript text.
- [Method/Experiments] Method and Experiments sections: the key assumption that VFM-derived semantic segmentation closes the domain gap for planetary surfaces (lighting, texture, and scale variations absent from typical VFM pre-training) is not supported by any quantitative evidence such as real-vs-synthetic error deltas, segmentation-quality ablations, or comparisons of localization performance with and without the VFM component.
minor comments (2)
- [Abstract] Abstract: consider adding one or two concrete performance numbers or explicit references to result tables/figures to make the claims more informative.
- [Method] Notation: clarify whether the dual-encoder architecture is symmetric or asymmetric and how the cross-view matching loss is formulated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract and experimental evidence require strengthening with explicit quantitative results to support the claims. We will revise the manuscript accordingly and address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'accurate position estimation' on new real datasets is asserted without any reported quantitative metrics (e.g., mean position error, success rate, or error distributions), baselines, or ablation results, so the soundness of the generalization claim cannot be evaluated from the manuscript text.
Authors: We agree that the abstract should include quantitative metrics to substantiate the claim of accurate position estimation. The experiments section reports mean position errors, success rates, error distributions, and comparisons against baselines on the new real planetary analogue trajectories. We will revise the abstract to incorporate key quantitative results (e.g., average localization error and success rates) along with brief references to the baselines and ablations, enabling direct evaluation of the generalization claims. revision: yes
-
Referee: [Method/Experiments] Method and Experiments sections: the key assumption that VFM-derived semantic segmentation closes the domain gap for planetary surfaces (lighting, texture, and scale variations absent from typical VFM pre-training) is not supported by any quantitative evidence such as real-vs-synthetic error deltas, segmentation-quality ablations, or comparisons of localization performance with and without the VFM component.
Authors: The current manuscript includes direct comparisons of localization performance with and without the VFM semantic segmentation component, showing measurable improvements in domain transfer on both synthetic and real data. We acknowledge, however, that additional quantitative support such as segmentation-quality metrics on real images and explicit real-vs-synthetic error deltas would further strengthen the argument. We will add these ablations and metrics to the revised Experiments section. revision: yes
Circularity Check
No circularity: empirical method with new datasets and no self-referential derivations
full rationale
The paper presents an empirical pipeline for cross-view localization: dual-encoder networks trained on synthetic image pairs (augmented by VFM semantic segmentation) and evaluated via particle filters on a newly contributed real-world planetary analogue dataset. No equations, derivations, uniqueness theorems, or predictions are claimed that reduce to fitted parameters or self-citations by construction. All load-bearing elements (domain-gap bridging, trajectory accuracy) rest on experimental results from the contributed datasets rather than definitional equivalences or self-citation chains. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision foundation models produce semantic segmentations that generalize to real planetary images after training on synthetic data
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images... Using particle filters for state estimation with the cross-view networks allows accurate position estimation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
W. Zhang, J. Li, H. Chen, and J. Wu, “MT-GEO: A Multi-Scale Feature Extraction Network for Cross-View Geo-Localization Between Street-View and Remote Sensing Imagery,” inIGARSS, 2024, pp. 6964–6968
work page 2024
-
[2]
ArcGeo: Localizing Limited Field-of- View Images using Cross-view Matching,
M. Shugaev et al., “ArcGeo: Localizing Limited Field-of- View Images using Cross-view Matching,” inWACV, 2024, pp. 208–217
work page 2024
-
[3]
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization,
S. Zhu, M. Shah, and C. Chen, “TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization,” in CVPR, 2022, pp. 1152–1161
work page 2022
-
[4]
Metric localization for lunar rovers via cross-view image matching,
Z. Chen, K. Li, H. Li, Z. Fu, H. Zhang, and Y . Guo, “Metric localization for lunar rovers via cross-view image matching,” Visual Intelligence, vol. 2, no. 1, p. 12, 2024
work page 2024
-
[5]
Lunar Rover Cross-View Localization Through Integration of Rover and Orbital Images,
X. Zhao, L. Cui, X. Wei, C. Liu, and J. Yin, “Lunar Rover Cross-View Localization Through Integration of Rover and Orbital Images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024
work page 2024
-
[6]
Y . Kou et al., “Cross-Site Visual Localization of Zhurong Mars Rover Based on Self-Supervised Keypoint Extraction and Robust Matching,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–20, 2025
work page 2025
-
[7]
S. Andolfo, F. Petricca, and A. Genova, “Precise pose estimation of the NASA Mars 2020 Perseverance rover through a stereo-vision-based approach,”Journal of Field Robotics, vol. 40, no. 3, pp. 684–700, 2023
work page 2020
-
[8]
Toward Autonomous Localization of Plan- etary Robotic Explorers by Relying on Semantic Mapping,
K. Ebadi et al., “Toward Autonomous Localization of Plan- etary Robotic Explorers by Relying on Semantic Mapping,” inAERO, 2022, pp. 1–10
work page 2022
-
[9]
K. Ebadi and A.-A. Agha-Mohammadi, “Rover Localization in Mars Helicopter Aerial Maps: Experimental Results in a Mars-Analogue Environment,” inProceedings of the 2018 International Symposium on Experimental Robotics, J. Xiao, T. Kr¨oger, and O. Khatib, Eds., Cham: Springer International Publishing, 2018, pp. 72–84
work page 2018
-
[10]
Enabling Long & Precise Drives for The Perseverance Mars Rover via Onboard Global Localization,
V . Verma et al., “Enabling Long & Precise Drives for The Perseverance Mars Rover via Onboard Global Localization,” inAERO, Big Sky, MT, USA: IEEE, 2024, pp. 1–18
work page 2024
-
[11]
Topographical Landmarks for Ground-Level Terrain Relative Navigation on Mars,
J. V . Hook, R. Schwartz, K. Ebadi, K. Coble, and C. Pad- gett, “Topographical Landmarks for Ground-Level Terrain Relative Navigation on Mars,” inAERO, 2022, pp. 1–6
work page 2022
-
[12]
Absolute Localisation by Map Matching for Sample Fetch Rover,
M. Dinsdale et al., “Absolute Localisation by Map Matching for Sample Fetch Rover,” 2022
work page 2022
-
[13]
Planetary Rover Localisation via Surface and Orbital Image Matching,
V . Franchi and E. Ntagiou, “Planetary Rover Localisation via Surface and Orbital Image Matching,” inAERO, 2022, pp. 1–14
work page 2022
-
[14]
AI4MARS: A Dataset for Terrain-Aware Autonomous Driving on Mars,
R. M. Swan et al., “AI4MARS: A Dataset for Terrain-Aware Autonomous Driving on Mars,” inCVPR Workshops, 2021, pp. 1982–1991
work page 2021
-
[15]
S5Mars: Semi-Supervised Learning for Mars Semantic Segmenta- tion,
J. Zhang, L. Lin, Z. Fan, W. Wang, and J. Liu, “S5Mars: Semi-Supervised Learning for Mars Semantic Segmenta- tion,”IEEE Transactions on Geoscience and Remote Sens- ing, vol. 62, pp. 1–15, 2024
work page 2024
-
[16]
A. M. Barrett et al., “NOAH-H, a deep-learning, terrain classification system for Mars: Results for the ExoMars Rover candidate landing sites,”Icarus, vol. 371, p. 114 701, 2022
work page 2022
-
[17]
Rocknet: Lightweight network for real-time segmentation of Martian rocks,
P. Wei, Z. Sun, and H. Tian, “Rocknet: Lightweight network for real-time segmentation of Martian rocks,”Journal of Real-Time Image Processing, vol. 22, no. 1, p. 41, 2025
work page 2025
-
[18]
RockFormer: A U-Shaped Transformer Network for Martian Rock Seg- mentation,
H. Liu, M. Yao, X. Xiao, and Y . Xiong, “RockFormer: A U-Shaped Transformer Network for Martian Rock Seg- mentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023
work page 2023
-
[19]
S. Fu et al., “LLMDet: Learning Strong Open-V ocabulary Object Detectors under the Supervision of Large Language Models,” inCVPR, 2025, pp. 14 987–14 997
work page 2025
-
[20]
SAM 2: Segment Anything in Images and Videos,
N. Ravi et al., “SAM 2: Segment Anything in Images and Videos,” inThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[21]
CVM-Net: Cross-View Matching Network for Image- Based Ground-to-Aerial Geo-Localization,
S. Hu, M. Feng, R. M. H. Nguyen, and G. H. Lee, “CVM-Net: Cross-View Matching Network for Image- Based Ground-to-Aerial Geo-Localization,” inCVPR, Salt Lake City, UT, USA: IEEE, 2018, pp. 7258–7267
work page 2018
- [22]
-
[23]
I. Martin and M. Dunstan,PANGU v6: Planet and Asteroid Natural Scene Generation Utility, 2021
work page 2021
-
[24]
Leo Rover - Outdoor Robotics Kit for research
Leo Rover. “Leo Rover - Outdoor Robotics Kit for research. ”[Online]. Available:https://www.leorover.tech/ the-rover
-
[25]
OptiTrack. “Motion Capture Systems,” OptiTrack. [Online]. Available:http : / / www . optitrack . com / index . html
-
[26]
ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks,
J. Kwon, J. Kim, H. Park, and I. K. Choi, “ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks,” inICML, 2021, pp. 5905–5914
work page 2021
-
[27]
Total Ionizing Dose Radiation Testing of NVIDIA Jetson Orin NX System on Module,
M. A. Felix, W. S. Slater, D. C. Landauer, R. E. Pinson, and B. B. Rutherford, “Total Ionizing Dose Radiation Testing of NVIDIA Jetson Orin NX System on Module,” inIEEE Space Computing Conference, 2024, pp. 116–121
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.