pith. sign in

arxiv: 2601.09107 · v1 · submitted 2026-01-14 · 💻 cs.CV · cs.RO

Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams

Pith reviewed 2026-05-16 15:12 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords cross-view localizationplanetary roboticsvision foundation modelsdomain generalizationsynthetic dataparticle filtersrover localizationaerial mapping
0
0 comments X

The pith

Cross-view dual-encoder networks trained on synthetic image pairs and foundation-model semantic segmentation enable accurate rover localization in aerial maps using particle filters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that rovers can determine their position inside a local aerial map from sequences of limited-field-of-view ground-view RGB images. It does this by training dual-encoder networks on large synthetic ground-aerial pairs and using vision foundation models to produce semantic segmentation that reduces the appearance gap between simulation and real planetary terrain. Particle filters then combine the network outputs over time to track both simple and complex trajectories. The approach matters because labeled real planetary data is scarce, so any method that generalizes from synthetic training directly supports larger-scale ground-aerial missions. The authors also release a new real-world rover trajectory dataset captured in a planetary analogue facility together with matching synthetic pairs.

Core claim

Dual-encoder cross-view networks that ingest ground-view images and predict their location inside an aerial map can be made to generalize from synthetic training data to real planetary images when the networks are supervised with semantic segmentation masks produced by vision foundation models; particle-filter state estimation on the network outputs then yields accurate position estimates along both simple and complex rover trajectories.

What carries the argument

Cross-view-localising dual-encoder deep neural networks that map ground-view images to positions inside an aerial map, guided by semantic segmentation from vision foundation models and trained on high-volume synthetic ground-aerial pairs.

If this is right

  • Accurate position estimates are obtained over both simple and complex trajectories when particle filters combine successive cross-view network outputs.
  • Localization performance remains usable even when only monocular ground-view RGB images with limited field of view are available.
  • The same synthetic-plus-foundation-model pipeline produces usable cross-view matches on the new planetary-analogue real dataset.
  • Ground-aerial robotic teams can therefore perform local map-based localization without requiring large quantities of labeled real flight data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on orbital imagery of other bodies by swapping the aerial map source while keeping the same synthetic-training recipe.
  • If the semantic segmentation step is replaced by a different foundation model, the localization error on real trajectories would indicate how sensitive the pipeline is to the choice of segmentation prior.
  • Extending the particle filter to also estimate heading or velocity from the same ground-view sequence would turn the current position-only estimator into a full pose tracker without extra sensors.

Load-bearing premise

Semantic segmentation masks from vision foundation models plus large synthetic training sets are enough to make the learned cross-view matching reliable on real planetary images.

What would settle it

On the contributed real-world rover dataset, replace the synthetic-trained cross-view network with a network trained only on real images and measure whether particle-filter position error increases beyond the reported synthetic-trained performance.

Figures

Figures reproduced from arXiv: 2601.09107 by Alberto Candela, David Harvey, Feras Dayoub, Lachlan Holden, Tat-Jun Chin.

Figure 1
Figure 1. Figure 1: A high-level overview outlining our method of estimating the state of a rover within an aerial image. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Full rock segmentation pipeline using LLMDet and SAM 2. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The network follows a dual-encoder structure, with [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dual-encoder cross-view localising network structure, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of ground view (bottom) and rectified aerial [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The six trajectories, A to F, used in the validation set [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples of particle filter runs A (left) [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the distribution of particle filter [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of particle filter errors between automatic [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
read the original abstract

Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a cross-view localization method for planetary rovers that uses dual-encoder neural networks trained on synthetic image pairs. Vision foundation models provide semantic segmentation to help bridge the domain gap to real planetary images. New real-world rover trajectory datasets with ground-truth labels and corresponding high-volume synthetic datasets are contributed. Particle filters are applied for state estimation, with the claim that this yields accurate position estimates on both simple and complex trajectories from sequences of monocular ground-view RGB images.

Significance. If the domain-transfer claims hold, the approach would be valuable for planetary robotics, where labeled real data is scarce, by enabling localization against aerial maps without extensive real-world training. The release of new real and synthetic cross-view datasets is a clear positive contribution that could support future work. The combination of VFMs, synthetic data, and particle filtering is a reasonable empirical strategy for this setting.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'accurate position estimation' on new real datasets is asserted without any reported quantitative metrics (e.g., mean position error, success rate, or error distributions), baselines, or ablation results, so the soundness of the generalization claim cannot be evaluated from the manuscript text.
  2. [Method/Experiments] Method and Experiments sections: the key assumption that VFM-derived semantic segmentation closes the domain gap for planetary surfaces (lighting, texture, and scale variations absent from typical VFM pre-training) is not supported by any quantitative evidence such as real-vs-synthetic error deltas, segmentation-quality ablations, or comparisons of localization performance with and without the VFM component.
minor comments (2)
  1. [Abstract] Abstract: consider adding one or two concrete performance numbers or explicit references to result tables/figures to make the claims more informative.
  2. [Method] Notation: clarify whether the dual-encoder architecture is symmetric or asymmetric and how the cross-view matching loss is formulated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract and experimental evidence require strengthening with explicit quantitative results to support the claims. We will revise the manuscript accordingly and address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'accurate position estimation' on new real datasets is asserted without any reported quantitative metrics (e.g., mean position error, success rate, or error distributions), baselines, or ablation results, so the soundness of the generalization claim cannot be evaluated from the manuscript text.

    Authors: We agree that the abstract should include quantitative metrics to substantiate the claim of accurate position estimation. The experiments section reports mean position errors, success rates, error distributions, and comparisons against baselines on the new real planetary analogue trajectories. We will revise the abstract to incorporate key quantitative results (e.g., average localization error and success rates) along with brief references to the baselines and ablations, enabling direct evaluation of the generalization claims. revision: yes

  2. Referee: [Method/Experiments] Method and Experiments sections: the key assumption that VFM-derived semantic segmentation closes the domain gap for planetary surfaces (lighting, texture, and scale variations absent from typical VFM pre-training) is not supported by any quantitative evidence such as real-vs-synthetic error deltas, segmentation-quality ablations, or comparisons of localization performance with and without the VFM component.

    Authors: The current manuscript includes direct comparisons of localization performance with and without the VFM semantic segmentation component, showing measurable improvements in domain transfer on both synthetic and real data. We acknowledge, however, that additional quantitative support such as segmentation-quality metrics on real images and explicit real-vs-synthetic error deltas would further strengthen the argument. We will add these ablations and metrics to the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with new datasets and no self-referential derivations

full rationale

The paper presents an empirical pipeline for cross-view localization: dual-encoder networks trained on synthetic image pairs (augmented by VFM semantic segmentation) and evaluated via particle filters on a newly contributed real-world planetary analogue dataset. No equations, derivations, uniqueness theorems, or predictions are claimed that reduce to fitted parameters or self-citations by construction. All load-bearing elements (domain-gap bridging, trajectory accuracy) rest on experimental results from the contributed datasets rather than definitional equivalences or self-citation chains. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that foundation-model semantic segmentation transfers effectively from general images to planetary scenes and that synthetic data can stand in for scarce real labeled trajectories.

axioms (1)
  • domain assumption Vision foundation models produce semantic segmentations that generalize to real planetary images after training on synthetic data
    Invoked to justify bridging the domain gap without extensive real labeled data

pith-pipeline@v0.9.0 · 5509 in / 1225 out tokens · 32616 ms · 2026-05-16T15:12:05.279066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images... Using particle filters for state estimation with the cross-view networks allows accurate position estimation

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    MT-GEO: A Multi-Scale Feature Extraction Network for Cross-View Geo-Localization Between Street-View and Remote Sensing Imagery,

    W. Zhang, J. Li, H. Chen, and J. Wu, “MT-GEO: A Multi-Scale Feature Extraction Network for Cross-View Geo-Localization Between Street-View and Remote Sensing Imagery,” inIGARSS, 2024, pp. 6964–6968

  2. [2]

    ArcGeo: Localizing Limited Field-of- View Images using Cross-view Matching,

    M. Shugaev et al., “ArcGeo: Localizing Limited Field-of- View Images using Cross-view Matching,” inWACV, 2024, pp. 208–217

  3. [3]

    TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization,

    S. Zhu, M. Shah, and C. Chen, “TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization,” in CVPR, 2022, pp. 1152–1161

  4. [4]

    Metric localization for lunar rovers via cross-view image matching,

    Z. Chen, K. Li, H. Li, Z. Fu, H. Zhang, and Y . Guo, “Metric localization for lunar rovers via cross-view image matching,” Visual Intelligence, vol. 2, no. 1, p. 12, 2024

  5. [5]

    Lunar Rover Cross-View Localization Through Integration of Rover and Orbital Images,

    X. Zhao, L. Cui, X. Wei, C. Liu, and J. Yin, “Lunar Rover Cross-View Localization Through Integration of Rover and Orbital Images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  6. [6]

    Cross-Site Visual Localization of Zhurong Mars Rover Based on Self-Supervised Keypoint Extraction and Robust Matching,

    Y . Kou et al., “Cross-Site Visual Localization of Zhurong Mars Rover Based on Self-Supervised Keypoint Extraction and Robust Matching,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–20, 2025

  7. [7]

    Precise pose estimation of the NASA Mars 2020 Perseverance rover through a stereo-vision-based approach,

    S. Andolfo, F. Petricca, and A. Genova, “Precise pose estimation of the NASA Mars 2020 Perseverance rover through a stereo-vision-based approach,”Journal of Field Robotics, vol. 40, no. 3, pp. 684–700, 2023

  8. [8]

    Toward Autonomous Localization of Plan- etary Robotic Explorers by Relying on Semantic Mapping,

    K. Ebadi et al., “Toward Autonomous Localization of Plan- etary Robotic Explorers by Relying on Semantic Mapping,” inAERO, 2022, pp. 1–10

  9. [9]

    Rover Localization in Mars Helicopter Aerial Maps: Experimental Results in a Mars-Analogue Environment,

    K. Ebadi and A.-A. Agha-Mohammadi, “Rover Localization in Mars Helicopter Aerial Maps: Experimental Results in a Mars-Analogue Environment,” inProceedings of the 2018 International Symposium on Experimental Robotics, J. Xiao, T. Kr¨oger, and O. Khatib, Eds., Cham: Springer International Publishing, 2018, pp. 72–84

  10. [10]

    Enabling Long & Precise Drives for The Perseverance Mars Rover via Onboard Global Localization,

    V . Verma et al., “Enabling Long & Precise Drives for The Perseverance Mars Rover via Onboard Global Localization,” inAERO, Big Sky, MT, USA: IEEE, 2024, pp. 1–18

  11. [11]

    Topographical Landmarks for Ground-Level Terrain Relative Navigation on Mars,

    J. V . Hook, R. Schwartz, K. Ebadi, K. Coble, and C. Pad- gett, “Topographical Landmarks for Ground-Level Terrain Relative Navigation on Mars,” inAERO, 2022, pp. 1–6

  12. [12]

    Absolute Localisation by Map Matching for Sample Fetch Rover,

    M. Dinsdale et al., “Absolute Localisation by Map Matching for Sample Fetch Rover,” 2022

  13. [13]

    Planetary Rover Localisation via Surface and Orbital Image Matching,

    V . Franchi and E. Ntagiou, “Planetary Rover Localisation via Surface and Orbital Image Matching,” inAERO, 2022, pp. 1–14

  14. [14]

    AI4MARS: A Dataset for Terrain-Aware Autonomous Driving on Mars,

    R. M. Swan et al., “AI4MARS: A Dataset for Terrain-Aware Autonomous Driving on Mars,” inCVPR Workshops, 2021, pp. 1982–1991

  15. [15]

    S5Mars: Semi-Supervised Learning for Mars Semantic Segmenta- tion,

    J. Zhang, L. Lin, Z. Fan, W. Wang, and J. Liu, “S5Mars: Semi-Supervised Learning for Mars Semantic Segmenta- tion,”IEEE Transactions on Geoscience and Remote Sens- ing, vol. 62, pp. 1–15, 2024

  16. [16]

    NOAH-H, a deep-learning, terrain classification system for Mars: Results for the ExoMars Rover candidate landing sites,

    A. M. Barrett et al., “NOAH-H, a deep-learning, terrain classification system for Mars: Results for the ExoMars Rover candidate landing sites,”Icarus, vol. 371, p. 114 701, 2022

  17. [17]

    Rocknet: Lightweight network for real-time segmentation of Martian rocks,

    P. Wei, Z. Sun, and H. Tian, “Rocknet: Lightweight network for real-time segmentation of Martian rocks,”Journal of Real-Time Image Processing, vol. 22, no. 1, p. 41, 2025

  18. [18]

    RockFormer: A U-Shaped Transformer Network for Martian Rock Seg- mentation,

    H. Liu, M. Yao, X. Xiao, and Y . Xiong, “RockFormer: A U-Shaped Transformer Network for Martian Rock Seg- mentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023

  19. [19]

    LLMDet: Learning Strong Open-V ocabulary Object Detectors under the Supervision of Large Language Models,

    S. Fu et al., “LLMDet: Learning Strong Open-V ocabulary Object Detectors under the Supervision of Large Language Models,” inCVPR, 2025, pp. 14 987–14 997

  20. [20]

    SAM 2: Segment Anything in Images and Videos,

    N. Ravi et al., “SAM 2: Segment Anything in Images and Videos,” inThe Thirteenth International Conference on Learning Representations, 2024

  21. [21]

    CVM-Net: Cross-View Matching Network for Image- Based Ground-to-Aerial Geo-Localization,

    S. Hu, M. Feng, R. M. H. Nguyen, and G. H. Lee, “CVM-Net: Cross-View Matching Network for Image- Based Ground-to-Aerial Geo-Localization,” inCVPR, Salt Lake City, UT, USA: IEEE, 2018, pp. 7258–7267

  22. [22]

    Thrun, W

    S. Thrun, W. Burgard, and D. Fox,Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). Cambridge, Mass: MIT Press, 2005, 647 pp

  23. [23]

    Martin and M

    I. Martin and M. Dunstan,PANGU v6: Planet and Asteroid Natural Scene Generation Utility, 2021

  24. [24]

    Leo Rover - Outdoor Robotics Kit for research

    Leo Rover. “Leo Rover - Outdoor Robotics Kit for research. ”[Online]. Available:https://www.leorover.tech/ the-rover

  25. [25]

    Motion Capture Systems,

    OptiTrack. “Motion Capture Systems,” OptiTrack. [Online]. Available:http : / / www . optitrack . com / index . html

  26. [26]

    ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks,

    J. Kwon, J. Kim, H. Park, and I. K. Choi, “ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks,” inICML, 2021, pp. 5905–5914

  27. [27]

    Total Ionizing Dose Radiation Testing of NVIDIA Jetson Orin NX System on Module,

    M. A. Felix, W. S. Slater, D. C. Landauer, R. E. Pinson, and B. B. Rutherford, “Total Ionizing Dose Radiation Testing of NVIDIA Jetson Orin NX System on Module,” inIEEE Space Computing Conference, 2024, pp. 116–121