pith. sign in

arxiv: 2604.10217 · v3 · submitted 2026-04-11 · 💻 cs.CV

Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords SAR-optical registrationimage matchingzero-shot transferpretrained modelsremote sensingcross-modal matchingSpaceNet9disaster response
0
0 comments X

The pith

Pretrained image matchers achieve 3-pixel accuracy on SAR-optical satellite registration without cross-modal training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates twenty-four pretrained matcher families in a zero-shot setting on SAR-optical benchmarks including SpaceNet9. It shows that matchers without explicit cross-modal training, such as RoMa, reach the lowest mean errors of 3.0 pixels, while some trained for cross-modal tasks perform similarly or worse. This points to foundation-model features possibly supplying enough modality invariance to handle the task. The evaluation also finds that protocol decisions around geometry models, tile sizes, and filtering can change accuracy by up to 33 times, often outweighing the difference between matchers. These outcomes matter for disaster-response applications that rely on fast, accurate alignment of optical and radar satellite imagery.

Core claim

Pretrained image matchers exhibit asymmetric transfer to SAR-optical satellite registration, with RoMa achieving a mean error of 3.0 pixels on the labeled SpaceNet9 training scenes without any cross-modal training. Matchers with explicit cross-modal training do not uniformly outperform those without it, and MatchAnything-ELoFTR trained on synthetic pairs reaches 3.4 pixels. 3D-reconstruction matchers remain fragile under default settings, while deployment protocol choices such as affine geometry, tile size, and inlier gating shift accuracy by up to 33 times for the same matcher.

What carries the argument

Zero-shot tiled-inference protocol with robust geometric filtering and tie-point-grounded metrics applied to large satellite images across multiple benchmarks.

If this is right

  • Matchers without cross-modal training can match or exceed the performance of those trained for cross-modal tasks.
  • Foundation-model features may contribute to modality invariance that partially substitutes for explicit cross-modal supervision.
  • Protocol choices in geometry model, tile size, and inlier gating can affect accuracy more than the choice of matcher itself.
  • 3D-reconstruction matchers are highly protocol-sensitive and remain fragile for traditional 2D image matching under default settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General-purpose pretrained matchers could lower the barrier to using existing tools for additional remote-sensing alignment tasks.
  • Focusing on protocol optimization might yield faster practical gains than training new cross-modal matchers.
  • Extending the evaluation to unlabeled real-time disaster imagery would test whether the observed error levels hold outside the benchmark.

Load-bearing premise

The SpaceNet9 benchmark and tie-point-grounded metrics under the chosen tiled-inference protocol are representative of real-world SAR-optical registration needs in disaster-response scenarios.

What would settle it

Showing that RoMa exceeds 3 pixels mean error on a new held-out set of SAR-optical pairs from different sensors or regions, or that protocol variations produce smaller accuracy shifts in operational disaster imagery, would challenge the central findings.

Figures

Figures reproduced from arXiv: 2604.10217 by Alex Stoken, Gabriele Berton, Isaac Corley.

Figure 1
Figure 1. Figure 1: Zero-shot SAR–optical registration on SpaceNet9. Left column: full-resolution optical scene (top) and full-resolution SAR scene (bottom). Right column: tiled correspondence gallery from a zero-shot pretrained matcher—each panel shows one randomly sam￾pled, tiepoint-projected overlapping region from the same geographic area in both modalities. Green lines denote affine-RANSAC inliers; orange lines denote ou… view at source ↗
Figure 2
Figure 2. Figure 2: Protocol sensitivity per matcher (SpaceNet9). Y￾axis: mean error (px, ↓ better). Each data point is one re￾sult from the threshold-robustness, keypoint-budget, or inlier￾gating ablations, aggregated across seven matchers (MA-ELoFTR, MINIMA-RoMa, RoMa, XoFTR, LoFTR, MINIMA-XoFTR, XFeat; 133 total runs). Intra-matcher variance often matches or exceeds inter-matcher differences, confirming that protocol choic… view at source ↗
Figure 3
Figure 3. Figure 3: Core protocol effects on SpaceNet9. Geometry choice and tiling interact strongly with matcher performance. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 RANSAC reprojection threshold (px) 0.0 0.2 0.4 0.6 0.8 1.0 Success@10 Threshold Robustness on SpaceNet9 loftr S@10 matchanything-eloftr S@10 minima-roma S@10 minima-xoftr S@10 roma S@10 romav2 S@10 xfeat S@10 xoftr S@10 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RANSAC threshold robustness. Success@10 (↑ bet￾ter) as a function of reprojection threshold (px) for seven match￾ers under fixed tiled affine settings. MA-ELoFTR and MINIMA￾RoMa maintain >90% success at permissive thresholds; XoFTR and RoMa follow closely. Sparser or less robust matchers (LoFTR, XFeat) are more threshold-sensitive—confirming threshold selec￾tion as a first-order deployment concern. trade-o… view at source ↗
Figure 5
Figure 5. Figure 5: SpaceNet9 qualitative correspondence gallery. Panels show top matchers under their best zero-shot configurations; green lines denote affine-RANSAC inliers and orange lines matched outliers. Identity Z-Score CLAHE MINIMA-RoMa LoFTR RoMa RoMa+Tiny-RoMa SP-LightGlue RoMa+LoFTR DUSt3R XFeat RoMaV2 MASt3R MINIMA-RoMa-Tiny Tiny-RoMa 48.2 47.0 47.7 59.3 63.1 63.3 69.9 74.1 64.9 75.7 71.9 63.9 72.3 73.7 67.0 80.7 … view at source ↗
Figure 6
Figure 6. Figure 6: Normalization × matcher heatmap (SRIF). Cell val￾ues are mean corner reprojection error (px). Column-wise spread illustrates that normalization choice can have marked performance impacts on matchability. More spread indicates a matcher with greater sensitivity to normalization scheme. MINIMA-RoMa with Z-Score achieves the lowest error. under speckle and radiometric inversion; we offer this as a plausible e… view at source ↗
read the original abstract

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates twenty-four pretrained image matchers in a zero-shot setting for cross-modal SAR-optical satellite registration on SpaceNet9 and two other benchmarks. It uses a deterministic protocol involving tiled inference, robust geometric filtering, and tie-point metrics. Key findings include RoMa achieving the lowest mean error of 3.0 pixels on SpaceNet9 without cross-modal training, asymmetric performance of cross-modal trained matchers, high sensitivity to protocol choices (up to 33× variation), and a working hypothesis on foundation model features aiding modality invariance.

Significance. If the results hold, the study offers important practical insights for deploying pretrained matchers in SAR-optical registration for applications like disaster response. Strengths include the coherent deterministic protocol, tiled inference for large images, and tie-point-grounded metrics that provide a solid empirical foundation. It challenges assumptions about the need for cross-modal training and emphasizes protocol importance in benchmarking.

major comments (3)
  1. [Abstract] The reported mean errors (RoMa at 3.0 px) are given without error bars, confidence intervals, or details on the number of evaluated pairs and data splits, undermining the ability to assess whether the differences (e.g., vs. 3.4 px for MatchAnything-ELoFTR) are statistically meaningful or robust.
  2. [Results section on protocol sensitivity] The claim that deployment protocol choices shift accuracy by up to 33× is central to the practical recommendations, yet the manuscript does not specify the exact baseline and modified configurations or provide a table breaking down the contributions of each protocol element (tile size, geometry model, inlier gating) to this factor.
  3. [Discussion on working hypothesis] The suggestion that DINOv2 features may contribute to modality invariance is a key interpretive claim, but it is not supported by any ablation studies or feature analysis within the evaluated matchers, making it speculative rather than substantiated by the experiments.
minor comments (2)
  1. [Abstract] Clarify whether the 3.0 px is the absolute lowest or the lowest among a particular category of matchers.
  2. [Methods] The description of the 'tiled large-image inference' protocol could benefit from a diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, with proposed revisions to improve transparency and rigor where the comments are valid.

read point-by-point responses
  1. Referee: [Abstract] The reported mean errors (RoMa at 3.0 px) are given without error bars, confidence intervals, or details on the number of evaluated pairs and data splits, undermining the ability to assess whether the differences (e.g., vs. 3.4 px for MatchAnything-ELoFTR) are statistically meaningful or robust.

    Authors: The protocol is fully deterministic with no stochastic elements, so error bars or confidence intervals from repeated sampling are not applicable. We will revise the abstract and results to report the exact number of evaluated image pairs and the data splits (SpaceNet9 training scenes and the two additional benchmarks), providing full context for the means. This addresses the core concern about transparency without altering the reported values. revision: partial

  2. Referee: [Results section on protocol sensitivity] The claim that deployment protocol choices shift accuracy by up to 33× is central to the practical recommendations, yet the manuscript does not specify the exact baseline and modified configurations or provide a table breaking down the contributions of each protocol element (tile size, geometry model, inlier gating) to this factor.

    Authors: We agree that a breakdown is needed for the claim to be fully actionable. In the revised manuscript we will add a table in the results section that defines the baseline (default matcher settings) versus modified configurations, with quantitative contributions from tile size, geometry model (including the affine example reducing error from 12.34 to 9.74 px), and inlier gating. This will explicitly account for the observed 33× variation. revision: yes

  3. Referee: [Discussion on working hypothesis] The suggestion that DINOv2 features may contribute to modality invariance is a key interpretive claim, but it is not supported by any ablation studies or feature analysis within the evaluated matchers, making it speculative rather than substantiated by the experiments.

    Authors: We present the statement explicitly as a 'working hypothesis' because it is an interpretation of the zero-shot results (RoMa matching cross-modal-trained matchers without such training). No ablations or feature analyses were conducted, as the study scope was limited to evaluating existing pretrained matchers. We will revise the discussion to strengthen the caveats, label it more clearly as a hypothesis for future investigation, and avoid any implication of direct substantiation. revision: partial

Circularity Check

0 steps flagged

Pure empirical benchmarking with no derivations or self-referential reductions

full rationale

This is a zero-shot empirical evaluation study that measures the performance of 24 pretrained matcher families on SpaceNet9 and two other cross-modal benchmarks using a fixed tiled-inference protocol and tie-point metrics. The manuscript contains no equations, no fitted parameters, no predictions derived from author-defined quantities, and no load-bearing self-citations or uniqueness theorems. All reported numbers (e.g., RoMa at 3.0 px mean error, 33× protocol sensitivity) are direct experimental outcomes on external public datasets; the central claims therefore stand or fall on the representativeness of the chosen benchmarks rather than on any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that standard image-matching benchmarks and tie-point metrics serve as valid proxies for operational SAR-optical registration; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption SpaceNet9 and the additional cross-modal benchmarks are representative of real disaster-response satellite registration tasks.
    Invoked when generalizing the 3 px error and protocol-sensitivity findings to practical use.

pith-pipeline@v0.9.0 · 5618 in / 1251 out tokens · 46014 ms · 2026-05-10T16:49:49.546005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026

    Gabriele Berton and contributors. vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026. GitHub repository, accessed 2026-02-21. 4

  2. [2]

    Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography

    Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone. Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 4

  3. [3]

    Spacenet9 final report (4th place), 2025

    Giovanni Cavallin. Spacenet9 final report (4th place), 2025. Winning technical report (SpaceNet9 Challenge). 3

  4. [4]

    SuperPoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InCVPR Workshops, 2018. 4, 5

  5. [5]

    Less biased noise scale estimation for threshold-robust ransac

    Johan Edstedt. Less biased noise scale estimation for threshold-robust ransac. InCVPR 2025 Workshops (IMW),

  6. [6]

    DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching

    Johan Edstedt, Georg Athanasiadis, Marten B ¨ulow, and M˚arten Wadenb ¨ack. DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching. In3DV,

  7. [7]

    RoMa: Robust dense feature matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust dense feature matching. InCVPR, 2024. 3, 4, 5, 7

  8. [8]

    Fischler and Robert C

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4

  9. [9]

    Bacastow

    Ronny H”ansch, Jacob Arndt, Philipe Dias, Abhishek Pot- nis, Dalton Lunga, Desiree Petrie, and Todd M. Bacastow. Introducing spacenet9 – cross-modal satellite imagery regis- tration for natural disaster responses. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, 2024. 2, 3

  10. [10]

    Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

    Ronny H ¨ansch, Jacob Arndt, Abhishek Potnis, Philipe Dias, Peter Novotn `y, Fabio Pacifici, and Todd M Bacastow. Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

  11. [11]

    arXiv preprint arXiv:2501.07556 (2025)

    Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, and Xiaowei Zhou. MatchAnything: Universal cross-modality image matching with large-scale pre-training. arXiv preprint arXiv:2501.07556, 2025. 2, 3, 4, 5

  12. [12]

    Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu

    Lloyd H. Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu. Identifying corresponding patches in SAR and optical images with a pseudo-siamese CNN.IEEE Geoscience and Remote Sensing Letters, 15(5): 784–788, 2018. 2, 3

  13. [13]

    A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020

    Lloyd Haydn Hughes, Diego Marcos, Sylvain Lobry, Devis Tuia, and Michael Schmitt. A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020. 2

  14. [14]

    MINIMA: Modality invariant image matching

    Xingyu Jiang, Jiangwei Ma, Xinying Hu, Yao Tai, Chengjie Wang, and Jian Yang. MINIMA: Modality invariant image matching. InNeurIPS, 2024. 3, 4, 5, 7

  15. [15]

    Spacenet9 final report (2nd place), 2025

    Motoki Kimura. Spacenet9 final report (2nd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4 9

  16. [16]

    Ground- ing image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 4, 5

  17. [17]

    Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023

    Jiayuan Li, Qingwu Hu, and Yongjun Zhang. Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023. 2, 3

  18. [18]

    LightGlue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local feature matching at light speed. InICCV,

  19. [19]

    David G. Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 4, 5

  20. [20]

    Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

    Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovi ´c, and Jiˇr´ı Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InNeurIPS, 2017. 4, 5

  21. [21]

    Spacenet9 final report (1st place), 2025

    Andrea Nascetti. Spacenet9 final report (1st place), 2025. Winning technical report (SpaceNet9 Challenge). 3

  22. [22]

    DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4, 7

  23. [23]

    Spacenet9 final report (5th place),

    Jes ´us Orozco G ´omez. Spacenet9 final report (5th place),

  24. [24]

    Winning technical report (SpaceNet9 Challenge). 3

  25. [25]

    Nascimento

    Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Mar- tins, and Erickson R. Nascimento. XFeat: Accelerated fea- tures for lightweight image matching. InCVPR, 2024. 4

  26. [26]

    Spacenet9 final report (3rd place), 2025

    Roman Pyankov. Spacenet9 final report (3rd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4

  27. [27]

    Depth any canopy: Leveraging depth foundation models for canopy height estimation

    Daniele Rege Cambrin, Isaac Corley, and Paolo Garza. Depth any canopy: Leveraging depth foundation models for canopy height estimation. InEuropean Conference on Com- puter Vision, pages 71–86. Springer, 2024. 3

  28. [28]

    Kornia: an open source differentiable computer vision library for pytorch

    Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: an open source differentiable computer vision library for pytorch. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3674–3683, 2020. 4

  29. [29]

    Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026

    Caleb Robinson, Nils Lehmann, Adam J Stewart, Burak Ekim, Heng Fang, Isaac A Corley, and Mauricio Cordeiro. Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026. 2

  30. [30]

    Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024

    Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024. 2, 3

  31. [31]

    Cross-modal satellite imagery registration

    Kelly Schroeder. Cross-modal satellite imagery registration. https://spacenet.ai/sn9- challenge/, 2025. SpaceNet 9 overview page, accessed 2026-02-21. 2

  32. [32]

    GIM: Learn- ing generalizable image matcher from internet videos

    Xuelun Shen, Zhipeng Yin, Xin Wang, Xuehui Chen, Zijin Chen, Xiao Bai, Jian Wang, and Hongbo Gao. GIM: Learn- ing generalizable image matcher from internet videos. In ICLR, 2024. 4, 5

  33. [33]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  34. [34]

    Spacenet9 challenge repository

    SpaceNetChallenge. Spacenet9 challenge repository. https : / / github . com / SpaceNetChallenge / SpaceNet9, 2026. GitHub repository, accessed 2026-02-

  35. [35]

    LoFTR: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InCVPR, 2021. 4, 5

  36. [36]

    Spacenet9 final report (top graduate), 2025

    Dongli Tan. Spacenet9 final report (top graduate), 2025. Winning technical report (SpaceNet9 Challenge). 3

  37. [37]

    Spacenet 9: Cross-modal satellite im- agery registration.https : / / www

    Topcoder. Spacenet 9: Cross-modal satellite im- agery registration.https : / / www . topcoder . com/challenges/9620f66a- 767e- 40ac- 81d5- 5cc61274b186, 2025. Challenge page, accessed 2026- 02-21. 2

  38. [38]

    Aydın Alatan

    ¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydın Alatan. XoFTR: Cross-modal fea- ture matching transformer. InCVPR 2024 Workshops (IMW),

  39. [39]

    DISK: Learning local features with policy gradient

    Michał Tyszkiewicz, Pascal Fua, and Vincent Lepetit. DISK: Learning local features with policy gradient. InNeurIPS,

  40. [40]

    Spacenet9 final report (top undergrad- uate), 2025

    Poojan Vachharajani. Spacenet9 final report (top undergrad- uate), 2025. Winning technical report (SpaceNet9 Chal- lenge). 3

  41. [41]

    DUSt3R: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and J ´erˆome Revaud. DUSt3R: Geometric 3d vision made easy. InCVPR, 2024. 4, 5

  42. [42]

    SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026

    Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao, Jiayuan Li, and Yongjun Zhang. SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026. 3, 9

  43. [43]

    Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and SAR images via improved phase congruency model.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 2

  44. [44]

    Han Zhang, Weiping Ni, Weidong Yan, Deliang Xiang, Jun- zheng Wu, Xiaoliang Yang, and Hui Bian. Registration of multimodal remote sensing image based on deep fully con- volutional neural network.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(8): 3028–3042, 2019. 2

  45. [45]

    Xiaoming Zhao, Xingming Wu, Jiabi Miao, Weihai Chen, Peter C. Y . Chen, and Zhengguo Li. ALIKED: A lighter keypoint and descriptor extraction network via deformable transformation. InIEEE Transactions on Instrumentation and Measurement, 2023. 4, 5 10