Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Alex Stoken; Gabriele Berton; Isaac Corley

arxiv: 2604.10217 · v3 · submitted 2026-04-11 · 💻 cs.CV

Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Isaac Corley , Alex Stoken , Gabriele Berton This is my paper

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords SAR-optical registrationimage matchingzero-shot transferpretrained modelsremote sensingcross-modal matchingSpaceNet9disaster response

0 comments

The pith

Pretrained image matchers achieve 3-pixel accuracy on SAR-optical satellite registration without cross-modal training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates twenty-four pretrained matcher families in a zero-shot setting on SAR-optical benchmarks including SpaceNet9. It shows that matchers without explicit cross-modal training, such as RoMa, reach the lowest mean errors of 3.0 pixels, while some trained for cross-modal tasks perform similarly or worse. This points to foundation-model features possibly supplying enough modality invariance to handle the task. The evaluation also finds that protocol decisions around geometry models, tile sizes, and filtering can change accuracy by up to 33 times, often outweighing the difference between matchers. These outcomes matter for disaster-response applications that rely on fast, accurate alignment of optical and radar satellite imagery.

Core claim

Pretrained image matchers exhibit asymmetric transfer to SAR-optical satellite registration, with RoMa achieving a mean error of 3.0 pixels on the labeled SpaceNet9 training scenes without any cross-modal training. Matchers with explicit cross-modal training do not uniformly outperform those without it, and MatchAnything-ELoFTR trained on synthetic pairs reaches 3.4 pixels. 3D-reconstruction matchers remain fragile under default settings, while deployment protocol choices such as affine geometry, tile size, and inlier gating shift accuracy by up to 33 times for the same matcher.

What carries the argument

Zero-shot tiled-inference protocol with robust geometric filtering and tie-point-grounded metrics applied to large satellite images across multiple benchmarks.

If this is right

Matchers without cross-modal training can match or exceed the performance of those trained for cross-modal tasks.
Foundation-model features may contribute to modality invariance that partially substitutes for explicit cross-modal supervision.
Protocol choices in geometry model, tile size, and inlier gating can affect accuracy more than the choice of matcher itself.
3D-reconstruction matchers are highly protocol-sensitive and remain fragile for traditional 2D image matching under default settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

General-purpose pretrained matchers could lower the barrier to using existing tools for additional remote-sensing alignment tasks.
Focusing on protocol optimization might yield faster practical gains than training new cross-modal matchers.
Extending the evaluation to unlabeled real-time disaster imagery would test whether the observed error levels hold outside the benchmark.

Load-bearing premise

The SpaceNet9 benchmark and tie-point-grounded metrics under the chosen tiled-inference protocol are representative of real-world SAR-optical registration needs in disaster-response scenarios.

What would settle it

Showing that RoMa exceeds 3 pixels mean error on a new held-out set of SAR-optical pairs from different sensors or regions, or that protocol variations produce smaller accuracy shifts in operational disaster imagery, would challenge the central findings.

Figures

Figures reproduced from arXiv: 2604.10217 by Alex Stoken, Gabriele Berton, Isaac Corley.

**Figure 1.** Figure 1: Zero-shot SAR–optical registration on SpaceNet9. Left column: full-resolution optical scene (top) and full-resolution SAR scene (bottom). Right column: tiled correspondence gallery from a zero-shot pretrained matcher—each panel shows one randomly sampled, tiepoint-projected overlapping region from the same geographic area in both modalities. Green lines denote affine-RANSAC inliers; orange lines denote ou… view at source ↗

**Figure 2.** Figure 2: Protocol sensitivity per matcher (SpaceNet9). Yaxis: mean error (px, ↓ better). Each data point is one result from the threshold-robustness, keypoint-budget, or inliergating ablations, aggregated across seven matchers (MA-ELoFTR, MINIMA-RoMa, RoMa, XoFTR, LoFTR, MINIMA-XoFTR, XFeat; 133 total runs). Intra-matcher variance often matches or exceeds inter-matcher differences, confirming that protocol choic… view at source ↗

**Figure 3.** Figure 3: Core protocol effects on SpaceNet9. Geometry choice and tiling interact strongly with matcher performance. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 RANSAC reprojection threshold (px) 0.0 0.2 0.4 0.6 0.8 1.0 Success@10 Threshold Robustness on SpaceNet9 loftr S@10 matchanything-eloftr S@10 minima-roma S@10 minima-xoftr S@10 roma S@10 romav2 S@10 xfeat S@10 xoftr S@10 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: RANSAC threshold robustness. Success@10 (↑ better) as a function of reprojection threshold (px) for seven matchers under fixed tiled affine settings. MA-ELoFTR and MINIMARoMa maintain >90% success at permissive thresholds; XoFTR and RoMa follow closely. Sparser or less robust matchers (LoFTR, XFeat) are more threshold-sensitive—confirming threshold selection as a first-order deployment concern. trade-o… view at source ↗

**Figure 5.** Figure 5: SpaceNet9 qualitative correspondence gallery. Panels show top matchers under their best zero-shot configurations; green lines denote affine-RANSAC inliers and orange lines matched outliers. Identity Z-Score CLAHE MINIMA-RoMa LoFTR RoMa RoMa+Tiny-RoMa SP-LightGlue RoMa+LoFTR DUSt3R XFeat RoMaV2 MASt3R MINIMA-RoMa-Tiny Tiny-RoMa 48.2 47.0 47.7 59.3 63.1 63.3 69.9 74.1 64.9 75.7 71.9 63.9 72.3 73.7 67.0 80.7 … view at source ↗

**Figure 6.** Figure 6: Normalization × matcher heatmap (SRIF). Cell values are mean corner reprojection error (px). Column-wise spread illustrates that normalization choice can have marked performance impacts on matchability. More spread indicates a matcher with greater sensitivity to normalization scheme. MINIMA-RoMa with Z-Score achieves the lowest error. under speckle and radiometric inversion; we offer this as a plausible e… view at source ↗

read the original abstract

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoMa hits low error zero-shot on SAR-optical but protocol tweaks move the numbers more than swapping matchers.

read the letter

The key thing here is that some off-the-shelf matchers like RoMa can register SAR and optical satellite images with mean errors around 3 pixels in a zero-shot setup, without any cross-modal training. At the same time, the paper shows that small changes in how you run the matcher—such as the geometry model or tile size—can shift those errors by up to 33 times, which is often a bigger effect than picking a different matcher. They also note that explicit cross-modal training does not reliably beat general models, and that 3D reconstruction matchers fall apart under default settings. What stands out as new is the scale: they tested 24 matcher families under one consistent protocol on SpaceNet9 plus two other benchmarks. The tiled inference for large images and the use of tie-point metrics give a reproducible basis for the numbers. The asymmetric transfer observation and the fragility flags are practical points that prior scattered tests had not quantified at this level. The soft spots are mostly around generalization. SpaceNet9 consists of labeled training scenes that are likely more aligned and less noisy than real disaster scenarios, which often include temporal decorrelation, incidence angle differences, or lower signal quality. The paper already demonstrates high protocol sensitivity, so if the test data is atypically favorable, the reported performance could look better than what users would see in the field. There are also no error bars or statistical comparisons mentioned, which makes it harder to judge how reliable the ordering of methods is. This paper is for applied researchers in remote sensing who need to do cross-modal registration for things like disaster response. A practitioner could use it to decide which pretrained matcher to try first and what pipeline settings to watch. It has clear thinking on the empirical side with no obvious internal contradictions, so it deserves a serious referee to verify the implementation and see if the conclusions hold under closer scrutiny. I would recommend sending it to peer review. The benchmarking effort is substantial and the protocol sensitivity result is worth documenting for the community.

Referee Report

3 major / 2 minor

Summary. The paper evaluates twenty-four pretrained image matchers in a zero-shot setting for cross-modal SAR-optical satellite registration on SpaceNet9 and two other benchmarks. It uses a deterministic protocol involving tiled inference, robust geometric filtering, and tie-point metrics. Key findings include RoMa achieving the lowest mean error of 3.0 pixels on SpaceNet9 without cross-modal training, asymmetric performance of cross-modal trained matchers, high sensitivity to protocol choices (up to 33× variation), and a working hypothesis on foundation model features aiding modality invariance.

Significance. If the results hold, the study offers important practical insights for deploying pretrained matchers in SAR-optical registration for applications like disaster response. Strengths include the coherent deterministic protocol, tiled inference for large images, and tie-point-grounded metrics that provide a solid empirical foundation. It challenges assumptions about the need for cross-modal training and emphasizes protocol importance in benchmarking.

major comments (3)

[Abstract] The reported mean errors (RoMa at 3.0 px) are given without error bars, confidence intervals, or details on the number of evaluated pairs and data splits, undermining the ability to assess whether the differences (e.g., vs. 3.4 px for MatchAnything-ELoFTR) are statistically meaningful or robust.
[Results section on protocol sensitivity] The claim that deployment protocol choices shift accuracy by up to 33× is central to the practical recommendations, yet the manuscript does not specify the exact baseline and modified configurations or provide a table breaking down the contributions of each protocol element (tile size, geometry model, inlier gating) to this factor.
[Discussion on working hypothesis] The suggestion that DINOv2 features may contribute to modality invariance is a key interpretive claim, but it is not supported by any ablation studies or feature analysis within the evaluated matchers, making it speculative rather than substantiated by the experiments.

minor comments (2)

[Abstract] Clarify whether the 3.0 px is the absolute lowest or the lowest among a particular category of matchers.
[Methods] The description of the 'tiled large-image inference' protocol could benefit from a diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, with proposed revisions to improve transparency and rigor where the comments are valid.

read point-by-point responses

Referee: [Abstract] The reported mean errors (RoMa at 3.0 px) are given without error bars, confidence intervals, or details on the number of evaluated pairs and data splits, undermining the ability to assess whether the differences (e.g., vs. 3.4 px for MatchAnything-ELoFTR) are statistically meaningful or robust.

Authors: The protocol is fully deterministic with no stochastic elements, so error bars or confidence intervals from repeated sampling are not applicable. We will revise the abstract and results to report the exact number of evaluated image pairs and the data splits (SpaceNet9 training scenes and the two additional benchmarks), providing full context for the means. This addresses the core concern about transparency without altering the reported values. revision: partial
Referee: [Results section on protocol sensitivity] The claim that deployment protocol choices shift accuracy by up to 33× is central to the practical recommendations, yet the manuscript does not specify the exact baseline and modified configurations or provide a table breaking down the contributions of each protocol element (tile size, geometry model, inlier gating) to this factor.

Authors: We agree that a breakdown is needed for the claim to be fully actionable. In the revised manuscript we will add a table in the results section that defines the baseline (default matcher settings) versus modified configurations, with quantitative contributions from tile size, geometry model (including the affine example reducing error from 12.34 to 9.74 px), and inlier gating. This will explicitly account for the observed 33× variation. revision: yes
Referee: [Discussion on working hypothesis] The suggestion that DINOv2 features may contribute to modality invariance is a key interpretive claim, but it is not supported by any ablation studies or feature analysis within the evaluated matchers, making it speculative rather than substantiated by the experiments.

Authors: We present the statement explicitly as a 'working hypothesis' because it is an interpretation of the zero-shot results (RoMa matching cross-modal-trained matchers without such training). No ablations or feature analyses were conducted, as the study scope was limited to evaluating existing pretrained matchers. We will revise the discussion to strengthen the caveats, label it more clearly as a hypothesis for future investigation, and avoid any implication of direct substantiation. revision: partial

Circularity Check

0 steps flagged

Pure empirical benchmarking with no derivations or self-referential reductions

full rationale

This is a zero-shot empirical evaluation study that measures the performance of 24 pretrained matcher families on SpaceNet9 and two other cross-modal benchmarks using a fixed tiled-inference protocol and tie-point metrics. The manuscript contains no equations, no fitted parameters, no predictions derived from author-defined quantities, and no load-bearing self-citations or uniqueness theorems. All reported numbers (e.g., RoMa at 3.0 px mean error, 33× protocol sensitivity) are direct experimental outcomes on external public datasets; the central claims therefore stand or fall on the representativeness of the chosen benchmarks rather than on any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that standard image-matching benchmarks and tie-point metrics serve as valid proxies for operational SAR-optical registration; no free parameters or invented entities are introduced.

axioms (1)

domain assumption SpaceNet9 and the additional cross-modal benchmarks are representative of real disaster-response satellite registration tasks.
Invoked when generalizing the 3 px error and protocol-sensitivity findings to practical use.

pith-pipeline@v0.9.0 · 5618 in / 1251 out tokens · 46014 ms · 2026-05-10T16:49:49.546005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026

Gabriele Berton and contributors. vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026. GitHub repository, accessed 2026-02-21. 4

work page 2026
[2]

Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography

Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone. Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 4

work page 2024
[3]

Spacenet9 final report (4th place), 2025

Giovanni Cavallin. Spacenet9 final report (4th place), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025
[4]

SuperPoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InCVPR Workshops, 2018. 4, 5

work page 2018
[5]

Less biased noise scale estimation for threshold-robust ransac

Johan Edstedt. Less biased noise scale estimation for threshold-robust ransac. InCVPR 2025 Workshops (IMW),

work page 2025
[6]

DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching

Johan Edstedt, Georg Athanasiadis, Marten B ¨ulow, and M˚arten Wadenb ¨ack. DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching. In3DV,

work page
[7]

RoMa: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust dense feature matching. InCVPR, 2024. 3, 4, 5, 7

work page 2024
[8]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4

work page 1981
[9]

Bacastow

Ronny H”ansch, Jacob Arndt, Philipe Dias, Abhishek Pot- nis, Dalton Lunga, Desiree Petrie, and Todd M. Bacastow. Introducing spacenet9 – cross-modal satellite imagery regis- tration for natural disaster responses. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, 2024. 2, 3

work page 2024
[10]

Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

Ronny H ¨ansch, Jacob Arndt, Abhishek Potnis, Philipe Dias, Peter Novotn `y, Fabio Pacifici, and Todd M Bacastow. Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

work page 2026
[11]

arXiv preprint arXiv:2501.07556 (2025)

Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, and Xiaowei Zhou. MatchAnything: Universal cross-modality image matching with large-scale pre-training. arXiv preprint arXiv:2501.07556, 2025. 2, 3, 4, 5

work page arXiv 2025
[12]

Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu

Lloyd H. Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu. Identifying corresponding patches in SAR and optical images with a pseudo-siamese CNN.IEEE Geoscience and Remote Sensing Letters, 15(5): 784–788, 2018. 2, 3

work page 2018
[13]

A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020

Lloyd Haydn Hughes, Diego Marcos, Sylvain Lobry, Devis Tuia, and Michael Schmitt. A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020. 2

work page 2020
[14]

MINIMA: Modality invariant image matching

Xingyu Jiang, Jiangwei Ma, Xinying Hu, Yao Tai, Chengjie Wang, and Jian Yang. MINIMA: Modality invariant image matching. InNeurIPS, 2024. 3, 4, 5, 7

work page 2024
[15]

Spacenet9 final report (2nd place), 2025

Motoki Kimura. Spacenet9 final report (2nd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4 9

work page 2025
[16]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 4, 5

work page 2024
[17]

Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023

Jiayuan Li, Qingwu Hu, and Yongjun Zhang. Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023. 2, 3

work page 2023
[18]

LightGlue: Local feature matching at light speed

Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local feature matching at light speed. InICCV,

work page
[19]

David G. Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 4, 5

work page 2004
[20]

Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovi ´c, and Jiˇr´ı Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InNeurIPS, 2017. 4, 5

work page 2017
[21]

Spacenet9 final report (1st place), 2025

Andrea Nascetti. Spacenet9 final report (1st place), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025
[22]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4, 7

work page 2024
[23]

Spacenet9 final report (5th place),

Jes ´us Orozco G ´omez. Spacenet9 final report (5th place),

work page
[24]

Winning technical report (SpaceNet9 Challenge). 3

work page
[25]

Nascimento

Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Mar- tins, and Erickson R. Nascimento. XFeat: Accelerated fea- tures for lightweight image matching. InCVPR, 2024. 4

work page 2024
[26]

Spacenet9 final report (3rd place), 2025

Roman Pyankov. Spacenet9 final report (3rd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4

work page 2025
[27]

Depth any canopy: Leveraging depth foundation models for canopy height estimation

Daniele Rege Cambrin, Isaac Corley, and Paolo Garza. Depth any canopy: Leveraging depth foundation models for canopy height estimation. InEuropean Conference on Com- puter Vision, pages 71–86. Springer, 2024. 3

work page 2024
[28]

Kornia: an open source differentiable computer vision library for pytorch

Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: an open source differentiable computer vision library for pytorch. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3674–3683, 2020. 4

work page 2020
[29]

Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026

Caleb Robinson, Nils Lehmann, Adam J Stewart, Burak Ekim, Heng Fang, Isaac A Corley, and Mauricio Cordeiro. Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026. 2

work page arXiv 2026
[30]

Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024

Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024. 2, 3

work page arXiv 2024
[31]

Cross-modal satellite imagery registration

Kelly Schroeder. Cross-modal satellite imagery registration. https://spacenet.ai/sn9- challenge/, 2025. SpaceNet 9 overview page, accessed 2026-02-21. 2

work page 2025
[32]

GIM: Learn- ing generalizable image matcher from internet videos

Xuelun Shen, Zhipeng Yin, Xin Wang, Xuehui Chen, Zijin Chen, Xiao Bai, Jian Wang, and Hongbo Gao. GIM: Learn- ing generalizable image matcher from internet videos. In ICLR, 2024. 4, 5

work page 2024
[33]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025
[34]

Spacenet9 challenge repository

SpaceNetChallenge. Spacenet9 challenge repository. https : / / github . com / SpaceNetChallenge / SpaceNet9, 2026. GitHub repository, accessed 2026-02-

work page 2026
[35]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InCVPR, 2021. 4, 5

work page 2021
[36]

Spacenet9 final report (top graduate), 2025

Dongli Tan. Spacenet9 final report (top graduate), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025
[37]

Spacenet 9: Cross-modal satellite im- agery registration.https : / / www

Topcoder. Spacenet 9: Cross-modal satellite im- agery registration.https : / / www . topcoder . com/challenges/9620f66a- 767e- 40ac- 81d5- 5cc61274b186, 2025. Challenge page, accessed 2026- 02-21. 2

work page 2025
[38]

Aydın Alatan

¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydın Alatan. XoFTR: Cross-modal fea- ture matching transformer. InCVPR 2024 Workshops (IMW),

work page 2024
[39]

DISK: Learning local features with policy gradient

Michał Tyszkiewicz, Pascal Fua, and Vincent Lepetit. DISK: Learning local features with policy gradient. InNeurIPS,

work page
[40]

Spacenet9 final report (top undergrad- uate), 2025

Poojan Vachharajani. Spacenet9 final report (top undergrad- uate), 2025. Winning technical report (SpaceNet9 Chal- lenge). 3

work page 2025
[41]

DUSt3R: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and J ´erˆome Revaud. DUSt3R: Geometric 3d vision made easy. InCVPR, 2024. 4, 5

work page 2024
[42]

SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026

Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao, Jiayuan Li, and Yongjun Zhang. SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026. 3, 9

work page 2026
[43]

Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and SAR images via improved phase congruency model.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 2

work page 2020
[44]

Han Zhang, Weiping Ni, Weidong Yan, Deliang Xiang, Jun- zheng Wu, Xiaoliang Yang, and Hui Bian. Registration of multimodal remote sensing image based on deep fully con- volutional neural network.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(8): 3028–3042, 2019. 2

work page 2019
[45]

Xiaoming Zhao, Xingming Wu, Jiabi Miao, Weihai Chen, Peter C. Y . Chen, and Zhengguo Li. ALIKED: A lighter keypoint and descriptor extraction network via deformable transformation. InIEEE Transactions on Instrumentation and Measurement, 2023. 4, 5 10

work page 2023

[1] [1]

vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026

Gabriele Berton and contributors. vismatch: Wrapper of 50+ image matching models with a unified interface.https: //github.com/gmberton/vismatch, 2026. GitHub repository, accessed 2026-02-21. 4

work page 2026

[2] [2]

Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography

Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone. Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 4

work page 2024

[3] [3]

Spacenet9 final report (4th place), 2025

Giovanni Cavallin. Spacenet9 final report (4th place), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025

[4] [4]

SuperPoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InCVPR Workshops, 2018. 4, 5

work page 2018

[5] [5]

Less biased noise scale estimation for threshold-robust ransac

Johan Edstedt. Less biased noise scale estimation for threshold-robust ransac. InCVPR 2025 Workshops (IMW),

work page 2025

[6] [6]

DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching

Johan Edstedt, Georg Athanasiadis, Marten B ¨ulow, and M˚arten Wadenb ¨ack. DeDoDe: Detect, don’t describe— describe, don’t detect for local feature matching. In3DV,

work page

[7] [7]

RoMa: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust dense feature matching. InCVPR, 2024. 3, 4, 5, 7

work page 2024

[8] [8]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 4

work page 1981

[9] [9]

Bacastow

Ronny H”ansch, Jacob Arndt, Philipe Dias, Abhishek Pot- nis, Dalton Lunga, Desiree Petrie, and Todd M. Bacastow. Introducing spacenet9 – cross-modal satellite imagery regis- tration for natural disaster responses. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, 2024. 2, 3

work page 2024

[10] [10]

Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

Ronny H ¨ansch, Jacob Arndt, Abhishek Potnis, Philipe Dias, Peter Novotn `y, Fabio Pacifici, and Todd M Bacastow. Spacenet 9-cross-sensor alignment of optical and sar im- agery.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

work page 2026

[11] [11]

arXiv preprint arXiv:2501.07556 (2025)

Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, and Xiaowei Zhou. MatchAnything: Universal cross-modality image matching with large-scale pre-training. arXiv preprint arXiv:2501.07556, 2025. 2, 3, 4, 5

work page arXiv 2025

[12] [12]

Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu

Lloyd H. Hughes, Michael Schmitt, Lichao Mou, Yuanyuan Wang, and Xiao Xiang Zhu. Identifying corresponding patches in SAR and optical images with a pseudo-siamese CNN.IEEE Geoscience and Remote Sensing Letters, 15(5): 784–788, 2018. 2, 3

work page 2018

[13] [13]

A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020

Lloyd Haydn Hughes, Diego Marcos, Sylvain Lobry, Devis Tuia, and Michael Schmitt. A deep learning framework for matching of SAR and optical imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 169:166–179, 2020. 2

work page 2020

[14] [14]

MINIMA: Modality invariant image matching

Xingyu Jiang, Jiangwei Ma, Xinying Hu, Yao Tai, Chengjie Wang, and Jian Yang. MINIMA: Modality invariant image matching. InNeurIPS, 2024. 3, 4, 5, 7

work page 2024

[15] [15]

Spacenet9 final report (2nd place), 2025

Motoki Kimura. Spacenet9 final report (2nd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4 9

work page 2025

[16] [16]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 4, 5

work page 2024

[17] [17]

Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023

Jiayuan Li, Qingwu Hu, and Yongjun Zhang. Multimodal image matching: A scale-invariant algorithm and an open dataset.ISPRS Journal of Photogrammetry and Remote Sensing, 204:77–88, 2023. 2, 3

work page 2023

[18] [18]

LightGlue: Local feature matching at light speed

Philipp Lindenberger, Paul-Erik Sarlin, and Marc Pollefeys. LightGlue: Local feature matching at light speed. InICCV,

work page

[19] [19]

David G. Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2, 4, 5

work page 2004

[20] [20]

Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovi ´c, and Jiˇr´ı Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InNeurIPS, 2017. 4, 5

work page 2017

[21] [21]

Spacenet9 final report (1st place), 2025

Andrea Nascetti. Spacenet9 final report (1st place), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025

[22] [22]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4, 7

work page 2024

[23] [23]

Spacenet9 final report (5th place),

Jes ´us Orozco G ´omez. Spacenet9 final report (5th place),

work page

[24] [24]

Winning technical report (SpaceNet9 Challenge). 3

work page

[25] [25]

Nascimento

Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Mar- tins, and Erickson R. Nascimento. XFeat: Accelerated fea- tures for lightweight image matching. InCVPR, 2024. 4

work page 2024

[26] [26]

Spacenet9 final report (3rd place), 2025

Roman Pyankov. Spacenet9 final report (3rd place), 2025. Winning technical report (SpaceNet9 Challenge). 3, 4

work page 2025

[27] [27]

Depth any canopy: Leveraging depth foundation models for canopy height estimation

Daniele Rege Cambrin, Isaac Corley, and Paolo Garza. Depth any canopy: Leveraging depth foundation models for canopy height estimation. InEuropean Conference on Com- puter Vision, pages 71–86. Springer, 2024. 3

work page 2024

[28] [28]

Kornia: an open source differentiable computer vision library for pytorch

Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: an open source differentiable computer vision library for pytorch. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3674–3683, 2020. 4

work page 2020

[29] [29]

Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026

Caleb Robinson, Nils Lehmann, Adam J Stewart, Burak Ekim, Heng Fang, Isaac A Corley, and Mauricio Cordeiro. Advancing earth observation through machine learning: A torchgeo tutorial.arXiv preprint arXiv:2603.02386, 2026. 2

work page arXiv 2026

[30] [30]

Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024

Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a dis- tinct modality in machine learning.arXiv preprint arXiv:2402.01444, 2024. 2, 3

work page arXiv 2024

[31] [31]

Cross-modal satellite imagery registration

Kelly Schroeder. Cross-modal satellite imagery registration. https://spacenet.ai/sn9- challenge/, 2025. SpaceNet 9 overview page, accessed 2026-02-21. 2

work page 2025

[32] [32]

GIM: Learn- ing generalizable image matcher from internet videos

Xuelun Shen, Zhipeng Yin, Xin Wang, Xuehui Chen, Zijin Chen, Xiao Bai, Jian Wang, and Hongbo Gao. GIM: Learn- ing generalizable image matcher from internet videos. In ICLR, 2024. 4, 5

work page 2024

[33] [33]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025

[34] [34]

Spacenet9 challenge repository

SpaceNetChallenge. Spacenet9 challenge repository. https : / / github . com / SpaceNetChallenge / SpaceNet9, 2026. GitHub repository, accessed 2026-02-

work page 2026

[35] [35]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InCVPR, 2021. 4, 5

work page 2021

[36] [36]

Spacenet9 final report (top graduate), 2025

Dongli Tan. Spacenet9 final report (top graduate), 2025. Winning technical report (SpaceNet9 Challenge). 3

work page 2025

[37] [37]

Spacenet 9: Cross-modal satellite im- agery registration.https : / / www

Topcoder. Spacenet 9: Cross-modal satellite im- agery registration.https : / / www . topcoder . com/challenges/9620f66a- 767e- 40ac- 81d5- 5cc61274b186, 2025. Challenge page, accessed 2026- 02-21. 2

work page 2025

[38] [38]

Aydın Alatan

¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydın Alatan. XoFTR: Cross-modal fea- ture matching transformer. InCVPR 2024 Workshops (IMW),

work page 2024

[39] [39]

DISK: Learning local features with policy gradient

Michał Tyszkiewicz, Pascal Fua, and Vincent Lepetit. DISK: Learning local features with policy gradient. InNeurIPS,

work page

[40] [40]

Spacenet9 final report (top undergrad- uate), 2025

Poojan Vachharajani. Spacenet9 final report (top undergrad- uate), 2025. Winning technical report (SpaceNet9 Chal- lenge). 3

work page 2025

[41] [41]

DUSt3R: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and J ´erˆome Revaud. DUSt3R: Geometric 3d vision made easy. InCVPR, 2024. 4, 5

work page 2024

[42] [42]

SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026

Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao, Jiayuan Li, and Yongjun Zhang. SOMA- 1M: A large-scale SAR-optical multi-resolution alignment dataset for multi-task remote sensing, 2026. 3, 9

work page 2026

[43] [43]

Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and SAR images via improved phase congruency model.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 2

work page 2020

[44] [44]

Han Zhang, Weiping Ni, Weidong Yan, Deliang Xiang, Jun- zheng Wu, Xiaoliang Yang, and Hui Bian. Registration of multimodal remote sensing image based on deep fully con- volutional neural network.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(8): 3028–3042, 2019. 2

work page 2019

[45] [45]

Xiaoming Zhao, Xingming Wu, Jiabi Miao, Weihai Chen, Peter C. Y . Chen, and Zhengguo Li. ALIKED: A lighter keypoint and descriptor extraction network via deformable transformation. InIEEE Transactions on Instrumentation and Measurement, 2023. 4, 5 10

work page 2023