pith. sign in

arxiv: 2604.05689 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords cross-modal image registrationfeature flow learningtransformer architectureimage alignmentmultimodal correspondencegeometric transform
0
0 comments X

The pith

CRFT uses a transformer to learn a consistent recurrent feature flow that aligns cross-modal images more accurately and robustly than existing methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CRFT as a unified coarse-to-fine framework for cross-modal image registration based on learning a modality-independent feature flow in a transformer. It jointly performs feature alignment and flow estimation, with a coarse stage for global correspondences via multi-scale correlation and a fine stage for local refinement through hierarchical fusion and adaptive reasoning. An iterative discrepancy-guided attention combined with Spatial Geometric Transform recurrently refines the flow to enforce consistency under large variations. A reader would care because reliable registration between different sensor types enables better integration of data in fields like remote sensing and medical imaging where modalities differ significantly in appearance.

Core claim

CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. An iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence

What carries the argument

The Consistent-Recurrent Feature Flow Transformer, which learns a modality-independent feature flow representation through iterative discrepancy-guided attention and Spatial Geometric Transform to enforce consistency across modalities.

Load-bearing premise

A single modality-independent feature flow representation learned in the transformer can jointly handle feature alignment and flow estimation while the iterative discrepancy-guided attention with Spatial Geometric Transform enforces consistency under large affine and scale variations.

What would settle it

Registration experiments on a new cross-modal dataset featuring affine and scale variations larger than those in the original tests, where CRFT's accuracy and robustness metrics fall below those of competing state-of-the-art methods.

Figures

Figures reproduced from arXiv: 2604.05689 by Mengzhu Ding, Xichao Teng, Xuecong Liu, Zhang Li, Zixuan Sun.

Figure 1
Figure 1. Figure 1: Overview of the proposed CRFT framework. CRFT follows a unified coarse-to-fine pipeline for robust cross-modal image registration. (a) Multi-Scale Feature Extraction: A CNN encoder extracts modality-independent features at {1/2, 1/4, 1/8} resolutions from the input image pair (I A, IB). (b) Coarse-Scale Flow Estimation: At 1/8 resolution, transformer-style self-attention (SA) and cross-attention (CA) opera… view at source ↗
Figure 2
Figure 2. Figure 2: Iterative discrepancy-guided flow refinement. At each iteration, fine-scale features from both modalities are first mapped into a shared feature space (FSFT). The current flow then drives a SGT to align the local window of F B 5×5 with F A 5×5. Feature discrepancies between the aligned windows are computed and used to weight an attention-based refinement module, which predicts a residual flow added to the … view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison between the predicted flow and ground-truth flow fields. The proposed CRFT achieves geometrically consistent and dense alignment across challenging optical-SAR and optical-infrared image pairs, demonstrating strong robustness to nonlinear radiation and geometric variations [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative comparison with state-of-the-art methods. For each image pair, 5000 candidate correspondences are uniformly sampled, and only the matches with registration error below 2 pixels are visualized. The density and spatial consistency of the displayed inliers reflect the alignment accuracy of different methods across both OSdataset and RoadScene. CRFT produces the largest number of geometrically con… view at source ↗
Figure 5
Figure 5. Figure 5: CMR curves under varying thresholds on (a) OSdataset and (b) RoadScene. We compare CRFT with representative handcrafted, sparse, and (semi-)dense matching methods by evaluating the CMR across a range of pixel thresholds. CRFT consistently achieves the highest CMR across a wide range of thresholds. show severe degradation under cross-modal shifts, with AEPE ranging from 17.27 to 50.87 and CMR falling to zer… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on OSdataset. We evaluate the contribution of each component in CRFT by progressively adding the flow estimation (FE), iterative discrepancy-guided optimization (IDGO), and the Iterative Loss (IL) to the XoFTR baseline. The CMR curves across varying thresholds demonstrate that each module brings a clear performance gain, while integrating all modules (X+FE+IDGO+IL) yields the most significan… view at source ↗
Figure 7
Figure 7. Figure 7: Flow prediction visualization on OSdataset. For each pair, we visualize the predicted dense flow, the ground-truth flow, and the corresponding difference map. CRFT produces smooth and geometrically consistent flow fields across optical￾SAR modality gap, indicating accurate geometric correspondence and robustness to strong appearance variations [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Flow prediction visualization on RoadScene. For each pair, we visualize the predicted dense flow, the ground-truth flow, and the corresponding difference map. CRFT maintains stable and consistent flow behavior across heterogeneous modalities, demonstrating robustness to illumination changes, noise patterns, and texture inconsistencies in complex driving environments [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on the OSdataset (top) and RoadScene (Bottom). The red quadrilateral denotes the ground￾truth alignment, while the yellow quadrilateral shows the predicted registration result. CRFT achieves more accurate and geometrically consistent registration under large geometric deformation and modality gaps compared with existing optical-SAR registration baselines [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 10
Figure 10. Figure 10: Checkerboard registration on OSdataset (left) and RoadScene (right). CRFT yields geometrically coherent checkerboard fusion across both optical-SAR and optical-infrared modalities. Zoomed-in regions demonstrate precise alignment of boundaries and textures, confirming the model’s ability to recover fine-scale geometric correspondence under large cross-modal appearance differences [PITH_FULL_IMAGE:figures/… view at source ↗
read the original abstract

We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces CRFT, a unified coarse-to-fine transformer framework for cross-modal image registration based on learning modality-independent feature flows. The coarse stage uses multi-scale feature correlation for global correspondences, while the fine stage employs hierarchical feature fusion and adaptive spatial reasoning. An iterative discrepancy-guided attention mechanism augmented with a Spatial Geometric Transform (SGT) recurrently refines the flow field to capture spatial inconsistencies and enforce consistency under large affine and scale variations. The central claim is that CRFT consistently outperforms state-of-the-art registration methods in accuracy and robustness across diverse cross-modal datasets, with broader applicability to remote sensing, autonomous navigation, and medical imaging; code and datasets are released publicly.

Significance. If the empirical results are robust, this work could advance cross-modal registration by providing a practical, transformer-based paradigm that jointly addresses feature alignment and flow estimation without modality-specific assumptions. The emphasis on recurrent consistency enforcement and public code release supports reproducibility and potential adoption in applied domains where large deformations are common.

minor comments (3)
  1. The abstract states that 'extensive experiments... demonstrate that CRFT consistently outperforms' but provides no quantitative metrics, specific datasets, or baseline comparisons; moving at least one key result (e.g., average error reduction on a named dataset) into the abstract would strengthen the summary.
  2. Notation for the Spatial Geometric Transform (SGT) and the discrepancy-guided attention is introduced without an explicit equation reference in the high-level description; adding a compact equation block early in §3 would improve readability for readers unfamiliar with recurrent flow refinement.
  3. The claim of a 'modality-independent feature flow representation' is presented as an outcome of joint training; a short ablation isolating the contribution of the recurrent SGT module versus a non-recurrent baseline would help substantiate that this property is not merely an artifact of the training data.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript on CRFT and the recommendation for minor revision. We appreciate the recognition of the framework's potential to advance cross-modal registration through recurrent consistency enforcement and its applicability across domains. No specific major comments were provided in the report, so we have no point-by-point revisions to address at this stage.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a transformer-based architecture for cross-modal registration with coarse-to-fine stages, discrepancy-guided attention, and a Spatial Geometric Transform module. No equations, derivations, or first-principles claims are present in the provided text that reduce performance claims to fitted parameters, self-definitions, or self-citation chains. The central claims rest on empirical outperformance across datasets, which is independent of any internal reduction. This is a standard empirical ML architecture paper with no load-bearing theoretical steps that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated; typical deep-learning registration models contain many hyperparameters, but none are identified here.

pith-pipeline@v0.9.0 · 5502 in / 1024 out tokens · 68459 ms · 2026-05-10T19:34:16.218576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

  1. [1]

    Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

    R Archana and PS Eliahim Jeevaraj. Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024. 1

  2. [2]

    Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning

    Lin Bie, Shouan Pan, Siqi Li, Yining Zhao, and Yue Gao. Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22161–22171, 2025. 1

  3. [3]

    A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

    Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L Prince, and Yong Du. A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

  4. [4]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1

  5. [5]

    Dsap: Dynamic sparse attention perception matcher for accurate local feature matching

    Kun Dai, Ke Wang, Tao Xie, Tao Sun, Jinhang Zhang, Qingjia Kong, Zhiqiang Jiang, Ruifeng Li, Lijun Zhao, and Mohamed Omar. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Transactions on Instrumentation and Measurement, 73:1–16, 2024. 1

  6. [6]

    Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022

    Yuxin Deng and Jiayi Ma. Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022. 2

  7. [7]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pages 224–236, 2018. 2

  8. [8]

    Improve representation for imbalanced regression through geometric constraints

    Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, and Juan Helen Zhou. Improve representation for imbalanced regression through geometric constraints. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5082–5091, 2025. 1

  9. [9]

    Dkm: Dense kernelized feature matching for geometry estimation

    Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dkm: Dense kernelized feature matching for geometry estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023. 2

  10. [10]

    Roma: Robust dense feature matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024. 2

  11. [11]

    Colabsfm: Collaborative structure-from-motion by point cloud registra- tion

    Johan Edstedt, Andr ´e Mateus, and Alberto Jaenal. Colabsfm: Collaborative structure-from-motion by point cloud registra- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6573–6583, 2025. 1

  12. [12]

    Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation

    Yuxiang Fu, Qi Yan, Lele Wang, Ke Li, and Renjie Liao. Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17282–17293,

  13. [13]

    Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024

    Daniel Gehrig and Davide Scaramuzza. Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024. 1

  14. [14]

    Flowformer: A transformer architecture for optical flow

    Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InProc. Eur. Conf. Comput. Vis., pages 668–685, 2022. 2

  15. [15]

    Omniglue: Generalizable feature matching with foundation model guidance

    Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024. 2

  16. [16]

    A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

    Xingyu Jiang, Jiayi Ma, Guobao Xiao, Zhenfeng Shao, and Xiaojie Guo. A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

  17. [17]

    Dense-sfm: Structure from motion with dense consistent matching

    JongMin Lee and Sungjoo Yoo. Dense-sfm: Structure from motion with dense consistent matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6404–6414, 2025. 2

  18. [18]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer,

  19. [19]

    Genflow3d: Generative scene flow estimation and prediction on point cloud sequences

    Hanlin Li, Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. Genflow3d: Generative scene flow estimation and prediction on point cloud sequences. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27488– 27497, 2025. 2

  20. [20]

    Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans

    Jiayuan Li, Qingwu Hu, and Mingyao Ai. Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans. Image Process., 29:3296– 3310, 2019. 2, 6

  21. [21]

    Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans

    Jiayuan Li, Wangyi Xu, Pengcheng Shi, Yongjun Zhang, and Qingwu Hu. Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans. Geosci. Remote Sens., 60:1–14, 2022. 2, 6

  22. [22]

    Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023

    Jiayuan Li, Pengcheng Shi, Qingwu Hu, and Yongjun Zhang. Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023. 2, 6

  23. [23]

    Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025

    Wuxin Li, Qian Chen, Guohua Gu, and Xiubao Sui. Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025. 1

  24. [24]

    Implicit correspondence learning for image-to-point cloud registration

    Xinjun Li, Wenfei Yang, Jiacheng Deng, Zhixin Cheng, Xu Zhou, and Tianzhu Zhang. Implicit correspondence learning for image-to-point cloud registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16922–16931, 2025. 1

  25. [25]

    Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024

    Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 1

  26. [26]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 2

  27. [27]

    A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans

    Xuecong Liu, Xichao Teng, Zhang Li, Qifeng Yu, and Yijie Bian. A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans. Instrum. Meas., 71:1–12, 2022. 1

  28. [28]

    Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching

    Xuecong Liu, Xichao Teng, Yijie Bian, Zhang Li, and Qifeng Yu. Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 17:18139– 18155, 2024. 2

  29. [29]

    Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin

    Xuecong Liu, Xichao Teng, Jing Luo, Zhang Li, Qifeng Yu, and Yijie Bian. Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin. J. Aeronaut., 37(1):271–286, 2024. 2

  30. [30]

    Xuecong Liu, Zixuan Sun, Hongwei Ding, Xin Song, Shuaiying Zhang, and Yongsheng Sun. Gaff: Global attention feature flow network for optical and sar image registration under geometric transformations.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

  31. [31]

    Cross-rejective open-set sar image registration

    Shasha Mao, Shiming Lu, Zhaolong Du, Licheng Jiao, Shuiping Gou, Luntian Mou, Xuequan Lu, Lin Xiong, and Yimeng Zhang. Cross-rejective open-set sar image registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23027–23036, 2025. 1

  32. [32]

    Cesar, Xiangyang Ji, and Xu-Cheng Yin

    Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive optical flow estimation with a dual-pyramid framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17810–17820, 2025. 2

  33. [33]

    Flowseek: Optical flow made easier with depth foundation models and motion bases

    Matteo Poggi and Fabio Tosi. Flowseek: Optical flow made easier with depth foundation models and motion bases. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5667–5679, 2025. 1

  34. [34]

    Xfeat: Accelerated features for lightweight image matching

    Guilherme Potje, Felipe Cadar, Andr ´e Araujo, Renato Martins, and Erickson R Nascimento. Xfeat: Accelerated features for lightweight image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2682–2691, 2024. 2

  35. [35]

    Must: The first dataset and unified framework for multispectral uav single object tracking

    Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. Must: The first dataset and unified framework for multispectral uav single object tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16882–16891, 2025. 1

  36. [36]

    Minima: Modality invariant image matching

    Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, and Xiang Bai. Minima: Modality invariant image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 6

  37. [37]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4938–4947,

  38. [38]

    Diff2flow: Training flow matching models via diffusion model alignment

    Johannes Schusterbauer, Ming Gui, Frank Fundel, and Bj ¨orn Ommer. Diff2flow: Training flow matching models via diffusion model alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28347– 28357, 2025. 1

  39. [39]

    Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation

    Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1599–1610, 2023. 2

  40. [40]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8922–8931, 2021. 2

  41. [41]

    Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

    Zixuan Sun, Shuaifeng Zhi, Kai Huo, Xuecong Liu, Weidong Jiang, and Yongxiang Liu. Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024. 2

  42. [42]

    Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025

    Zixuan Sun, Shuaifeng Zhi, Ruize Li, Jingyuan Xia, Yongxiang Liu, and Weidong Jiang. Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025. 2, 6

  43. [43]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProc. Eur. Conf. Comput. Vis.,

  44. [44]

    Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci

    Xichao Teng, Xuecong Liu, Zhang Li, Qifeng Yu, and Yijie Bian. Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci. Remote Sens. Lett., 20:1–5, 2023. 2

  45. [45]

    Aydin Alatan

    ¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydin Alatan. Xoftr: Cross-modal feature matching transformer. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4275–4286, 2024. 2, 6

  46. [46]

    Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024

    Haiqiao Wang, Dong Ni, and Yi Wang. Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024. 1

  47. [47]

    Efficient loftr: Semi-dense local feature matching with sparse-like speed

    Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024. 2, 6

  48. [48]

    A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024

    Yanan Wang, Yaobin Tian, Jiawei Chen, Kun Xu, and Xilun Ding. A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024. 1

  49. [49]

    Dfm: Differentiable feature matching for anomaly detection

    Sheng Wu, Yimi Wang, Xudong Liu, Yuguang Yang, Runqi Wang, Guodong Guo, David Doermann, and Baochang Zhang. Dfm: Differentiable feature matching for anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15224–15233, 2025. 1

  50. [50]

    Single-model and any-modality for video object tracking

    Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19156– 19166, 2024. 1

  51. [51]

    Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans

    Yuming Xiang, Feng Wang, and Hongjian You. Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans. Geosci. Remote Sens., 56(6):3078–3090, 2018. 2

  52. [52]

    Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and sar images via improved phase congruency model.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 5

  53. [53]

    Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

    Yun Xiao, Chunlei Zhang, Yuan Chen, Bo Jiang, and Jin Tang. Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 2, 6

  54. [54]

    Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 2

  55. [55]

    U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

    Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 5

  56. [56]

    Gmflow: Learning optical flow via global matching

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8121–8130, 2022. 2

  57. [57]

    Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans

    Han Xu, Jiteng Yuan, and Jiayi Ma. Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12148–12166,

  58. [58]

    Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 6

  59. [59]

    Towards grand unified representation learning for unsupervised visible-infrared person re-identification

    Bin Yang, Jun Chen, and Mang Ye. Towards grand unified representation learning for unsupervised visible-infrared person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11069– 11079, 2023. 1

  60. [60]

    3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025

    Yibin Ye, Xichao Teng, Hongrui Yang, Shuo Chen, Yuli Sun, Yijie Bian, Tao Tan, Zhang Li, and Qifeng Yu. 3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025. 1

  61. [61]

    From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision

    Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, and Xiangyu Yue. From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2588–2598, 2025. 1

  62. [62]

    Comatcher: Multi- view collaborative feature matching

    Jintao Zhang, Zimin Xia, Mingyue Dong, Shuhan Shen, Linwei Yue, and Xianwei Zheng. Comatcher: Multi- view collaborative feature matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21970–21980, 2025. 1

  63. [63]

    Adapting dense matching for homography estimation with grid-based acceleration

    Kaining Zhang, Yuxin Deng, Jiayi Ma, and Paolo Favaro. Adapting dense matching for homography estimation with grid-based acceleration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6294– 6303, 2025. 1

  64. [64]

    Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023

    Yongjun Zhang, Yongxiang Yao, Yi Wan, Weiyu Liu, Wupeng Yang, Zhi Zheng, and Rang Xiao. Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023. 2, 6

  65. [65]

    Full- dof egomotion estimation for event cameras using geometric solvers

    Ji Zhao, Banglei Guan, Zibin Liu, and Laurent Kneip. Full- dof egomotion estimation for event cameras using geometric solvers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11515–11524, 2025. 1

  66. [66]

    Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Chongyue Zheng, Shanshan Li, Chengyou Wang, and Bing Zhang. Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 6 CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration Supplementary Material A. Visualization of Registration Resul...