CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Mengzhu Ding; Xichao Teng; Xuecong Liu; Zhang Li; Zixuan Sun

arxiv: 2604.05689 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Xuecong Liu , Mengzhu Ding , Zixuan Sun , Zhang Li , Xichao Teng This is my paper

Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords cross-modal image registrationfeature flow learningtransformer architectureimage alignmentmultimodal correspondencegeometric transform

0 comments

The pith

CRFT uses a transformer to learn a consistent recurrent feature flow that aligns cross-modal images more accurately and robustly than existing methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CRFT as a unified coarse-to-fine framework for cross-modal image registration based on learning a modality-independent feature flow in a transformer. It jointly performs feature alignment and flow estimation, with a coarse stage for global correspondences via multi-scale correlation and a fine stage for local refinement through hierarchical fusion and adaptive reasoning. An iterative discrepancy-guided attention combined with Spatial Geometric Transform recurrently refines the flow to enforce consistency under large variations. A reader would care because reliable registration between different sensor types enables better integration of data in fields like remote sensing and medical imaging where modalities differ significantly in appearance.

Core claim

CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. An iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence

What carries the argument

The Consistent-Recurrent Feature Flow Transformer, which learns a modality-independent feature flow representation through iterative discrepancy-guided attention and Spatial Geometric Transform to enforce consistency across modalities.

Load-bearing premise

A single modality-independent feature flow representation learned in the transformer can jointly handle feature alignment and flow estimation while the iterative discrepancy-guided attention with Spatial Geometric Transform enforces consistency under large affine and scale variations.

What would settle it

Registration experiments on a new cross-modal dataset featuring affine and scale variations larger than those in the original tests, where CRFT's accuracy and robustness metrics fall below those of competing state-of-the-art methods.

Figures

Figures reproduced from arXiv: 2604.05689 by Mengzhu Ding, Xichao Teng, Xuecong Liu, Zhang Li, Zixuan Sun.

**Figure 1.** Figure 1: Overview of the proposed CRFT framework. CRFT follows a unified coarse-to-fine pipeline for robust cross-modal image registration. (a) Multi-Scale Feature Extraction: A CNN encoder extracts modality-independent features at {1/2, 1/4, 1/8} resolutions from the input image pair (I A, IB). (b) Coarse-Scale Flow Estimation: At 1/8 resolution, transformer-style self-attention (SA) and cross-attention (CA) opera… view at source ↗

**Figure 2.** Figure 2: Iterative discrepancy-guided flow refinement. At each iteration, fine-scale features from both modalities are first mapped into a shared feature space (FSFT). The current flow then drives a SGT to align the local window of F B 5×5 with F A 5×5. Feature discrepancies between the aligned windows are computed and used to weight an attention-based refinement module, which predicts a residual flow added to the … view at source ↗

**Figure 3.** Figure 3: Visual comparison between the predicted flow and ground-truth flow fields. The proposed CRFT achieves geometrically consistent and dense alignment across challenging optical-SAR and optical-infrared image pairs, demonstrating strong robustness to nonlinear radiation and geometric variations [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative comparison with state-of-the-art methods. For each image pair, 5000 candidate correspondences are uniformly sampled, and only the matches with registration error below 2 pixels are visualized. The density and spatial consistency of the displayed inliers reflect the alignment accuracy of different methods across both OSdataset and RoadScene. CRFT produces the largest number of geometrically con… view at source ↗

**Figure 5.** Figure 5: CMR curves under varying thresholds on (a) OSdataset and (b) RoadScene. We compare CRFT with representative handcrafted, sparse, and (semi-)dense matching methods by evaluating the CMR across a range of pixel thresholds. CRFT consistently achieves the highest CMR across a wide range of thresholds. show severe degradation under cross-modal shifts, with AEPE ranging from 17.27 to 50.87 and CMR falling to zer… view at source ↗

**Figure 6.** Figure 6: Ablation study on OSdataset. We evaluate the contribution of each component in CRFT by progressively adding the flow estimation (FE), iterative discrepancy-guided optimization (IDGO), and the Iterative Loss (IL) to the XoFTR baseline. The CMR curves across varying thresholds demonstrate that each module brings a clear performance gain, while integrating all modules (X+FE+IDGO+IL) yields the most significan… view at source ↗

**Figure 7.** Figure 7: Flow prediction visualization on OSdataset. For each pair, we visualize the predicted dense flow, the ground-truth flow, and the corresponding difference map. CRFT produces smooth and geometrically consistent flow fields across opticalSAR modality gap, indicating accurate geometric correspondence and robustness to strong appearance variations [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Flow prediction visualization on RoadScene. For each pair, we visualize the predicted dense flow, the ground-truth flow, and the corresponding difference map. CRFT maintains stable and consistent flow behavior across heterogeneous modalities, demonstrating robustness to illumination changes, noise patterns, and texture inconsistencies in complex driving environments [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on the OSdataset (top) and RoadScene (Bottom). The red quadrilateral denotes the groundtruth alignment, while the yellow quadrilateral shows the predicted registration result. CRFT achieves more accurate and geometrically consistent registration under large geometric deformation and modality gaps compared with existing optical-SAR registration baselines [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 10.** Figure 10: Checkerboard registration on OSdataset (left) and RoadScene (right). CRFT yields geometrically coherent checkerboard fusion across both optical-SAR and optical-infrared modalities. Zoomed-in regions demonstrate precise alignment of boundaries and textures, confirming the model’s ability to recover fine-scale geometric correspondence under large cross-modal appearance differences [PITH_FULL_IMAGE:figures/… view at source ↗

read the original abstract

We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRFT combines recurrent discrepancy-guided attention and a spatial geometric transform inside a coarse-to-fine transformer for cross-modal registration, with public code as the main practical plus.

read the letter

CRFT puts forward a transformer that learns a modality-independent feature flow for aligning images across modalities. The coarse stage uses multi-scale correlation for global matches, then the fine stage adds hierarchical fusion and an iterative loop that refines the flow via discrepancy-guided attention and a Spatial Geometric Transform module. This recurrent setup is meant to keep consistency while handling large affine and scale shifts. The paper shows the pieces fit together into a single trainable pipeline and releases the code and datasets, which is helpful for anyone who wants to test it directly. That combination of named components is not a direct copy of earlier registration transformers, so the architecture counts as new even if the overall coarse-to-fine shape is familiar. The design choices around enforcing feature-level consistency through the recurrent path are laid out clearly enough to follow. The experiments are described as covering diverse cross-modal sets and showing better accuracy and robustness than prior methods, which is the central claim. The soft spot is that the abstract and high-level description do not include the actual tables, baselines, or error magnitudes, so it is still unclear how large or consistent the gains really are once you look at the numbers. The assumption that the learned flow stays modality-independent across unseen domains also rests on the empirical results rather than a proof, and that could be sensitive to dataset choice. The paper is aimed at computer-vision groups working on multimodal registration for remote sensing, navigation, or medical use. Readers who follow transformer applications to geometric tasks would get the most out of it. It has enough of a complete method and claimed empirical backing to go to a serious referee rather than a desk reject, even if the review will likely focus on the strength of the quantitative comparisons.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces CRFT, a unified coarse-to-fine transformer framework for cross-modal image registration based on learning modality-independent feature flows. The coarse stage uses multi-scale feature correlation for global correspondences, while the fine stage employs hierarchical feature fusion and adaptive spatial reasoning. An iterative discrepancy-guided attention mechanism augmented with a Spatial Geometric Transform (SGT) recurrently refines the flow field to capture spatial inconsistencies and enforce consistency under large affine and scale variations. The central claim is that CRFT consistently outperforms state-of-the-art registration methods in accuracy and robustness across diverse cross-modal datasets, with broader applicability to remote sensing, autonomous navigation, and medical imaging; code and datasets are released publicly.

Significance. If the empirical results are robust, this work could advance cross-modal registration by providing a practical, transformer-based paradigm that jointly addresses feature alignment and flow estimation without modality-specific assumptions. The emphasis on recurrent consistency enforcement and public code release supports reproducibility and potential adoption in applied domains where large deformations are common.

minor comments (3)

The abstract states that 'extensive experiments... demonstrate that CRFT consistently outperforms' but provides no quantitative metrics, specific datasets, or baseline comparisons; moving at least one key result (e.g., average error reduction on a named dataset) into the abstract would strengthen the summary.
Notation for the Spatial Geometric Transform (SGT) and the discrepancy-guided attention is introduced without an explicit equation reference in the high-level description; adding a compact equation block early in §3 would improve readability for readers unfamiliar with recurrent flow refinement.
The claim of a 'modality-independent feature flow representation' is presented as an outcome of joint training; a short ablation isolating the contribution of the recurrent SGT module versus a non-recurrent baseline would help substantiate that this property is not merely an artifact of the training data.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript on CRFT and the recommendation for minor revision. We appreciate the recognition of the framework's potential to advance cross-modal registration through recurrent consistency enforcement and its applicability across domains. No specific major comments were provided in the report, so we have no point-by-point revisions to address at this stage.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a transformer-based architecture for cross-modal registration with coarse-to-fine stages, discrepancy-guided attention, and a Spatial Geometric Transform module. No equations, derivations, or first-principles claims are present in the provided text that reduce performance claims to fitted parameters, self-definitions, or self-citation chains. The central claims rest on empirical outperformance across datasets, which is independent of any internal reduction. This is a standard empirical ML architecture paper with no load-bearing theoretical steps that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated; typical deep-learning registration models contain many hyperparameters, but none are identified here.

pith-pipeline@v0.9.0 · 5502 in / 1024 out tokens · 68459 ms · 2026-05-10T19:34:16.218576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

R Archana and PS Eliahim Jeevaraj. Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024. 1

work page 2024
[2]

Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning

Lin Bie, Shouan Pan, Siqi Li, Yining Zhao, and Yue Gao. Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22161–22171, 2025. 1

work page 2025
[3]

A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L Prince, and Yong Du. A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

work page
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1

work page 2024
[5]

Dsap: Dynamic sparse attention perception matcher for accurate local feature matching

Kun Dai, Ke Wang, Tao Xie, Tao Sun, Jinhang Zhang, Qingjia Kong, Zhiqiang Jiang, Ruifeng Li, Lijun Zhao, and Mohamed Omar. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Transactions on Instrumentation and Measurement, 73:1–16, 2024. 1

work page 2024
[6]

Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022

Yuxin Deng and Jiayi Ma. Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022. 2

work page 2022
[7]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pages 224–236, 2018. 2

work page 2018
[8]

Improve representation for imbalanced regression through geometric constraints

Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, and Juan Helen Zhou. Improve representation for imbalanced regression through geometric constraints. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5082–5091, 2025. 1

work page 2025
[9]

Dkm: Dense kernelized feature matching for geometry estimation

Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dkm: Dense kernelized feature matching for geometry estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023. 2

work page 2023
[10]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024. 2

work page 2024
[11]

Colabsfm: Collaborative structure-from-motion by point cloud registra- tion

Johan Edstedt, Andr ´e Mateus, and Alberto Jaenal. Colabsfm: Collaborative structure-from-motion by point cloud registra- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6573–6583, 2025. 1

work page 2025
[12]

Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation

Yuxiang Fu, Qi Yan, Lele Wang, Ke Li, and Renjie Liao. Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17282–17293,

work page
[13]

Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024

Daniel Gehrig and Davide Scaramuzza. Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024. 1

work page 2024
[14]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InProc. Eur. Conf. Comput. Vis., pages 668–685, 2022. 2

work page 2022
[15]

Omniglue: Generalizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024. 2

work page 2024
[16]

A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

Xingyu Jiang, Jiayi Ma, Guobao Xiao, Zhenfeng Shao, and Xiaojie Guo. A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

work page
[17]

Dense-sfm: Structure from motion with dense consistent matching

JongMin Lee and Sungjoo Yoo. Dense-sfm: Structure from motion with dense consistent matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6404–6414, 2025. 2

work page 2025
[18]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer,

work page
[19]

Genflow3d: Generative scene flow estimation and prediction on point cloud sequences

Hanlin Li, Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. Genflow3d: Generative scene flow estimation and prediction on point cloud sequences. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27488– 27497, 2025. 2

work page 2025
[20]

Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans

Jiayuan Li, Qingwu Hu, and Mingyao Ai. Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans. Image Process., 29:3296– 3310, 2019. 2, 6

work page 2019
[21]

Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans

Jiayuan Li, Wangyi Xu, Pengcheng Shi, Yongjun Zhang, and Qingwu Hu. Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans. Geosci. Remote Sens., 60:1–14, 2022. 2, 6

work page 2022
[22]

Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023

Jiayuan Li, Pengcheng Shi, Qingwu Hu, and Yongjun Zhang. Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023. 2, 6

work page 2023
[23]

Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025

Wuxin Li, Qian Chen, Guohua Gu, and Xiubao Sui. Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025. 1

work page 2025
[24]

Implicit correspondence learning for image-to-point cloud registration

Xinjun Li, Wenfei Yang, Jiacheng Deng, Zhixin Cheng, Xu Zhou, and Tianzhu Zhang. Implicit correspondence learning for image-to-point cloud registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16922–16931, 2025. 1

work page 2025
[25]

Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 1

work page 2024
[26]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 2

work page 2023
[27]

A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans

Xuecong Liu, Xichao Teng, Zhang Li, Qifeng Yu, and Yijie Bian. A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans. Instrum. Meas., 71:1–12, 2022. 1

work page 2022
[28]

Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching

Xuecong Liu, Xichao Teng, Yijie Bian, Zhang Li, and Qifeng Yu. Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 17:18139– 18155, 2024. 2

work page 2024
[29]

Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin

Xuecong Liu, Xichao Teng, Jing Luo, Zhang Li, Qifeng Yu, and Yijie Bian. Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin. J. Aeronaut., 37(1):271–286, 2024. 2

work page 2024
[30]

Xuecong Liu, Zixuan Sun, Hongwei Ding, Xin Song, Shuaiying Zhang, and Yongsheng Sun. Gaff: Global attention feature flow network for optical and sar image registration under geometric transformations.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

work page 2026
[31]

Cross-rejective open-set sar image registration

Shasha Mao, Shiming Lu, Zhaolong Du, Licheng Jiao, Shuiping Gou, Luntian Mou, Xuequan Lu, Lin Xiong, and Yimeng Zhang. Cross-rejective open-set sar image registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23027–23036, 2025. 1

work page 2025
[32]

Cesar, Xiangyang Ji, and Xu-Cheng Yin

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive optical flow estimation with a dual-pyramid framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17810–17820, 2025. 2

work page 2025
[33]

Flowseek: Optical flow made easier with depth foundation models and motion bases

Matteo Poggi and Fabio Tosi. Flowseek: Optical flow made easier with depth foundation models and motion bases. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5667–5679, 2025. 1

work page 2025
[34]

Xfeat: Accelerated features for lightweight image matching

Guilherme Potje, Felipe Cadar, Andr ´e Araujo, Renato Martins, and Erickson R Nascimento. Xfeat: Accelerated features for lightweight image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2682–2691, 2024. 2

work page 2024
[35]

Must: The first dataset and unified framework for multispectral uav single object tracking

Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. Must: The first dataset and unified framework for multispectral uav single object tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16882–16891, 2025. 1

work page 2025
[36]

Minima: Modality invariant image matching

Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, and Xiang Bai. Minima: Modality invariant image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 6

work page 2025
[37]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4938–4947,

work page
[38]

Diff2flow: Training flow matching models via diffusion model alignment

Johannes Schusterbauer, Ming Gui, Frank Fundel, and Bj ¨orn Ommer. Diff2flow: Training flow matching models via diffusion model alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28347– 28357, 2025. 1

work page 2025
[39]

Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation

Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1599–1610, 2023. 2

work page 2023
[40]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8922–8931, 2021. 2

work page 2021
[41]

Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

Zixuan Sun, Shuaifeng Zhi, Kai Huo, Xuecong Liu, Weidong Jiang, and Yongxiang Liu. Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024. 2

work page 2024
[42]

Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025

Zixuan Sun, Shuaifeng Zhi, Ruize Li, Jingyuan Xia, Yongxiang Liu, and Weidong Jiang. Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025. 2, 6

work page arXiv 2025
[43]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProc. Eur. Conf. Comput. Vis.,

work page
[44]

Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci

Xichao Teng, Xuecong Liu, Zhang Li, Qifeng Yu, and Yijie Bian. Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci. Remote Sens. Lett., 20:1–5, 2023. 2

work page 2023
[45]

Aydin Alatan

¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydin Alatan. Xoftr: Cross-modal feature matching transformer. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4275–4286, 2024. 2, 6

work page 2024
[46]

Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024

Haiqiao Wang, Dong Ni, and Yi Wang. Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024. 1

work page 2024
[47]

Efficient loftr: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024. 2, 6

work page 2024
[48]

A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024

Yanan Wang, Yaobin Tian, Jiawei Chen, Kun Xu, and Xilun Ding. A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024. 1

work page 2024
[49]

Dfm: Differentiable feature matching for anomaly detection

Sheng Wu, Yimi Wang, Xudong Liu, Yuguang Yang, Runqi Wang, Guodong Guo, David Doermann, and Baochang Zhang. Dfm: Differentiable feature matching for anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15224–15233, 2025. 1

work page 2025
[50]

Single-model and any-modality for video object tracking

Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19156– 19166, 2024. 1

work page 2024
[51]

Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans

Yuming Xiang, Feng Wang, and Hongjian You. Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans. Geosci. Remote Sens., 56(6):3078–3090, 2018. 2

work page 2018
[52]

Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and sar images via improved phase congruency model.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 5

work page 2020
[53]

Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Yun Xiao, Chunlei Zhang, Yuan Chen, Bo Jiang, and Jin Tang. Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 2, 6

work page 2024
[54]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 2

work page 2025
[55]

U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 5

work page 2020
[56]

Gmflow: Learning optical flow via global matching

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8121–8130, 2022. 2

work page 2022
[57]

Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans

Han Xu, Jiteng Yuan, and Jiayi Ma. Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12148–12166,

work page
[58]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 6

work page 2023
[59]

Towards grand unified representation learning for unsupervised visible-infrared person re-identification

Bin Yang, Jun Chen, and Mang Ye. Towards grand unified representation learning for unsupervised visible-infrared person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11069– 11079, 2023. 1

work page 2023
[60]

3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025

Yibin Ye, Xichao Teng, Hongrui Yang, Shuo Chen, Yuli Sun, Yijie Bian, Tao Tan, Zhang Li, and Qifeng Yu. 3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025. 1

work page 2025
[61]

From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision

Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, and Xiangyu Yue. From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2588–2598, 2025. 1

work page 2025
[62]

Comatcher: Multi- view collaborative feature matching

Jintao Zhang, Zimin Xia, Mingyue Dong, Shuhan Shen, Linwei Yue, and Xianwei Zheng. Comatcher: Multi- view collaborative feature matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21970–21980, 2025. 1

work page 2025
[63]

Adapting dense matching for homography estimation with grid-based acceleration

Kaining Zhang, Yuxin Deng, Jiayi Ma, and Paolo Favaro. Adapting dense matching for homography estimation with grid-based acceleration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6294– 6303, 2025. 1

work page 2025
[64]

Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023

Yongjun Zhang, Yongxiang Yao, Yi Wan, Weiyu Liu, Wupeng Yang, Zhi Zheng, and Rang Xiao. Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023. 2, 6

work page 2023
[65]

Full- dof egomotion estimation for event cameras using geometric solvers

Ji Zhao, Banglei Guan, Zibin Liu, and Laurent Kneip. Full- dof egomotion estimation for event cameras using geometric solvers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11515–11524, 2025. 1

work page 2025
[66]

Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025

Chongyue Zheng, Shanshan Li, Chengyou Wang, and Bing Zhang. Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 6 CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration Supplementary Material A. Visualization of Registration Resul...

work page 2025

[1] [1]

Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024

R Archana and PS Eliahim Jeevaraj. Deep learning models for digital image processing: a review.Artificial Intelligence Review, 57(1):11, 2024. 1

work page 2024

[2] [2]

Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning

Lin Bie, Shouan Pan, Siqi Li, Yining Zhao, and Yue Gao. Graphi2p: Image-to-point cloud registration with exploring pattern of correspondence via graph learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22161–22171, 2025. 1

work page 2025

[3] [3]

A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L Prince, and Yong Du. A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond.Medical Image Analysis, 100:103385,

work page

[4] [4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1

work page 2024

[5] [5]

Dsap: Dynamic sparse attention perception matcher for accurate local feature matching

Kun Dai, Ke Wang, Tao Xie, Tao Sun, Jinhang Zhang, Qingjia Kong, Zhiqiang Jiang, Ruifeng Li, Lijun Zhao, and Mohamed Omar. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Transactions on Instrumentation and Measurement, 73:1–16, 2024. 1

work page 2024

[6] [6]

Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022

Yuxin Deng and Jiayi Ma. Redfeat: Recoupling detection and description for multimodal feature learning.IEEE Transactions on Image Processing, 32:591–602, 2022. 2

work page 2022

[7] [7]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pages 224–236, 2018. 2

work page 2018

[8] [8]

Improve representation for imbalanced regression through geometric constraints

Zijian Dong, Yilei Wu, Chongyao Chen, Yingtian Zou, Yichi Zhang, and Juan Helen Zhou. Improve representation for imbalanced regression through geometric constraints. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5082–5091, 2025. 1

work page 2025

[9] [9]

Dkm: Dense kernelized feature matching for geometry estimation

Johan Edstedt, Ioannis Athanasiadis, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dkm: Dense kernelized feature matching for geometry estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023. 2

work page 2023

[10] [10]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024. 2

work page 2024

[11] [11]

Colabsfm: Collaborative structure-from-motion by point cloud registra- tion

Johan Edstedt, Andr ´e Mateus, and Alberto Jaenal. Colabsfm: Collaborative structure-from-motion by point cloud registra- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6573–6583, 2025. 1

work page 2025

[12] [12]

Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation

Yuxiang Fu, Qi Yan, Lele Wang, Ke Li, and Renjie Liao. Moflow: One-step flow matching for human trajectory forecasting via implicit maximum likelihood estimation based distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17282–17293,

work page

[13] [13]

Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024

Daniel Gehrig and Davide Scaramuzza. Low-latency automotive vision with event cameras.Nature, 629(8014): 1034–1040, 2024. 1

work page 2024

[14] [14]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InProc. Eur. Conf. Comput. Vis., pages 668–685, 2022. 2

work page 2022

[15] [15]

Omniglue: Generalizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024. 2

work page 2024

[16] [16]

A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

Xingyu Jiang, Jiayi Ma, Guobao Xiao, Zhenfeng Shao, and Xiaojie Guo. A review of multimodal image matching: Methods and applications.Information Fusion, 73:22–71,

work page

[17] [17]

Dense-sfm: Structure from motion with dense consistent matching

JongMin Lee and Sungjoo Yoo. Dense-sfm: Structure from motion with dense consistent matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6404–6414, 2025. 2

work page 2025

[18] [18]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer,

work page

[19] [19]

Genflow3d: Generative scene flow estimation and prediction on point cloud sequences

Hanlin Li, Wenming Weng, Yueyi Zhang, and Zhiwei Xiong. Genflow3d: Generative scene flow estimation and prediction on point cloud sequences. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27488– 27497, 2025. 2

work page 2025

[20] [20]

Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans

Jiayuan Li, Qingwu Hu, and Mingyao Ai. Rift: Multi-modal image matching based on radiation-variation insensitive feature transform.IEEE Trans. Image Process., 29:3296– 3310, 2019. 2, 6

work page 2019

[21] [21]

Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans

Jiayuan Li, Wangyi Xu, Pengcheng Shi, Yongjun Zhang, and Qingwu Hu. Lnift: Locally normalized image for rotation invariant multimodal feature matching.IEEE Trans. Geosci. Remote Sens., 60:1–14, 2022. 2, 6

work page 2022

[22] [22]

Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023

Jiayuan Li, Pengcheng Shi, Qingwu Hu, and Yongjun Zhang. Rift2: Speeding-up rift with a new rotation-invariance technique.arXiv, 2023. 2, 6

work page 2023

[23] [23]

Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025

Wuxin Li, Qian Chen, Guohua Gu, and Xiubao Sui. Object matching of visible–infrared image based on attention mechanism and feature fusion.Pattern Recognition, 158: 110972, 2025. 1

work page 2025

[24] [24]

Implicit correspondence learning for image-to-point cloud registration

Xinjun Li, Wenfei Yang, Jiacheng Deng, Zhixin Cheng, Xu Zhou, and Tianzhu Zhang. Implicit correspondence learning for image-to-point cloud registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16922–16931, 2025. 1

work page 2025

[25] [25]

Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object detection.Advances in Neural Information Processing Systems, 37:128430–128461, 2024. 1

work page 2024

[26] [26]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 2

work page 2023

[27] [27]

A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans

Xuecong Liu, Xichao Teng, Zhang Li, Qifeng Yu, and Yijie Bian. A fast algorithm for high accuracy airborne sar geolocation based on local linear approximation.IEEE Trans. Instrum. Meas., 71:1–12, 2022. 1

work page 2022

[28] [28]

Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching

Xuecong Liu, Xichao Teng, Yijie Bian, Zhang Li, and Qifeng Yu. Shape-adaptive modality independent region descriptor for multimodal remote sensing image matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 17:18139– 18155, 2024. 2

work page 2024

[29] [29]

Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin

Xuecong Liu, Xichao Teng, Jing Luo, Zhang Li, Qifeng Yu, and Yijie Bian. Robust multi-sensor image matching based on normalized self-similarity region descriptor.Chin. J. Aeronaut., 37(1):271–286, 2024. 2

work page 2024

[30] [30]

Xuecong Liu, Zixuan Sun, Hongwei Ding, Xin Song, Shuaiying Zhang, and Yongsheng Sun. Gaff: Global attention feature flow network for optical and sar image registration under geometric transformations.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026. 2

work page 2026

[31] [31]

Cross-rejective open-set sar image registration

Shasha Mao, Shiming Lu, Zhaolong Du, Licheng Jiao, Shuiping Gou, Luntian Mou, Xuequan Lu, Lin Xiong, and Yimeng Zhang. Cross-rejective open-set sar image registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23027–23036, 2025. 1

work page 2025

[32] [32]

Cesar, Xiangyang Ji, and Xu-Cheng Yin

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive optical flow estimation with a dual-pyramid framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17810–17820, 2025. 2

work page 2025

[33] [33]

Flowseek: Optical flow made easier with depth foundation models and motion bases

Matteo Poggi and Fabio Tosi. Flowseek: Optical flow made easier with depth foundation models and motion bases. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5667–5679, 2025. 1

work page 2025

[34] [34]

Xfeat: Accelerated features for lightweight image matching

Guilherme Potje, Felipe Cadar, Andr ´e Araujo, Renato Martins, and Erickson R Nascimento. Xfeat: Accelerated features for lightweight image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2682–2691, 2024. 2

work page 2024

[35] [35]

Must: The first dataset and unified framework for multispectral uav single object tracking

Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. Must: The first dataset and unified framework for multispectral uav single object tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16882–16891, 2025. 1

work page 2025

[36] [36]

Minima: Modality invariant image matching

Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, and Xiang Bai. Minima: Modality invariant image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 6

work page 2025

[37] [37]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4938–4947,

work page

[38] [38]

Diff2flow: Training flow matching models via diffusion model alignment

Johannes Schusterbauer, Ming Gui, Frank Fundel, and Bj ¨orn Ommer. Diff2flow: Training flow matching models via diffusion model alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28347– 28357, 2025. 1

work page 2025

[39] [39]

Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation

Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1599–1610, 2023. 2

work page 2023

[40] [40]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8922–8931, 2021. 2

work page 2021

[41] [41]

Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

Zixuan Sun, Shuaifeng Zhi, Kai Huo, Xuecong Liu, Weidong Jiang, and Yongxiang Liu. Os 3 flow: Optical and sar image registration using symmetry-guided semi-dense optical flow.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024. 2

work page 2024

[42] [42]

Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025

Zixuan Sun, Shuaifeng Zhi, Ruize Li, Jingyuan Xia, Yongxiang Liu, and Weidong Jiang. Gdros: A geometry- guided dense registration framework for optical-sar images under large geometric transformations.arXiv preprint arXiv:2511.00598, 2025. 2, 6

work page arXiv 2025

[43] [43]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InProc. Eur. Conf. Comput. Vis.,

work page

[44] [44]

Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci

Xichao Teng, Xuecong Liu, Zhang Li, Qifeng Yu, and Yijie Bian. Omird: Orientated modality independent region descriptor for optical-to-sar image matching.IEEE Geosci. Remote Sens. Lett., 20:1–5, 2023. 2

work page 2023

[45] [45]

Aydin Alatan

¨Onder Tuzcuo ˘glu, Aybora K ¨oksal, Bu ˘gra Sofu, Sinan Kalkan, and A. Aydin Alatan. Xoftr: Cross-modal feature matching transformer. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4275–4286, 2024. 2, 6

work page 2024

[46] [46]

Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024

Haiqiao Wang, Dong Ni, and Yi Wang. Recursive deformable pyramid network for unsupervised medical image registration.IEEE Transactions on Medical Imaging, 43(6):2229–2240, 2024. 1

work page 2024

[47] [47]

Efficient loftr: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024. 2, 6

work page 2024

[48] [48]

A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024

Yanan Wang, Yaobin Tian, Jiawei Chen, Kun Xu, and Xilun Ding. A survey of visual slam in dynamic environment: The evolution from geometric to semantic approaches.IEEE Transactions on Instrumentation and Measurement, 73:1– 21, 2024. 1

work page 2024

[49] [49]

Dfm: Differentiable feature matching for anomaly detection

Sheng Wu, Yimi Wang, Xudong Liu, Yuguang Yang, Runqi Wang, Guodong Guo, David Doermann, and Baochang Zhang. Dfm: Differentiable feature matching for anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15224–15233, 2025. 1

work page 2025

[50] [50]

Single-model and any-modality for video object tracking

Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, and Radu Timofte. Single-model and any-modality for video object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19156– 19166, 2024. 1

work page 2024

[51] [51]

Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans

Yuming Xiang, Feng Wang, and Hongjian You. Os-sift: A robust sift-like algorithm for high-resolution optical-to-sar image registration in suburban areas.IEEE Trans. Geosci. Remote Sens., 56(6):3078–3090, 2018. 2

work page 2018

[52] [52]

Yuming Xiang, Rongshu Tao, Feng Wang, Hongjian You, and Bing Han. Automatic registration of optical and sar images via improved phase congruency model.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13:5847–5861, 2020. 5

work page 2020

[53] [53]

Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Yun Xiao, Chunlei Zhang, Yuan Chen, Bo Jiang, and Jin Tang. Adrnet: Affine and deformable registration networks for multimodal remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. 2, 6

work page 2024

[54] [54]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 2

work page 2025

[55] [55]

U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 5

work page 2020

[56] [56]

Gmflow: Learning optical flow via global matching

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 8121–8130, 2022. 2

work page 2022

[57] [57]

Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans

Han Xu, Jiteng Yuan, and Jiayi Ma. Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Trans. Pattern Anal. Mach. Intell., 45(10):12148–12166,

work page

[58] [58]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 6

work page 2023

[59] [59]

Towards grand unified representation learning for unsupervised visible-infrared person re-identification

Bin Yang, Jun Chen, and Mang Ye. Towards grand unified representation learning for unsupervised visible-infrared person re-identification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11069– 11079, 2023. 1

work page 2023

[60] [60]

3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025

Yibin Ye, Xichao Teng, Hongrui Yang, Shuo Chen, Yuli Sun, Yijie Bian, Tao Tan, Zhang Li, and Qifeng Yu. 3mos: a multi-source, multi-resolution, and multi-scene optical-sar dataset with insights for multi-modal image matching.Visual Intelligence, 3(1):1–27, 2025. 1

work page 2025

[61] [61]

From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision

Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, and Xiangyu Yue. From easy to hard: Pro- gressive active learning framework for infrared small target detection with single point supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2588–2598, 2025. 1

work page 2025

[62] [62]

Comatcher: Multi- view collaborative feature matching

Jintao Zhang, Zimin Xia, Mingyue Dong, Shuhan Shen, Linwei Yue, and Xianwei Zheng. Comatcher: Multi- view collaborative feature matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21970–21980, 2025. 1

work page 2025

[63] [63]

Adapting dense matching for homography estimation with grid-based acceleration

Kaining Zhang, Yuxin Deng, Jiayi Ma, and Paolo Favaro. Adapting dense matching for homography estimation with grid-based acceleration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6294– 6303, 2025. 1

work page 2025

[64] [64]

Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023

Yongjun Zhang, Yongxiang Yao, Yi Wan, Weiyu Liu, Wupeng Yang, Zhi Zheng, and Rang Xiao. Histogram of the orientation of the weighted phase descriptor for multi- modal remote sensing image matching.ISPRS Journal of Photogrammetry and Remote Sensing, 196:1–15, 2023. 2, 6

work page 2023

[65] [65]

Full- dof egomotion estimation for event cameras using geometric solvers

Ji Zhao, Banglei Guan, Zibin Liu, and Laurent Kneip. Full- dof egomotion estimation for event cameras using geometric solvers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11515–11524, 2025. 1

work page 2025

[66] [66]

Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025

Chongyue Zheng, Shanshan Li, Chengyou Wang, and Bing Zhang. Msg: Robust multimodal remote sensing image matching using side window gaussian space.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 6 CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration Supplementary Material A. Visualization of Registration Resul...

work page 2025