pith. machine review for the scientific record. sign in

arxiv: 2605.12064 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-modal image registrationoptical-SAR registrationtext semantic assistanceremote sensingcoarse-to-fine matchingmulti-modal feature enhancementRemoteCLIP
0
0 comments X

The pith

TAR uses text semantic priors from remote sensing scenes to improve optical-SAR image registration under large geometric deformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TAR to register optical and SAR images by incorporating text descriptions of scenes and land-cover categories. It extracts multi-scale visual features from both modalities, enhances high-level features through interaction with text features extracted by a frozen RemoteCLIP encoder, and then refines correspondences in a coarse-to-fine process. The goal is to reduce the modality gap that makes direct visual matching difficult when geometric changes are large. A sympathetic reader would care because reliable alignment supports downstream remote sensing tasks such as change detection and data fusion.

Core claim

TAR shows that introducing text semantic priors via a frozen RemoteCLIP text encoder into a visual-text interaction module produces more reliable high-level features, which in turn yield stronger coarse matching and overall registration performance than several state-of-the-art methods, with notable gains when large geometric deformations are present.

What carries the argument

The text-assisted feature enhancement (TAFE) module, which builds text descriptors for remote sensing scenes and land-cover objects and fuses them with visual features through visual-text interaction to strengthen high-level representations for coarse matching.

If this is right

  • Coarse matching becomes more robust because semantic text features supplement appearance-based cues.
  • The coarse-to-fine pipeline in CFDM produces denser and more accurate final correspondences.
  • Gains are largest precisely when geometric deformations are severe, the regime where purely visual methods degrade most.
  • The frozen text encoder allows the framework to add semantic guidance without additional training cost on the text side.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-prior strategy could be tested on other cross-modal pairs such as optical-infrared if suitable scene text descriptors exist.
  • Allowing limited fine-tuning of the text branch on remote-sensing data might further close remaining modality gaps.
  • Larger language models could supply richer scene descriptions and potentially extend the approach to semantic-level alignment tasks.

Load-bearing premise

That text descriptors from a frozen RemoteCLIP encoder reliably shrink the modality gap and improve feature correspondence without adding domain biases or needing fine-tuning of the text branch.

What would settle it

A test set of optical-SAR image pairs with large deformations where TAR's matching accuracy fails to exceed that of comparable visual-only deep learning baselines.

Figures

Figures reproduced from arXiv: 2605.12064 by Dou Quan, Licheng Jiao, Ning Huyan, Pei He, Shuang Wang, Zhuoyu Cai.

Figure 1
Figure 1. Figure 1: Deep learning methods for image registration and our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our proposed text semantic-assisted registration (TAR). TAR mainly contains three components: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optical (upper) and SAR (lower) images from the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optical (upper) and SAR (lower) images from the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The visualized matching results of different methods on optical and SAR images. The green lines indicate successful [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce TAR, a text semantic-assisted cross-modal registration framework for optical and SAR images. It uses an MSFL module to extract multi-scale visual features, a TAFE module that constructs text descriptors from remote sensing scenes and land-cover categories and extracts features via a frozen RemoteCLIP text encoder to enhance high-level visual features through visual-text interaction, and a CFDM module for coarse-to-fine dense matching. The central claim is that TAR achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.

Significance. If the results hold, the work could be significant for remote sensing applications by showing how semantic text priors can help bridge the optical-SAR modality gap and improve robustness to large deformations. The use of a frozen pre-trained RemoteCLIP encoder is a strength, as it avoids task-specific fine-tuning overhead while composing existing components (MSFL, coarse-to-fine matching) into a coherent pipeline.

major comments (2)
  1. [Abstract] Abstract: The claim of stronger matching performance and significant gains under large deformations is stated without any quantitative metrics, error bars, dataset details, or ablation results, which prevents verification of the central experimental claim from the provided text.
  2. [TAFE module] TAFE module: The framework assumes that text semantic priors extracted by the frozen RemoteCLIP encoder reliably reduce the modality gap and improve feature correspondence for SAR images without introducing domain-specific biases or requiring fine-tuning; no controlled experiments, contribution ablations, or bias analysis are described to support that the reported gains are due to the text assistance rather than MSFL or CFDM.
minor comments (1)
  1. [TAFE module] The description of visual-text interaction in TAFE lacks explicit equations or pseudocode for how text features are fused with visual features, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of TAR's potential impact. We address each major comment point-by-point below and will revise the manuscript to strengthen the presentation of results and ablations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of stronger matching performance and significant gains under large deformations is stated without any quantitative metrics, error bars, dataset details, or ablation results, which prevents verification of the central experimental claim from the provided text.

    Authors: We agree that the abstract would benefit from including key quantitative results. In the revised version, we will add specific metrics (e.g., matching accuracy improvements of X% on average and Y% under large deformations) along with dataset names and brief reference to ablation outcomes, while keeping the abstract concise. revision: yes

  2. Referee: [TAFE module] TAFE module: The framework assumes that text semantic priors extracted by the frozen RemoteCLIP encoder reliably reduce the modality gap and improve feature correspondence for SAR images without introducing domain-specific biases or requiring fine-tuning; no controlled experiments, contribution ablations, or bias analysis are described to support that the reported gains are due to the text assistance rather than MSFL or CFDM.

    Authors: The manuscript does include ablation studies (Section 4.3) comparing TAR with and without the TAFE module, showing consistent performance drops when text assistance is removed. However, we acknowledge the referee's point that more targeted analysis is needed. We will expand the ablations with controlled experiments on text descriptor variants, add a bias analysis subsection examining domain-specific effects on SAR features, and clarify how the frozen encoder avoids fine-tuning overhead while contributing to the gains beyond MSFL and CFDM. revision: partial

Circularity Check

0 steps flagged

No significant circularity in compositional framework

full rationale

The paper describes TAR as a composition of three modules (MSFL for multi-scale visual features, TAFE for injecting frozen RemoteCLIP text features, and CFDM for coarse-to-fine matching) whose claimed benefit is demonstrated solely through external experimental comparisons on cross-modal remote-sensing datasets. No equations appear that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The text-semantic assistance is presented as an additive prior from an independently trained encoder, and performance gains are reported against state-of-the-art baselines rather than derived from the same fitted values, rendering the argument self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about feature learning and the transferability of a pre-trained vision-language model; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption Multi-scale visual features extracted by standard CNN backbones can be meaningfully enhanced by cross-modal text interaction
    Invoked in the TAFE module description
  • domain assumption A frozen RemoteCLIP text encoder produces descriptors that are semantically aligned with remote-sensing visual content
    Central to the text-assisted feature enhancement step

pith-pipeline@v0.9.0 · 5554 in / 1232 out tokens · 44334 ms · 2026-05-13T06:29:01.665777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Deep learning in remote sensing: A comprehensive review and list of resources,

    X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraun- dorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,”IEEE geoscience and remote sensing magazine, vol. 5, no. 4, pp. 8–36, 2017

  2. [2]

    A review of remote sensing image fusion methods,

    H. Ghassemian, “A review of remote sensing image fusion methods,” Information Fusion, vol. 32, pp. 75–89, 2016

  3. [3]

    The sen1-2 dataset for deep learning in sar-optical data fusion,

    M. Schmitt, L. H. Hughes, and X. X. Zhu, “The sen1-2 dataset for deep learning in sar-optical data fusion,”arXiv preprint arXiv:1807.01569, 2018

  4. [4]

    Image registration methods: a survey,

    B. Zitova and J. Flusser, “Image registration methods: a survey,”Image and vision computing, vol. 21, no. 11, pp. 977–1000, 2003

  5. [5]

    Multimodal remote sensing image registration: a survey,

    B. Zhu and Y . Ye, “Multimodal remote sensing image registration: a survey,”Journal of Image and Graphics, vol. 29, no. 08, pp. 2137–2161, 2024

  6. [6]

    Object detection in optical remote sensing images: A survey and a new benchmark,

    K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,”ISPRS journal of photogrammetry and remote sensing, vol. 159, pp. 296–307, 2020

  7. [7]

    Deep learning- based change detection in remote sensing images: A review,

    A. Shafique, G. Cao, Z. Khan, M. Asad, and M. Aslam, “Deep learning- based change detection in remote sensing images: A review,”Remote Sensing, vol. 14, no. 4, p. 871, 2022

  8. [8]

    Robust registration of multimodal remote sensing images based on structural similarity,

    Y . Ye, J. Shan, L. Bruzzone, and L. Shen, “Robust registration of multimodal remote sensing images based on structural similarity,”IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2941–2958, 2017

  9. [9]

    Deep learning in remote sensing image matching: A survey,

    L. Li, L. Han, Y . Ye, Y . Xiang, and T. Zhang, “Deep learning in remote sensing image matching: A survey,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 225, pp. 88–112, 2025

  10. [10]

    Deep feature correlation learning for multi-modal remote sensing image registration,

    D. Quan, S. Wang, Y . Gu, R. Lei, B. Yang, S. Wei, B. Hou, and L. Jiao, “Deep feature correlation learning for multi-modal remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022

  11. [11]

    Efficient and robust: A cross-modal registration deep wavelet learning method for remote sensing images,

    D. Quan, H. Wei, S. Wang, Y . Li, J. Chanussot, Y . Guo, B. Hou, and L. Jiao, “Efficient and robust: A cross-modal registration deep wavelet learning method for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4739–4754, 2023

  12. [12]

    Self-distillation feature learning network for optical and sar image registration,

    D. Quan, H. Wei, S. Wang, R. Lei, B. Duan, Y . Li, B. Hou, and L. Jiao, “Self-distillation feature learning network for optical and sar image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2022

  13. [13]

    Loftr: Detector- free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931

  14. [14]

    Match- former: Interleaving attention in transformers for feature matching,

    Q. Wang, J. Zhang, K. Yang, K. Peng, and R. Stiefelhagen, “Match- former: Interleaving attention in transformers for feature matching,” in Proceedings of the Asian conference on computer vision, 2022, pp. 2746–2762

  15. [15]

    Aspanformer: Detector-free image matching with adaptive span transformer,

    H. Chen, Z. Luo, L. Zhou, Y . Tian, M. Zhen, T. Fang, D. Mckinnon, Y . Tsin, and L. Quan, “Aspanformer: Detector-free image matching with adaptive span transformer,” inEuropean conference on computer vision. Springer, 2022, pp. 20–36

  16. [16]

    Dkm: Dense kernelized feature matching for geometry estimation,

    J. Edstedt, I. Athanasiadis, M. Wadenb ¨ack, and M. Felsberg, “Dkm: Dense kernelized feature matching for geometry estimation,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 765–17 775

  17. [17]

    Roma: Robust dense feature matching,

    J. Edstedt, Q. Sun, G. B ¨okman, M. Wadenb¨ack, and M. Felsberg, “Roma: Robust dense feature matching,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2024, pp. 19 790– 19 800

  18. [18]

    Remoteclip: A vision language foundation model for remote sensing,

    F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 16, 2024

  19. [19]

    Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,

    W. Zhang, R. Zhao, Y . Yao, Y . Wan, P. Wu, J. Li, Y . Li, and Y . Zhang, “Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives,”arXiv preprint arXiv:2502.01002, 2025

  20. [20]

    Multi-image matching using multi-scale oriented patches,

    M. Brown, R. Szeliski, and S. Winder, “Multi-image matching using multi-scale oriented patches,” in2005 IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 510–517. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, MAY 2026 11

  21. [21]

    Image matching using local symmetry features,

    D. C. Hauagge and N. Snavely, “Image matching using local symmetry features,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 206–213

  22. [22]

    Distinctive image features from scale-invariant keypoints,

    D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004

  23. [23]

    Surf: Speeded up robust features,

    H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” inEuropean conference on computer vision. Springer, 2006, pp. 404–417

  24. [24]

    Fast and robust matching for multimodal remote sensing image registration,

    Y . Ye, L. Bruzzone, J. Shan, F. Bovolo, and Q. Zhu, “Fast and robust matching for multimodal remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9059–9070, 2019

  25. [25]

    Rift: Multi-modal image matching based on radiation-variation insensitive feature transform,

    J. Li, Q. Hu, and M. Ai, “Rift: Multi-modal image matching based on radiation-variation insensitive feature transform,”IEEE Transactions on Image Processing, vol. 29, pp. 3296–3310, 2019

  26. [26]

    Lnift: Locally normalized image for rotation invariant multimodal feature matching,

    J. Li, W. Xu, P. Shi, Y . Zhang, and Q. Hu, “Lnift: Locally normalized image for rotation invariant multimodal feature matching,”IEEE Trans- actions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022

  27. [27]

    Robust feature matching for remote sensing image registration via locally linear transforming,

    J. Ma, H. Zhou, J. Zhao, Y . Gao, J. Jiang, and J. Tian, “Robust feature matching for remote sensing image registration via locally linear transforming,”IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 12, pp. 6469–6481, 2015

  28. [28]

    Disk: Learning local features with policy gradient,

    M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,”Advances in neural information processing sys- tems, vol. 33, pp. 14 254–14 265, 2020

  29. [29]

    Aslfeat: Learning local features of accurate shape and localization,

    Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6589–6598

  30. [30]

    Efficient and accurate remote sensing image registration with hierarchical mamba networks,

    T. Cao, Z. Zhang, H. Cao, S. Bai, and N. Liu, “Efficient and accurate remote sensing image registration with hierarchical mamba networks,” IEEE Geoscience and Remote Sensing Letters, vol. 23, pp. 1–5, 2025

  31. [31]

    F3net: Adaptive frequency feature filtering network for multimodal remote sensing image registration,

    D. Quan, Z. Wang, S. Wang, Y . Li, B. Ren, M. Kang, J. Chanussot, and L. Jiao, “F3net: Adaptive frequency feature filtering network for multimodal remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024

  32. [32]

    A novel rotation and scale equivariant network for optical-sar image matching,

    H. Nie, B. Luo, J. Liu, Z. Fu, W. Liu, C. Wang, and X. Su, “A novel rotation and scale equivariant network for optical-sar image matching,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 14, 2024

  33. [33]

    A multiscale framework with unsupervised learning for remote sensing image regis- tration,

    Y . Ye, T. Tang, B. Zhu, C. Yang, B. Li, and S. Hao, “A multiscale framework with unsupervised learning for remote sensing image regis- tration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

  34. [34]

    Adrnet: Affine and deformable registration networks for multimodal remote sensing images,

    Y . Xiao, C. Zhang, Y . Chen, B. Jiang, and J. Tang, “Adrnet: Affine and deformable registration networks for multimodal remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024

  35. [35]

    Gdros: A geometry- guided dense registration framework for optical–sar images under large geometric transformations,

    Z. Sun, S. Zhi, R. Li, J. Xia, Y . Liu, and W. Jiang, “Gdros: A geometry- guided dense registration framework for optical–sar images under large geometric transformations,”IEEE Transactions on Geoscience and Re- mote Sensing, vol. 63, pp. 1–15, 2025

  36. [36]

    Xoftr: Cross-modal feature matching transformer,

    ¨O. Tuzcuo ˘glu, A. K ¨oksal, B. Sofu, S. Kalkan, and A. A. Alatan, “Xoftr: Cross-modal feature matching transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4275–4286

  37. [37]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  38. [38]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  39. [39]

    Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision,

    X. Li, C. Wen, Y . Hu, and N. Zhou, “Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision,”Inter- national Journal of Applied Earth Observation and Geoinformation, vol. 124, p. 103497, 2023

  40. [40]

    Rs5m and georsclip: A large- scale vision-language dataset and a large vision-language model for remote sensing,

    Z. Zhang, T. Zhao, Y . Guo, and J. Yin, “Rs5m and georsclip: A large- scale vision-language dataset and a large vision-language model for remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–23, 2024

  41. [41]

    Segclip: Mul- timodal visual-language and prompt learning for high-resolution remote sensing semantic segmentation,

    S. Zhang, B. Zhang, Y . Wu, H. Zhou, J. Jiang, and J. Ma, “Segclip: Mul- timodal visual-language and prompt learning for high-resolution remote sensing semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

  42. [42]

    Topicgeo: An efficient unified frame- work for geolocation,

    X. Wang, X. Wang, and S. Gou, “Topicgeo: An efficient unified frame- work for geolocation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8241–8251

  43. [43]

    Text-assisted multimodal adaptive registration and fusion classification network,

    Y . He, B. Xi, G. Li, T. Zheng, Y . Li, C. Xue, and M. Shen, “Text-assisted multimodal adaptive registration and fusion classification network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 64, pp. 1–15, 2025

  44. [44]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  45. [45]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  46. [46]

    Automatic registration of optical and sar images via improved phase congruency model,

    Y . Xiang, R. Tao, F. Wang, H. You, and B. Han, “Automatic registration of optical and sar images via improved phase congruency model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 5847–5861, 2020

  47. [47]

    Os-flow: A robust algorithm for dense optical and sar image registration,

    Y . Xiang, F. Wang, L. Wan, N. Jiao, and H. You, “Os-flow: A robust algorithm for dense optical and sar image registration,”IEEE Transac- tions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6335–6354, 2019. Zhuoyu Caireceived the B.S. degree in Artificial Intelligence from Xidian University, Xi’an, China, in 2025. He is currently pursuing the M.S....