pith. sign in

arxiv: 2605.19734 · v1 · pith:O6SVGEZFnew · submitted 2026-05-19 · 💻 cs.CV

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

Pith reviewed 2026-05-20 05:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-modal retrievaloptical-SARremote sensingfine-grained object retrievalgeometric constraintsunaligned dataMamba vision
0
0 comments X p. Extension
pith:O6SVGEZF Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{O6SVGEZF}

Prints a linked pith:O6SVGEZF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

GeoMamba adds geometric feature injection and consistency constraints to enable robust fine-grained retrieval between unaligned optical and SAR remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of retrieving specific objects across optical and SAR images when the views are not spatially aligned or paired. It builds a framework that injects structural geometric information into the features and applies layered constraints to keep object shapes intact despite noise and modality gaps. A new dataset of aerospace and maritime categories is introduced to test performance under realistic unaligned conditions. If the approach holds, it would let analysts combine complementary sensor data for more precise object identification without needing perfectly matched training pairs.

Core claim

GeoMamba introduces a Geometric Feature Injection module to enhance cross-modal feature interaction and incorporate structural priors for improved SAR representation robustness, together with a Geometric Consistency Constraint module and deep supervision strategy that applies hierarchical geometric constraints via classical operators to preserve informative object structures during representation learning, supporting effective fine-grained optical-SAR retrieval on the FGOS-as dataset under unaligned conditions.

What carries the argument

The Geometric Feature Injection (GFI) module, which enhances cross-modal interaction and adds structural priors, combined with the Geometric Consistency Constraint (GCC) module that imposes hierarchical geometric constraints using classical operators.

If this is right

  • Cross-modal representations become learnable from unaligned optical-SAR pairs in practical remote sensing scenarios.
  • Object structures remain more intact through the representation learning process.
  • The framework supports all-to-all retrieval settings where queries and gallery items come from mixed modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric injection pattern could extend to other sensor pairs such as optical and infrared for similar unaligned retrieval tasks.
  • The FGOS-as dataset provides a benchmark that highlights limitations of paired-data methods when applied to real-world misaligned imagery.
  • Additional classical geometric operators might be swapped in to target specific object categories like ships or aircraft more precisely.

Load-bearing premise

That injecting geometric features and enforcing consistency constraints with classical operators will sufficiently reduce modality gaps, speckle noise, and structural differences to support reliable cross-modal learning without aligned samples.

What would settle it

A controlled test on the FGOS-as dataset that disables the Geometric Feature Injection and Geometric Consistency Constraint modules and checks whether retrieval performance drops to or below levels achieved by prior methods without these geometric additions.

Figures

Figures reproduced from arXiv: 2605.19734 by Jing Xiao, Liang Liao, Mi Wang, Tiantong Fang, Wujie Zhou, Xiuwei Wang.

Figure 1
Figure 1. Figure 1: Overview of cross-modal retrieval challenges and our proposed paradigm. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed FGOS-as dataset. (a) Workflow of the FGOS-as dataset construction, illustrating the stages of multi-source acquisition, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the proposed method. (a) GeoMamba framework, which features a dual-stream MambaVision backbone for extracting spatial [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of fine-grained retrieval results. Panel (a) shows the query samples, panel (b) displays the retrieval results by TransOSS, and [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of the learned feature embeddings on the FGOS-as dataset. The axes represent the 2D projected embedding space. Circles denote [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of cross-modal feature activation maps for optical and SAR images across different module combinations. The baseline (b) struggles [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes GeoMamba, a geometry-driven MambaVision framework for fine-grained optical-SAR object retrieval under unaligned conditions. It introduces a Geometric Feature Injection (GFI) module to enhance cross-modal interaction via structural priors and a Geometric Consistency Constraint (GCC) module with Deep Supervision that applies hierarchical classical operators. The authors construct a new FGOS-as dataset covering 11 aerospace and maritime categories and report that GeoMamba achieves 63.3% mAP and 77.0% Rank-1 accuracy in the all-to-all retrieval setting, outperforming existing methods.

Significance. If the experimental claims are substantiated, the work could advance cross-modal retrieval in remote sensing by demonstrating how geometric priors and consistency constraints can mitigate modality gaps, speckle noise, and structural inconsistencies in practical unaligned optical-SAR scenarios. The FGOS-as dataset may provide a valuable benchmark for aerospace and maritime applications.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): The central performance claims (63.3% mAP, 77.0% Rank-1) are presented without ablation tables that compare GeoMamba against a plain MambaVision backbone on identical FGOS-as splits and protocol. This omission is load-bearing because the attribution of robustness to GFI + GCC cannot be verified without isolating their contribution from the backbone or dataset properties.
  2. [§3 (Method)] §3 (Method): The descriptions of the Geometric Feature Injection module and the Geometric Consistency Constraint (including how classical operators are applied hierarchically under Deep Supervision) remain high-level. Without explicit equations, algorithmic pseudocode, or implementation details, it is not possible to assess whether these components sufficiently close the modality gap or are reproducible.
  3. [§4.1 (Dataset)] §4.1 (Dataset): No statistics or construction details are provided for FGOS-as (e.g., total images per category, train/test split ratios, selection criteria, or checks against implicit spatial alignment leakage). These are required to confirm that the reported gains are not artifacts of dataset curation under the unaligned all-to-all protocol.
minor comments (1)
  1. [Abstract] Abstract: Consider specifying the total number of images or samples in FGOS-as to give readers immediate context for the scale of the new benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we will incorporate the suggested changes in the next version of the paper to strengthen clarity, reproducibility, and experimental rigor.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The central performance claims (63.3% mAP, 77.0% Rank-1) are presented without ablation tables that compare GeoMamba against a plain MambaVision backbone on identical FGOS-as splits and protocol. This omission is load-bearing because the attribution of robustness to GFI + GCC cannot be verified without isolating their contribution from the backbone or dataset properties.

    Authors: We agree that isolating the contributions of the GFI and GCC modules from the underlying MambaVision backbone is important for substantiating our claims. The current manuscript reports overall performance but does not include dedicated ablation tables on the exact same FGOS-as splits and retrieval protocol. In the revised version, we will add these ablation studies, including direct comparisons of the plain backbone versus variants with GFI, GCC, and their combination, to clearly attribute the observed gains. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): The descriptions of the Geometric Feature Injection module and the Geometric Consistency Constraint (including how classical operators are applied hierarchically under Deep Supervision) remain high-level. Without explicit equations, algorithmic pseudocode, or implementation details, it is not possible to assess whether these components sufficiently close the modality gap or are reproducible.

    Authors: We acknowledge that the method section presents the GFI and GCC modules at a conceptual level without sufficient mathematical detail. To improve reproducibility and allow assessment of how these components address modality gaps, we will expand §3 in the revision with explicit equations for the Geometric Feature Injection process and the hierarchical application of classical operators within the GCC module under Deep Supervision. We will also include algorithmic pseudocode and key implementation hyperparameters. revision: yes

  3. Referee: [§4.1 (Dataset)] §4.1 (Dataset): No statistics or construction details are provided for FGOS-as (e.g., total images per category, train/test split ratios, selection criteria, or checks against implicit spatial alignment leakage). These are required to confirm that the reported gains are not artifacts of dataset curation under the unaligned all-to-all protocol.

    Authors: We recognize that detailed dataset documentation is essential for validating the experimental protocol. The current manuscript introduces FGOS-as but omits granular statistics and construction specifics. In the revised version, we will add a dedicated subsection or table in §4.1 reporting the total images per category, train/test split ratios, selection criteria for the 11 aerospace and maritime classes, and explicit checks or design choices to minimize any potential spatial alignment leakage under the unaligned all-to-all setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework and results are self-contained

full rationale

The paper introduces a new architecture (GeoMamba with GFI and GCC modules) and a new dataset (FGOS-as), then reports experimental retrieval metrics on that dataset. No equations, predictions, or first-principles claims are shown to reduce by construction to fitted inputs, self-citations, or renamed known results. The performance numbers (63.3% mAP, 77.0% Rank-1) are presented as outcomes of standard training and evaluation rather than tautological re-statements of the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the modules are presented as novel but their internal assumptions are not detailed.

pith-pipeline@v0.9.0 · 5772 in / 1058 out tokens · 36946 ms · 2026-05-20T05:28:56.648807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 3 internal anchors

  1. [1]

    Detection of building outlines based on the fusion of sar and optical features,

    F. Tupin and M. Roux, “Detection of building outlines based on the fusion of sar and optical features,”ISPRS J. Photogramm. Remote Sens., vol. 58, no. 1-2, pp. 71–82, 2003

  2. [2]

    M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,

    C. Wang, W. S. Lu, X. M. Li, J. Yang, and L. Luo, “M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,”arXiv preprint arXiv:2505.10931, 2025

  3. [3]

    Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,

    Z. Zhao, Y . Xu, A. Lu, C. Li, and J. Tang, “Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,”arXiv preprint arXiv:2512.22447, 2025

  4. [4]

    Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,

    Y . Zhang, L. Guo, Z. Wang, Y . Yu, X. Liu, and F. Xu, “Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,”Remote Sens., vol. 12, no. 20, p. 3316, 2020

  5. [5]

    Disaster risk reduction using image fusion of optical and sar data before and after tsunami,

    Y . Kwak, A. Yorozuya, and Y . Iwami, “Disaster risk reduction using image fusion of optical and sar data before and after tsunami,” inProc. IEEE Aerosp. Conf.IEEE, 2016, pp. 1–11

  6. [6]

    Coastline detection using optical and synthetic aperture radar images,

    T. Yu, S. W. Xu, B. Y . Tao, and W. Z. Shao, “Coastline detection using optical and synthetic aperture radar images,”Adv. Space Res., vol. 70, no. 1, pp. 70–84, 2022

  7. [7]

    Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,

    Z. Zhang, L. Zhang, J. Wu, and W. Guo, “Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,”IEEE Geosci. Remote Sens. Mag., vol. 12, no. 4, pp. 132–168, 2024

  8. [8]

    Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,

    M. Ahmed, N. El-Sheimy, and H. Leung, “Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,” Sensors, vol. 25, no. 2, p. 329, 2025

  9. [9]

    Machine learning based aircraft detection using sar & optical images,

    M. Rane and S. Kumar, “Machine learning based aircraft detection using sar & optical images,”ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 10, pp. 529–535, 2025

  10. [10]

    A deep cross- modality hashing network for sar and optical remote sensing images retrieval,

    W. Xiong, Z. Xiong, Y . Zhang, Y . Cui, and X. Gu, “A deep cross- modality hashing network for sar and optical remote sensing images retrieval,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 5284–5296, 2020

  11. [11]

    Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,

    Y . Sun, S. Feng, Y . Ye, X. Li, J. Kang, Z. Huang, and C. Luo, “Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2021

  12. [12]

    Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,

    J. Huang, Y . Feng, M. Zhou, X. Xiong, Y . Wang, and B. Qiang, “Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,” IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

  13. [13]

    Cross-modal retrieval algorithm based on patch aggregation,

    J. Yang and Y . Tang, “Cross-modal retrieval algorithm based on patch aggregation,” inProc. 4th Int. Conf. Electron. Inf. Technol. (EIT). IEEE, 2025, pp. 632–637

  14. [14]

    Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,

    Z. Luo, M. Meng, and J. Wu, “Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,”Neurocomputing, p. 132999, 2026

  15. [15]

    Canonical correlation analysis: An overview with application to learning methods,

    D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,”Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004

  16. [16]

    Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,

    A. A. Nielsen, “Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,”IEEE Trans. Image Process., vol. 11, no. 3, pp. 293–305, 2002

  17. [17]

    Distance metric learning for large margin nearest neighbor classification

    K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.”J. Mach. Learn. Res., vol. 10, no. 2, 2009

  18. [18]

    Data fusion through cross-modality metric learning using similarity-sensitive hashing,

    M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through cross-modality metric learning using similarity-sensitive hashing,” inProc. IEEE CVPR. IEEE, 2010, pp. 3594–3601

  19. [19]

    Learning hash functions for cross-view similarity search,

    S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” inProc. IJCAI, vol. 22, no. 1, 2011, p. 1360

  20. [20]

    Generalized multiview analysis: A discriminative latent space,

    A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” inProc. IEEE CVPR. IEEE, 2012, pp. 2160–2167

  21. [21]

    Sar-sift: a sift-like algorithm for sar images,

    F. Dellinger, J. Delon, Y . Gousseau, J. Michel, and F. Tupin, “Sar-sift: a sift-like algorithm for sar images,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 453–466, 2014

  22. [22]

    A novel active learning method in relevance feedback for content-based remote sensing image retrieval,

    B. Demir and L. Bruzzone, “A novel active learning method in relevance feedback for content-based remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2323–2334, 2014

  23. [23]

    Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,

    H. Liu, X. Tan, and X. Zhou, “Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,”IEEE Trans. Multimedia, vol. 23, pp. 4414–4425, 2020

  24. [24]

    Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,

    T. Liang, Y . Jin, W. Liu, T. Wang, S. Feng, and Y . Li, “Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 8, pp. 7683–7698, 2024

  25. [25]

    Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,

    H. Park, S. Lee, J. Lee, and B. Ham, “Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,” in Proc. IEEE/CVF ICCV, 2021, pp. 12 046–12 055

  26. [26]

    Deep learning for person re-identification: A survey and outlook,

    M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2021

  27. [27]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  28. [28]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inProc. ICML. PMLR, 2021, pp. 10 347–10 357

  29. [29]

    Transreid: Transformer-based object re-identification,

    S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid: Transformer-based object re-identification,” inProc. IEEE/CVF ICCV, 2021, pp. 15 013–15 022

  30. [30]

    A versatile framework for multi- scene person re-identification,

    W.-S. Zheng, J. Yan, and Y .-X. Peng, “A versatile framework for multi- scene person re-identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1362–1380, 2024

  31. [31]

    Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,

    W. Chen, X. Xu, J. Jia, H. Luo, Y . Wang, F. Wang, R. Jin, and X. Sun, “Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,” inProc. IEEE/CVF CVPR, 2023, pp. 15 050–15 061

  32. [32]

    Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,

    H. Wang, S. Li, J. Yang, Y . Liu, Y . Lv, and Z. Zhou, “Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,” inProc. IEEE/CVF ICCV, October 2025, pp. 7873–7883

  33. [33]

    Efficiently Modeling Long Sequences with Structured State Spaces

    A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

  34. [34]

    S4nd: Modeling images and videos as multidimensional signals with state spaces,

    E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. R ´e, “S4nd: Modeling images and videos as multidimensional signals with state spaces,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 2846–2861, 2022

  35. [35]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

  36. [36]

    Vmamba: Visual state space model,

    Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 103 031–103 063, 2024

  37. [37]

    Mambavision: A hybrid mamba- transformer vision backbone,

    A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,” inProc. IEEE/CVF CVPR, 2025, pp. 25 261–25 270

  38. [38]

    Mambahsi: Spatial–spectral mamba for hyperspectral image classification,

    Y . Li, Y . Luo, L. Zhang, Z. Wang, and B. Du, “Mambahsi: Spatial–spectral mamba for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, 2024

  39. [39]

    Samba: Semantic segmentation of remotely sensed images with state space model,

    Q. Zhu, Y . Cai, Y . b. Fang, Y . Yang, C. Chen, L. Fan, and A. Nguyen, “Samba: Semantic segmentation of remotely sensed images with state space model,”Heliyon, vol. 10, no. 19, 2024

  40. [40]

    Rsmamba: Remote sensing image classification with state space model,

    K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, “Rsmamba: Remote sensing image classification with state space model,”IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

  41. [41]

    Robust registration of multimodal remote sensing images based on structural similarity,

    Y . Ye, J. Shan, L. Bruzzone, and L. Shen, “Robust registration of multimodal remote sensing images based on structural similarity,”IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2941–2958, 2017

  42. [42]

    Robust optical-to-sar image matching based on shape properties,

    Y . Ye, L. Shen, M. Hao, J. Wang, and Z. Xu, “Robust optical-to-sar image matching based on shape properties,”IEEE Geosci. Remote Sens. Lett., vol. 14, no. 4, pp. 564–568, 2017

  43. [43]

    A new perspective on physics guided learning for sar image interpretation,

    Z. Wang, Z. Huang, and M. Datcu, “A new perspective on physics guided learning for sar image interpretation,” inProc. IEEE IGARSS. IEEE, 2023, pp. 1926–1929

  44. [44]

    Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,

    L. Yi, D. Lan, Z. Ke’er, and D. Yuang, “Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,”J. Radars, vol. 13, no. 2, pp. 443–456, 2024

  45. [45]

    Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,

    X. Xiong, X. Zhang, W. Jiang, L. Liu, Y . Liu, and T. Liu, “Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,”arXiv preprint arXiv:2505.08547, 2025

  46. [46]

    Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set

    Y . Yang and H. Zhao, “Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set.”Appl. Sci., vol. 15, no. 23, 2025

  47. [47]

    A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,

    D. Li, Z. Zhang, X. Chen, and K. Huang, “A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,”IEEE Trans. Image Process., vol. 28, no. 4, pp. 1575–1590, 2019

  48. [48]

    A public dataset for fine-grained ship classification in optical remote sensing images,

    Y . Di, Z. Jiang, and H. Zhang, “A public dataset for fine-grained ship classification in optical remote sensing images,”Remote Sens., vol. 13, no. 4, p. 747, 2021

  49. [49]

    Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,

    Y . Xiang, J. Chen, Z. Hong, N. Jiao, F. Wang, H. You, and X. Tong, “Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,”J. Radars, vol. 14, pp. 1–13, 2025

  50. [50]

    Bag-of-visual-words and spatial extensions for land-use classification,

    Y . Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” inProc. 18th SIGSPATIAL Int. Conf. Adv. Geogr . Inf. Syst., 2010, pp. 270–279

  51. [51]

    Structural high-resolution satellite image indexing,

    G.-S. Xia, W. Yang, J. Delon, Y . Gousseau, H. Sun, and H. Ma ˆıtre, “Structural high-resolution satellite image indexing,” inProc. ISPRS TC VII Symp., vol. 38, 2010, pp. 298–303

  52. [52]

    Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,

    Y . Li, Y . Zhang, X. Huang, and J. Ma, “Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11, pp. 6521–6536, 2018

  53. [53]

    A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,

    Y . Sun, S. Feng, Y . Ye, X. Li, and J. Kang, “A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,” inProc. BIGSARDATA. IEEE, 2021, pp. 1–4

  54. [54]

    An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,

    W. Xiong, Z. Xiong, Y . Cui, L. Huang, and R. Yang, “An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2696–2712, 2022

  55. [55]

    Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,

    W. Xu, X. Yuan, Q. Hu, and J. Li, “Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,”Int. J. Appl. Earth Obs. Geoinf., vol. 122, p. 103433, 2023

  56. [56]

    Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

    X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y . Feng, T. Xuet al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS J. Photogramm. Remote Sens., vol. 184, pp. 116–130, 2022

  57. [57]

    FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,

    X. Hou, W. Ao, Q. Songet al., “FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,” Sci. China Inf. Sci., vol. 63, no. 4, p. 140303, 2020

  58. [58]

    Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,

    W. Zhirui, K. Yuzhuo, Z. Xuan, W. Yuelei, Z. Ting, and S. Xian, “Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,” J. Radars, vol. 12, no. 4, pp. 906–922, 2023

  59. [59]

    Air- sarship-1.0: High-resolution sar ship detection dataset,

    S. Xian, W. Zhirui, S. Yuanrui, D. Wenhui, Z. Yue, and F. Kun, “Air- sarship-1.0: High-resolution sar ship detection dataset,”J. Radars, vol. 8, no. 6, pp. 852–863, 2019

  60. [60]

    Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,

    Y . Wu, Y . Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y . Zhao, K. Fu, and X. Sun, “Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–22, 2025

  61. [61]

    Gaussian-adaptive bilateral filter,

    B.-H. Chen, Y .-S. Tseng, and J.-L. Yin, “Gaussian-adaptive bilateral filter,”IEEE Signal Process. Lett., vol. 27, pp. 1670–1674, 2020

  62. [62]

    Optimized laplacian image sharpening algorithm based on graphic processing unit,

    T. Ma, L. Li, S. Ji, X. Wang, Y . Tian, A. Al-Dhelaan, and M. Al- Rodhaan, “Optimized laplacian image sharpening algorithm based on graphic processing unit,”Physica A, vol. 416, pp. 400–410, 2014

  63. [63]

    Speckle noise reduction in sar imagery using a local adaptive median filter,

    F. Qiu, J. Berglund, J. R. Jensen, P. Thakkar, and D. Ren, “Speckle noise reduction in sar imagery using a local adaptive median filter,”GIScience Remote Sens., vol. 41, no. 3, pp. 244–266, 2004

  64. [64]

    Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,

    I. M. Mohammed and N. A. M. Isa, “Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,” IEEE Access, 2025

  65. [65]

    Physics- guided detector for sar airplanes,

    Z. Huang, L. Liu, S. M. Yang, Z. Wang, G. Cheng, and J. Han, “Physics- guided detector for sar airplanes,”IEEE Trans. Circuits Syst. Video Technol., 2025

  66. [66]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

  67. [67]

    Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,

    Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,” inProc. IEEE/CVF CVPR, June 2023, pp. 2153–2162

  68. [68]

    Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,

    B. Liu, R. Huang, X. Pan, C. Li, J. Sun, J. Dong, and X. Wang, “Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,” inProc. 48th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr ., 2025, pp. 106–115

  69. [69]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE CVPR, 2009, pp. 248–255