GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

arxiv: 2605.19734 · v1 · pith:O6SVGEZFnew · submitted 2026-05-19 · 💻 cs.CV

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

Tiantong Fang , Xiuwei Wang , Jing Xiao , Wujie Zhou , Liang Liao , Mi Wang This is my paper

Pith reviewed 2026-05-20 05:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-modal retrievaloptical-SARremote sensingfine-grained object retrievalgeometric constraintsunaligned dataMamba vision

0 comments p. Extension

pith:O6SVGEZF Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{O6SVGEZF}

Prints a linked pith:O6SVGEZF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

GeoMamba adds geometric feature injection and consistency constraints to enable robust fine-grained retrieval between unaligned optical and SAR remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of retrieving specific objects across optical and SAR images when the views are not spatially aligned or paired. It builds a framework that injects structural geometric information into the features and applies layered constraints to keep object shapes intact despite noise and modality gaps. A new dataset of aerospace and maritime categories is introduced to test performance under realistic unaligned conditions. If the approach holds, it would let analysts combine complementary sensor data for more precise object identification without needing perfectly matched training pairs.

Core claim

GeoMamba introduces a Geometric Feature Injection module to enhance cross-modal feature interaction and incorporate structural priors for improved SAR representation robustness, together with a Geometric Consistency Constraint module and deep supervision strategy that applies hierarchical geometric constraints via classical operators to preserve informative object structures during representation learning, supporting effective fine-grained optical-SAR retrieval on the FGOS-as dataset under unaligned conditions.

What carries the argument

The Geometric Feature Injection (GFI) module, which enhances cross-modal interaction and adds structural priors, combined with the Geometric Consistency Constraint (GCC) module that imposes hierarchical geometric constraints using classical operators.

If this is right

Cross-modal representations become learnable from unaligned optical-SAR pairs in practical remote sensing scenarios.
Object structures remain more intact through the representation learning process.
The framework supports all-to-all retrieval settings where queries and gallery items come from mixed modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric injection pattern could extend to other sensor pairs such as optical and infrared for similar unaligned retrieval tasks.
The FGOS-as dataset provides a benchmark that highlights limitations of paired-data methods when applied to real-world misaligned imagery.
Additional classical geometric operators might be swapped in to target specific object categories like ships or aircraft more precisely.

Load-bearing premise

That injecting geometric features and enforcing consistency constraints with classical operators will sufficiently reduce modality gaps, speckle noise, and structural differences to support reliable cross-modal learning without aligned samples.

What would settle it

A controlled test on the FGOS-as dataset that disables the Geometric Feature Injection and Geometric Consistency Constraint modules and checks whether retrieval performance drops to or below levels achieved by prior methods without these geometric additions.

Figures

Figures reproduced from arXiv: 2605.19734 by Jing Xiao, Liang Liao, Mi Wang, Tiantong Fang, Wujie Zhou, Xiuwei Wang.

**Figure 2.** Figure 2: Overview of the proposed FGOS-as dataset. (a) Workflow of the FGOS-as dataset construction, illustrating the stages of multi-source acquisition, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the proposed method. (a) GeoMamba framework, which features a dual-stream MambaVision backbone for extracting spatial [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of fine-grained retrieval results. Panel (a) shows the query samples, panel (b) displays the retrieval results by TransOSS, and [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of the learned feature embeddings on the FGOS-as dataset. The axes represent the 2D projected embedding space. Circles denote [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of cross-modal feature activation maps for optical and SAR images across different module combinations. The baseline (b) struggles [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoMamba adds GFI and GCC modules plus a new FGOS-as dataset for unaligned optical-SAR retrieval, but the 63.3% mAP claim rests on unshown ablations.

read the letter

The main takeaway is that this paper targets unaligned optical-SAR fine-grained retrieval with a Mamba-based model that injects geometric priors and applies consistency constraints. They also release FGOS-as, a dataset with 11 aerospace and maritime categories built for realistic, non-aligned conditions. That combination is the concrete new element here. The approach makes sense on paper: GFI brings structural information into the features to handle speckle and modality gaps, while GCC uses classical operators under deep supervision to keep object geometry intact during learning. Those choices directly address the practical issues the abstract lays out. The reported 63.3% mAP and 77.0% Rank-1 on all-to-all retrieval look like a step forward if the gains hold up. The soft spots are the missing pieces that would let a reader verify the attribution. The abstract gives no ablation tables isolating GFI or GCC against a plain MambaVision baseline, no error bars, and no breakdown of dataset construction or splits. Without those, it is difficult to separate real module contributions from possible artifacts in how the new data was collected or balanced. The stress-test note flags exactly this gap, and the abstract-level description does not close it. This paper is for researchers working on cross-modal retrieval in remote sensing or earth observation. A reader already following Mamba variants or geometry-aware multimodal methods could extract usable ideas from the modules and the dataset release. It deserves peer review. The problem is well-defined, the dataset is new, and the framework is specific enough that referees can check the experiments and ask for the necessary controls.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes GeoMamba, a geometry-driven MambaVision framework for fine-grained optical-SAR object retrieval under unaligned conditions. It introduces a Geometric Feature Injection (GFI) module to enhance cross-modal interaction via structural priors and a Geometric Consistency Constraint (GCC) module with Deep Supervision that applies hierarchical classical operators. The authors construct a new FGOS-as dataset covering 11 aerospace and maritime categories and report that GeoMamba achieves 63.3% mAP and 77.0% Rank-1 accuracy in the all-to-all retrieval setting, outperforming existing methods.

Significance. If the experimental claims are substantiated, the work could advance cross-modal retrieval in remote sensing by demonstrating how geometric priors and consistency constraints can mitigate modality gaps, speckle noise, and structural inconsistencies in practical unaligned optical-SAR scenarios. The FGOS-as dataset may provide a valuable benchmark for aerospace and maritime applications.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The central performance claims (63.3% mAP, 77.0% Rank-1) are presented without ablation tables that compare GeoMamba against a plain MambaVision backbone on identical FGOS-as splits and protocol. This omission is load-bearing because the attribution of robustness to GFI + GCC cannot be verified without isolating their contribution from the backbone or dataset properties.
[§3 (Method)] §3 (Method): The descriptions of the Geometric Feature Injection module and the Geometric Consistency Constraint (including how classical operators are applied hierarchically under Deep Supervision) remain high-level. Without explicit equations, algorithmic pseudocode, or implementation details, it is not possible to assess whether these components sufficiently close the modality gap or are reproducible.
[§4.1 (Dataset)] §4.1 (Dataset): No statistics or construction details are provided for FGOS-as (e.g., total images per category, train/test split ratios, selection criteria, or checks against implicit spatial alignment leakage). These are required to confirm that the reported gains are not artifacts of dataset curation under the unaligned all-to-all protocol.

minor comments (1)

[Abstract] Abstract: Consider specifying the total number of images or samples in FGOS-as to give readers immediate context for the scale of the new benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are warranted, we will incorporate the suggested changes in the next version of the paper to strengthen clarity, reproducibility, and experimental rigor.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central performance claims (63.3% mAP, 77.0% Rank-1) are presented without ablation tables that compare GeoMamba against a plain MambaVision backbone on identical FGOS-as splits and protocol. This omission is load-bearing because the attribution of robustness to GFI + GCC cannot be verified without isolating their contribution from the backbone or dataset properties.

Authors: We agree that isolating the contributions of the GFI and GCC modules from the underlying MambaVision backbone is important for substantiating our claims. The current manuscript reports overall performance but does not include dedicated ablation tables on the exact same FGOS-as splits and retrieval protocol. In the revised version, we will add these ablation studies, including direct comparisons of the plain backbone versus variants with GFI, GCC, and their combination, to clearly attribute the observed gains. revision: yes
Referee: [§3 (Method)] §3 (Method): The descriptions of the Geometric Feature Injection module and the Geometric Consistency Constraint (including how classical operators are applied hierarchically under Deep Supervision) remain high-level. Without explicit equations, algorithmic pseudocode, or implementation details, it is not possible to assess whether these components sufficiently close the modality gap or are reproducible.

Authors: We acknowledge that the method section presents the GFI and GCC modules at a conceptual level without sufficient mathematical detail. To improve reproducibility and allow assessment of how these components address modality gaps, we will expand §3 in the revision with explicit equations for the Geometric Feature Injection process and the hierarchical application of classical operators within the GCC module under Deep Supervision. We will also include algorithmic pseudocode and key implementation hyperparameters. revision: yes
Referee: [§4.1 (Dataset)] §4.1 (Dataset): No statistics or construction details are provided for FGOS-as (e.g., total images per category, train/test split ratios, selection criteria, or checks against implicit spatial alignment leakage). These are required to confirm that the reported gains are not artifacts of dataset curation under the unaligned all-to-all protocol.

Authors: We recognize that detailed dataset documentation is essential for validating the experimental protocol. The current manuscript introduces FGOS-as but omits granular statistics and construction specifics. In the revised version, we will add a dedicated subsection or table in §4.1 reporting the total images per category, train/test split ratios, selection criteria for the 11 aerospace and maritime classes, and explicit checks or design choices to minimize any potential spatial alignment leakage under the unaligned all-to-all setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework and results are self-contained

full rationale

The paper introduces a new architecture (GeoMamba with GFI and GCC modules) and a new dataset (FGOS-as), then reports experimental retrieval metrics on that dataset. No equations, predictions, or first-principles claims are shown to reduce by construction to fitted inputs, self-citations, or renamed known results. The performance numbers (63.3% mAP, 77.0% Rank-1) are presented as outcomes of standard training and evaluation rather than tautological re-statements of the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the modules are presented as novel but their internal assumptions are not detailed.

pith-pipeline@v0.9.0 · 5772 in / 1058 out tokens · 36946 ms · 2026-05-20T05:28:56.648807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors... Geometric Consistency Constraint (GCC) module... using classical operators, i.e., the Sobel operator for optical contours and the Harris detector for SAR scattering centers.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-stream MambaVision backbone... State Space Models (SSMs)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 3 internal anchors

[1]

Detection of building outlines based on the fusion of sar and optical features,

F. Tupin and M. Roux, “Detection of building outlines based on the fusion of sar and optical features,”ISPRS J. Photogramm. Remote Sens., vol. 58, no. 1-2, pp. 71–82, 2003

work page 2003
[2]

M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,

C. Wang, W. S. Lu, X. M. Li, J. Yang, and L. Luo, “M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,”arXiv preprint arXiv:2505.10931, 2025

work page arXiv 2025
[3]

Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,

Z. Zhao, Y . Xu, A. Lu, C. Li, and J. Tang, “Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,”arXiv preprint arXiv:2512.22447, 2025

work page arXiv 2025
[4]

Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,

Y . Zhang, L. Guo, Z. Wang, Y . Yu, X. Liu, and F. Xu, “Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,”Remote Sens., vol. 12, no. 20, p. 3316, 2020

work page 2020
[5]

Disaster risk reduction using image fusion of optical and sar data before and after tsunami,

Y . Kwak, A. Yorozuya, and Y . Iwami, “Disaster risk reduction using image fusion of optical and sar data before and after tsunami,” inProc. IEEE Aerosp. Conf.IEEE, 2016, pp. 1–11

work page 2016
[6]

Coastline detection using optical and synthetic aperture radar images,

T. Yu, S. W. Xu, B. Y . Tao, and W. Z. Shao, “Coastline detection using optical and synthetic aperture radar images,”Adv. Space Res., vol. 70, no. 1, pp. 70–84, 2022

work page 2022
[7]

Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,

Z. Zhang, L. Zhang, J. Wu, and W. Guo, “Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,”IEEE Geosci. Remote Sens. Mag., vol. 12, no. 4, pp. 132–168, 2024

work page 2024
[8]

Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,

M. Ahmed, N. El-Sheimy, and H. Leung, “Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,” Sensors, vol. 25, no. 2, p. 329, 2025

work page 2025
[9]

Machine learning based aircraft detection using sar & optical images,

M. Rane and S. Kumar, “Machine learning based aircraft detection using sar & optical images,”ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 10, pp. 529–535, 2025

work page 2025
[10]

A deep cross- modality hashing network for sar and optical remote sensing images retrieval,

W. Xiong, Z. Xiong, Y . Zhang, Y . Cui, and X. Gu, “A deep cross- modality hashing network for sar and optical remote sensing images retrieval,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 5284–5296, 2020

work page 2020
[11]

Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,

Y . Sun, S. Feng, Y . Ye, X. Li, J. Kang, Z. Huang, and C. Luo, “Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2021

work page 2021
[12]

Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,

J. Huang, Y . Feng, M. Zhou, X. Xiong, Y . Wang, and B. Qiang, “Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,” IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

work page 2024
[13]

Cross-modal retrieval algorithm based on patch aggregation,

J. Yang and Y . Tang, “Cross-modal retrieval algorithm based on patch aggregation,” inProc. 4th Int. Conf. Electron. Inf. Technol. (EIT). IEEE, 2025, pp. 632–637

work page 2025
[14]

Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,

Z. Luo, M. Meng, and J. Wu, “Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,”Neurocomputing, p. 132999, 2026

work page 2026
[15]

Canonical correlation analysis: An overview with application to learning methods,

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,”Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004

work page 2004
[16]

Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,

A. A. Nielsen, “Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,”IEEE Trans. Image Process., vol. 11, no. 3, pp. 293–305, 2002

work page 2002
[17]

Distance metric learning for large margin nearest neighbor classification

K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.”J. Mach. Learn. Res., vol. 10, no. 2, 2009

work page 2009
[18]

Data fusion through cross-modality metric learning using similarity-sensitive hashing,

M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through cross-modality metric learning using similarity-sensitive hashing,” inProc. IEEE CVPR. IEEE, 2010, pp. 3594–3601

work page 2010
[19]

Learning hash functions for cross-view similarity search,

S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” inProc. IJCAI, vol. 22, no. 1, 2011, p. 1360

work page 2011
[20]

Generalized multiview analysis: A discriminative latent space,

A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” inProc. IEEE CVPR. IEEE, 2012, pp. 2160–2167

work page 2012
[21]

Sar-sift: a sift-like algorithm for sar images,

F. Dellinger, J. Delon, Y . Gousseau, J. Michel, and F. Tupin, “Sar-sift: a sift-like algorithm for sar images,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 453–466, 2014

work page 2014
[22]

A novel active learning method in relevance feedback for content-based remote sensing image retrieval,

B. Demir and L. Bruzzone, “A novel active learning method in relevance feedback for content-based remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2323–2334, 2014

work page 2014
[23]

Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,

H. Liu, X. Tan, and X. Zhou, “Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,”IEEE Trans. Multimedia, vol. 23, pp. 4414–4425, 2020

work page 2020
[24]

Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,

T. Liang, Y . Jin, W. Liu, T. Wang, S. Feng, and Y . Li, “Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 8, pp. 7683–7698, 2024

work page 2024
[25]

Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,

H. Park, S. Lee, J. Lee, and B. Ham, “Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,” in Proc. IEEE/CVF ICCV, 2021, pp. 12 046–12 055

work page 2021
[26]

Deep learning for person re-identification: A survey and outlook,

M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2021

work page 2021
[27]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[28]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inProc. ICML. PMLR, 2021, pp. 10 347–10 357

work page 2021
[29]

Transreid: Transformer-based object re-identification,

S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid: Transformer-based object re-identification,” inProc. IEEE/CVF ICCV, 2021, pp. 15 013–15 022

work page 2021
[30]

A versatile framework for multi- scene person re-identification,

W.-S. Zheng, J. Yan, and Y .-X. Peng, “A versatile framework for multi- scene person re-identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1362–1380, 2024

work page 2024
[31]

Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,

W. Chen, X. Xu, J. Jia, H. Luo, Y . Wang, F. Wang, R. Jin, and X. Sun, “Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,” inProc. IEEE/CVF CVPR, 2023, pp. 15 050–15 061

work page 2023
[32]

Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,

H. Wang, S. Li, J. Yang, Y . Liu, Y . Lv, and Z. Zhou, “Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,” inProc. IEEE/CVF ICCV, October 2025, pp. 7873–7883

work page 2025
[33]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

S4nd: Modeling images and videos as multidimensional signals with state spaces,

E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. R ´e, “S4nd: Modeling images and videos as multidimensional signals with state spaces,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 2846–2861, 2022

work page 2022
[35]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 103 031–103 063, 2024

work page 2024
[37]

Mambavision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,” inProc. IEEE/CVF CVPR, 2025, pp. 25 261–25 270

work page 2025
[38]

Mambahsi: Spatial–spectral mamba for hyperspectral image classification,

Y . Li, Y . Luo, L. Zhang, Z. Wang, and B. Du, “Mambahsi: Spatial–spectral mamba for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, 2024

work page 2024
[39]

Samba: Semantic segmentation of remotely sensed images with state space model,

Q. Zhu, Y . Cai, Y . b. Fang, Y . Yang, C. Chen, L. Fan, and A. Nguyen, “Samba: Semantic segmentation of remotely sensed images with state space model,”Heliyon, vol. 10, no. 19, 2024

work page 2024
[40]

Rsmamba: Remote sensing image classification with state space model,

K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, “Rsmamba: Remote sensing image classification with state space model,”IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

work page 2024
[41]

Robust registration of multimodal remote sensing images based on structural similarity,

Y . Ye, J. Shan, L. Bruzzone, and L. Shen, “Robust registration of multimodal remote sensing images based on structural similarity,”IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2941–2958, 2017

work page 2017
[42]

Robust optical-to-sar image matching based on shape properties,

Y . Ye, L. Shen, M. Hao, J. Wang, and Z. Xu, “Robust optical-to-sar image matching based on shape properties,”IEEE Geosci. Remote Sens. Lett., vol. 14, no. 4, pp. 564–568, 2017

work page 2017
[43]

A new perspective on physics guided learning for sar image interpretation,

Z. Wang, Z. Huang, and M. Datcu, “A new perspective on physics guided learning for sar image interpretation,” inProc. IEEE IGARSS. IEEE, 2023, pp. 1926–1929

work page 2023
[44]

Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,

L. Yi, D. Lan, Z. Ke’er, and D. Yuang, “Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,”J. Radars, vol. 13, no. 2, pp. 443–456, 2024

work page 2024
[45]

Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,

X. Xiong, X. Zhang, W. Jiang, L. Liu, Y . Liu, and T. Liu, “Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,”arXiv preprint arXiv:2505.08547, 2025

work page arXiv 2025
[46]

Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set

Y . Yang and H. Zhao, “Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set.”Appl. Sci., vol. 15, no. 23, 2025

work page 2025
[47]

A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,

D. Li, Z. Zhang, X. Chen, and K. Huang, “A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,”IEEE Trans. Image Process., vol. 28, no. 4, pp. 1575–1590, 2019

work page 2019
[48]

A public dataset for fine-grained ship classification in optical remote sensing images,

Y . Di, Z. Jiang, and H. Zhang, “A public dataset for fine-grained ship classification in optical remote sensing images,”Remote Sens., vol. 13, no. 4, p. 747, 2021

work page 2021
[49]

Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,

Y . Xiang, J. Chen, Z. Hong, N. Jiao, F. Wang, H. You, and X. Tong, “Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,”J. Radars, vol. 14, pp. 1–13, 2025

work page 2025
[50]

Bag-of-visual-words and spatial extensions for land-use classification,

Y . Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” inProc. 18th SIGSPATIAL Int. Conf. Adv. Geogr . Inf. Syst., 2010, pp. 270–279

work page 2010
[51]

Structural high-resolution satellite image indexing,

G.-S. Xia, W. Yang, J. Delon, Y . Gousseau, H. Sun, and H. Ma ˆıtre, “Structural high-resolution satellite image indexing,” inProc. ISPRS TC VII Symp., vol. 38, 2010, pp. 298–303

work page 2010
[52]

Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,

Y . Li, Y . Zhang, X. Huang, and J. Ma, “Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11, pp. 6521–6536, 2018

work page 2018
[53]

A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,

Y . Sun, S. Feng, Y . Ye, X. Li, and J. Kang, “A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,” inProc. BIGSARDATA. IEEE, 2021, pp. 1–4

work page 2021
[54]

An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,

W. Xiong, Z. Xiong, Y . Cui, L. Huang, and R. Yang, “An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2696–2712, 2022

work page 2022
[55]

Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,

W. Xu, X. Yuan, Q. Hu, and J. Li, “Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,”Int. J. Appl. Earth Obs. Geoinf., vol. 122, p. 103433, 2023

work page 2023
[56]

Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y . Feng, T. Xuet al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS J. Photogramm. Remote Sens., vol. 184, pp. 116–130, 2022

work page 2022
[57]

FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,

X. Hou, W. Ao, Q. Songet al., “FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,” Sci. China Inf. Sci., vol. 63, no. 4, p. 140303, 2020

work page 2020
[58]

Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,

W. Zhirui, K. Yuzhuo, Z. Xuan, W. Yuelei, Z. Ting, and S. Xian, “Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,” J. Radars, vol. 12, no. 4, pp. 906–922, 2023

work page 2023
[59]

Air- sarship-1.0: High-resolution sar ship detection dataset,

S. Xian, W. Zhirui, S. Yuanrui, D. Wenhui, Z. Yue, and F. Kun, “Air- sarship-1.0: High-resolution sar ship detection dataset,”J. Radars, vol. 8, no. 6, pp. 852–863, 2019

work page 2019
[60]

Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,

Y . Wu, Y . Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y . Zhao, K. Fu, and X. Sun, “Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–22, 2025

work page 2025
[61]

Gaussian-adaptive bilateral filter,

B.-H. Chen, Y .-S. Tseng, and J.-L. Yin, “Gaussian-adaptive bilateral filter,”IEEE Signal Process. Lett., vol. 27, pp. 1670–1674, 2020

work page 2020
[62]

Optimized laplacian image sharpening algorithm based on graphic processing unit,

T. Ma, L. Li, S. Ji, X. Wang, Y . Tian, A. Al-Dhelaan, and M. Al- Rodhaan, “Optimized laplacian image sharpening algorithm based on graphic processing unit,”Physica A, vol. 416, pp. 400–410, 2014

work page 2014
[63]

Speckle noise reduction in sar imagery using a local adaptive median filter,

F. Qiu, J. Berglund, J. R. Jensen, P. Thakkar, and D. Ren, “Speckle noise reduction in sar imagery using a local adaptive median filter,”GIScience Remote Sens., vol. 41, no. 3, pp. 244–266, 2004

work page 2004
[64]

Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,

I. M. Mohammed and N. A. M. Isa, “Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,” IEEE Access, 2025

work page 2025
[65]

Physics- guided detector for sar airplanes,

Z. Huang, L. Liu, S. M. Yang, Z. Wang, G. Cheng, and J. Han, “Physics- guided detector for sar airplanes,”IEEE Trans. Circuits Syst. Video Technol., 2025

work page 2025
[66]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

work page 2017
[67]

Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,

Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,” inProc. IEEE/CVF CVPR, June 2023, pp. 2153–2162

work page 2023
[68]

Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,

B. Liu, R. Huang, X. Pan, C. Li, J. Sun, J. Dong, and X. Wang, “Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,” inProc. 48th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr ., 2025, pp. 106–115

work page 2025
[69]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE CVPR, 2009, pp. 248–255

work page 2009

[1] [1]

Detection of building outlines based on the fusion of sar and optical features,

F. Tupin and M. Roux, “Detection of building outlines based on the fusion of sar and optical features,”ISPRS J. Photogramm. Remote Sens., vol. 58, no. 1-2, pp. 71–82, 2003

work page 2003

[2] [2]

M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,

C. Wang, W. S. Lu, X. M. Li, J. Yang, and L. Luo, “M4-sar: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-sar fusion object detection,”arXiv preprint arXiv:2505.10931, 2025

work page arXiv 2025

[3] [3]

Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,

Z. Zhao, Y . Xu, A. Lu, C. Li, and J. Tang, “Towards robust optical-sar object detection under missing modalities: A dynamic quality-aware fusion framework,”arXiv preprint arXiv:2512.22447, 2025

work page arXiv 2025

[4] [4]

Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,

Y . Zhang, L. Guo, Z. Wang, Y . Yu, X. Liu, and F. Xu, “Intelligent ship detection in remote sensing images based on multi-layer convolutional feature fusion,”Remote Sens., vol. 12, no. 20, p. 3316, 2020

work page 2020

[5] [5]

Disaster risk reduction using image fusion of optical and sar data before and after tsunami,

Y . Kwak, A. Yorozuya, and Y . Iwami, “Disaster risk reduction using image fusion of optical and sar data before and after tsunami,” inProc. IEEE Aerosp. Conf.IEEE, 2016, pp. 1–11

work page 2016

[6] [6]

Coastline detection using optical and synthetic aperture radar images,

T. Yu, S. W. Xu, B. Y . Tao, and W. Z. Shao, “Coastline detection using optical and synthetic aperture radar images,”Adv. Space Res., vol. 70, no. 1, pp. 70–84, 2022

work page 2022

[7] [7]

Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,

Z. Zhang, L. Zhang, J. Wu, and W. Guo, “Optical and synthetic aperture radar image fusion for ship detection and recognition: Current state, challenges, and future prospects,”IEEE Geosci. Remote Sens. Mag., vol. 12, no. 4, pp. 132–168, 2024

work page 2024

[8] [8]

Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,

M. Ahmed, N. El-Sheimy, and H. Leung, “Dual-modal approach for ship detection: Fusing synthetic aperture radar and optical satellite imagery,” Sensors, vol. 25, no. 2, p. 329, 2025

work page 2025

[9] [9]

Machine learning based aircraft detection using sar & optical images,

M. Rane and S. Kumar, “Machine learning based aircraft detection using sar & optical images,”ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 10, pp. 529–535, 2025

work page 2025

[10] [10]

A deep cross- modality hashing network for sar and optical remote sensing images retrieval,

W. Xiong, Z. Xiong, Y . Zhang, Y . Cui, and X. Gu, “A deep cross- modality hashing network for sar and optical remote sensing images retrieval,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 5284–5296, 2020

work page 2020

[11] [11]

Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,

Y . Sun, S. Feng, Y . Ye, X. Li, J. Kang, Z. Huang, and C. Luo, “Multisensor fusion and explicit semantic preserving-based deep hashing for cross- modal remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2021

work page 2021

[12] [12]

Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,

J. Huang, Y . Feng, M. Zhou, X. Xiong, Y . Wang, and B. Qiang, “Deep multiscale fine-grained hashing for remote sensing cross-modal retrieval,” IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

work page 2024

[13] [13]

Cross-modal retrieval algorithm based on patch aggregation,

J. Yang and Y . Tang, “Cross-modal retrieval algorithm based on patch aggregation,” inProc. 4th Int. Conf. Electron. Inf. Technol. (EIT). IEEE, 2025, pp. 632–637

work page 2025

[14] [14]

Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,

Z. Luo, M. Meng, and J. Wu, “Dynamic patch selection and dual- granularity alignment for cross-modal retrieval,”Neurocomputing, p. 132999, 2026

work page 2026

[15] [15]

Canonical correlation analysis: An overview with application to learning methods,

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,”Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004

work page 2004

[16] [16]

Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,

A. A. Nielsen, “Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data,”IEEE Trans. Image Process., vol. 11, no. 3, pp. 293–305, 2002

work page 2002

[17] [17]

Distance metric learning for large margin nearest neighbor classification

K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.”J. Mach. Learn. Res., vol. 10, no. 2, 2009

work page 2009

[18] [18]

Data fusion through cross-modality metric learning using similarity-sensitive hashing,

M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through cross-modality metric learning using similarity-sensitive hashing,” inProc. IEEE CVPR. IEEE, 2010, pp. 3594–3601

work page 2010

[19] [19]

Learning hash functions for cross-view similarity search,

S. Kumar and R. Udupa, “Learning hash functions for cross-view similarity search,” inProc. IJCAI, vol. 22, no. 1, 2011, p. 1360

work page 2011

[20] [20]

Generalized multiview analysis: A discriminative latent space,

A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” inProc. IEEE CVPR. IEEE, 2012, pp. 2160–2167

work page 2012

[21] [21]

Sar-sift: a sift-like algorithm for sar images,

F. Dellinger, J. Delon, Y . Gousseau, J. Michel, and F. Tupin, “Sar-sift: a sift-like algorithm for sar images,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 453–466, 2014

work page 2014

[22] [22]

A novel active learning method in relevance feedback for content-based remote sensing image retrieval,

B. Demir and L. Bruzzone, “A novel active learning method in relevance feedback for content-based remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2323–2334, 2014

work page 2014

[23] [23]

Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,

H. Liu, X. Tan, and X. Zhou, “Parameter sharing exploration and hetero- center triplet loss for visible-thermal person re-identification,”IEEE Trans. Multimedia, vol. 23, pp. 4414–4425, 2020

work page 2020

[24] [24]

Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,

T. Liang, Y . Jin, W. Liu, T. Wang, S. Feng, and Y . Li, “Bridging the gap: Multi-level cross-modality joint alignment for visible-infrared person re-identification,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 8, pp. 7683–7698, 2024

work page 2024

[25] [25]

Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,

H. Park, S. Lee, J. Lee, and B. Ham, “Learning by aligning: Visible- infrared person re-identification using cross-modal correspondences,” in Proc. IEEE/CVF ICCV, 2021, pp. 12 046–12 055

work page 2021

[26] [26]

Deep learning for person re-identification: A survey and outlook,

M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2021

work page 2021

[27] [27]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[28] [28]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inProc. ICML. PMLR, 2021, pp. 10 347–10 357

work page 2021

[29] [29]

Transreid: Transformer-based object re-identification,

S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Transreid: Transformer-based object re-identification,” inProc. IEEE/CVF ICCV, 2021, pp. 15 013–15 022

work page 2021

[30] [30]

A versatile framework for multi- scene person re-identification,

W.-S. Zheng, J. Yan, and Y .-X. Peng, “A versatile framework for multi- scene person re-identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 3, pp. 1362–1380, 2024

work page 2024

[31] [31]

Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,

W. Chen, X. Xu, J. Jia, H. Luo, Y . Wang, F. Wang, R. Jin, and X. Sun, “Beyond appearance: a semantic controllable self-supervised learning 13 framework for human-centric visual tasks,” inProc. IEEE/CVF CVPR, 2023, pp. 15 050–15 061

work page 2023

[32] [32]

Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,

H. Wang, S. Li, J. Yang, Y . Liu, Y . Lv, and Z. Zhou, “Cross-modal ship re-identification via optical and sar imagery: A novel dataset and method,” inProc. IEEE/CVF ICCV, October 2025, pp. 7873–7883

work page 2025

[33] [33]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

S4nd: Modeling images and videos as multidimensional signals with state spaces,

E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. R ´e, “S4nd: Modeling images and videos as multidimensional signals with state spaces,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, pp. 2846–2861, 2022

work page 2022

[35] [35]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 103 031–103 063, 2024

work page 2024

[37] [37]

Mambavision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,” inProc. IEEE/CVF CVPR, 2025, pp. 25 261–25 270

work page 2025

[38] [38]

Mambahsi: Spatial–spectral mamba for hyperspectral image classification,

Y . Li, Y . Luo, L. Zhang, Z. Wang, and B. Du, “Mambahsi: Spatial–spectral mamba for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, 2024

work page 2024

[39] [39]

Samba: Semantic segmentation of remotely sensed images with state space model,

Q. Zhu, Y . Cai, Y . b. Fang, Y . Yang, C. Chen, L. Fan, and A. Nguyen, “Samba: Semantic segmentation of remotely sensed images with state space model,”Heliyon, vol. 10, no. 19, 2024

work page 2024

[40] [40]

Rsmamba: Remote sensing image classification with state space model,

K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, “Rsmamba: Remote sensing image classification with state space model,”IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024

work page 2024

[41] [41]

Robust registration of multimodal remote sensing images based on structural similarity,

Y . Ye, J. Shan, L. Bruzzone, and L. Shen, “Robust registration of multimodal remote sensing images based on structural similarity,”IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2941–2958, 2017

work page 2017

[42] [42]

Robust optical-to-sar image matching based on shape properties,

Y . Ye, L. Shen, M. Hao, J. Wang, and Z. Xu, “Robust optical-to-sar image matching based on shape properties,”IEEE Geosci. Remote Sens. Lett., vol. 14, no. 4, pp. 564–568, 2017

work page 2017

[43] [43]

A new perspective on physics guided learning for sar image interpretation,

Z. Wang, Z. Huang, and M. Datcu, “A new perspective on physics guided learning for sar image interpretation,” inProc. IEEE IGARSS. IEEE, 2023, pp. 1926–1929

work page 2023

[44] [44]

Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,

L. Yi, D. Lan, Z. Ke’er, and D. Yuang, “Deep network for sar target recognition based on attribute scattering center convolutional kernel modulation,”J. Radars, vol. 13, no. 2, pp. 443–456, 2024

work page 2024

[45] [45]

Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,

X. Xiong, X. Zhang, W. Jiang, L. Liu, Y . Liu, and T. Liu, “Sar-gtr: Attributed scattering information guided sar graph transformer recognition algorithm,”arXiv preprint arXiv:2505.08547, 2025

work page arXiv 2025

[46] [46]

Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set

Y . Yang and H. Zhao, “Deep learning-based sar target recognition: A dual-perspective survey of closed set and open set.”Appl. Sci., vol. 15, no. 23, 2025

work page 2025

[47] [47]

A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,

D. Li, Z. Zhang, X. Chen, and K. Huang, “A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios,”IEEE Trans. Image Process., vol. 28, no. 4, pp. 1575–1590, 2019

work page 2019

[48] [48]

A public dataset for fine-grained ship classification in optical remote sensing images,

Y . Di, Z. Jiang, and H. Zhang, “A public dataset for fine-grained ship classification in optical remote sensing images,”Remote Sens., vol. 13, no. 4, p. 747, 2021

work page 2021

[49] [49]

Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,

Y . Xiang, J. Chen, Z. Hong, N. Jiao, F. Wang, H. You, and X. Tong, “Osdataset2. 0: Sar-optical image matching dataset and evaluation benchmark,”J. Radars, vol. 14, pp. 1–13, 2025

work page 2025

[50] [50]

Bag-of-visual-words and spatial extensions for land-use classification,

Y . Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” inProc. 18th SIGSPATIAL Int. Conf. Adv. Geogr . Inf. Syst., 2010, pp. 270–279

work page 2010

[51] [51]

Structural high-resolution satellite image indexing,

G.-S. Xia, W. Yang, J. Delon, Y . Gousseau, H. Sun, and H. Ma ˆıtre, “Structural high-resolution satellite image indexing,” inProc. ISPRS TC VII Symp., vol. 38, 2010, pp. 298–303

work page 2010

[52] [52]

Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,

Y . Li, Y . Zhang, X. Huang, and J. Ma, “Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,”IEEE Trans. Geosci. Remote Sens., vol. 56, no. 11, pp. 6521–6536, 2018

work page 2018

[53] [53]

A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,

Y . Sun, S. Feng, Y . Ye, X. Li, and J. Kang, “A deep cross-modal hashing technique for large-scale sar and vhr image retrieval,” inProc. BIGSARDATA. IEEE, 2021, pp. 1–4

work page 2021

[54] [54]

An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,

W. Xiong, Z. Xiong, Y . Cui, L. Huang, and R. Yang, “An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 6, pp. 2696–2712, 2022

work page 2022

[55] [55]

Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,

W. Xu, X. Yuan, Q. Hu, and J. Li, “Sar-optical feature matching: A large-scale patch dataset and a deep local descriptor,”Int. J. Appl. Earth Obs. Geoinf., vol. 122, p. 103433, 2023

work page 2023

[56] [56]

Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y . Feng, T. Xuet al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS J. Photogramm. Remote Sens., vol. 184, pp. 116–130, 2022

work page 2022

[57] [57]

FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,

X. Hou, W. Ao, Q. Songet al., “FUSAR-Ship: Building a high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,” Sci. China Inf. Sci., vol. 63, no. 4, p. 140303, 2020

work page 2020

[58] [58]

Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,

W. Zhirui, K. Yuzhuo, Z. Xuan, W. Yuelei, Z. Ting, and S. Xian, “Sar- aircraft-1.0: High-resolution sar aircraft detection and recognition dataset,” J. Radars, vol. 12, no. 4, pp. 906–922, 2023

work page 2023

[59] [59]

Air- sarship-1.0: High-resolution sar ship detection dataset,

S. Xian, W. Zhirui, S. Yuanrui, D. Wenhui, Z. Yue, and F. Kun, “Air- sarship-1.0: High-resolution sar ship detection dataset,”J. Radars, vol. 8, no. 6, pp. 852–863, 2019

work page 2019

[60] [60]

Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,

Y . Wu, Y . Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y . Zhao, K. Fu, and X. Sun, “Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single- look complex sar images,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–22, 2025

work page 2025

[61] [61]

Gaussian-adaptive bilateral filter,

B.-H. Chen, Y .-S. Tseng, and J.-L. Yin, “Gaussian-adaptive bilateral filter,”IEEE Signal Process. Lett., vol. 27, pp. 1670–1674, 2020

work page 2020

[62] [62]

Optimized laplacian image sharpening algorithm based on graphic processing unit,

T. Ma, L. Li, S. Ji, X. Wang, Y . Tian, A. Al-Dhelaan, and M. Al- Rodhaan, “Optimized laplacian image sharpening algorithm based on graphic processing unit,”Physica A, vol. 416, pp. 400–410, 2014

work page 2014

[63] [63]

Speckle noise reduction in sar imagery using a local adaptive median filter,

F. Qiu, J. Berglund, J. R. Jensen, P. Thakkar, and D. Ren, “Speckle noise reduction in sar imagery using a local adaptive median filter,”GIScience Remote Sens., vol. 41, no. 3, pp. 244–266, 2004

work page 2004

[64] [64]

Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,

I. M. Mohammed and N. A. M. Isa, “Contrast limited adaptive local histogram equalization method for poor contrast image enhancement,” IEEE Access, 2025

work page 2025

[65] [65]

Physics- guided detector for sar airplanes,

Z. Huang, L. Liu, S. M. Yang, Z. Wang, G. Cheng, and J. Han, “Physics- guided detector for sar airplanes,”IEEE Trans. Circuits Syst. Video Technol., 2025

work page 2025

[66] [66]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 30, 2017

work page 2017

[67] [67]

Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,

Y . Zhang and H. Wang, “Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re- identification,” inProc. IEEE/CVF CVPR, June 2023, pp. 2153–2162

work page 2023

[68] [68]

Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,

B. Liu, R. Huang, X. Pan, C. Li, J. Sun, J. Dong, and X. Wang, “Ad- vancing ship re-identification in the wild: The shipreid-2400 benchmark dataset and d2internet baseline method,” inProc. 48th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr ., 2025, pp. 106–115

work page 2025

[69] [69]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE CVPR, 2009, pp. 248–255

work page 2009