pith. sign in

arxiv: 2509.23310 · v3 · submitted 2025-09-27 · 💻 cs.CV

Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

Pith reviewed 2026-05-18 12:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal remote sensingdiffusion modelsfeature fusionland-cover classificationmulti-branch networkmutual learningmodality masking
0
0 comments X

The pith

A balanced diffusion-guided fusion framework uses modality-masked diffusion features to hierarchically guide a multi-branch network and improve multimodal remote sensing classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that an adaptive modality masking strategy can prevent one sensor from dominating when training diffusion models on multimodal remote sensing data. These balanced diffusion features then serve as hierarchical guides for extracting features in a multi-branch network that combines convolutional, state-space, and attention-based processing, supported by mutual learning across branches. If successful, this would allow more effective use of complementary information from different sensors in land-cover classification tasks, which are important for monitoring changes in the environment and managing resources. Experiments on four datasets show the approach outperforms prior methods.

Core claim

The authors claim that their balanced diffusion-guided fusion framework addresses modality imbalance in pre-trained multimodal DDPMs through an adaptive modality masking strategy, enabling the resulting diffusion features to hierarchically guide feature extraction in a multi-branch network incorporating CNN, Mamba, and transformer components via feature fusion, group channel attention, and cross-attention mechanisms, while a mutual learning strategy aligns the branches by matching probability entropy and feature similarity, ultimately delivering superior classification performance across four multimodal remote sensing datasets.

What carries the argument

The balanced diffusion-guided fusion (BDGF) framework, which applies adaptive modality masking to DDPMs for balanced features and uses those features to hierarchically guide a multi-branch network through fusion, attention, and mutual learning.

Load-bearing premise

The adaptive modality masking strategy successfully produces a modality-balanced data distribution in the DDPMs without discarding critical complementary information from any sensor.

What would settle it

Running the full classification experiments on the four multimodal datasets both with and without the adaptive modality masking step and checking for a significant drop in accuracy when masking is removed would test whether the balancing is essential to the gains.

Figures

Figures reproduced from arXiv: 2509.23310 by Hao Liu, Lorenzo Bruzzone, Maoguo Gong, Mingyang Zhang, Yongjie Zheng, Yuhan Kang.

Figure 1
Figure 1. Figure 1: Illustration of the proposed BDGF framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of the adaptive modality masking strategy. In the forward diffusion process, the strategy consists of adding an iteration-varying structure mask and sample mask [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Flowchart of diffusion features guide CNN-based network. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the mutual learning module. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Flowchart of the proposed Mamba-based network. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: 2D t-SNE embeddings of per-branch features on the LCZ HK dataset. (a)–(c) represents features from the CNN, Transformer, and Mamba branches, respectively. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 14
Figure 14. Figure 14: OA% versus the number of labeled samples on the four considered datasets. [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 11
Figure 11. Figure 11: Classification maps and OA% obtained on the Berlin dataset using several [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Classification maps and OA% obtained on the Yellow River Estuary dataset using [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Classification maps and OA% obtained on the LCZ HK dataset using several [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a Balanced Diffusion-Guided Fusion (BDGF) framework for multimodal remote sensing land-cover classification. It proposes an adaptive modality masking strategy to pre-train multimodal DDPMs for a modality-balanced data distribution rather than spectral-dominated, then uses the resulting diffusion features to hierarchically guide feature extraction in a multi-branch network (CNN, Mamba, transformer) via feature fusion, group channel attention, and cross-attention. A mutual learning strategy aligns probability entropy and feature similarity across branches. Superior classification performance is reported on four public multimodal remote sensing datasets, with code released.

Significance. If the central claims hold, the work provides a timely engineering contribution by explicitly addressing modality imbalance in diffusion pre-training for remote sensing and combining it with recent architectures like Mamba for hierarchical guidance. The public code release is a clear strength supporting reproducibility. The significance hinges on whether the balanced diffusion features deliver gains beyond what the multi-branch architecture alone would achieve.

major comments (2)
  1. [Method (adaptive modality masking)] The adaptive modality masking strategy (described in the abstract and method) is load-bearing for the claim that balanced diffusion features drive the performance gains. No quantitative validation is provided (e.g., per-modality loss statistics, distribution histograms, or ablation comparing masked vs. unmasked pre-training) to confirm that masking equalizes modality influence without discarding critical complementary spectral or structural information from any sensor. If masking trades off unique features, the downstream results could be explained by the multi-branch network alone.
  2. [§4] §4 (experiments): The manuscript reports superior performance on four datasets but provides no error bars, statistical significance tests, or comprehensive ablation studies isolating the contribution of the masking strategy, hierarchical diffusion guidance, and mutual learning. This weakens verification of the central claim that the balanced diffusion features are responsible for the improvements.
minor comments (2)
  1. [Abstract] The abstract could briefly name the four datasets and report one or two key quantitative metrics (e.g., overall accuracy gains) to strengthen the summary of results.
  2. [Notation and method] Ensure consistent notation for the diffusion features and attention modules across sections; minor inconsistencies in variable naming could confuse readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully reviewed the major comments and will incorporate revisions to strengthen the validation of the adaptive modality masking strategy and enhance the experimental analysis with additional statistical measures and ablations.

read point-by-point responses
  1. Referee: [Method (adaptive modality masking)] The adaptive modality masking strategy (described in the abstract and method) is load-bearing for the claim that balanced diffusion features drive the performance gains. No quantitative validation is provided (e.g., per-modality loss statistics, distribution histograms, or ablation comparing masked vs. unmasked pre-training) to confirm that masking equalizes modality influence without discarding critical complementary spectral or structural information from any sensor. If masking trades off unique features, the downstream results could be explained by the multi-branch network alone.

    Authors: We appreciate the referee's emphasis on this point, as the adaptive modality masking is central to our claim of achieving modality-balanced diffusion features. The current manuscript describes the strategy in detail and integrates it into the pre-training process to mitigate spectral dominance. However, we acknowledge that explicit quantitative evidence, such as per-modality loss curves or histograms showing balanced influence, would further substantiate that critical complementary information is preserved. In the revised manuscript, we will add these analyses along with an ablation comparing masked versus unmasked pre-training on the downstream classification task. This will demonstrate that the masking equalizes modality contributions without discarding unique spectral or structural details from any sensor. revision: yes

  2. Referee: [§4] §4 (experiments): The manuscript reports superior performance on four datasets but provides no error bars, statistical significance tests, or comprehensive ablation studies isolating the contribution of the masking strategy, hierarchical diffusion guidance, and mutual learning. This weakens verification of the central claim that the balanced diffusion features are responsible for the improvements.

    Authors: We agree that the experimental section would benefit from greater statistical rigor and more targeted ablations to isolate each component's contribution. The current results demonstrate consistent superiority across four datasets, but we recognize the value of error bars from multiple runs and significance testing to confirm the gains are not due to the multi-branch architecture alone. In the revised version, we will report mean and standard deviation over multiple independent runs, include statistical significance tests (such as paired t-tests against baselines), and expand the ablation studies to separately evaluate the adaptive modality masking, hierarchical diffusion guidance mechanisms, and mutual learning strategy. These additions will more clearly attribute performance improvements to the balanced diffusion features. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework validated on external public datasets with released code

full rationale

The paper presents BDGF as an engineering framework: adaptive modality masking during DDPM pre-training, hierarchical diffusion-feature guidance via fusion/attention mechanisms across CNN/Mamba/transformer branches, and mutual learning via entropy/similarity alignment. All central performance claims are tied to empirical results on four independent public multimodal remote sensing datasets rather than any equation or parameter that reduces by construction to the inputs. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation; the method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed components (masking, hierarchical guidance, mutual learning) whose internal hyperparameters and training details are not specified in the abstract; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5772 in / 1212 out tokens · 35777 ms · 2026-05-18T12:37:24.345347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    Deep learning in remote sensing image fusion: Methods, protocols, data, and future perspectives,

    G. Vivone, L.-J. Deng, S. Deng, D. Hong, M. Jiang, C. Li, W. Li, H. Shen, X. Wu, J.-L. Xiao, J. Yao, M. Zhang, J. Chanussot, S. Garc ´ıa, and A. Plaza, “Deep learning in remote sensing image fusion: Methods, protocols, data, and future perspectives,”IEEE Geosci. Remote Sens. Mag., vol. 13, no. 1, pp. 269–310, Mar. 2025

  2. [2]

    Spatial–spectral heterogeneity-aware network for hyperspectral and lidar joint classification,

    S. Zhang, Q. Liu, Z. Zhang, R. Zhao, L. Chen, F. Shao, and X. Meng, “Spatial–spectral heterogeneity-aware network for hyperspectral and lidar joint classification,”IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, Jun. 2025

  3. [3]

    Environmental degradation in the urban areas of china: Evidence from multi-source remote sensing data,

    C. He, B. Gao, Q. Huang, Q. Ma, and Y . Dou, “Environmental degradation in the urban areas of china: Evidence from multi-source remote sensing data,”Remote Sens. Environ, vol. 193, pp. 65–75, Mar. 2017

  4. [4]

    Multisource remote sensing classification for coastal wetland using feature intersecting learning,

    Z. Han, Y . Gao, X. Jiang, J. Wang, and W. Li, “Multisource remote sensing classification for coastal wetland using feature intersecting learning,”IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, Mar. 2022

  5. [5]

    A new fusion approach for extracting urban built-up areas from multisource remotely sensed data,

    X. Ma, C. Li, X. Tong, and S. Liu, “A new fusion approach for extracting urban built-up areas from multisource remotely sensed data,”Remote Sens., vol. 11, no. 21, p. 2516, Oct. 2019

  6. [6]

    Multi-source remote sensing data fusion: status and trends,

    J. Zhang, “Multi-source remote sensing data fusion: status and trends,” Int. J. Image Data fusion, vol. 1, no. 1, pp. 5–24, Feb. 2010

  7. [7]

    Fusion of hyper- spectral and lidar remote sensing data using multiple feature learning,

    M. Khodadadzadeh, J. Li, S. Prasad, and A. Plaza, “Fusion of hyper- spectral and lidar remote sensing data using multiple feature learning,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2971–2983, Jun. 2015

  8. [8]

    Relationship learning from multisource images via spatial-spectral perception network,

    Y . Gao, W. Li, J. Wang, M. Zhang, and R. Tao, “Relationship learning from multisource images via spatial-spectral perception network,”IEEE Trans. Image Process., vol. 33, pp. 3271–3284, May 2024

  9. [9]

    More diverse means better: Multimodal deep learning meets remote- sensing imagery classification,

    D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang, “More diverse means better: Multimodal deep learning meets remote- sensing imagery classification,”IEEE Trans. Geosci. and Remote Sens., vol. 59, no. 5, pp. 4340–4354, Aug. 2021

  10. [10]

    Convolutional neural networks for multimodal remote sensing data classification,

    X. Wu, D. Hong, and J. Chanussot, “Convolutional neural networks for multimodal remote sensing data classification,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–10, Feb. 2022

  11. [11]

    Fusion classification of hsi and msi using a spatial-spectral vision transformer for wetland biodiversity estimation,

    Y . Gao, X. Song, W. Li, J. Wang, J. He, X. Jiang, and Y . Feng, “Fusion classification of hsi and msi using a spatial-spectral vision transformer for wetland biodiversity estimation,”Remote Sens., vol. 14, no. 4, p. 850, Feb. 2022

  12. [12]

    Deep hierarchical vision transformer for hyperspectral and lidar data classification,

    Z. Xue, X. Tan, X. Yu, B. Liu, A. Yu, and P. Zhang, “Deep hierarchical vision transformer for hyperspectral and lidar data classification,”IEEE Trans. Image Process., vol. 31, pp. 3095–3110, Apr. 2022

  13. [13]

    Multimodal fusion transformer for remote sensing image classification,

    S. K. Roy, A. Deria, D. Hong, B. Rasti, A. Plaza, and J. Chanussot, “Multimodal fusion transformer for remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–20, Jun. 2023

  14. [14]

    Mutually beneficial transformer for multimodal data fusion,

    J. Wang and X. Tan, “Mutually beneficial transformer for multimodal data fusion,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7466–7479, May 2023

  15. [15]

    Global clue- guided cross-memory quaternion transformer network for multisource remote sensing data classification,

    W.-S. Hu, W. Li, H.-C. Li, F.-H. Huang, and R. Tao, “Global clue- guided cross-memory quaternion transformer network for multisource remote sensing data classification,”IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, Jun. 2024

  16. [16]

    Multimodal quaternion representation network for multisource remote sensing data classification,

    Y .-L. Wei, H.-C. Li, J.-L. Wang, Y .-B. Zheng, J. Pan, and Q. Du, “Multimodal quaternion representation network for multisource remote sensing data classification,”IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, Sep. 2025

  17. [17]

    Mhst: Multiscale head selection transformer for hyperspectral and lidar classification,

    K. Ni, D. Wang, Z. Zheng, and P. Wang, “Mhst: Multiscale head selection transformer for hyperspectral and lidar classification,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 5470–5483, Feb. 2024

  18. [18]

    Ncglf2: Network combining global and local features for fusion of multisource remote sensing data,

    B. Tu, Q. Ren, J. Li, Z. Cao, Y . Chen, and A. Plaza, “Ncglf2: Network combining global and local features for fusion of multisource remote sensing data,”Inf. Fusion, vol. 104, p. 102192, Apr. 2024

  19. [19]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  20. [20]

    A mamba- diffusion framework for multimodal remote sensing image semantic segmentation,

    W.-L. Du, Y . Gu, J. Zhao, H. Zhu, R. Yao, and Y . Zhou, “A mamba- diffusion framework for multimodal remote sensing image semantic segmentation,”IEEE Geosci. and Remote Sens. Lett., vol. 21, pp. 1– 5, Oct. 2024

  21. [21]

    S2crossmamba: Spatial–spectral cross-mamba for multimodal remote sensing image classification,

    G. Zhang, Z. Zhang, J. Deng, L. Bian, and C. Yang, “S2crossmamba: Spatial–spectral cross-mamba for multimodal remote sensing image classification,”IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, Oct. 2024

  22. [22]

    Msfmamba: Multiscale feature fusion state space model for multisource remote sensing image classification,

    F. Gao, X. Jin, X. Zhou, J. Dong, and Q. Du, “Msfmamba: Multiscale feature fusion state space model for multisource remote sensing image classification,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–16, Jan. 2025

  23. [23]

    Mlmamba: A mamba- based efficient network for multi-label remote sensing scene classifica- tion,

    R. Du, X. Tang, J. Ma, X. Zhang, and L. Jiao, “Mlmamba: A mamba- based efficient network for multi-label remote sensing scene classifica- tion,”IEEE Trans. Circuits Syst. Video Technol., pp. 1–1, Jan. 2025

  24. [24]

    Joint classification of hyperspectral and lidar data based on mamba,

    D. Liao, Q. Wang, T. Lai, and H. Huang, “Joint classification of hyperspectral and lidar data based on mamba,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–15, Oct. 2024

  25. [25]

    Distribution-independent domain generalization for multisource remote sensing classification,

    Y . Gao, M. Zhang, W. Li, and R. Tao, “Distribution-independent domain generalization for multisource remote sensing classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 7, Jul. 2025

  26. [26]

    A comprehensive survey for hyperspectral image classification: The evolution from conventional to transformers and mamba models,

    M. Ahmad, S. Distifano, A. M. Khan, M. Mazzara, C. Li, H. Li, J. Aryal, Y . Ding, G. Vivone, and D. Hong, “A comprehensive survey for hyperspectral image classification: The evolution from conventional to transformers and mamba models,”arXiv preprint arXiv:2404.14955, 2024

  27. [27]

    Diffusion models meet remote sensing: Principles, methods, and perspectives,

    Y . Liu, J. Yue, S. Xia, P. Ghamisi, W. Xie, and L. Fang, “Diffusion models meet remote sensing: Principles, methods, and perspectives,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–22, Sep. 2024

  28. [28]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inProc. Adv. Neural. Inf. Process. Syst., vol. 33, Dec. 2020, pp. 6840– 6851

  29. [29]

    Dif- fusion models beat gans on image classification,

    S. Mukhopadhyay, M. Gwilliam, V . Agarwal, N. Padmanabhan, A. Swaminathan, S. Hegde, T. Zhou, and A. Shrivastava, “Dif- fusion models beat gans on image classification,”arXiv preprint arXiv:2307.08702, 2023

  30. [30]

    Diffusion subspace clustering for hyperspectral images,

    J. Chen, S. Liu, Z. Zhang, and H. Wang, “Diffusion subspace clustering for hyperspectral images,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, pp. 6517–6530, Jul. 2023

  31. [31]

    Spectraldiff: A generative framework for hyperspectral image classification with diffusion models,

    N. Chen, J. Yue, L. Fang, and S. Xia, “Spectraldiff: A generative framework for hyperspectral image classification with diffusion models,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–16, Aug. 2023

  32. [32]

    Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification,

    N. Sigger, Q.-T. Vien, S. V . Nguyen, G. Tozzi, and T. T. Nguyen, “Unveiling the potential of diffusion model-based framework with transformer for hyperspectral image classification,”Sci. Rep., vol. 14, no. 1, p. 8438, Apr. 2024

  33. [33]

    Exploring multi-timestep multi-stage diffusion features for hyperspectral image classification,

    J. Zhou, J. Sheng, P. Ye, J. Fan, T. He, B. Wang, and T. Chen, “Exploring multi-timestep multi-stage diffusion features for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, May 2024. 14

  34. [34]

    Dect: Diffusion-enhanced cnn–transformer for multisource remote sensing data classification,

    G. Zhang, L. Zhang, Z. Zhang, J. Deng, L. Bian, and C. Yang, “Dect: Diffusion-enhanced cnn–transformer for multisource remote sensing data classification,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 19 288–19 301, Oct. 2024

  35. [35]

    Ss-mae: Spatial–spectral masked autoencoder for multisource remote sensing image classifica- tion,

    J. Lin, F. Gao, X. Shi, J. Dong, and Q. Du, “Ss-mae: Spatial–spectral masked autoencoder for multisource remote sensing image classifica- tion,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–14, Nov. 2023

  36. [36]

    Moddrop: Adaptive multi-modal gesture recognition,

    N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: Adaptive multi-modal gesture recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1692–1706, Aug. 2016

  37. [37]

    What makes training multi-modal classification networks hard?

    W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020

  38. [38]

    On-the-fly modulation for balanced multimodal learning,

    Y . Wei, D. Hu, H. Du, and J.-R. Wen, “On-the-fly modulation for balanced multimodal learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 1, pp. 469–485, Jan. 2025

  39. [39]

    Siamese meets diffusion network: Smdnet for enhanced change detection in high-resolution rs imagery,

    J. Jia, G. Lee, Z. Wang, L. Zhi, and Y . He, “Siamese meets diffusion network: Smdnet for enhanced change detection in high-resolution rs imagery,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., 2024

  40. [40]

    Diffucd: Unsuper- vised hyperspectral image change detection with semantic correlation diffusion model,

    X. Zhang, S. Tian, G. Wang, H. Zhou, and L. Jiao, “Diffucd: Unsuper- vised hyperspectral image change detection with semantic correlation diffusion model,”arXiv preprint arXiv:2305.12410, 2023

  41. [41]

    Mifnet: Learning modality-invariant features for generalizable multimodal image matching,

    Y . Liu, Z. Sun, B. Yu, Y . Zhao, B. Du, Y . Xu, and J. Cheng, “Mifnet: Learning modality-invariant features for generalizable multimodal image matching,”IEEE Trans. Image Process., vol. 34, pp. 3593–3608, Jan. 2025

  42. [42]

    Diffusiondet: Diffusion model for object detection,

    S. Chen, P. Sun, Y . Song, and P. Luo, “Diffusiondet: Diffusion model for object detection,” inProc. Int. Conf. Comput. Vis. (ICCV), 2023, pp. 19 830–19 843

  43. [43]

    Deep learning in multimodal remote sensing data fusion: A compre- hensive review,

    J. Li, D. Hong, L. Gao, J. Yao, K. Zheng, B. Zhang, and J. Chanussot, “Deep learning in multimodal remote sensing data fusion: A compre- hensive review,”Int. J. Appl. Earth Observ. Geoinf., vol. 112, p. 102926, Aug. 2022

  44. [44]

    Remote sensing scene classification via multi-branch local attention network,

    S.-B. Chen, Q.-S. Wei, W.-Z. Wang, J. Tang, B. Luo, and Z.-Y . Wang, “Remote sensing scene classification via multi-branch local attention network,”IEEE Trans. Image Process., vol. 31, pp. 99–109, Nov. 2021

  45. [45]

    Speckle analysis and smoothing of synthetic aperture radar images,

    J.-S. Lee, “Speckle analysis and smoothing of synthetic aperture radar images,”Comput. Graph. Image Process., vol. 17, no. 1, pp. 24–32, 1981

  46. [46]

    A multistage information complementary fusion network based on flexible-mixup for hsi-x image classification,

    J. Wang, M. Zhang, W. Li, and R. Tao, “A multistage information complementary fusion network based on flexible-mixup for hsi-x image classification,”IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 12, pp. 17 189–17 201, Dec. 2024

  47. [47]

    A hybrid multi-task learning network for hyperspectral image classification with few labels,

    H. Liu, M. Zhang, Z. Di, M. Gong, T. Gao, and A. K. Qin, “A hybrid multi-task learning network for hyperspectral image classification with few labels,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, Jan. 2024

  48. [48]

    Mambaout: Do we really need mamba for vision?

    W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” arXiv preprint arXiv:2405.07992, 2024

  49. [49]

    Hyperspectral and sar image classification via multiscale interactive fusion network,

    J. Wang, W. Li, Y . Gao, M. Zhang, R. Tao, and Q. Du, “Hyperspectral and sar image classification via multiscale interactive fusion network,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 12, pp. 10 823– 10 837, Dec. 2023

  50. [50]

    Joint classification of hyperspectral and lidar data using a hierarchical cnn and transformer,

    G. Zhao, Q. Ye, L. Sun, Z. Wu, C. Pan, and B. Jeon, “Joint classification of hyperspectral and lidar data using a hierarchical cnn and transformer,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–16, Jan. 2023

  51. [51]

    Asymmetric feature fusion network for hyperspectral and sar image classification,

    W. Li, Y . Gao, M. Zhang, R. Tao, and Q. Du, “Asymmetric feature fusion network for hyperspectral and sar image classification,”IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 10, pp. 8057–8070, Oct. 2023

  52. [52]

    Coupled adversarial learning for fusion classification of hyperspectral and lidar data,

    T. Lu, K. Ding, W. Fu, S. Li, and A. Guo, “Coupled adversarial learning for fusion classification of hyperspectral and lidar data,”Inf. Fusion, vol. 93, pp. 118–131, May 2023

  53. [53]

    Mixing self- attention and convolution: A unified framework for multi-source remote sensing data classification,

    K. Li, D. Wang, X. Wang, G. Liu, Z. Wu, and Q. Wang, “Mixing self- attention and convolution: A unified framework for multi-source remote sensing data classification,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–16, Sep. 2023

  54. [54]

    Uncertainty-aware contrastive learning for semi-supervised classification of multimodal remote sensing images,

    K. Ding, T. Lu, and S. Li, “Uncertainty-aware contrastive learning for semi-supervised classification of multimodal remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, May 2024

  55. [55]

    Domain information mining and state-guided adaptation network for multispectral image segmentation,

    B. Zhao, M. Zhang, W. Li, Y . Gao, and J. Wang, “Domain information mining and state-guided adaptation network for multispectral image segmentation,”IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2025

  56. [56]

    Earthmind: Towards multi-granular and multi- sensor earth observation with large multimodal models,

    Y . Shu, B. Ren, Z. Xiong, D. P. Paudel, L. Van Gool, B. Demir, N. Sebe, and P. Rota, “Earthmind: Towards multi-granular and multi- sensor earth observation with large multimodal models,”arXiv preprint arXiv:2506.01667, 2025