Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Erdem Akag\"und\"uz; Irem Ulku; \"Omer \"Ozg\"ur Tanr{\i}\"over

arxiv: 2604.15856 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

Irem Ulku , Erdem Akag\"und\"uz , \"Omer \"Ozg\"ur Tanr{\i}\"over This is my paper

Pith reviewed 2026-05-10 09:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal semantic segmentationmissing modalitiesstructured latent projectionremote sensingmodality-specific informationadaptive transfercomplementary information

0 comments

The pith

CBC-SLP structures latent representations into shared and modality-specific parts with adaptive transfer to keep complementary information available under both full and missing modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing multimodal segmentation approaches lose modality-specific details when they force everything into one shared representation, and that baking a split into shared and specific components directly into the network architecture avoids this loss while still handling random modality dropouts. It does so by projecting features into those two kinds of components and routing the specific ones only when their source modality is present. A reader should care because real deployments of multispectral sensors routinely face missing channels from clouds, failures, or cost, yet most current models either degrade when all data arrive or fail when some do not.

Core claim

By decomposing the latent space into shared and modality-specific components and routing the specific components to the decoder only according to the current availability mask, the model recovers complementary information that a single shared representation discards, yielding higher segmentation accuracy on three remote-sensing benchmarks whether all modalities are present or some are randomly absent.

What carries the argument

Structured latent projection that splits encoder outputs into a shared component and per-modality specific components, then adaptively transfers the specific parts to the decoder according to the modality availability mask.

If this is right

Segmentation remains accurate even when sensors fail or data are incomplete.
Modality-specific details that would be averaged away in a shared embedding can still contribute to the final map.
The inductive bias is enforced by architecture rather than an extra loss term.
The same split-and-route pattern works across full-modality and partial-modality test regimes without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be tested on other multimodal tasks where perfect cross-modal alignment is known to hurt downstream performance.
One could measure how much of the recovered accuracy comes from each modality-specific branch by ablating them individually.
The approach may reduce the need for modality-specific data augmentation or imputation modules.

Load-bearing premise

That splitting latents into shared and modality-specific parts and routing them according to the availability mask will preserve complementary information without creating new accuracy losses or needing extensive tuning of the split itself.

What would settle it

An experiment on the same three datasets in which CBC-SLP shows no accuracy gain over a pure shared-representation baseline when modalities are fully available, or fails to improve under dropout.

Figures

Figures reproduced from arXiv: 2604.15856 by Erdem Akag\"und\"uz, Irem Ulku, \"Omer \"Ozg\"ur Tanr{\i}\"over.

**Figure 2.** Figure 2: Cross-modality T-distributed stochastic neighbor embedding distributions on DSTL, Potsdam, and Hunan image sets. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Framework of the proposed CBC-SLP model that learns structured multimodal feature representations for robust semantic [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative semantic segmentation comparison on the DSTL image set under different modality-availability settings. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative semantic segmentation comparison on the Potsdam image set under different modality-availability settings. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative semantic segmentation comparison on the Hunan image set under different modality-availability settings. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Information gap across different modality availability [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CBC-SLP splits latents into shared and modality-specific parts as an architectural choice to avoid the usual shared-rep trade-off in missing-modality segmentation.

read the letter

The paper's main contribution is an architectural approach called CBC-SLP that projects latents into structured shared and modality-specific parts, then routes them adaptively to the decoder using the availability mask. This is presented as an inductive bias drawn from modality alignment theory, avoiding the need for a separate loss term. It performs well by reporting better segmentation results than existing multimodal models on three remote sensing datasets, both when all modalities are present and when some are dropped. The experiments also suggest it preserves complementary information that shared representations tend to lose. Public code is a good sign for reproducibility. Soft spots include limited information in the abstract about experimental setup, such as how baselines were selected or if results include error bars and significance tests. The projection mechanism might require careful design choices that could affect generality, though nothing indicates a major issue with the core logic. The central claims seem to hold up based on the described construction. This is relevant for anyone working on robust semantic segmentation with multispectral or multimodal data where sensor reliability varies. Readers focused on inductive biases in multimodal learning would get direct value from the design. I recommend putting it through peer review. The idea is clear, the problem is real, and the empirical support looks promising enough to justify referee time.

Referee Report

2 major / 2 minor

Summary. The paper introduces CBC-SLP, a multimodal semantic segmentation architecture for remote sensing imagery that structures latent representations into shared and modality-specific components. These are adaptively transferred to the decoder according to a random modality availability mask. The design is motivated by theoretical observations that perfectly aligned multimodal representations can be suboptimal for downstream tasks; the inductive bias is embedded in the architecture rather than enforced via an auxiliary loss. The authors report that CBC-SLP outperforms existing state-of-the-art multimodal models on three remote-sensing datasets under both full-modality and random missing-modality regimes, and provide empirical evidence that the structured projection recovers complementary modality-specific information that shared-representation baselines discard.

Significance. If the reported gains prove robust under rigorous controls, the work is significant for practical multimodal remote-sensing applications where sensor dropout is routine. Encoding the shared/specific decomposition directly in the architecture (rather than via loss terms) is a clean inductive-bias choice that aligns with the cited theoretical motivation. Public code release supports reproducibility. The empirical demonstration that complementary information can be recovered without sacrificing full-modality performance challenges the prevailing shared-representation paradigm and could influence future multimodal segmentation designs.

major comments (2)

[Experiments section] Experiments section: the central claim of consistent outperformance across full and missing-modality regimes is load-bearing, yet the abstract (and presumably the experimental write-up) provides no information on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is impossible to determine whether the reported gains exceed experimental noise.
[Method section] Method section (structured latent projection): the adaptive transfer mechanism according to the modality availability mask is the key architectural novelty. The manuscript should explicitly state how the mask is sampled during training (e.g., uniform random dropout rate per modality or per sample) and whether the same mask distribution is used at test time; any mismatch would undermine the robustness claims.

minor comments (2)

[Abstract] Abstract: the acronym CBC-SLP is introduced without expansion; the first sentence should spell out the full name.
[Introduction / Experiments] The three datasets are referred to only generically; the manuscript should name them (e.g., Potsdam, Vaihingen, etc.) and cite the corresponding references in the introduction or experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comments point by point below.

read point-by-point responses

Referee: [Experiments section] Experiments section: the central claim of consistent outperformance across full and missing-modality regimes is load-bearing, yet the abstract (and presumably the experimental write-up) provides no information on the number of independent runs, standard deviations, or statistical significance tests. Without these, it is impossible to determine whether the reported gains exceed experimental noise.

Authors: We agree that including the number of independent runs, standard deviations, and statistical significance tests would make the performance claims more robust. We will revise the Experiments section to add these details from our experimental setup. revision: yes
Referee: [Method section] Method section (structured latent projection): the adaptive transfer mechanism according to the modality availability mask is the key architectural novelty. The manuscript should explicitly state how the mask is sampled during training (e.g., uniform random dropout rate per modality or per sample) and whether the same mask distribution is used at test time; any mismatch would undermine the robustness claims.

Authors: Thank you for pointing this out. We will update the Method section to explicitly describe the sampling of the random modality availability mask during training and confirm that the same distribution is used at test time. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes CBC-SLP as a new architectural inductive bias that structures latent representations into shared and modality-specific components with mask-adaptive transfer, directly encoding the motivation about preserving complementary information. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed known results; the central claims rest on empirical validation across full and missing-modality regimes on three datasets rather than any self-referential derivation. The reference to theoretical results on modality alignment is presented as external inspiration, not a self-citation chain or uniqueness theorem imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are detailed beyond the general neural network training process and the cited theoretical results on modality alignment.

axioms (1)

domain assumption Perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks
Invoked in the abstract as inspiration for avoiding full alignment via shared representations.

pith-pipeline@v0.9.0 · 5590 in / 1251 out tokens · 25130 ms · 2026-05-10T09:02:39.939657+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
cs.CV 2026-05 unverdicted novelty 6.0

A method that derives a sampling distribution over modality-missing scenarios from latent-space distortions improves fine-tuning performance for multimodal semantic segmentation on remote sensing datasets compared to ...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper

[1]

A gift from the integration of discriminative and diffusion-based generative learning: Boundary refinement remote sensing semantic segmentation,

H. Wang, K. Hu, X. Guo, H. Li, and C. Tao, “A gift from the integration of discriminative and diffusion-based generative learning: Boundary refinement remote sensing semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2026

work page 2026
[2]

Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,

W. Chen, L. Bruzzone, B. Dang, Y . Gao, Y . Deng, J.-G. Yu, L. Yuan, and Y . Li, “Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 693–710, 2026

work page 2026
[3]

In-season wall-to- wall crop-type mapping using ensemble of image segmentation models,

S. A. Zaheer, Y . Ryu, J. Lee, Z. Zhong, and K. Lee, “In-season wall-to- wall crop-type mapping using ensemble of image segmentation models,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 11, 2023

work page 2023
[4]

Clusterformer for pine tree disease identification based on uav remote sensing image segmentation,

H. Liu, W. Li, W. Jia, H. Sun, M. Zhang, L. Song, and Y . Gui, “Clusterformer for pine tree disease identification based on uav remote sensing image segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

work page 2024
[5]

Stathaki, Ed.,Image Fusion: Algorithms and Applications

T. Stathaki, Ed.,Image Fusion: Algorithms and Applications. New York, NY , USA: Academic Press, 2008. 14

work page 2008
[6]

Infrared and visible image fusion: From data compatibility to task adaption,

J. Liu, G. Wu, Z. Liu, D. Wang, Z. Jiang, L. Ma, W. Zhong, X. Fan, and R. Liu, “Infrared and visible image fusion: From data compatibility to task adaption,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2349–2369, 2024

work page 2024
[7]

Deep semantic segmentation of trees using multispectral images,

I. Ulku, E. Akag ¨und¨uz, and P. Ghamisi, “Deep semantic segmentation of trees using multispectral images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 7589– 7604, 2022

work page 2022
[8]

Dsm- assisted unsupervised domain adaptive network for semantic segmenta- tion of remote sensing imagery,

S. Zhou, Y . Feng, S. Li, D. Zheng, F. Fang, Y . Liu, and B. Wan, “Dsm- assisted unsupervised domain adaptive network for semantic segmenta- tion of remote sensing imagery,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023

work page 2023
[9]

Dkdfn: Domain knowledge-guided deep collaborative fusion network for mul- timodal unitemporal remote sensing land cover classification,

Y . Li, Y . Zhou, Y . Zhang, L. Zhong, J. Wang, and J. Chen, “Dkdfn: Domain knowledge-guided deep collaborative fusion network for mul- timodal unitemporal remote sensing land cover classification,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 186, pp. 170– 189, 2022

work page 2022
[10]

Segcr: A multimodal and multitask complementary fusion network for remote sensing semantic segmentation and cloud removal,

S. Wu, J. Zhu, Y . Gu, W. Han, W. Jiang, and J. Geng, “Segcr: A multimodal and multitask complementary fusion network for remote sensing semantic segmentation and cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[11]

Multimodal heterogeneous hypergraph learning for incomplete multimodal semantic segmentation of remote sensing images,

W. Han, J. Geng, Z. Xu, and W. Jiang, “Multimodal heterogeneous hypergraph learning for incomplete multimodal semantic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[12]

Multimodal learning with transform- ers: A survey,

P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transform- ers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023

work page 2023
[13]

Multimodal classification of remote sensing images: A review and future directions,

L. G ´omez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal classification of remote sensing images: A review and future directions,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1560–1584, 2015

work page 2015
[14]

Missing data reconstruction in remote sensing image with a unified spatial–temporal– spectral deep convolutional neural network,

Q. Zhang, Q. Yuan, C. Zeng, X. Li, and Y . Wei, “Missing data reconstruction in remote sensing image with a unified spatial–temporal– spectral deep convolutional neural network,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 8, pp. 4274–4288, 2018

work page 2018
[15]

Msh-net: Modality- shared hallucination with joint adaptation distillation for remote sensing image classification using missing modalities,

S. Wei, Y . Luo, X. Ma, P. Ren, and C. Luo, “Msh-net: Modality- shared hallucination with joint adaptation distillation for remote sensing image classification using missing modalities,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

work page 2023
[16]

Challenges and opportunities of multimodality and data fusion in remote sensing,

M. Dalla Mura, S. Prasad, F. Pacifici, P. Gamba, J. Chanussot, and J. A. Benediktsson, “Challenges and opportunities of multimodality and data fusion in remote sensing,”Proceedings of the IEEE, vol. 103, no. 9, pp. 1585–1601, 2015

work page 2015
[17]

Decrecnet: a decoupling-reconstruction network for restoring the missing information of optical remote sensing images,

W. Liu, H. Cui, Y . Jiang, G. Zhang, X. Li, H. Li, Y . Chen, and J. Yang, “Decrecnet: a decoupling-reconstruction network for restoring the missing information of optical remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 9777–9801, 2023

work page 2023
[18]

Missing information reconstruction integrating isophote constraint and color-structure control for remote sensing data,

X. Yu, J. Pan, J. Xu, and M. Wang, “Missing information reconstruction integrating isophote constraint and color-structure control for remote sensing data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 208, pp. 261–278, 2024

work page 2024
[19]

Em-eof: Gap-filling in incom- plete sar displacement time series,

A. Hippert-Ferrer, Y . Yan, and P. Bolon, “Em-eof: Gap-filling in incom- plete sar displacement time series,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 5794–5811, 2020

work page 2020
[20]

Remote sensing meta modal representation for missing modality land cover mapping: From earthmiss dataset to metars method,

Y . Zhou, A. Ma, J. Wang, Z. Chen, and Y . Zhong, “Remote sensing meta modal representation for missing modality land cover mapping: From earthmiss dataset to metars method,”Remote Sensing of Environment, vol. 333, p. 115132, 2026

work page 2026
[21]

A novel approach to incomplete multimodal learning for remote sensing data fusion,

Y . Chen, M. Zhao, and L. Bruzzone, “A novel approach to incomplete multimodal learning for remote sensing data fusion,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024
[22]

Robust multimodal learning with missing modalities via parameter-efficient adaptation,

M. K. Reza, A. Prater-Bennette, and M. S. Asif, “Robust multimodal learning with missing modalities via parameter-efficient adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 742–754, 2025

work page 2025
[23]

Robsense: A robust multi-modal foundation model for remote sensing with static, temporal, and incomplete data adaptability,

M. K. Do, K. Han, P. Lai, K. T. Phan, and W. Xiang, “Robsense: A robust multi-modal foundation model for remote sensing with static, temporal, and incomplete data adaptability,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 7427–7436

work page 2025
[24]

Understanding and constructing latent modal- ity structures in multi-modal representation learning,

Q. Jiang, C. Chen, H. Zhao, L. Chen, Q. Ping, S. D. Tran, Y . Xu, B. Zeng, and T. Chilimbi, “Understanding and constructing latent modal- ity structures in multi-modal representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7661–7671

work page 2023
[25]

Simmlm: A simple framework for multi- modal learning with missing modality,

S. Li, C. Chen, and J. Han, “Simmlm: A simple framework for multi- modal learning with missing modality,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 068–24 077

work page 2025
[26]

Freefusion: Infrared and visible image fusion via cross reconstruction learning,

W. Zhao, H. Cui, H. Wang, Y . He, and H. Lu, “Freefusion: Infrared and visible image fusion via cross reconstruction learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 8040–8056, 2025

work page 2025
[27]

Cross-band correlation- aware interactive fusion for multispectral images,

I. Ulku, O. O. Tanriover, and E. Akag ¨und¨uz, “Cross-band correlation- aware interactive fusion for multispectral images,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2025

work page 2025
[28]

Boosting multimodal learning via disentan- gled gradient learning,

S. Wei, C. Luo, and Y . Luo, “Boosting multimodal learning via disentan- gled gradient learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 879–22 888

work page 2025
[29]

Ea-gans: edge-aware generative adversarial networks for cross-modality mr image synthesis,

B. Yu, L. Zhou, L. Wang, Y . Shi, J. Fripp, and P. Bourgeat, “Ea-gans: edge-aware generative adversarial networks for cross-modality mr image synthesis,”IEEE transactions on medical imaging, vol. 38, no. 7, pp. 1750–1762, 2019

work page 2019
[30]

Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation,

Y . Ding, X. Yu, and Y . Yang, “Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3975–3984

work page 2021
[31]

Kmd: Koopman multi-modality decomposition for generalized brain tumor segmentation under incom- plete modalities,

T. Liu, H. Jiang, and K. Huang, “Kmd: Koopman multi-modality decomposition for generalized brain tumor segmentation under incom- plete modalities,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 663–15 671

work page 2025
[32]

Mmanet: Margin-aware distillation and modality-aware regularization for incomplete multimodal learning,

S. Wei, C. Luo, and Y . Luo, “Mmanet: Margin-aware distillation and modality-aware regularization for incomplete multimodal learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 039–20 049

work page 2023
[33]

Incomplete multi-modal brain tumor segmentation via learnable sorting state space model,

Z. Zhang, Y . Lu, F. Ma, Y . Zhang, H. Yue, and X. Sun, “Incomplete multi-modal brain tumor segmentation via learnable sorting state space model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 982–25 992

work page 2025
[34]

Mmmvit: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities,

C. Qiu, Y . Song, Y . Liu, Y . Zhu, K. Han, V . S. Sheng, and Z. Liu, “Mmmvit: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities,”Biomedical Signal Processing and Control, vol. 90, p. 105827, 2024

work page 2024
[35]

DSTL Satellite Imagery Feature Detection,

D. S. I. F. Detection, “DSTL Satellite Imagery Feature Detection,” Kaggle competition, 2016, [Online]. Available: https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection. Accessed: Jan. 29, 2026

work page 2016
[36]

Band reconstruction using a modified unet for sentinel-2 images,

I. C. Neagoe, D. Faur, C. Vaduva, and M. Datcu, “Band reconstruction using a modified unet for sentinel-2 images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 6739–6757, 2023

work page 2023
[37]

2D Semantic Labeling Contest: Potsdam,

ISPRS, “2D Semantic Labeling Contest: Potsdam,” ISPRS Benchmark Datasets (UrbanSemLab), 2014, [Online]. Available: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab. Accessed: Jan. 30, 2026

work page 2014
[38]

Dsm building shape refine- ment from combined remote sensing images based on wnet-cgans,

K. Bittner, M. K ¨orner, and P. Reinartz, “Dsm building shape refine- ment from combined remote sensing images based on wnet-cgans,” in IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019, pp. 783–786

work page 2019
[39]

SRTM 90m digital elevation data,

CGIAR-CSI, “SRTM 90m digital elevation data,” Website, [Online]. Available: https://srtm.csi.cgiar.org/. Accessed: Jan. 31, 2026

work page 2026
[40]

A minimax approach to supervised learning,

F. Farnia and D. Tse, “A minimax approach to supervised learning,” Advances in Neural Information Processing Systems, vol. 29, 2016

work page 2016
[41]

Fundamental limits and tradeoffs in invariant represen- tation learning,

H. Zhao, C. Dan, B. Aragam, T. S. Jaakkola, G. J. Gordon, and P. Ravikumar, “Fundamental limits and tradeoffs in invariant represen- tation learning,”Journal of machine learning research, vol. 23, no. 340, pp. 1–49, 2022

work page 2022
[42]

Weighted intersection over union (wiou) for evaluating image segmentation,

Y .-J. Cho, “Weighted intersection over union (wiou) for evaluating image segmentation,”Pattern Recognition Letters, vol. 185, pp. 101–107, 2024

work page 2024
[43]

Multisenseseg: A cost-effective unified multimodal semantic segmentation model for remote sensing,

Q. Wang, W. Chen, Z. Huang, H. Tang, and L. Yang, “Multisenseseg: A cost-effective unified multimodal semantic segmentation model for remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–24, 2024

work page 2024
[44]

Cfformer: A cross-fusion transformer framework for the semantic seg- mentation of multi-source remote sensing images,

J. Zhao, M. Zhang, Z. Zhou, Z. Wang, F. Lang, H. Shi, and N. Zheng, “Cfformer: A cross-fusion transformer framework for the semantic seg- mentation of multi-source remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[45]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” IEEE Transactions on intelligent transportation systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

work page 2023
[46]

Delivering arbitrary-modal semantic segmentation,

J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” 15 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1136–1147

work page 2023
[47]

Dformerv2: Geometry self-attention for rgbd semantic segmentation,

B.-W. Yin, J.-L. Cao, M.-M. Cheng, and Q. Hou, “Dformerv2: Geometry self-attention for rgbd semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 345– 19 355

work page 2025
[48]

Missing modality robustness in semi-supervised multi-modal semantic segmentation,

H. Maheshwari, Y .-C. Liu, and Z. Kira, “Missing modality robustness in semi-supervised multi-modal semantic segmentation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 1020–1030. Irem Ulkureceived B.Sc. degrees in both Electron- ics and Communication Engineering and in Indus- trial Engineering from C ¸ ank...

work page 2024

[1] [1]

A gift from the integration of discriminative and diffusion-based generative learning: Boundary refinement remote sensing semantic segmentation,

H. Wang, K. Hu, X. Guo, H. Li, and C. Tao, “A gift from the integration of discriminative and diffusion-based generative learning: Boundary refinement remote sensing semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2026

work page 2026

[2] [2]

Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,

W. Chen, L. Bruzzone, B. Dang, Y . Gao, Y . Deng, J.-G. Yu, L. Yuan, and Y . Li, “Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 693–710, 2026

work page 2026

[3] [3]

In-season wall-to- wall crop-type mapping using ensemble of image segmentation models,

S. A. Zaheer, Y . Ryu, J. Lee, Z. Zhong, and K. Lee, “In-season wall-to- wall crop-type mapping using ensemble of image segmentation models,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 11, 2023

work page 2023

[4] [4]

Clusterformer for pine tree disease identification based on uav remote sensing image segmentation,

H. Liu, W. Li, W. Jia, H. Sun, M. Zhang, L. Song, and Y . Gui, “Clusterformer for pine tree disease identification based on uav remote sensing image segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

work page 2024

[5] [5]

Stathaki, Ed.,Image Fusion: Algorithms and Applications

T. Stathaki, Ed.,Image Fusion: Algorithms and Applications. New York, NY , USA: Academic Press, 2008. 14

work page 2008

[6] [6]

Infrared and visible image fusion: From data compatibility to task adaption,

J. Liu, G. Wu, Z. Liu, D. Wang, Z. Jiang, L. Ma, W. Zhong, X. Fan, and R. Liu, “Infrared and visible image fusion: From data compatibility to task adaption,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2349–2369, 2024

work page 2024

[7] [7]

Deep semantic segmentation of trees using multispectral images,

I. Ulku, E. Akag ¨und¨uz, and P. Ghamisi, “Deep semantic segmentation of trees using multispectral images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 7589– 7604, 2022

work page 2022

[8] [8]

Dsm- assisted unsupervised domain adaptive network for semantic segmenta- tion of remote sensing imagery,

S. Zhou, Y . Feng, S. Li, D. Zheng, F. Fang, Y . Liu, and B. Wan, “Dsm- assisted unsupervised domain adaptive network for semantic segmenta- tion of remote sensing imagery,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023

work page 2023

[9] [9]

Dkdfn: Domain knowledge-guided deep collaborative fusion network for mul- timodal unitemporal remote sensing land cover classification,

Y . Li, Y . Zhou, Y . Zhang, L. Zhong, J. Wang, and J. Chen, “Dkdfn: Domain knowledge-guided deep collaborative fusion network for mul- timodal unitemporal remote sensing land cover classification,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 186, pp. 170– 189, 2022

work page 2022

[10] [10]

Segcr: A multimodal and multitask complementary fusion network for remote sensing semantic segmentation and cloud removal,

S. Wu, J. Zhu, Y . Gu, W. Han, W. Jiang, and J. Geng, “Segcr: A multimodal and multitask complementary fusion network for remote sensing semantic segmentation and cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025

[11] [11]

Multimodal heterogeneous hypergraph learning for incomplete multimodal semantic segmentation of remote sensing images,

W. Han, J. Geng, Z. Xu, and W. Jiang, “Multimodal heterogeneous hypergraph learning for incomplete multimodal semantic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025

[12] [12]

Multimodal learning with transform- ers: A survey,

P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transform- ers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023

work page 2023

[13] [13]

Multimodal classification of remote sensing images: A review and future directions,

L. G ´omez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal classification of remote sensing images: A review and future directions,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1560–1584, 2015

work page 2015

[14] [14]

Missing data reconstruction in remote sensing image with a unified spatial–temporal– spectral deep convolutional neural network,

Q. Zhang, Q. Yuan, C. Zeng, X. Li, and Y . Wei, “Missing data reconstruction in remote sensing image with a unified spatial–temporal– spectral deep convolutional neural network,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 8, pp. 4274–4288, 2018

work page 2018

[15] [15]

Msh-net: Modality- shared hallucination with joint adaptation distillation for remote sensing image classification using missing modalities,

S. Wei, Y . Luo, X. Ma, P. Ren, and C. Luo, “Msh-net: Modality- shared hallucination with joint adaptation distillation for remote sensing image classification using missing modalities,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

work page 2023

[16] [16]

Challenges and opportunities of multimodality and data fusion in remote sensing,

M. Dalla Mura, S. Prasad, F. Pacifici, P. Gamba, J. Chanussot, and J. A. Benediktsson, “Challenges and opportunities of multimodality and data fusion in remote sensing,”Proceedings of the IEEE, vol. 103, no. 9, pp. 1585–1601, 2015

work page 2015

[17] [17]

Decrecnet: a decoupling-reconstruction network for restoring the missing information of optical remote sensing images,

W. Liu, H. Cui, Y . Jiang, G. Zhang, X. Li, H. Li, Y . Chen, and J. Yang, “Decrecnet: a decoupling-reconstruction network for restoring the missing information of optical remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 9777–9801, 2023

work page 2023

[18] [18]

Missing information reconstruction integrating isophote constraint and color-structure control for remote sensing data,

X. Yu, J. Pan, J. Xu, and M. Wang, “Missing information reconstruction integrating isophote constraint and color-structure control for remote sensing data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 208, pp. 261–278, 2024

work page 2024

[19] [19]

Em-eof: Gap-filling in incom- plete sar displacement time series,

A. Hippert-Ferrer, Y . Yan, and P. Bolon, “Em-eof: Gap-filling in incom- plete sar displacement time series,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 5794–5811, 2020

work page 2020

[20] [20]

Remote sensing meta modal representation for missing modality land cover mapping: From earthmiss dataset to metars method,

Y . Zhou, A. Ma, J. Wang, Z. Chen, and Y . Zhong, “Remote sensing meta modal representation for missing modality land cover mapping: From earthmiss dataset to metars method,”Remote Sensing of Environment, vol. 333, p. 115132, 2026

work page 2026

[21] [21]

A novel approach to incomplete multimodal learning for remote sensing data fusion,

Y . Chen, M. Zhao, and L. Bruzzone, “A novel approach to incomplete multimodal learning for remote sensing data fusion,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

work page 2024

[22] [22]

Robust multimodal learning with missing modalities via parameter-efficient adaptation,

M. K. Reza, A. Prater-Bennette, and M. S. Asif, “Robust multimodal learning with missing modalities via parameter-efficient adaptation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 742–754, 2025

work page 2025

[23] [23]

Robsense: A robust multi-modal foundation model for remote sensing with static, temporal, and incomplete data adaptability,

M. K. Do, K. Han, P. Lai, K. T. Phan, and W. Xiang, “Robsense: A robust multi-modal foundation model for remote sensing with static, temporal, and incomplete data adaptability,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 7427–7436

work page 2025

[24] [24]

Understanding and constructing latent modal- ity structures in multi-modal representation learning,

Q. Jiang, C. Chen, H. Zhao, L. Chen, Q. Ping, S. D. Tran, Y . Xu, B. Zeng, and T. Chilimbi, “Understanding and constructing latent modal- ity structures in multi-modal representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7661–7671

work page 2023

[25] [25]

Simmlm: A simple framework for multi- modal learning with missing modality,

S. Li, C. Chen, and J. Han, “Simmlm: A simple framework for multi- modal learning with missing modality,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 068–24 077

work page 2025

[26] [26]

Freefusion: Infrared and visible image fusion via cross reconstruction learning,

W. Zhao, H. Cui, H. Wang, Y . He, and H. Lu, “Freefusion: Infrared and visible image fusion via cross reconstruction learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 8040–8056, 2025

work page 2025

[27] [27]

Cross-band correlation- aware interactive fusion for multispectral images,

I. Ulku, O. O. Tanriover, and E. Akag ¨und¨uz, “Cross-band correlation- aware interactive fusion for multispectral images,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2025

work page 2025

[28] [28]

Boosting multimodal learning via disentan- gled gradient learning,

S. Wei, C. Luo, and Y . Luo, “Boosting multimodal learning via disentan- gled gradient learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 879–22 888

work page 2025

[29] [29]

Ea-gans: edge-aware generative adversarial networks for cross-modality mr image synthesis,

B. Yu, L. Zhou, L. Wang, Y . Shi, J. Fripp, and P. Bourgeat, “Ea-gans: edge-aware generative adversarial networks for cross-modality mr image synthesis,”IEEE transactions on medical imaging, vol. 38, no. 7, pp. 1750–1762, 2019

work page 2019

[30] [30]

Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation,

Y . Ding, X. Yu, and Y . Yang, “Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3975–3984

work page 2021

[31] [31]

Kmd: Koopman multi-modality decomposition for generalized brain tumor segmentation under incom- plete modalities,

T. Liu, H. Jiang, and K. Huang, “Kmd: Koopman multi-modality decomposition for generalized brain tumor segmentation under incom- plete modalities,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 663–15 671

work page 2025

[32] [32]

Mmanet: Margin-aware distillation and modality-aware regularization for incomplete multimodal learning,

S. Wei, C. Luo, and Y . Luo, “Mmanet: Margin-aware distillation and modality-aware regularization for incomplete multimodal learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 039–20 049

work page 2023

[33] [33]

Incomplete multi-modal brain tumor segmentation via learnable sorting state space model,

Z. Zhang, Y . Lu, F. Ma, Y . Zhang, H. Yue, and X. Sun, “Incomplete multi-modal brain tumor segmentation via learnable sorting state space model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 982–25 992

work page 2025

[34] [34]

Mmmvit: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities,

C. Qiu, Y . Song, Y . Liu, Y . Zhu, K. Han, V . S. Sheng, and Z. Liu, “Mmmvit: Multiscale multimodal vision transformer for brain tumor segmentation with missing modalities,”Biomedical Signal Processing and Control, vol. 90, p. 105827, 2024

work page 2024

[35] [35]

DSTL Satellite Imagery Feature Detection,

D. S. I. F. Detection, “DSTL Satellite Imagery Feature Detection,” Kaggle competition, 2016, [Online]. Available: https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection. Accessed: Jan. 29, 2026

work page 2016

[36] [36]

Band reconstruction using a modified unet for sentinel-2 images,

I. C. Neagoe, D. Faur, C. Vaduva, and M. Datcu, “Band reconstruction using a modified unet for sentinel-2 images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 6739–6757, 2023

work page 2023

[37] [37]

2D Semantic Labeling Contest: Potsdam,

ISPRS, “2D Semantic Labeling Contest: Potsdam,” ISPRS Benchmark Datasets (UrbanSemLab), 2014, [Online]. Available: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab. Accessed: Jan. 30, 2026

work page 2014

[38] [38]

Dsm building shape refine- ment from combined remote sensing images based on wnet-cgans,

K. Bittner, M. K ¨orner, and P. Reinartz, “Dsm building shape refine- ment from combined remote sensing images based on wnet-cgans,” in IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019, pp. 783–786

work page 2019

[39] [39]

SRTM 90m digital elevation data,

CGIAR-CSI, “SRTM 90m digital elevation data,” Website, [Online]. Available: https://srtm.csi.cgiar.org/. Accessed: Jan. 31, 2026

work page 2026

[40] [40]

A minimax approach to supervised learning,

F. Farnia and D. Tse, “A minimax approach to supervised learning,” Advances in Neural Information Processing Systems, vol. 29, 2016

work page 2016

[41] [41]

Fundamental limits and tradeoffs in invariant represen- tation learning,

H. Zhao, C. Dan, B. Aragam, T. S. Jaakkola, G. J. Gordon, and P. Ravikumar, “Fundamental limits and tradeoffs in invariant represen- tation learning,”Journal of machine learning research, vol. 23, no. 340, pp. 1–49, 2022

work page 2022

[42] [42]

Weighted intersection over union (wiou) for evaluating image segmentation,

Y .-J. Cho, “Weighted intersection over union (wiou) for evaluating image segmentation,”Pattern Recognition Letters, vol. 185, pp. 101–107, 2024

work page 2024

[43] [43]

Multisenseseg: A cost-effective unified multimodal semantic segmentation model for remote sensing,

Q. Wang, W. Chen, Z. Huang, H. Tang, and L. Yang, “Multisenseseg: A cost-effective unified multimodal semantic segmentation model for remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–24, 2024

work page 2024

[44] [44]

Cfformer: A cross-fusion transformer framework for the semantic seg- mentation of multi-source remote sensing images,

J. Zhao, M. Zhang, Z. Zhou, Z. Wang, F. Lang, H. Shi, and N. Zheng, “Cfformer: A cross-fusion transformer framework for the semantic seg- mentation of multi-source remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024

[45] [45]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” IEEE Transactions on intelligent transportation systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

work page 2023

[46] [46]

Delivering arbitrary-modal semantic segmentation,

J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” 15 inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1136–1147

work page 2023

[47] [47]

Dformerv2: Geometry self-attention for rgbd semantic segmentation,

B.-W. Yin, J.-L. Cao, M.-M. Cheng, and Q. Hou, “Dformerv2: Geometry self-attention for rgbd semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 345– 19 355

work page 2025

[48] [48]

Missing modality robustness in semi-supervised multi-modal semantic segmentation,

H. Maheshwari, Y .-C. Liu, and Z. Kira, “Missing modality robustness in semi-supervised multi-modal semantic segmentation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 1020–1030. Irem Ulkureceived B.Sc. degrees in both Electron- ics and Communication Engineering and in Indus- trial Engineering from C ¸ ank...

work page 2024