arxiv: 2604.12319 · v2 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu , Yang Xiao , Guangwei Gao , Dongchen Zhu , Guo-jun Qi , Wenjing Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal semantic segmentationstate space modelself-gatingmodality reliabilityRGB-D segmentationRGB-T segmentationfeature fusionMamba

0 comments

The pith

A self-gated Mamba block improves multimodal segmentation by judging each sensor's reliability before fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that cross-modal fusion for semantic segmentation degrades when methods treat every input modality as equally trustworthy. It introduces a Reliability-aware Self-Gated Mamba Block that first estimates reliability per modality and then uses self-gating to select or enhance features accordingly. A lightweight Local Cross-Gated Modulation step is added to preserve fine spatial details. If the mechanism succeeds, the model should produce more robust scene labels even when depth or thermal channels contain noise or misalignment. The resulting network reports higher accuracy than prior work on standard RGB-D and RGB-T benchmarks while remaining under 50 million parameters.

Core claim

The central claim is that explicitly modeling modality reliability inside a state-space architecture allows dynamic regulation of cross-modal interactions. The Reliability-aware Self-Gated Mamba Block performs reliability-aware feature selection and aggregation instead of indiscriminate mixing. When combined with local cross-gated modulation, this yields improved semantic segmentation accuracy on both RGB-D and RGB-T tasks.

What carries the argument

The Reliability-aware Self-Gated Mamba Block (RSGMB), which applies self-gating to estimate and leverage per-modality reliability for selective feature aggregation.

If this is right

Produces state-of-the-art mIoU of 58.8 percent on NYUDepth V2 and 54.0 percent on SUN-RGBD.
Achieves 61.1 percent and 88.9 percent mIoU on MFNet and PST900 respectively.
Maintains these results with a model size of 48.6 million parameters.
Avoids feature degradation by selectively enhancing only reliable cross-modal information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gating logic could extend to other multimodal perception tasks such as object detection where sensor quality also varies.
Online reliability scoring might allow models to adapt to changing conditions without full retraining.
Systematic ablation on datasets with synthetically varied noise would directly test how much the self-gating contributes.

Load-bearing premise

The self-gating mechanism can accurately estimate modality reliability and use that estimate to improve fusion without introducing new errors or needing heavy per-dataset tuning.

What would settle it

A controlled test in which one modality is artificially corrupted with increasing noise levels; if RSGMamba's accuracy falls below that of a non-gated baseline under heavy corruption, the reliability estimation is not functioning as claimed.

Figures

Figures reproduced from arXiv: 2604.12319 by Dongchen Zhu, Guangwei Gao, Guoan Xu, Guo-jun Qi, Wenjing Jia, Yang Xiao.

**Figure 2.** Figure 2: Overall architecture of our RSGMamba for RGB–X segmentation. RGB and auxiliary modalities are encoded through two weight-sharing encoders b [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Overall architecture of the proposed RSGMB, where RGB and auxiliary-modal features are processed in parallel and fused via a reliability [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Structure of the Local Cross-Gated Modulation (LCGM). The RGB [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons on the NYUDepth V2 [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons on the MFNet [24] (first two rows) and PST900 [39] (last two rows) datasets. The highlighted regions (orange circles) emphasize the key differences among methods under challenging conditions, such as low illumination and thermal ambiguity. Compared with previous approaches, our RSGMamba produces more accurate and consistent predictions, with clearer boundaries and better detection o… view at source ↗

**Figure 8.** Figure 8: Visualization of feature responses under different fusion mechanisms: [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons of reliability modeling in our framework. The [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSGMamba adds a self-gating mechanism to Mamba for handling unreliable modalities in segmentation, but the small gains and missing targeted validation make the core reliability claim hard to confirm from the abstract.

read the letter

RSGMamba's main contribution is a Reliability-aware Self-Gated Mamba Block that tries to dynamically adjust cross-modal fusion based on estimated reliability, paired with a local modulation module. It reports small but consistent improvements on standard benchmarks. The paper does a solid job of identifying a real issue in multimodal segmentation: many fusion methods treat all inputs as equally good, which fails when one modality is noisy or misaligned. By building on Mamba's efficient long-range modeling instead of transformers, they keep the model light at 48.6M parameters while claiming top results on NYUDepth V2, SUN-RGBD for RGB-D and MFNet, PST900 for RGB-T. The +0.4 to +1.6 mIoU gains are incremental, but the parameter efficiency is a plus for practical use. Where it gets soft is in demonstrating that the self-gating actually does what it claims. The description sounds promising, but without ablations showing gate values changing with injected noise, or comparisons to a plain Mamba version, it's possible the gains come from the overall architecture or training rather than the reliability awareness. The abstract lacks those details, so the central assumption remains untested in the provided summary. This matches the stress-test concern, and since the full text isn't contradicting it here, the point stands. This work is for computer vision folks working on robust multimodal perception, especially those exploring state space models for segmentation tasks. A reader interested in new fusion strategies for real-world sensors would get value from the ideas, even if they need to implement and test the reliability part themselves. It deserves a serious referee because the benchmarks are established, the problem is relevant, and the architecture is novel enough to warrant detailed review and potential revisions for stronger evidence. I'd recommend sending this to peer review. The ideas are worth exploring further with more rigorous testing.

Referee Report

2 major / 2 minor

Summary. The paper proposes RSGMamba, a multimodal semantic segmentation framework centered on the Reliability-aware Self-Gated Mamba Block (RSGMB) that uses a learned self-gating mechanism to model per-modality reliability and regulate cross-modal feature interactions, plus a lightweight Local Cross-Gated Modulation (LCGM) module for spatial refinement. It reports state-of-the-art mIoU results on RGB-D benchmarks (NYUDepth V2: 58.8%, SUN-RGBD: 54.0%) and RGB-T benchmarks (MFNet: 61.1%, PST900: 88.9%), with gains of +0.4% to +1.6% over prior best methods and a total of 48.6M parameters.

Significance. If the self-gating mechanism can be shown to explicitly estimate and act on modality reliability (rather than serving as generic modulation), the work would usefully extend state-space models to reliability-aware fusion in multimodal settings. The parameter-efficient design and application to both RGB-D and RGB-T tasks are strengths. However, the modest absolute gains make it essential to isolate the contribution of the reliability component from the Mamba backbone and training choices.

major comments (2)

[Abstract and §3.2] Abstract and §3.2 (RSGMB description): the central claim that RSGMB 'explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism' is not supported by any direct evidence such as gate-value statistics, correlation with injected noise or misalignment, or controlled experiments that isolate the reliability term from generic learned modulation.
[§4] §4 (Experiments): the reported improvements (+0.4% / +0.7% on NYUDepth V2 / SUN-RGBD and up to +1.6% on RGB-T sets) are small; without ablation studies removing or replacing the self-gating component, statistical significance tests, or error analysis, it remains possible that gains arise from the Mamba backbone, hyperparameter choices, or training protocol rather than reliability awareness.

minor comments (2)

[Abstract and §3] The abstract and method sections would benefit from a concise equation or pseudocode block summarizing the self-gating operation inside RSGMB to clarify how reliability is computed and applied.
[§4] Table captions and result tables should explicitly list the exact prior methods and their parameter counts for direct comparison with the reported 48.6M figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of evidence and experimental rigor. We address each major comment below and have revised the manuscript to incorporate additional supporting analysis and ablations.

read point-by-point responses

Referee: [Abstract and §3.2] Abstract and §3.2 (RSGMB description): the central claim that RSGMB 'explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism' is not supported by any direct evidence such as gate-value statistics, correlation with injected noise or misalignment, or controlled experiments that isolate the reliability term from generic learned modulation.

Authors: We agree that direct empirical validation of the reliability modeling would strengthen the central claim. The self-gating is formulated to compute per-modality weights from feature statistics in a manner intended to reflect reliability, but we acknowledge the absence of explicit supporting visualizations or isolations in the original submission. In the revised manuscript, we have added gate-value statistics under controlled noise injection and misalignment scenarios in an expanded §3.2, along with a controlled ablation replacing the learned self-gating with a generic modulation to isolate its effect. These additions directly address the request for evidence. revision: yes
Referee: [§4] §4 (Experiments): the reported improvements (+0.4% / +0.7% on NYUDepth V2 / SUN-RGBD and up to +1.6% on RGB-T sets) are small; without ablation studies removing or replacing the self-gating component, statistical significance tests, or error analysis, it remains possible that gains arise from the Mamba backbone, hyperparameter choices, or training protocol rather than reliability awareness.

Authors: We recognize that the absolute gains are modest and that stronger isolation of the reliability component is warranted. The original experiments already include some component ablations, but we agree they do not fully separate the self-gating from the Mamba backbone or training choices. The revised §4 now includes: (i) an ablation removing or replacing the self-gating with fixed or non-reliability-aware alternatives, (ii) mean and standard deviation over multiple random seeds, and (iii) qualitative error analysis on subsets with simulated modality degradation. These results help attribute the observed improvements more specifically to the reliability-aware design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with no derivations or self-referential reductions

full rationale

The paper proposes an empirical neural architecture (RSGMamba with RSGMB and LCGM blocks) for multimodal segmentation. No equations, closed-form derivations, or first-principles predictions are present in the abstract or described method. Claims of 'explicitly modeling modality reliability' are architectural descriptions, not reductions of outputs to fitted inputs or self-citations. The performance numbers are reported as experimental results on benchmarks, not predictions derived by construction from the model definition itself. No load-bearing steps reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method is described at the level of architectural blocks without mathematical formulation.

pith-pipeline@v0.9.0 · 5608 in / 1130 out tokens · 29225 ms · 2026-05-10T15:42:43.815335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 7622–7629

2025
[2]

Zisvfm: Zero-shot object instance segmentation in indoor robotic environments with vision foundation models,

Y . Zhang, M. Yin, W. Bi, H. Yan, S. Bian, C.-H. Zhang, and C. Hua, “Zisvfm: Zero-shot object instance segmentation in indoor robotic environments with vision foundation models,”IEEE Transactions on Robotics, vol. 41, pp. 1568–1580, 2025

2025
[3]

Diff-unet: A diffusion embedded network for robust 3d medical image segmentation,

Z. Xing, L. Wan, H. Fu, G. Yang, Y . Yang, L. Yu, B. Lei, and L. Zhu, “Diff-unet: A diffusion embedded network for robust 3d medical image segmentation,”Medical Image Analysis, vol. 105, p. 103654, 2025

2025
[4]

U-kan makes strong backbone for medical image segmentation and generation,

C. Li, X. Liu, W. Li, C. Wang, H. Liu, Y . Liu, Z. Chen, and Y . Yuan, “U-kan makes strong backbone for medical image segmentation and generation,” inProceedings of the AAAI Conference on Artificial Intel- ligence, vol. 39, no. 5, 2025, pp. 4652–4660

2025
[5]

Fbsnet: A fast bilateral symmetrical network for real-time semantic segmentation,

G. Gao, G. Xu, J. Li, Y . Yu, H. Lu, and J. Yang, “Fbsnet: A fast bilateral symmetrical network for real-time semantic segmentation,” IEEE Transactions on Multimedia, vol. 25, pp. 3273–3283, 2023

2023
[6]

Golden cudgel network for real-time semantic segmentation,

G. Yang, Y . Wang, D. Shi, and Y . Wang, “Golden cudgel network for real-time semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 25 367–25 376

2025
[7]

Haformer: Unleashing the power of hierarchy-aware features for lightweight semantic segmenta- tion,

G. Xu, W. Jia, T. Wu, L. Chen, and G. Gao, “Haformer: Unleashing the power of hierarchy-aware features for lightweight semantic segmenta- tion,”IEEE Transactions on Image Processing, vol. 33, pp. 4202–4214, 2024

2024
[8]

Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation,

Y . Fu, M. Lou, and Y . Yu, “Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 077–19 087

2025
[9]

Gemi- nifusion: Efficient pixel-wise multimodal fusion for vision transformer,

D. Jia, J. Guo, K. Han, H. Wu, C. Zhang, C. Xu, and X. Chen, “Gemi- nifusion: Efficient pixel-wise multimodal fusion for vision transformer,” arXiv preprint arXiv:2406.01210, 2024

work page arXiv 2024
[10]

Stitchfusion: Weaving any visual modalities to enhance multimodal semantic segmentation,

B. Li, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Stitchfusion: Weaving any visual modalities to enhance multimodal semantic segmentation,” in Proceedings of the ACM International Conference on Multimedia, 2025, pp. 1308–1317

2025
[11]

Token fusion: Bridging the gap between token pruning and token merging,

M. Kim, S. Gao, Y .-C. Hsu, Y . Shen, and H. Jin, “Token fusion: Bridging the gap between token pruning and token merging,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1383–1392

2024
[12]

Dformerv2: Geometry self-attention for rgbd semantic segmentation,

B.-W. Yin, J.-L. Cao, M.-M. Cheng, and Q. Hou, “Dformerv2: Geometry self-attention for rgbd semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 19 345–19 355

2025
[13]

Omnisegmentor: A flexible multi-modal learning framework for seman- tic segmentation,

B.-W. Yin, J.-L. Cao, X. Zhang, Y . Chen, M.-M. Cheng, and Q. Hou, “Omnisegmentor: A flexible multi-modal learning framework for seman- tic segmentation,”arXiv preprint arXiv:2509.15096, 2025

work page arXiv 2025
[14]

Delivering arbitrary-modal semantic segmentation,

J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1136–1147

2023
[15]

Diff- pixelformer: Differential pixel-aware transformer for rgb-d indoor scene segmentation,

Y . Gong, J. Lu, Y . Gao, J. Zhao, X. Zhang, and S. Rahardja, “Diff- pixelformer: Differential pixel-aware transformer for rgb-d indoor scene segmentation,”arXiv preprint arXiv:2511.13047, 2025

work page arXiv 2025
[16]

Tuni: Real-time rgb-t semantic segmentation with unified multi-modal feature extraction and cross-modal feature fusion,

X. Guo, T. Liu, Y . Li, Z. Lin, and Z. Deng, “Tuni: Real-time rgb-t semantic segmentation with unified multi-modal feature extraction and cross-modal feature fusion,”arXiv preprint arXiv:2509.10005, 2025

work page arXiv 2025
[17]

Dformer: Rethinking rgbd representation learning for semantic segmentation,

B. Yin, X. Zhang, Z.-Y . Li, L. Liu, M.-M. Cheng, and Q. Hou, “Dformer: Rethinking rgbd representation learning for semantic segmentation,” in ICLR, 2024

2024
[18]

Adbnet: Asymmetric dual-branch network for indoor real-time rgb-d semantic segmentation,

C. Xu, G. Ma, F. Gao, B. Wang, and J. Liu, “Adbnet: Asymmetric dual-branch network for indoor real-time rgb-d semantic segmentation,” Knowledge-Based Systems, vol. 326, p. 113885, 2025

2025
[19]

Sigma: Siamese mamba network for multi-modal semantic segmentation,

Z. Wan, P. Zhang, Y . Wang, S. Yong, S. Stepputtis, K. Sycara, and Y . Xie, “Sigma: Siamese mamba network for multi-modal semantic segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 1734–1744

2025
[20]

Cross-modal state space modeling for real-time rgb-thermal wild scene semantic segmentation,

X. Guo, Z. Lin, L. Hu, Z. Deng, T. Liu, and W. Zhou, “Cross-modal state space modeling for real-time rgb-thermal wild scene semantic segmentation,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 5938–5945

2025
[21]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

2023
[22]

Dcanet: Differential convolution attention network for rgb-d semantic segmentation,

L. Bai, J. Yang, C. Tian, Y . Sun, M. Mao, Y . Xu, and W. Xu, “Dcanet: Differential convolution attention network for rgb-d semantic segmentation,”Pattern Recognition, vol. 162, p. 111379, 2025

2025
[23]

Dformer++: Improving rgbd representation learning for semantic segmentation,

B.-W. Yin, J.-L. Cao, D. Xu, M.-M. Cheng, and Q. Hou, “Dformer++: Improving rgbd representation learning for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[24]

Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,

Q. Ha, K. Watanabe, T. Karasawa, Y . Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115

2017
[25]

Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,

Y . Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,”IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019

2019
[26]

Gmnet: Graded- feature multilabel-learning network for rgb-thermal urban scene seman- tic segmentation,

W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “Gmnet: Graded- feature multilabel-learning network for rgb-thermal urban scene seman- tic segmentation,”IEEE Transactions on Image Processing, vol. 30, pp. 7790–7802, 2021

2021
[27]

Multispectral fusion transformer network for rgb-thermal urban scene semantic seg- mentation,

H. Zhou, C. Tian, Z. Zhang, Q. Huo, Y . Xie, and Z. Li, “Multispectral fusion transformer network for rgb-thermal urban scene semantic seg- mentation,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

2022
[28]

Mirror complementary trans- former network for rgb-thermal salient object detection,

X. Jiang, Y . Hou, H. Tian, and L. Zhu, “Mirror complementary trans- former network for rgb-thermal salient object detection,”IET Computer Vision, vol. 18, no. 1, pp. 15–32, 2024

2024
[29]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review arXiv 2021
[30]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, 2024

2024
[31]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Advances in Neural Information Processing Systems, vol. 37, pp. 103 031–103 063, 2024

2024
[32]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

work page internal anchor Pith review arXiv 2024
[33]

Multimodal token fusion for vision transformers,

Y . Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y . Wang, “Multimodal token fusion for vision transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 186–12 195

2022
[34]

Difference-aware fusion network for efficient rgb-d semantic segmentation in indoor robots,

Y . Yang, Y . Hong, Y . Yuan, H. Pan, and W. Sun, “Difference-aware fusion network for efficient rgb-d semantic segmentation in indoor robots,”IEEE Transactions on Industrial Informatics, vol. 21, no. 10, pp. 7424–7434, 2025

2025
[35]

Ecmrn: Efficient cross- modal reparameterization network for rgb-d tasks via prompt tuning,

D. Jia, C. Zhao, H. Song, H. Zhang, and W. Li, “Ecmrn: Efficient cross- modal reparameterization network for rgb-d tasks via prompt tuning,” Knowledge-Based Systems, p. 114321, 2025

2025
[36]

Primkd: Primary modality guided multimodal fusion for rgb-d semantic JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12 segmentation,

Z. Hao, Z. Xiao, Y . Luo, J. Guo, J. Wang, L. Shen, and H. Hu, “Primkd: Primary modality guided multimodal fusion for rgb-d semantic JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12 segmentation,” inProceedings of the ACM International Conference on Multimedia, 2024, pp. 1943–1951

2020
[37]

Fastdepth: Fast monocular depth estimation on embedded systems,

D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V . Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA, 2019, pp. 6101–6108

2019
[38]

Sun rgb-d: A rgb-d scene understanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 567–576

2015
[39]

Pst900: Rgb-thermal calibration, dataset and segmentation network,

S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V . Kumar, and C. J. Taylor, “Pst900: Rgb-thermal calibration, dataset and segmentation network,” inProceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 9441–9447

2020
[40]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation,

Q. Zhang, S. Zhao, Y . Luo, D. Zhang, N. Huang, and J. Han, “Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2633–2642

2021
[42]

Rgb-t semantic seg- mentation with location, activation, and sharpening,

G. Li, Y . Wang, Z. Liu, X. Zhang, and D. Zeng, “Rgb-t semantic seg- mentation with location, activation, and sharpening,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1223– 1235, 2022

2022
[43]

Ex- plicit attention-enhanced fusion for rgb-thermal perception tasks,

M. Liang, J. Hu, C. Bao, H. Feng, F. Deng, and T. L. Lam, “Ex- plicit attention-enhanced fusion for rgb-thermal perception tasks,”IEEE Robotics and Automation Letters, vol. 8, no. 7, pp. 4060–4067, 2023

2023
[44]

Cacfnet: Cross-modal attention cascaded fusion network for rgb-t urban scene parsing,

W. Zhou, S. Dong, M. Fang, and L. Yu, “Cacfnet: Cross-modal attention cascaded fusion network for rgb-t urban scene parsing,”IEEE Transac- tions on Intelligent Vehicles, vol. 9, no. 1, pp. 1919–1929, 2023

1919
[45]

Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,

J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8115–8124

2023
[46]

Complementary random masking for rgb-thermal semantic segmentation,

U. Shin, K. Lee, I. S. Kweon, and J. Oh, “Complementary random masking for rgb-thermal semantic segmentation,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 11 110–11 117

2024
[47]

Multi-branch differential bidirectional fusion network for rgb-t semantic segmentation,

W. Liang, C. Shan, Y . Yang, and J. Han, “Multi-branch differential bidirectional fusion network for rgb-t semantic segmentation,”IEEE Transactions on Intelligent Vehicles, vol. 10, no. 4, pp. 2362–2372, 2024

2024
[48]

Hybrid knowledge distillation network for rgb-d co-salient object detection,

Z. Tu, W. Zhou, X. Qian, and W. Yan, “Hybrid knowledge distillation network for rgb-d co-salient object detection,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2695–2706, 2025

2025
[49]

Omnifuse: Composite degradation-robust image fusion with language-driven semantics,

H. Zhang, L. Cao, X. Zuo, Z. Shao, and J. Ma, “Omnifuse: Composite degradation-robust image fusion with language-driven semantics,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 7577–7595, 2025

2025
[50]

Agfnet: Adaptive gated fusion network for rgb-t semantic segmentation,

X. Zhou, X. Wu, L. Bao, H. Yin, Q. Jiang, and J. Zhang, “Agfnet: Adaptive gated fusion network for rgb-t semantic segmentation,”IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 5, pp. 6477–6492, 2025

2025
[51]

Mixprompt: Efficient mixed prompting for multimodal semantic seg- mentation,

Z. Hao, Z. Xiao, J. Guo, L. Shen, Y . Luo, H. Hu, and D. Zeng, “Mixprompt: Efficient mixed prompting for multimodal semantic seg- mentation,” inProceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[52]

Milnet: Multiplex interactive learning network for rgb-t semantic segmentation,

J. Liu, H. Liu, X. Li, J. Ren, and X. Xu, “Milnet: Multiplex interactive learning network for rgb-t semantic segmentation,”IEEE Transactions on Image Processing, vol. 4, pp. 1686–1699, 2025

2025
[53]

Dual-space graph-based interaction network for rgb-thermal semantic segmentation in electric power scene,

C. Xu, Q. Li, X. Jiang, D. Yu, and Y . Zhou, “Dual-space graph-based interaction network for rgb-thermal semantic segmentation in electric power scene,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 4, pp. 1577–1592, 2022

2022
[54]

A feature divide-and-conquer network for rgb-t semantic segmentation,

S. Zhao and Q. Zhang, “A feature divide-and-conquer network for rgb-t semantic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 6, pp. 2892–2905, 2022

2022
[55]

Sgfnet: Semantic-guided fusion network for rgb-thermal semantic segmentation,

Y . Wang, G. Li, and Z. Liu, “Sgfnet: Semantic-guided fusion network for rgb-thermal semantic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 12, pp. 7737–7748, 2023

2023
[56]

Efficient multimodal semantic segmentation via dual-prompt learning,

S. Dong, Y . Feng, Q. Yang, Y . Huang, D. Liu, and H. Fan, “Efficient multimodal semantic segmentation via dual-prompt learning,” inPro- ceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 14 196–14 203

2024
[57]

Gopt: Generalizable online 3d bin packing via transformer-based deep reinforcement learning,

H. Xiong, C. Guo, J. Peng, K. Ding, W. Chen, X. Qiu, L. Bai, and J. Xu, “Gopt: Generalizable online 3d bin packing via transformer-based deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 10 335–10 342, 2024

2024
[58]

Context-aware interaction network for rgb-t semantic segmentation,

Y . Lv, Z. Liu, and G. Li, “Context-aware interaction network for rgb-t semantic segmentation,”IEEE Transactions on Multimedia, vol. 26, pp. 6348–6360, 2024

2024
[59]

Cpal: Cross-prompting adapter with loras for rgb+ x semantic segmentation,

Y . Liu, P. Wu, M. Wang, and J. Liu, “Cpal: Cross-prompting adapter with loras for rgb+ x semantic segmentation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 6, pp. 5858–5871, 2025

2025