pith. sign in

arxiv: 2604.11234 · v1 · submitted 2026-04-13 · 💻 cs.CV

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords multispectral object detectiontext-guided fusionRGB-IR alignmentconsensus discrepancy modelingsemantic bridgecross-modal interactionbi-support fusion
0
0 comments X

The pith

Text semantics serve as a shared bridge to align RGB and IR features by modeling consensus and discrepancy supports in multispectral detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a semantic bridge fusion framework that treats text as a common semantic anchor to align responses from RGB and infrared images under unified category conditions. It projects recalibrated thermal semantic priors onto the RGB branch for mapping fusion while splitting cross-modal evidence into a regular consensus support and a complementary discrepancy support that supplies additional discriminative cues. These supports are introduced through dynamic recalibration as an inductive bias, and a bidirectional semantic alignment module closes the vision-text guidance loop. The approach moves beyond treating text as mere auxiliary enhancement or relying only on stable consensus in attention-based fusion. Experiments confirm gains in detection accuracy on multispectral benchmarks.

Core claim

Text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion; RGB-IR interaction evidence is formulated into regular consensus support and complementary discrepancy support introduced via dynamic recalibration as a structured inductive bias, together with a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement.

What carries the argument

Semantic bridge fusion framework with bi-support modeling, where text acts as the shared semantic bridge for category-conditioned alignment and bi-support splits interactions into consensus and discrepancy components for recalibrated fusion.

If this is right

  • Category-conditioned text alignment reduces granularity mismatch between RGB and IR modalities.
  • Incorporating discrepancy support alongside consensus captures cross-modal cues that standard fusion overlooks.
  • Bidirectional vision-text alignment creates a closed-loop guidance mechanism that strengthens semantic consistency.
  • The overall framework yields higher detection accuracy on multispectral object detection benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bi-support idea could transfer to other cross-sensor fusion settings such as RGB-depth or radar-vision pairs when semantic descriptions are available.
  • Actively modeling discrepancies rather than suppressing them may encourage new fusion designs that treat differences as signal.
  • Combining the framework with richer language models for text semantics offers a direct route to stronger bridging in future sensor setups.

Load-bearing premise

Text semantics can reliably serve as a shared bridge to align inherently asymmetric RGB and IR granularities and that the discrepancy support contains consistently valuable discriminative cues rather than noise.

What would settle it

An ablation on a multispectral dataset with unreliable or absent text annotations where removing the text bridge or the discrepancy modeling component produces no accuracy gain or a measurable drop in detection performance.

Figures

Figures reproduced from arXiv: 2604.11234 by Enhao Huang, Gao Huang, Jiaqi Wu, Kangqing Shen, Yang Yue, Yifan Pu, Yulin Wang, Zhen Wang.

Figure 1
Figure 1. Figure 1: Comparison of different fusion paradigms. Unlike vanilla direct RGB–IR fusion and conditional prompt fusion, our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of consensus–discrepancy activation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework. It consists of a semantic-bridge-guided dynamic fusion module for modeling [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed bi-support modeling. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on seven representative scenes from the FLIR dataset. The examples include daytime perception [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of responses of consensus support [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Perception performance comparison on FLIR. The [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on five representative scenes from the LLVIP dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on five representative scenes from the M3FD dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of different fusion paradigms on the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of responses of consensus support [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Population-level trends of consensus and discrep [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a semantic bridge fusion framework for text-guided multispectral object detection. It uses text semantics as a shared bridge to align RGB and IR responses under unified category conditions, projects recalibrated thermal priors onto the RGB branch, formulates interaction evidence into regular consensus support and complementary discrepancy support with dynamic recalibration as inductive bias, and adds a bidirectional semantic alignment module for closed-loop vision-text guidance. The authors claim extensive experiments demonstrate superior detection performance on multispectral benchmarks, with code released.

Significance. If the bi-support modeling proves effective at extracting useful discrepancy cues, the framework could advance cross-modal fusion by explicitly addressing granularity asymmetry and overlooked discrepancies via text guidance, offering a structured alternative to standard attention-based methods. The public code release supports reproducibility and enables direct follow-up validation.

major comments (2)
  1. [bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.
  2. [experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.
minor comments (2)
  1. [abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., mAP improvement on a benchmark) to support the superiority claim.
  2. [method] Clarify the exact formulation of the discrepancy support (e.g., whether it is an attention difference or residual) and the recalibration operator to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.

    Authors: We appreciate the referee highlighting the need for clearer justification of the discrepancy support. The discrepancy support is formulated as the element-wise residual between text-aligned RGB and IR feature maps after the semantic bridge alignment, intended to capture modality-specific variations that aid detection under granularity asymmetry. The dynamic recalibration employs a learned, modality-aware gating function (implemented via convolutional layers followed by sigmoid activation) that adaptively scales the supports, serving as an inductive bias to down-weight inconsistent or noisy regions. We acknowledge that the original manuscript provided insufficient mathematical derivation and explicit discussion of noise separation. In the revision, we will add a detailed formulation of the bi-support computation, explain the recalibration's role in implicit suppression, and include additional analysis (e.g., visualization of discrepancy maps) to demonstrate separation from common artifacts. Existing ablations already indicate performance gains when discrepancy support is included versus consensus-only fusion. revision: yes

  2. Referee: [experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.

    Authors: We thank the referee for noting this. The abstract is intentionally concise and omits specific numbers per standard practice, but the full Experiments section (Section 4) reports quantitative mAP results on multiple multispectral benchmarks, direct comparisons to state-of-the-art fusion baselines, and comprehensive ablations isolating each proposed component (including consensus vs. bi-support). The public code release further enables verification. To improve accessibility, we will revise the abstract to include a brief summary of key performance gains and ensure all metrics are prominently tabulated in the main text. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural framework with independent design choices

full rationale

The paper introduces a semantic bridge fusion framework and bi-support modeling as a methodological proposal for multispectral detection. Text is positioned as a shared semantic bridge and discrepancy support is formulated as a complementary term containing 'potentially discriminative cues' via dynamic recalibration; these are explicit design decisions and inductive biases rather than quantities derived from or equivalent to fitted inputs by construction. No equations are shown that reduce outputs to self-defined inputs, no predictions are fitted parameters renamed, and no load-bearing self-citations or uniqueness theorems appear in the abstract or description. The chain is self-contained as a new architecture without reduction to its own data or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the framework rests on domain assumptions about text alignment and discrepancy value rather than new mathematical axioms or invented physical entities.

axioms (2)
  • domain assumption Text can act as a shared semantic bridge to align RGB and IR under a unified category condition
    Invoked to justify the projection of thermal prior onto RGB branch.
  • domain assumption Cross-modal discrepancies contain potentially valuable discriminative cues that should be modeled separately from consensus
    Basis for introducing consensus and discrepancy supports via dynamic recalibration.

pith-pipeline@v0.9.0 · 5545 in / 1405 out tokens · 80025 ms · 2026-05-10T15:16:47.685828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  2. [2]

    Jiale Cao, Yanwei Pang, Jin Xie, Fahad Shahbaz Khan, and Ling Shao. 2021. From handcrafted to deep features for pedestrian detection: A survey.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4913–4934

  3. [3]

    Yishuo Chen, Boran Wang, Wenbin Zhu, and Jing Yuan. 2024. RGB-IR YOLO combining Modality-Specific Reconstruction and Information Integration. In 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). 2045–2050. doi:10.1109/YAC63405.2024.10598725

  4. [4]

    Yung-Yao Chen, Sin-Ye Jhong, Hsin-Chun Lin, and Yi-Chen Wu. 2025. Vision- Language-Guided Adaptive Cross-Modal Fusion for Multispectral Object Detec- tion Under Adverse Weather Conditions.IEEE MultiMedia32, 2 (2025), 22–32. doi:10.1109/MMUL.2025.3525559

  5. [5]

    Zhenshuai Chen, Wei Xiang, Zhiyuan Lin, Kaixuan Yang, Yunpeng Liu, and Zelin Shi. 2025. Alignment-assisted Frequency Fusion Network for RGB-infrared vehicle detection.Neurocomputing647 (2025), 130505. doi:10.1016/j.neucom. 2025.130505

  6. [6]

    Xiaolong Cheng, Keke Geng, Ziwei Wang, Jinhu Wang, Yuxiao Sun, and Pengbo Ding. 2023. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment.Multimedia Tools Appl.82, 30 (May 2023), 47773–47792. doi:10.1007/s11042-023-15333-w

  7. [7]

    Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Guodong Guo, and Baochang Zhang. 2025. Fusion-Mamba for Cross-Modality Object Detection.IEEE Transactions on Multimedia27 (2025), 7392–7406. doi:10.1109/TMM.2025.3599020

  8. [8]

    Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. 2021. ReDet: A Rotation- Equivariant Detector for Aerial Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2786–2795

  9. [9]

    Xiao He, Chang Tang, Xin Zou, and Wei Zhang. 2023. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM). 1465–1474. doi:10.1145/ 3581783.3612651

  10. [10]

    Lian Huang, Zongju Peng, Fen Chen, Shaosheng Dai, Ziqiang He, and Kesh- eng Liu. 2024. Cross-Modality Interaction for Few-Shot Multispectral Object Detection with Semantic Knowledge.Neural Networks173 (2024), 106156. doi:10.1016/j.neunet.2024.106156

  11. [11]

    Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon

  12. [12]

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1037–1045

  13. [13]

    Junbo Jang, Chanyeong Park, Heegwang Kim, Jiyoon Lee, and Joonki Paik

  14. [14]

    In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

    Multispectral Object Detection Enhanced by Cross-Modal Information Complementary and Cosine Similarity Channel Resampling Modules. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 9437–

  15. [15]

    doi:10.1109/WACV61041.2025.00914

  16. [16]

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international conference on computer vision. 3496–3504

  17. [17]

    2020.Ultralytics YOLOv5

    Glenn Jocher. 2020.Ultralytics YOLOv5. doi:10.5281/zenodo.3908559

  18. [18]

    2023.Ultralytics YOLOv8

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.Ultralytics YOLOv8. https: //github.com/ultralytics/ultralytics

  19. [19]

    Xudong Kang, Hui Yin, and Puhong Duan. 2024. Global–Local Feature Fusion Network for Visible–Infrared Vehicle Detection.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5. doi:10.1109/LGRS.2024.3375634

  20. [20]

    Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Uncertainty-guided cross- modal learning for robust multispectral pedestrian detection.IEEE Transactions on Circuits and Systems for Video Technology32, 3 (2021), 1510–1523

  21. [21]

    Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. 2019. Illumination-aware faster R-CNN for robust multispectral pedestrian detection.Pattern Recognition 85 (2019), 161–171

  22. [22]

    Hanyun Li, Linsong Xiao, Lihua Cao, Di Wu, Yangfan Liu, Yi Li, Yunfeng Zhang, and Haiyang Bao. 2026. CrossModalNet: A dual-modal object detection network based on cross-modal fusion and channel interaction.Expert Systems with Applications298 (2026), 129677. doi:10.1016/j.eswa.2025.129677

  23. [23]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 202). 19730–19742

  24. [24]

    Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C

    Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. InAdvances in Neural Information Processing Systems (NeurIPS)

  25. [25]

    Ting Li, Songtao Li, Shuaifeng Li, Xiaolin Qin, Maoyuan Zhao, Luping Ji, and Mao Ye. 2025. SAM-Guided Semantic Knowledge Fusion for Visible-Infrared Object Detection. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM). 8835–8844. doi:10.1145/3746027.3755718

  26. [26]

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV)

  27. [27]

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detec- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5802–5811

  28. [28]

    Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. 2024. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 4 (2024), 2349–2369

  29. [29]

    Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. 2016. Multi- spectral Deep Neural Networks for Pedestrian Detection. InProceedings of the British Machine Vision Conference (BMVC). 73.1–73.13. doi:10.5244/C.30.73

  30. [30]

    Xiaowen Liu, Hongtao Huo, Jing Li, Shan Pang, and Bowen Zheng. 2024. A Semantic-Driven Coupled Network for Infrared and Visible Image Fusion.Infor- mation Fusion108 (2024), 102352. doi:10.1016/j.inffus.2024.102352

  31. [31]

    Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, and Siao Liu. 2026. Modal- ity Dominance-Aware Optimization for Embodied RGB-Infrared Perception. arXiv:2601.00598 [cs.CV]

  32. [32]

    Junlin Ouyang, Pengcheng Jin, and Qingwang Wang. 2024. Multimodal Feature- Guided Pretraining for RGB-T Perception.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 16041–16050. doi:10. 1109/JSTARS.2024.3454054

  33. [33]

    Xiangyu Qin, Enlong Wang, Shihua Zhou, Bin Wang, and Nikola K. Kasabov

  34. [34]

    Masset, R

    TSPFusion: Text-Guided Semantic Perception for Infrared and Visible Image Fusion.Infrared Physics & Technology153 (2026), 106324. doi:10.1016/j. infrared.2025.106324

  35. [35]

    Fang Qingyun, Han Dapeng, and Wang Zhaokui. 2021. Cross-modality fusion transformer for multispectral object detection.arXiv preprint arXiv:2111.00273 (2021)

  36. [36]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  37. [37]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  38. [38]

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net- works. InAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/ ...

  39. [39]

    Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. 2024. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognition145 (2024), 109913. doi:10.1016/j.patcog. 2023.109913

  40. [40]

    Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, and Haibin Ling. 2026. Multispectral state-space feature fusion: Bridging shared and cross- parametric interactions for object detection.Information Fusion127 (2026), 103895. doi:10.1016/j.inffus.2025.103895

  41. [41]

    Dongdong Sun, Chuanyun Wang, Tian Wang, Qian Gao, Qiong Liu, and Linlin Wang. 2025. CLIPFusion: Infrared and Visible Image Fusion Network Based on Image–Text Large Model and Adaptive Learning.Displays89 (2025), 103042. doi:10.1016/j.displa.2025.103042

  42. [42]

    Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. 2022. Drone-based RGB- infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 6700– 6713

  43. [43]

    Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. 2025. ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts. InAdvances in Neural Information Processing Systems

  44. [44]

    Teledyne FLIR LLC. 2021. Teledyne FLIR Free Starter Thermal Dataset for Algo- rithm Training. https://adas-dataset-v2.flirconservator.com/dataset/README. txt. Accessed: 2026-04-02

  45. [45]

    Changhai Wang, Zhe Huang, Yuwei Xu, Wanwei Huang, and Yuan Tian. 2026. FDFusion: Efficient text-guided infrared-visible image fusion via fine-tuned light- weight VLM and Dual-branch feature modeling.Infrared Physics & Technology (2026), 106499

  46. [46]

    Enlong Wang, Jiawei Li, Jia Lei, Jinyuan Liu, Shihua Zhou, Bin Wang, and Nikola K. Kasabov. 2024. SDFuse: Semantic-Injected Dual-Flow Learning for Infrared and Visible Image Fusion.Expert Systems with Applications252, Part B (2024), 124188. doi:10.1016/j.eswa.2024.124188

  47. [47]

    Huiying Wang, Chunping Wang, Qiang Fu, Binqiang Si, Dongdong Zhang, Renke Kou, Ying Yu, and Changfeng Feng. 2024. YOLOFIV: Object Detection Algorithm for Around-the-Clock Aerial Remote Sensing Images by Fusing Infrared and Visible Features.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 15269–15287. doi:10.1109/J...

  48. [48]

    Huiying Wang, Chunping Wang, Qiang Fu, Dongdong Zhang, Renke Kou, Ying Yu, and Jian Song. 2024. Cross-Modal Oriented Object Detection of UAV Aerial Images Based on Image Feature.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–21. doi:10.1109/TGRS.2024.3367934

  49. [49]

    Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, and Yuxin Hu. 2026. SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection. arXiv:2601.02249 [cs.CV]

  50. [50]

    Chenjia Yang, Xiaoqing Luo, Zhancheng Zhang, Zhiguo Chen, and Xiao jun Wu

  51. [51]

    doi:10.1016/j.inffus.2025.102944

    KDFuse: A High-Level Vision Task-Driven Infrared and Visible Image Fusion Method Based on Cross-Domain Knowledge Distillation.Information Fusion118 (2025), 102944. doi:10.1016/j.inffus.2025.102944

  52. [52]

    Zengyi Yang, Yunping Li, Xin Tang, and Minghong Xie. 2024. MGFusion: A Multimodal Large Language Model-Guided Information Perception for Infrared and Visible Image Fusion.Frontiers in Neurorobotics18 (2024), 1521603. doi:10. 3389/fnbot.2024.1521603

  53. [53]

    Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2024. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  54. [54]

    Maoxun Yuan, Xiaorong Shi, Nan Wang, Yinyan Wang, and Xingxing Wei. 2024. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Information Fusion105 (2024), 102246. doi:10.1016/j.inffus.2024. 102246

  55. [55]

    Maoxun Yuan and Xingxing Wei. 2024. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–12. doi:10.1109/TGRS.2024.3376819

  56. [56]

    Jun-Seok Yun, Seon-Hoo Park, and Seok Bong Yoo. 2022. Infusion-Net: Inter- and Intra-Weighted Cross-Fusion Network for Multispectral Object Detection. Mathematics10, 21 (2022), 3966. doi:10.3390/math10213966

  57. [57]

    Yuqiao Zeng, Tengfei Liang, Yi Jin, and Yidong Li. 2024. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection.IEEE Trans- actions on Circuits and Systems for Video Technology34 (2024), 11198–11213. doi:10.1109/TCSVT.2024.3418965

  58. [58]

    Hao Zhang, Lei Cao, Xuhui Zuo, Zhenfeng Shao, and Jiayi Ma. 2025. Omnifuse: Composite degradation-robust image fusion with language-driven semantics. IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  59. [59]

    Heng Zhang, Elisa Fromont, Sébastien Lefèvre, and Bruno Avignon. 2020. Mul- tispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks. In2020 IEEE International Conference on Image Processing (ICIP). 276–280. doi:10.1109/ICIP40778.2020.9191080

  60. [60]

    Heng Zhang, Elisa Fromont, Sebastien Lefevre, and Bruno Avignon. 2021. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 72–80. doi:10.1109/WACV48630.2021.00012

  61. [61]

    Jianming Zhang, Xiangnan Shi, Zhijian Feng, Yan Gui, and Jin Wang. 2025. TMCN: Text-Guided Mamba-CNN Dual-Encoder Network for Infrared and Visible Image Fusion.Infrared Physics & Technology149 (2025), 105895. doi:10.1016/j.infrared. 2025.105895

  62. [62]

    Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wen- wei Zhang, Ping Luo, and Kai Chen. 2023. Dense Distinct Query for End-to-End Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7329–7338

  63. [63]

    Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion us- ing deep learning.IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 10535–10554

  64. [64]

    Yan Zhang, Huai Yu, Yujie He, Xinya Wang, and Wen Yang. 2023. Illumination- Guided RGBT Object Detection With Inter- and Intra-Modality Fusion.IEEE Transactions on Instrumentation and Measurement72 (2023), 1–13. doi:10.1109/ TIM.2023.3251414

  65. [65]

    Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. 2026. Removal Then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection.IEEE Transactions on Intelligent Transportation Systems27, 2 (2026), 2504–2519. doi:10.1109/TITS.2025.3638627

  66. [66]

    Kailai Zhou, Linsen Chen, and Xun Cao. 2020. Improving multispectral pedestrian detection by addressing modality imbalance problems. InEuropean conference on computer vision. Springer, 787–803

  67. [67]

    Minghang Zhou, Tianyu Li, Chaofan Qiao, Dongyu Xie, Guoqing Wang, Ningjuan Ruan, Lin Mei, Yang Yang, and Heng Tao Shen. 2025. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2025), 1–13. doi:10.1109/TGRS. 2025.3578309

  68. [68]

    Mingliang Zhou, Yunyao Li, Guangchao Yang, Xuekai Wei, Huayan Pu, Jun Luo, and Weijia Jia. 2025. COFNet: Contrastive Object-Aware Fusion Using Box-Level Masks for Multispectral Object Detection.IEEE Transactions on Multimedia27 (2025), 7444–7458. doi:10.1109/TMM.2025.3599097

  69. [69]

    Wei Zhou, Yingyuan Wang, Lina Zuo, Yuan Gao, and Yugen Yi. 2024. High-Level Vision Task-Driven Infrared and Visible Image Fusion Approach: Progressive Semantic Enhancement based Multi-Scale Cross-Modality Interactive Network. Measurement237 (2024), 114977. doi:10.1016/j.measurement.2024.114977

  70. [70]

    Haodong Zhu, Wenhao Dong, Linlin Yang, Hong Li, Yuguang Yang, Yangyang Ren, Qingcheng Zhu, Zichao Feng, Changbai Li, Shaohui Lin, Runqi Wang, Xi- aoyan Luo, and Baochang Zhang. 2025. WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 11219–11229. Bri...