Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Enhao Huang; Gao Huang; Jiaqi Wu; Kangqing Shen; Yang Yue; Yifan Pu; Yulin Wang; Zhen Wang

arxiv: 2604.11234 · v1 · submitted 2026-04-13 · 💻 cs.CV

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Jiaqi Wu , Zhen Wang , Enhao Huang , Kangqing Shen , Yulin Wang , Yang Yue , Yifan Pu , Gao Huang This is my paper

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords multispectral object detectiontext-guided fusionRGB-IR alignmentconsensus discrepancy modelingsemantic bridgecross-modal interactionbi-support fusion

0 comments

The pith

Text semantics serve as a shared bridge to align RGB and IR features by modeling consensus and discrepancy supports in multispectral detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a semantic bridge fusion framework that treats text as a common semantic anchor to align responses from RGB and infrared images under unified category conditions. It projects recalibrated thermal semantic priors onto the RGB branch for mapping fusion while splitting cross-modal evidence into a regular consensus support and a complementary discrepancy support that supplies additional discriminative cues. These supports are introduced through dynamic recalibration as an inductive bias, and a bidirectional semantic alignment module closes the vision-text guidance loop. The approach moves beyond treating text as mere auxiliary enhancement or relying only on stable consensus in attention-based fusion. Experiments confirm gains in detection accuracy on multispectral benchmarks.

Core claim

Text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion; RGB-IR interaction evidence is formulated into regular consensus support and complementary discrepancy support introduced via dynamic recalibration as a structured inductive bias, together with a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement.

What carries the argument

Semantic bridge fusion framework with bi-support modeling, where text acts as the shared semantic bridge for category-conditioned alignment and bi-support splits interactions into consensus and discrepancy components for recalibrated fusion.

If this is right

Category-conditioned text alignment reduces granularity mismatch between RGB and IR modalities.
Incorporating discrepancy support alongside consensus captures cross-modal cues that standard fusion overlooks.
Bidirectional vision-text alignment creates a closed-loop guidance mechanism that strengthens semantic consistency.
The overall framework yields higher detection accuracy on multispectral object detection benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bi-support idea could transfer to other cross-sensor fusion settings such as RGB-depth or radar-vision pairs when semantic descriptions are available.
Actively modeling discrepancies rather than suppressing them may encourage new fusion designs that treat differences as signal.
Combining the framework with richer language models for text semantics offers a direct route to stronger bridging in future sensor setups.

Load-bearing premise

Text semantics can reliably serve as a shared bridge to align inherently asymmetric RGB and IR granularities and that the discrepancy support contains consistently valuable discriminative cues rather than noise.

What would settle it

An ablation on a multispectral dataset with unreliable or absent text annotations where removing the text bridge or the discrepancy modeling component produces no accuracy gain or a measurable drop in detection performance.

Figures

Figures reproduced from arXiv: 2604.11234 by Enhao Huang, Gao Huang, Jiaqi Wu, Kangqing Shen, Yang Yue, Yifan Pu, Yulin Wang, Zhen Wang.

**Figure 1.** Figure 1: Comparison of different fusion paradigms. Unlike vanilla direct RGB–IR fusion and conditional prompt fusion, our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of consensus–discrepancy activation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed framework. It consists of a semantic-bridge-guided dynamic fusion module for modeling [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed bi-support modeling. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on seven representative scenes from the FLIR dataset. The examples include daytime perception [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Visualization of responses of consensus support [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 6.** Figure 6: Perception performance comparison on FLIR. The [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on five representative scenes from the LLVIP dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on five representative scenes from the M3FD dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of different fusion paradigms on the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of responses of consensus support [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Population-level trends of consensus and discrep [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds text as an explicit semantic bridge plus a split into consensus and discrepancy supports for RGB-IR fusion, but the discrepancy term risks pulling in noise without a clear isolation step.

read the letter

The main contribution is a fusion architecture that treats text as a shared category-level bridge to align RGB and IR features, then splits the cross-modal evidence into a regular consensus part and a complementary discrepancy part that gets dynamically recalibrated before fusion. They also add a bidirectional semantic alignment module to close the vision-text loop. This is a step beyond plain attention-based multimodal fusion because it explicitly tries to use the text to handle granularity differences and to treat discrepancies as potential signal rather than just averaging them away. The code release is a plus for checking the implementation. The experiments are described as showing gains on standard multispectral benchmarks, which is the usual way these papers demonstrate value. The soft spot is exactly the one in the stress test: the discrepancy support is presented as containing useful cues, but the description does not show a dedicated filter or separation step that would keep out registration errors, sensor-specific artifacts, or illumination noise. If the discrepancy is computed as a simple difference or residual, the recalibration step has to do all the work of suppressing the bad parts, and it is not obvious from the framing that it does so reliably. Without seeing the actual ablation numbers and failure cases, it is hard to tell whether the added complexity pays off or just adds variance. This is aimed at people already working on RGB-IR or multispectral detection who want a text-guided variant. It is solid enough on the design side to deserve a serious referee, even if the results section needs close scrutiny on whether the discrepancy modeling actually delivers clean gains.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a semantic bridge fusion framework for text-guided multispectral object detection. It uses text semantics as a shared bridge to align RGB and IR responses under unified category conditions, projects recalibrated thermal priors onto the RGB branch, formulates interaction evidence into regular consensus support and complementary discrepancy support with dynamic recalibration as inductive bias, and adds a bidirectional semantic alignment module for closed-loop vision-text guidance. The authors claim extensive experiments demonstrate superior detection performance on multispectral benchmarks, with code released.

Significance. If the bi-support modeling proves effective at extracting useful discrepancy cues, the framework could advance cross-modal fusion by explicitly addressing granularity asymmetry and overlooked discrepancies via text guidance, offering a structured alternative to standard attention-based methods. The public code release supports reproducibility and enables direct follow-up validation.

major comments (2)

[bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.
[experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.

minor comments (2)

[abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., mAP improvement on a benchmark) to support the superiority claim.
[method] Clarify the exact formulation of the discrepancy support (e.g., whether it is an attention difference or residual) and the recalibration operator to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.

Authors: We appreciate the referee highlighting the need for clearer justification of the discrepancy support. The discrepancy support is formulated as the element-wise residual between text-aligned RGB and IR feature maps after the semantic bridge alignment, intended to capture modality-specific variations that aid detection under granularity asymmetry. The dynamic recalibration employs a learned, modality-aware gating function (implemented via convolutional layers followed by sigmoid activation) that adaptively scales the supports, serving as an inductive bias to down-weight inconsistent or noisy regions. We acknowledge that the original manuscript provided insufficient mathematical derivation and explicit discussion of noise separation. In the revision, we will add a detailed formulation of the bi-support computation, explain the recalibration's role in implicit suppression, and include additional analysis (e.g., visualization of discrepancy maps) to demonstrate separation from common artifacts. Existing ablations already indicate performance gains when discrepancy support is included versus consensus-only fusion. revision: yes
Referee: [experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.

Authors: We thank the referee for noting this. The abstract is intentionally concise and omits specific numbers per standard practice, but the full Experiments section (Section 4) reports quantitative mAP results on multiple multispectral benchmarks, direct comparisons to state-of-the-art fusion baselines, and comprehensive ablations isolating each proposed component (including consensus vs. bi-support). The public code release further enables verification. To improve accessibility, we will revise the abstract to include a brief summary of key performance gains and ensure all metrics are prominently tabulated in the main text. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural framework with independent design choices

full rationale

The paper introduces a semantic bridge fusion framework and bi-support modeling as a methodological proposal for multispectral detection. Text is positioned as a shared semantic bridge and discrepancy support is formulated as a complementary term containing 'potentially discriminative cues' via dynamic recalibration; these are explicit design decisions and inductive biases rather than quantities derived from or equivalent to fitted inputs by construction. No equations are shown that reduce outputs to self-defined inputs, no predictions are fitted parameters renamed, and no load-bearing self-citations or uniqueness theorems appear in the abstract or description. The chain is self-contained as a new architecture without reduction to its own data or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the framework rests on domain assumptions about text alignment and discrepancy value rather than new mathematical axioms or invented physical entities.

axioms (2)

domain assumption Text can act as a shared semantic bridge to align RGB and IR under a unified category condition
Invoked to justify the projection of thermal prior onto RGB branch.
domain assumption Cross-modal discrepancies contain potentially valuable discriminative cues that should be modeled separately from consensus
Basis for introducing consensus and discrepancy supports via dynamic recalibration.

pith-pipeline@v0.9.0 · 5545 in / 1405 out tokens · 80025 ms · 2026-05-10T15:16:47.685828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

work page 2022
[2]

Jiale Cao, Yanwei Pang, Jin Xie, Fahad Shahbaz Khan, and Ling Shao. 2021. From handcrafted to deep features for pedestrian detection: A survey.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4913–4934

work page 2021
[3]

Yishuo Chen, Boran Wang, Wenbin Zhu, and Jing Yuan. 2024. RGB-IR YOLO combining Modality-Specific Reconstruction and Information Integration. In 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). 2045–2050. doi:10.1109/YAC63405.2024.10598725

work page doi:10.1109/yac63405.2024.10598725 2024
[4]

Yung-Yao Chen, Sin-Ye Jhong, Hsin-Chun Lin, and Yi-Chen Wu. 2025. Vision- Language-Guided Adaptive Cross-Modal Fusion for Multispectral Object Detec- tion Under Adverse Weather Conditions.IEEE MultiMedia32, 2 (2025), 22–32. doi:10.1109/MMUL.2025.3525559

work page doi:10.1109/mmul.2025.3525559 2025
[5]

Zhenshuai Chen, Wei Xiang, Zhiyuan Lin, Kaixuan Yang, Yunpeng Liu, and Zelin Shi. 2025. Alignment-assisted Frequency Fusion Network for RGB-infrared vehicle detection.Neurocomputing647 (2025), 130505. doi:10.1016/j.neucom. 2025.130505

work page doi:10.1016/j.neucom 2025
[6]

Xiaolong Cheng, Keke Geng, Ziwei Wang, Jinhu Wang, Yuxiao Sun, and Pengbo Ding. 2023. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment.Multimedia Tools Appl.82, 30 (May 2023), 47773–47792. doi:10.1007/s11042-023-15333-w

work page doi:10.1007/s11042-023-15333-w 2023
[7]

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Guodong Guo, and Baochang Zhang. 2025. Fusion-Mamba for Cross-Modality Object Detection.IEEE Transactions on Multimedia27 (2025), 7392–7406. doi:10.1109/TMM.2025.3599020

work page doi:10.1109/tmm.2025.3599020 2025
[8]

Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. 2021. ReDet: A Rotation- Equivariant Detector for Aerial Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2786–2795

work page 2021
[9]

Xiao He, Chang Tang, Xin Zou, and Wei Zhang. 2023. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM). 1465–1474. doi:10.1145/ 3581783.3612651

work page arXiv 2023
[10]

Lian Huang, Zongju Peng, Fen Chen, Shaosheng Dai, Ziqiang He, and Kesh- eng Liu. 2024. Cross-Modality Interaction for Few-Shot Multispectral Object Detection with Semantic Knowledge.Neural Networks173 (2024), 106156. doi:10.1016/j.neunet.2024.106156

work page doi:10.1016/j.neunet.2024.106156 2024
[11]

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon

work page
[12]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1037–1045

work page
[13]

Junbo Jang, Chanyeong Park, Heegwang Kim, Jiyoon Lee, and Joonki Paik

work page
[14]

In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Multispectral Object Detection Enhanced by Cross-Modal Information Complementary and Cosine Similarity Channel Resampling Modules. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 9437–

work page
[15]

doi:10.1109/WACV61041.2025.00914

work page doi:10.1109/wacv61041.2025.00914 2025
[16]

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international conference on computer vision. 3496–3504

work page 2021
[17]

2020.Ultralytics YOLOv5

Glenn Jocher. 2020.Ultralytics YOLOv5. doi:10.5281/zenodo.3908559

work page doi:10.5281/zenodo.3908559 2020
[18]

2023.Ultralytics YOLOv8

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.Ultralytics YOLOv8. https: //github.com/ultralytics/ultralytics

work page 2023
[19]

Xudong Kang, Hui Yin, and Puhong Duan. 2024. Global–Local Feature Fusion Network for Visible–Infrared Vehicle Detection.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5. doi:10.1109/LGRS.2024.3375634

work page doi:10.1109/lgrs.2024.3375634 2024
[20]

Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Uncertainty-guided cross- modal learning for robust multispectral pedestrian detection.IEEE Transactions on Circuits and Systems for Video Technology32, 3 (2021), 1510–1523

work page 2021
[21]

Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. 2019. Illumination-aware faster R-CNN for robust multispectral pedestrian detection.Pattern Recognition 85 (2019), 161–171

work page 2019
[22]

Hanyun Li, Linsong Xiao, Lihua Cao, Di Wu, Yangfan Liu, Yi Li, Yunfeng Zhang, and Haiyang Bao. 2026. CrossModalNet: A dual-modal object detection network based on cross-modal fusion and channel interaction.Expert Systems with Applications298 (2026), 129677. doi:10.1016/j.eswa.2025.129677

work page doi:10.1016/j.eswa.2025.129677 2026
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 202). 19730–19742

work page 2023
[24]

Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2021
[25]

Ting Li, Songtao Li, Shuaifeng Li, Xiaolin Qin, Maoyuan Zhao, Luping Ji, and Mao Ye. 2025. SAM-Guided Semantic Knowledge Fusion for Visible-Infrared Object Detection. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM). 8835–8844. doi:10.1145/3746027.3755718

work page doi:10.1145/3746027.3755718 2025
[26]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV)

work page 2017
[27]

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detec- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5802–5811

work page 2022
[28]

Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. 2024. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 4 (2024), 2349–2369

work page 2024
[29]

Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. 2016. Multi- spectral Deep Neural Networks for Pedestrian Detection. InProceedings of the British Machine Vision Conference (BMVC). 73.1–73.13. doi:10.5244/C.30.73

work page doi:10.5244/c.30.73 2016
[30]

Xiaowen Liu, Hongtao Huo, Jing Li, Shan Pang, and Bowen Zheng. 2024. A Semantic-Driven Coupled Network for Infrared and Visible Image Fusion.Infor- mation Fusion108 (2024), 102352. doi:10.1016/j.inffus.2024.102352

work page doi:10.1016/j.inffus.2024.102352 2024
[31]

Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, and Siao Liu. 2026. Modal- ity Dominance-Aware Optimization for Embodied RGB-Infrared Perception. arXiv:2601.00598 [cs.CV]

work page arXiv 2026
[32]

Junlin Ouyang, Pengcheng Jin, and Qingwang Wang. 2024. Multimodal Feature- Guided Pretraining for RGB-T Perception.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 16041–16050. doi:10. 1109/JSTARS.2024.3454054

work page arXiv 2024
[33]

Xiangyu Qin, Enlong Wang, Shihua Zhou, Bin Wang, and Nikola K. Kasabov

work page
[34]

Masset, R

TSPFusion: Text-Guided Semantic Perception for Infrared and Visible Image Fusion.Infrared Physics & Technology153 (2026), 106324. doi:10.1016/j. infrared.2025.106324

work page doi:10.1016/j 2026
[35]

Fang Qingyun, Han Dapeng, and Wang Zhaokui. 2021. Cross-modality fusion transformer for multispectral object detection.arXiv preprint arXiv:2111.00273 (2021)

work page arXiv 2021
[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[37]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page
[38]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net- works. InAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/ ...

work page 2015
[39]

Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. 2024. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognition145 (2024), 109913. doi:10.1016/j.patcog. 2023.109913

work page doi:10.1016/j.patcog 2024
[40]

Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, and Haibin Ling. 2026. Multispectral state-space feature fusion: Bridging shared and cross- parametric interactions for object detection.Information Fusion127 (2026), 103895. doi:10.1016/j.inffus.2025.103895

work page doi:10.1016/j.inffus.2025.103895 2026
[41]

Dongdong Sun, Chuanyun Wang, Tian Wang, Qian Gao, Qiong Liu, and Linlin Wang. 2025. CLIPFusion: Infrared and Visible Image Fusion Network Based on Image–Text Large Model and Adaptive Learning.Displays89 (2025), 103042. doi:10.1016/j.displa.2025.103042

work page doi:10.1016/j.displa.2025.103042 2025
[42]

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. 2022. Drone-based RGB- infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 6700– 6713

work page 2022
[43]

Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. 2025. ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts. InAdvances in Neural Information Processing Systems

work page 2025
[44]

Teledyne FLIR LLC. 2021. Teledyne FLIR Free Starter Thermal Dataset for Algo- rithm Training. https://adas-dataset-v2.flirconservator.com/dataset/README. txt. Accessed: 2026-04-02

work page 2021
[45]

Changhai Wang, Zhe Huang, Yuwei Xu, Wanwei Huang, and Yuan Tian. 2026. FDFusion: Efficient text-guided infrared-visible image fusion via fine-tuned light- weight VLM and Dual-branch feature modeling.Infrared Physics & Technology (2026), 106499

work page 2026
[46]

Enlong Wang, Jiawei Li, Jia Lei, Jinyuan Liu, Shihua Zhou, Bin Wang, and Nikola K. Kasabov. 2024. SDFuse: Semantic-Injected Dual-Flow Learning for Infrared and Visible Image Fusion.Expert Systems with Applications252, Part B (2024), 124188. doi:10.1016/j.eswa.2024.124188

work page doi:10.1016/j.eswa.2024.124188 2024
[47]

Huiying Wang, Chunping Wang, Qiang Fu, Binqiang Si, Dongdong Zhang, Renke Kou, Ying Yu, and Changfeng Feng. 2024. YOLOFIV: Object Detection Algorithm for Around-the-Clock Aerial Remote Sensing Images by Fusing Infrared and Visible Features.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 15269–15287. doi:10.1109/J...

work page doi:10.1109/jstars.2024.3447649 2024
[48]

Huiying Wang, Chunping Wang, Qiang Fu, Dongdong Zhang, Renke Kou, Ying Yu, and Jian Song. 2024. Cross-Modal Oriented Object Detection of UAV Aerial Images Based on Image Feature.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–21. doi:10.1109/TGRS.2024.3367934

work page doi:10.1109/tgrs.2024.3367934 2024
[49]

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, and Yuxin Hu. 2026. SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection. arXiv:2601.02249 [cs.CV]

work page arXiv 2026
[50]

Chenjia Yang, Xiaoqing Luo, Zhancheng Zhang, Zhiguo Chen, and Xiao jun Wu

work page
[51]

doi:10.1016/j.inffus.2025.102944

KDFuse: A High-Level Vision Task-Driven Infrared and Visible Image Fusion Method Based on Cross-Domain Knowledge Distillation.Information Fusion118 (2025), 102944. doi:10.1016/j.inffus.2025.102944

work page doi:10.1016/j.inffus.2025.102944 2025
[52]

Zengyi Yang, Yunping Li, Xin Tang, and Minghong Xie. 2024. MGFusion: A Multimodal Large Language Model-Guided Information Perception for Infrared and Visible Image Fusion.Frontiers in Neurorobotics18 (2024), 1521603. doi:10. 3389/fnbot.2024.1521603

work page arXiv 2024
[53]

Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2024. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024
[54]

Maoxun Yuan, Xiaorong Shi, Nan Wang, Yinyan Wang, and Xingxing Wei. 2024. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Information Fusion105 (2024), 102246. doi:10.1016/j.inffus.2024. 102246

work page doi:10.1016/j.inffus.2024 2024
[55]

Maoxun Yuan and Xingxing Wei. 2024. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–12. doi:10.1109/TGRS.2024.3376819

work page doi:10.1109/tgrs.2024.3376819 2024
[56]

Jun-Seok Yun, Seon-Hoo Park, and Seok Bong Yoo. 2022. Infusion-Net: Inter- and Intra-Weighted Cross-Fusion Network for Multispectral Object Detection. Mathematics10, 21 (2022), 3966. doi:10.3390/math10213966

work page doi:10.3390/math10213966 2022
[57]

Yuqiao Zeng, Tengfei Liang, Yi Jin, and Yidong Li. 2024. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection.IEEE Trans- actions on Circuits and Systems for Video Technology34 (2024), 11198–11213. doi:10.1109/TCSVT.2024.3418965

work page doi:10.1109/tcsvt.2024.3418965 2024
[58]

Hao Zhang, Lei Cao, Xuhui Zuo, Zhenfeng Shao, and Jiayi Ma. 2025. Omnifuse: Composite degradation-robust image fusion with language-driven semantics. IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

work page 2025
[59]

Heng Zhang, Elisa Fromont, Sébastien Lefèvre, and Bruno Avignon. 2020. Mul- tispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks. In2020 IEEE International Conference on Image Processing (ICIP). 276–280. doi:10.1109/ICIP40778.2020.9191080

work page doi:10.1109/icip40778.2020.9191080 2020
[60]

Heng Zhang, Elisa Fromont, Sebastien Lefevre, and Bruno Avignon. 2021. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 72–80. doi:10.1109/WACV48630.2021.00012

work page doi:10.1109/wacv48630.2021.00012 2021
[61]

Jianming Zhang, Xiangnan Shi, Zhijian Feng, Yan Gui, and Jin Wang. 2025. TMCN: Text-Guided Mamba-CNN Dual-Encoder Network for Infrared and Visible Image Fusion.Infrared Physics & Technology149 (2025), 105895. doi:10.1016/j.infrared. 2025.105895

work page doi:10.1016/j.infrared 2025
[62]

Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wen- wei Zhang, Ping Luo, and Kai Chen. 2023. Dense Distinct Query for End-to-End Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7329–7338

work page 2023
[63]

Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion us- ing deep learning.IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 10535–10554

work page 2023
[64]

Yan Zhang, Huai Yu, Yujie He, Xinya Wang, and Wen Yang. 2023. Illumination- Guided RGBT Object Detection With Inter- and Intra-Modality Fusion.IEEE Transactions on Instrumentation and Measurement72 (2023), 1–13. doi:10.1109/ TIM.2023.3251414

work page arXiv 2023
[65]

Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. 2026. Removal Then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection.IEEE Transactions on Intelligent Transportation Systems27, 2 (2026), 2504–2519. doi:10.1109/TITS.2025.3638627

work page doi:10.1109/tits.2025.3638627 2026
[66]

Kailai Zhou, Linsen Chen, and Xun Cao. 2020. Improving multispectral pedestrian detection by addressing modality imbalance problems. InEuropean conference on computer vision. Springer, 787–803

work page 2020
[67]

Minghang Zhou, Tianyu Li, Chaofan Qiao, Dongyu Xie, Guoqing Wang, Ningjuan Ruan, Lin Mei, Yang Yang, and Heng Tao Shen. 2025. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2025), 1–13. doi:10.1109/TGRS. 2025.3578309

work page doi:10.1109/tgrs 2025
[68]

Mingliang Zhou, Yunyao Li, Guangchao Yang, Xuekai Wei, Huayan Pu, Jun Luo, and Weijia Jia. 2025. COFNet: Contrastive Object-Aware Fusion Using Box-Level Masks for Multispectral Object Detection.IEEE Transactions on Multimedia27 (2025), 7444–7458. doi:10.1109/TMM.2025.3599097

work page doi:10.1109/tmm.2025.3599097 2025
[69]

Wei Zhou, Yingyuan Wang, Lina Zuo, Yuan Gao, and Yugen Yi. 2024. High-Level Vision Task-Driven Infrared and Visible Image Fusion Approach: Progressive Semantic Enhancement based Multi-Scale Cross-Modality Interactive Network. Measurement237 (2024), 114977. doi:10.1016/j.measurement.2024.114977

work page doi:10.1016/j.measurement.2024.114977 2024
[70]

Haodong Zhu, Wenhao Dong, Linlin Yang, Hong Li, Yuguang Yang, Yangyang Ren, Qingcheng Zhu, Zichao Feng, Changbai Li, Shaohui Lin, Runqi Wang, Xi- aoyan Luo, and Baochang Zhang. 2025. WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 11219–11229. Bri...

work page 2025

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

work page 2022

[2] [2]

Jiale Cao, Yanwei Pang, Jin Xie, Fahad Shahbaz Khan, and Ling Shao. 2021. From handcrafted to deep features for pedestrian detection: A survey.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4913–4934

work page 2021

[3] [3]

Yishuo Chen, Boran Wang, Wenbin Zhu, and Jing Yuan. 2024. RGB-IR YOLO combining Modality-Specific Reconstruction and Information Integration. In 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). 2045–2050. doi:10.1109/YAC63405.2024.10598725

work page doi:10.1109/yac63405.2024.10598725 2024

[4] [4]

Yung-Yao Chen, Sin-Ye Jhong, Hsin-Chun Lin, and Yi-Chen Wu. 2025. Vision- Language-Guided Adaptive Cross-Modal Fusion for Multispectral Object Detec- tion Under Adverse Weather Conditions.IEEE MultiMedia32, 2 (2025), 22–32. doi:10.1109/MMUL.2025.3525559

work page doi:10.1109/mmul.2025.3525559 2025

[5] [5]

Zhenshuai Chen, Wei Xiang, Zhiyuan Lin, Kaixuan Yang, Yunpeng Liu, and Zelin Shi. 2025. Alignment-assisted Frequency Fusion Network for RGB-infrared vehicle detection.Neurocomputing647 (2025), 130505. doi:10.1016/j.neucom. 2025.130505

work page doi:10.1016/j.neucom 2025

[6] [6]

Xiaolong Cheng, Keke Geng, Ziwei Wang, Jinhu Wang, Yuxiao Sun, and Pengbo Ding. 2023. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment.Multimedia Tools Appl.82, 30 (May 2023), 47773–47792. doi:10.1007/s11042-023-15333-w

work page doi:10.1007/s11042-023-15333-w 2023

[7] [7]

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Guodong Guo, and Baochang Zhang. 2025. Fusion-Mamba for Cross-Modality Object Detection.IEEE Transactions on Multimedia27 (2025), 7392–7406. doi:10.1109/TMM.2025.3599020

work page doi:10.1109/tmm.2025.3599020 2025

[8] [8]

Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. 2021. ReDet: A Rotation- Equivariant Detector for Aerial Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2786–2795

work page 2021

[9] [9]

Xiao He, Chang Tang, Xin Zou, and Wei Zhang. 2023. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. InProceedings of the 31st ACM International Conference on Multimedia (ACM MM). 1465–1474. doi:10.1145/ 3581783.3612651

work page arXiv 2023

[10] [10]

Lian Huang, Zongju Peng, Fen Chen, Shaosheng Dai, Ziqiang He, and Kesh- eng Liu. 2024. Cross-Modality Interaction for Few-Shot Multispectral Object Detection with Semantic Knowledge.Neural Networks173 (2024), 106156. doi:10.1016/j.neunet.2024.106156

work page doi:10.1016/j.neunet.2024.106156 2024

[11] [11]

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon

work page

[12] [12]

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1037–1045

work page

[13] [13]

Junbo Jang, Chanyeong Park, Heegwang Kim, Jiyoon Lee, and Joonki Paik

work page

[14] [14]

In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Multispectral Object Detection Enhanced by Cross-Modal Information Complementary and Cosine Similarity Channel Resampling Modules. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 9437–

work page

[15] [15]

doi:10.1109/WACV61041.2025.00914

work page doi:10.1109/wacv61041.2025.00914 2025

[16] [16]

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international conference on computer vision. 3496–3504

work page 2021

[17] [17]

2020.Ultralytics YOLOv5

Glenn Jocher. 2020.Ultralytics YOLOv5. doi:10.5281/zenodo.3908559

work page doi:10.5281/zenodo.3908559 2020

[18] [18]

2023.Ultralytics YOLOv8

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.Ultralytics YOLOv8. https: //github.com/ultralytics/ultralytics

work page 2023

[19] [19]

Xudong Kang, Hui Yin, and Puhong Duan. 2024. Global–Local Feature Fusion Network for Visible–Infrared Vehicle Detection.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5. doi:10.1109/LGRS.2024.3375634

work page doi:10.1109/lgrs.2024.3375634 2024

[20] [20]

Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Uncertainty-guided cross- modal learning for robust multispectral pedestrian detection.IEEE Transactions on Circuits and Systems for Video Technology32, 3 (2021), 1510–1523

work page 2021

[21] [21]

Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. 2019. Illumination-aware faster R-CNN for robust multispectral pedestrian detection.Pattern Recognition 85 (2019), 161–171

work page 2019

[22] [22]

Hanyun Li, Linsong Xiao, Lihua Cao, Di Wu, Yangfan Liu, Yi Li, Yunfeng Zhang, and Haiyang Bao. 2026. CrossModalNet: A dual-modal object detection network based on cross-modal fusion and channel interaction.Expert Systems with Applications298 (2026), 129677. doi:10.1016/j.eswa.2025.129677

work page doi:10.1016/j.eswa.2025.129677 2026

[23] [23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 202). 19730–19742

work page 2023

[24] [24]

Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2021

[25] [25]

Ting Li, Songtao Li, Shuaifeng Li, Xiaolin Qin, Maoyuan Zhao, Luping Ji, and Mao Ye. 2025. SAM-Guided Semantic Knowledge Fusion for Visible-Infrared Object Detection. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM). 8835–8844. doi:10.1145/3746027.3755718

work page doi:10.1145/3746027.3755718 2025

[26] [26]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV)

work page 2017

[27] [27]

Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detec- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5802–5811

work page 2022

[28] [28]

Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. 2024. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 4 (2024), 2349–2369

work page 2024

[29] [29]

Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. 2016. Multi- spectral Deep Neural Networks for Pedestrian Detection. InProceedings of the British Machine Vision Conference (BMVC). 73.1–73.13. doi:10.5244/C.30.73

work page doi:10.5244/c.30.73 2016

[30] [30]

Xiaowen Liu, Hongtao Huo, Jing Li, Shan Pang, and Bowen Zheng. 2024. A Semantic-Driven Coupled Network for Infrared and Visible Image Fusion.Infor- mation Fusion108 (2024), 102352. doi:10.1016/j.inffus.2024.102352

work page doi:10.1016/j.inffus.2024.102352 2024

[31] [31]

Xianhui Liu, Siqi Jiang, Yi Xie, Yuqing Lin, and Siao Liu. 2026. Modal- ity Dominance-Aware Optimization for Embodied RGB-Infrared Perception. arXiv:2601.00598 [cs.CV]

work page arXiv 2026

[32] [32]

Junlin Ouyang, Pengcheng Jin, and Qingwang Wang. 2024. Multimodal Feature- Guided Pretraining for RGB-T Perception.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 16041–16050. doi:10. 1109/JSTARS.2024.3454054

work page arXiv 2024

[33] [33]

Xiangyu Qin, Enlong Wang, Shihua Zhou, Bin Wang, and Nikola K. Kasabov

work page

[34] [34]

Masset, R

TSPFusion: Text-Guided Semantic Perception for Infrared and Visible Image Fusion.Infrared Physics & Technology153 (2026), 106324. doi:10.1016/j. infrared.2025.106324

work page doi:10.1016/j 2026

[35] [35]

Fang Qingyun, Han Dapeng, and Wang Zhaokui. 2021. Cross-modality fusion transformer for multispectral object detection.arXiv preprint arXiv:2111.00273 (2021)

work page arXiv 2021

[36] [36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page

[37] [37]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page

[38] [38]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net- works. InAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/ ...

work page 2015

[39] [39]

Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. 2024. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognition145 (2024), 109913. doi:10.1016/j.patcog. 2023.109913

work page doi:10.1016/j.patcog 2024

[40] [40]

Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, and Haibin Ling. 2026. Multispectral state-space feature fusion: Bridging shared and cross- parametric interactions for object detection.Information Fusion127 (2026), 103895. doi:10.1016/j.inffus.2025.103895

work page doi:10.1016/j.inffus.2025.103895 2026

[41] [41]

Dongdong Sun, Chuanyun Wang, Tian Wang, Qian Gao, Qiong Liu, and Linlin Wang. 2025. CLIPFusion: Infrared and Visible Image Fusion Network Based on Image–Text Large Model and Adaptive Learning.Displays89 (2025), 103042. doi:10.1016/j.displa.2025.103042

work page doi:10.1016/j.displa.2025.103042 2025

[42] [42]

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. 2022. Drone-based RGB- infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 6700– 6713

work page 2022

[43] [43]

Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. 2025. ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts. InAdvances in Neural Information Processing Systems

work page 2025

[44] [44]

Teledyne FLIR LLC. 2021. Teledyne FLIR Free Starter Thermal Dataset for Algo- rithm Training. https://adas-dataset-v2.flirconservator.com/dataset/README. txt. Accessed: 2026-04-02

work page 2021

[45] [45]

Changhai Wang, Zhe Huang, Yuwei Xu, Wanwei Huang, and Yuan Tian. 2026. FDFusion: Efficient text-guided infrared-visible image fusion via fine-tuned light- weight VLM and Dual-branch feature modeling.Infrared Physics & Technology (2026), 106499

work page 2026

[46] [46]

Enlong Wang, Jiawei Li, Jia Lei, Jinyuan Liu, Shihua Zhou, Bin Wang, and Nikola K. Kasabov. 2024. SDFuse: Semantic-Injected Dual-Flow Learning for Infrared and Visible Image Fusion.Expert Systems with Applications252, Part B (2024), 124188. doi:10.1016/j.eswa.2024.124188

work page doi:10.1016/j.eswa.2024.124188 2024

[47] [47]

Huiying Wang, Chunping Wang, Qiang Fu, Binqiang Si, Dongdong Zhang, Renke Kou, Ying Yu, and Changfeng Feng. 2024. YOLOFIV: Object Detection Algorithm for Around-the-Clock Aerial Remote Sensing Images by Fusing Infrared and Visible Features.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 15269–15287. doi:10.1109/J...

work page doi:10.1109/jstars.2024.3447649 2024

[48] [48]

Huiying Wang, Chunping Wang, Qiang Fu, Dongdong Zhang, Renke Kou, Ying Yu, and Jian Song. 2024. Cross-Modal Oriented Object Detection of UAV Aerial Images Based on Image Feature.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–21. doi:10.1109/TGRS.2024.3367934

work page doi:10.1109/tgrs.2024.3367934 2024

[49] [49]

Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, and Yuxin Hu. 2026. SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection. arXiv:2601.02249 [cs.CV]

work page arXiv 2026

[50] [50]

Chenjia Yang, Xiaoqing Luo, Zhancheng Zhang, Zhiguo Chen, and Xiao jun Wu

work page

[51] [51]

doi:10.1016/j.inffus.2025.102944

KDFuse: A High-Level Vision Task-Driven Infrared and Visible Image Fusion Method Based on Cross-Domain Knowledge Distillation.Information Fusion118 (2025), 102944. doi:10.1016/j.inffus.2025.102944

work page doi:10.1016/j.inffus.2025.102944 2025

[52] [52]

Zengyi Yang, Yunping Li, Xin Tang, and Minghong Xie. 2024. MGFusion: A Multimodal Large Language Model-Guided Information Perception for Infrared and Visible Image Fusion.Frontiers in Neurorobotics18 (2024), 1521603. doi:10. 3389/fnbot.2024.1521603

work page arXiv 2024

[53] [53]

Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2024. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024

[54] [54]

Maoxun Yuan, Xiaorong Shi, Nan Wang, Yinyan Wang, and Xingxing Wei. 2024. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Information Fusion105 (2024), 102246. doi:10.1016/j.inffus.2024. 102246

work page doi:10.1016/j.inffus.2024 2024

[55] [55]

Maoxun Yuan and Xingxing Wei. 2024. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–12. doi:10.1109/TGRS.2024.3376819

work page doi:10.1109/tgrs.2024.3376819 2024

[56] [56]

Jun-Seok Yun, Seon-Hoo Park, and Seok Bong Yoo. 2022. Infusion-Net: Inter- and Intra-Weighted Cross-Fusion Network for Multispectral Object Detection. Mathematics10, 21 (2022), 3966. doi:10.3390/math10213966

work page doi:10.3390/math10213966 2022

[57] [57]

Yuqiao Zeng, Tengfei Liang, Yi Jin, and Yidong Li. 2024. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection.IEEE Trans- actions on Circuits and Systems for Video Technology34 (2024), 11198–11213. doi:10.1109/TCSVT.2024.3418965

work page doi:10.1109/tcsvt.2024.3418965 2024

[58] [58]

Hao Zhang, Lei Cao, Xuhui Zuo, Zhenfeng Shao, and Jiayi Ma. 2025. Omnifuse: Composite degradation-robust image fusion with language-driven semantics. IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

work page 2025

[59] [59]

Heng Zhang, Elisa Fromont, Sébastien Lefèvre, and Bruno Avignon. 2020. Mul- tispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks. In2020 IEEE International Conference on Image Processing (ICIP). 276–280. doi:10.1109/ICIP40778.2020.9191080

work page doi:10.1109/icip40778.2020.9191080 2020

[60] [60]

Heng Zhang, Elisa Fromont, Sebastien Lefevre, and Bruno Avignon. 2021. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 72–80. doi:10.1109/WACV48630.2021.00012

work page doi:10.1109/wacv48630.2021.00012 2021

[61] [61]

Jianming Zhang, Xiangnan Shi, Zhijian Feng, Yan Gui, and Jin Wang. 2025. TMCN: Text-Guided Mamba-CNN Dual-Encoder Network for Infrared and Visible Image Fusion.Infrared Physics & Technology149 (2025), 105895. doi:10.1016/j.infrared. 2025.105895

work page doi:10.1016/j.infrared 2025

[62] [62]

Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wen- wei Zhang, Ping Luo, and Kai Chen. 2023. Dense Distinct Query for End-to-End Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7329–7338

work page 2023

[63] [63]

Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion us- ing deep learning.IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 10535–10554

work page 2023

[64] [64]

Yan Zhang, Huai Yu, Yujie He, Xinya Wang, and Wen Yang. 2023. Illumination- Guided RGBT Object Detection With Inter- and Intra-Modality Fusion.IEEE Transactions on Instrumentation and Measurement72 (2023), 1–13. doi:10.1109/ TIM.2023.3251414

work page arXiv 2023

[65] [65]

Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. 2026. Removal Then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection.IEEE Transactions on Intelligent Transportation Systems27, 2 (2026), 2504–2519. doi:10.1109/TITS.2025.3638627

work page doi:10.1109/tits.2025.3638627 2026

[66] [66]

Kailai Zhou, Linsen Chen, and Xun Cao. 2020. Improving multispectral pedestrian detection by addressing modality imbalance problems. InEuropean conference on computer vision. Springer, 787–803

work page 2020

[67] [67]

Minghang Zhou, Tianyu Li, Chaofan Qiao, Dongyu Xie, Guoqing Wang, Ningjuan Ruan, Lin Mei, Yang Yang, and Heng Tao Shen. 2025. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2025), 1–13. doi:10.1109/TGRS. 2025.3578309

work page doi:10.1109/tgrs 2025

[68] [68]

Mingliang Zhou, Yunyao Li, Guangchao Yang, Xuekai Wei, Huayan Pu, Jun Luo, and Weijia Jia. 2025. COFNet: Contrastive Object-Aware Fusion Using Box-Level Masks for Multispectral Object Detection.IEEE Transactions on Multimedia27 (2025), 7444–7458. doi:10.1109/TMM.2025.3599097

work page doi:10.1109/tmm.2025.3599097 2025

[69] [69]

Wei Zhou, Yingyuan Wang, Lina Zuo, Yuan Gao, and Yugen Yi. 2024. High-Level Vision Task-Driven Infrared and Visible Image Fusion Approach: Progressive Semantic Enhancement based Multi-Scale Cross-Modality Interactive Network. Measurement237 (2024), 114977. doi:10.1016/j.measurement.2024.114977

work page doi:10.1016/j.measurement.2024.114977 2024

[70] [70]

Haodong Zhu, Wenhao Dong, Linlin Yang, Hong Li, Yuguang Yang, Yangyang Ren, Qingcheng Zhu, Zichao Feng, Changbai Li, Shaohui Lin, Runqi Wang, Xi- aoyan Luo, and Baochang Zhang. 2025. WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 11219–11229. Bri...

work page 2025