Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection
Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3
The pith
Text semantics serve as a shared bridge to align RGB and IR features by modeling consensus and discrepancy supports in multispectral detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion; RGB-IR interaction evidence is formulated into regular consensus support and complementary discrepancy support introduced via dynamic recalibration as a structured inductive bias, together with a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement.
What carries the argument
Semantic bridge fusion framework with bi-support modeling, where text acts as the shared semantic bridge for category-conditioned alignment and bi-support splits interactions into consensus and discrepancy components for recalibrated fusion.
If this is right
- Category-conditioned text alignment reduces granularity mismatch between RGB and IR modalities.
- Incorporating discrepancy support alongside consensus captures cross-modal cues that standard fusion overlooks.
- Bidirectional vision-text alignment creates a closed-loop guidance mechanism that strengthens semantic consistency.
- The overall framework yields higher detection accuracy on multispectral object detection benchmarks.
Where Pith is reading between the lines
- The bi-support idea could transfer to other cross-sensor fusion settings such as RGB-depth or radar-vision pairs when semantic descriptions are available.
- Actively modeling discrepancies rather than suppressing them may encourage new fusion designs that treat differences as signal.
- Combining the framework with richer language models for text semantics offers a direct route to stronger bridging in future sensor setups.
Load-bearing premise
Text semantics can reliably serve as a shared bridge to align inherently asymmetric RGB and IR granularities and that the discrepancy support contains consistently valuable discriminative cues rather than noise.
What would settle it
An ablation on a multispectral dataset with unreliable or absent text annotations where removing the text bridge or the discrepancy modeling component produces no accuracy gain or a measurable drop in detection performance.
Figures
read the original abstract
Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a semantic bridge fusion framework for text-guided multispectral object detection. It uses text semantics as a shared bridge to align RGB and IR responses under unified category conditions, projects recalibrated thermal priors onto the RGB branch, formulates interaction evidence into regular consensus support and complementary discrepancy support with dynamic recalibration as inductive bias, and adds a bidirectional semantic alignment module for closed-loop vision-text guidance. The authors claim extensive experiments demonstrate superior detection performance on multispectral benchmarks, with code released.
Significance. If the bi-support modeling proves effective at extracting useful discrepancy cues, the framework could advance cross-modal fusion by explicitly addressing granularity asymmetry and overlooked discrepancies via text guidance, offering a structured alternative to standard attention-based methods. The public code release supports reproducibility and enables direct follow-up validation.
major comments (2)
- [bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.
- [experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.
minor comments (2)
- [abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., mAP improvement on a benchmark) to support the superiority claim.
- [method] Clarify the exact formulation of the discrepancy support (e.g., whether it is an attention difference or residual) and the recalibration operator to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [bi-support modeling] In the bi-support modeling (abstract and proposed framework description): the claim that the complementary discrepancy support 'contains potentially discriminative cues' is load-bearing for superiority over consensus-only fusion, yet no derivation, filtering module, or separation from sensor noise, illumination artifacts, or registration errors is provided; a simple residual computation would mix signal and noise, and the dynamic recalibration step lacks an explicit suppression mechanism.
Authors: We appreciate the referee highlighting the need for clearer justification of the discrepancy support. The discrepancy support is formulated as the element-wise residual between text-aligned RGB and IR feature maps after the semantic bridge alignment, intended to capture modality-specific variations that aid detection under granularity asymmetry. The dynamic recalibration employs a learned, modality-aware gating function (implemented via convolutional layers followed by sigmoid activation) that adaptively scales the supports, serving as an inductive bias to down-weight inconsistent or noisy regions. We acknowledge that the original manuscript provided insufficient mathematical derivation and explicit discussion of noise separation. In the revision, we will add a detailed formulation of the bi-support computation, explain the recalibration's role in implicit suppression, and include additional analysis (e.g., visualization of discrepancy maps) to demonstrate separation from common artifacts. Existing ablations already indicate performance gains when discrepancy support is included versus consensus-only fusion. revision: yes
-
Referee: [experiments] Experiments section: the abstract asserts 'extensive experiments demonstrate... superior detection performance' but supplies no quantitative metrics, ablation results, or baseline comparisons, preventing verification of the central performance claim.
Authors: We thank the referee for noting this. The abstract is intentionally concise and omits specific numbers per standard practice, but the full Experiments section (Section 4) reports quantitative mAP results on multiple multispectral benchmarks, direct comparisons to state-of-the-art fusion baselines, and comprehensive ablations isolating each proposed component (including consensus vs. bi-support). The public code release further enables verification. To improve accessibility, we will revise the abstract to include a brief summary of key performance gains and ensure all metrics are prominently tabulated in the main text. revision: partial
Circularity Check
No circularity: architectural framework with independent design choices
full rationale
The paper introduces a semantic bridge fusion framework and bi-support modeling as a methodological proposal for multispectral detection. Text is positioned as a shared semantic bridge and discrepancy support is formulated as a complementary term containing 'potentially discriminative cues' via dynamic recalibration; these are explicit design decisions and inductive biases rather than quantities derived from or equivalent to fitted inputs by construction. No equations are shown that reduce outputs to self-defined inputs, no predictions are fitted parameters renamed, and no load-bearing self-citations or uniqueness theorems appear in the abstract or description. The chain is self-contained as a new architecture without reduction to its own data or prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Text can act as a shared semantic bridge to align RGB and IR under a unified category condition
- domain assumption Cross-modal discrepancies contain potentially valuable discriminative cues that should be modeled separately from consensus
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page 2022
-
[2]
Jiale Cao, Yanwei Pang, Jin Xie, Fahad Shahbaz Khan, and Ling Shao. 2021. From handcrafted to deep features for pedestrian detection: A survey.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4913–4934
work page 2021
-
[3]
Yishuo Chen, Boran Wang, Wenbin Zhu, and Jing Yuan. 2024. RGB-IR YOLO combining Modality-Specific Reconstruction and Information Integration. In 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). 2045–2050. doi:10.1109/YAC63405.2024.10598725
-
[4]
Yung-Yao Chen, Sin-Ye Jhong, Hsin-Chun Lin, and Yi-Chen Wu. 2025. Vision- Language-Guided Adaptive Cross-Modal Fusion for Multispectral Object Detec- tion Under Adverse Weather Conditions.IEEE MultiMedia32, 2 (2025), 22–32. doi:10.1109/MMUL.2025.3525559
-
[5]
Zhenshuai Chen, Wei Xiang, Zhiyuan Lin, Kaixuan Yang, Yunpeng Liu, and Zelin Shi. 2025. Alignment-assisted Frequency Fusion Network for RGB-infrared vehicle detection.Neurocomputing647 (2025), 130505. doi:10.1016/j.neucom. 2025.130505
-
[6]
Xiaolong Cheng, Keke Geng, Ziwei Wang, Jinhu Wang, Yuxiao Sun, and Pengbo Ding. 2023. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment.Multimedia Tools Appl.82, 30 (May 2023), 47773–47792. doi:10.1007/s11042-023-15333-w
-
[7]
Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Guodong Guo, and Baochang Zhang. 2025. Fusion-Mamba for Cross-Modality Object Detection.IEEE Transactions on Multimedia27 (2025), 7392–7406. doi:10.1109/TMM.2025.3599020
-
[8]
Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. 2021. ReDet: A Rotation- Equivariant Detector for Aerial Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2786–2795
work page 2021
- [9]
-
[10]
Lian Huang, Zongju Peng, Fen Chen, Shaosheng Dai, Ziqiang He, and Kesh- eng Liu. 2024. Cross-Modality Interaction for Few-Shot Multispectral Object Detection with Semantic Knowledge.Neural Networks173 (2024), 106156. doi:10.1016/j.neunet.2024.106156
-
[11]
Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon
-
[12]
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1037–1045
-
[13]
Junbo Jang, Chanyeong Park, Heegwang Kim, Jiyoon Lee, and Joonki Paik
-
[14]
In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)
Multispectral Object Detection Enhanced by Cross-Modal Information Complementary and Cosine Similarity Channel Resampling Modules. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 9437–
-
[15]
doi:10.1109/WACV61041.2025.00914
-
[16]
Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international conference on computer vision. 3496–3504
work page 2021
-
[17]
Glenn Jocher. 2020.Ultralytics YOLOv5. doi:10.5281/zenodo.3908559
-
[18]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.Ultralytics YOLOv8. https: //github.com/ultralytics/ultralytics
work page 2023
-
[19]
Xudong Kang, Hui Yin, and Puhong Duan. 2024. Global–Local Feature Fusion Network for Visible–Infrared Vehicle Detection.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5. doi:10.1109/LGRS.2024.3375634
-
[20]
Jung Uk Kim, Sungjune Park, and Yong Man Ro. 2021. Uncertainty-guided cross- modal learning for robust multispectral pedestrian detection.IEEE Transactions on Circuits and Systems for Video Technology32, 3 (2021), 1510–1523
work page 2021
-
[21]
Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. 2019. Illumination-aware faster R-CNN for robust multispectral pedestrian detection.Pattern Recognition 85 (2019), 161–171
work page 2019
-
[22]
Hanyun Li, Linsong Xiao, Lihua Cao, Di Wu, Yangfan Liu, Yi Li, Yunfeng Zhang, and Haiyang Bao. 2026. CrossModalNet: A dual-modal object detection network based on cross-modal fusion and channel interaction.Expert Systems with Applications298 (2026), 129677. doi:10.1016/j.eswa.2025.129677
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 202). 19730–19742
work page 2023
-
[24]
Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2021
-
[25]
Ting Li, Songtao Li, Shuaifeng Li, Xiaolin Qin, Maoyuan Zhao, Luping Ji, and Mao Ye. 2025. SAM-Guided Semantic Knowledge Fusion for Visible-Infrared Object Detection. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM). 8835–8844. doi:10.1145/3746027.3755718
-
[26]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV)
work page 2017
-
[27]
Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detec- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5802–5811
work page 2022
-
[28]
Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, and Risheng Liu. 2024. Infrared and visible image fusion: From data compatibility to task adaption.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 4 (2024), 2349–2369
work page 2024
-
[29]
Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N. Metaxas. 2016. Multi- spectral Deep Neural Networks for Pedestrian Detection. InProceedings of the British Machine Vision Conference (BMVC). 73.1–73.13. doi:10.5244/C.30.73
-
[30]
Xiaowen Liu, Hongtao Huo, Jing Li, Shan Pang, and Bowen Zheng. 2024. A Semantic-Driven Coupled Network for Infrared and Visible Image Fusion.Infor- mation Fusion108 (2024), 102352. doi:10.1016/j.inffus.2024.102352
- [31]
- [32]
-
[33]
Xiangyu Qin, Enlong Wang, Shihua Zhou, Bin Wang, and Nikola K. Kasabov
-
[34]
TSPFusion: Text-Guided Semantic Perception for Infrared and Visible Image Fusion.Infrared Physics & Technology153 (2026), 106324. doi:10.1016/j. infrared.2025.106324
work page doi:10.1016/j 2026
- [35]
-
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[37]
In International conference on machine learning
Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763
-
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net- works. InAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/ ...
work page 2015
-
[39]
Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. 2024. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection.Pattern Recognition145 (2024), 109913. doi:10.1016/j.patcog. 2023.109913
-
[40]
Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, and Haibin Ling. 2026. Multispectral state-space feature fusion: Bridging shared and cross- parametric interactions for object detection.Information Fusion127 (2026), 103895. doi:10.1016/j.inffus.2025.103895
-
[41]
Dongdong Sun, Chuanyun Wang, Tian Wang, Qian Gao, Qiong Liu, and Linlin Wang. 2025. CLIPFusion: Infrared and Visible Image Fusion Network Based on Image–Text Large Model and Adaptive Learning.Displays89 (2025), 103042. doi:10.1016/j.displa.2025.103042
-
[42]
Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. 2022. Drone-based RGB- infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 6700– 6713
work page 2022
-
[43]
Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, and Jiayi Ma. 2025. ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts. InAdvances in Neural Information Processing Systems
work page 2025
-
[44]
Teledyne FLIR LLC. 2021. Teledyne FLIR Free Starter Thermal Dataset for Algo- rithm Training. https://adas-dataset-v2.flirconservator.com/dataset/README. txt. Accessed: 2026-04-02
work page 2021
-
[45]
Changhai Wang, Zhe Huang, Yuwei Xu, Wanwei Huang, and Yuan Tian. 2026. FDFusion: Efficient text-guided infrared-visible image fusion via fine-tuned light- weight VLM and Dual-branch feature modeling.Infrared Physics & Technology (2026), 106499
work page 2026
-
[46]
Enlong Wang, Jiawei Li, Jia Lei, Jinyuan Liu, Shihua Zhou, Bin Wang, and Nikola K. Kasabov. 2024. SDFuse: Semantic-Injected Dual-Flow Learning for Infrared and Visible Image Fusion.Expert Systems with Applications252, Part B (2024), 124188. doi:10.1016/j.eswa.2024.124188
-
[47]
Huiying Wang, Chunping Wang, Qiang Fu, Binqiang Si, Dongdong Zhang, Renke Kou, Ying Yu, and Changfeng Feng. 2024. YOLOFIV: Object Detection Algorithm for Around-the-Clock Aerial Remote Sensing Images by Fusing Infrared and Visible Features.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17 (2024), 15269–15287. doi:10.1109/J...
-
[48]
Huiying Wang, Chunping Wang, Qiang Fu, Dongdong Zhang, Renke Kou, Ying Yu, and Jian Song. 2024. Cross-Modal Oriented Object Detection of UAV Aerial Images Based on Image Feature.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–21. doi:10.1109/TGRS.2024.3367934
- [49]
-
[50]
Chenjia Yang, Xiaoqing Luo, Zhancheng Zhang, Zhiguo Chen, and Xiao jun Wu
-
[51]
doi:10.1016/j.inffus.2025.102944
KDFuse: A High-Level Vision Task-Driven Infrared and Visible Image Fusion Method Based on Cross-Domain Knowledge Distillation.Information Fusion118 (2025), 102944. doi:10.1016/j.inffus.2025.102944
- [52]
-
[53]
Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2024. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2024
-
[54]
Maoxun Yuan, Xiaorong Shi, Nan Wang, Yinyan Wang, and Xingxing Wei. 2024. Improving RGB-Infrared Object Detection with Cascade Alignment-Guided Transformer.Information Fusion105 (2024), 102246. doi:10.1016/j.inffus.2024. 102246
-
[55]
Maoxun Yuan and Xingxing Wei. 2024. C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–12. doi:10.1109/TGRS.2024.3376819
-
[56]
Jun-Seok Yun, Seon-Hoo Park, and Seok Bong Yoo. 2022. Infusion-Net: Inter- and Intra-Weighted Cross-Fusion Network for Multispectral Object Detection. Mathematics10, 21 (2022), 3966. doi:10.3390/math10213966
-
[57]
Yuqiao Zeng, Tengfei Liang, Yi Jin, and Yidong Li. 2024. MMI-Det: Exploring Multi-Modal Integration for Visible and Infrared Object Detection.IEEE Trans- actions on Circuits and Systems for Video Technology34 (2024), 11198–11213. doi:10.1109/TCSVT.2024.3418965
-
[58]
Hao Zhang, Lei Cao, Xuhui Zuo, Zhenfeng Shao, and Jiayi Ma. 2025. Omnifuse: Composite degradation-robust image fusion with language-driven semantics. IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)
work page 2025
-
[59]
Heng Zhang, Elisa Fromont, Sébastien Lefèvre, and Bruno Avignon. 2020. Mul- tispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks. In2020 IEEE International Conference on Image Processing (ICIP). 276–280. doi:10.1109/ICIP40778.2020.9191080
-
[60]
Heng Zhang, Elisa Fromont, Sebastien Lefevre, and Bruno Avignon. 2021. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 72–80. doi:10.1109/WACV48630.2021.00012
-
[61]
Jianming Zhang, Xiangnan Shi, Zhijian Feng, Yan Gui, and Jin Wang. 2025. TMCN: Text-Guided Mamba-CNN Dual-Encoder Network for Infrared and Visible Image Fusion.Infrared Physics & Technology149 (2025), 105895. doi:10.1016/j.infrared. 2025.105895
-
[62]
Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wen- wei Zhang, Ping Luo, and Kai Chen. 2023. Dense Distinct Query for End-to-End Object Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7329–7338
work page 2023
-
[63]
Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion us- ing deep learning.IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 10535–10554
work page 2023
- [64]
-
[65]
Tianyi Zhao, Maoxun Yuan, Feng Jiang, Nan Wang, and Xingxing Wei. 2026. Removal Then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection.IEEE Transactions on Intelligent Transportation Systems27, 2 (2026), 2504–2519. doi:10.1109/TITS.2025.3638627
-
[66]
Kailai Zhou, Linsen Chen, and Xun Cao. 2020. Improving multispectral pedestrian detection by addressing modality imbalance problems. InEuropean conference on computer vision. Springer, 787–803
work page 2020
-
[67]
Minghang Zhou, Tianyu Li, Chaofan Qiao, Dongyu Xie, Guoqing Wang, Ningjuan Ruan, Lin Mei, Yang Yang, and Heng Tao Shen. 2025. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing.IEEE Transactions on Geoscience and Remote Sensing63 (2025), 1–13. doi:10.1109/TGRS. 2025.3578309
-
[68]
Mingliang Zhou, Yunyao Li, Guangchao Yang, Xuekai Wei, Huayan Pu, Jun Luo, and Weijia Jia. 2025. COFNet: Contrastive Object-Aware Fusion Using Box-Level Masks for Multispectral Object Detection.IEEE Transactions on Multimedia27 (2025), 7444–7458. doi:10.1109/TMM.2025.3599097
-
[69]
Wei Zhou, Yingyuan Wang, Lina Zuo, Yuan Gao, and Yugen Yi. 2024. High-Level Vision Task-Driven Infrared and Visible Image Fusion Approach: Progressive Semantic Enhancement based Multi-Scale Cross-Modality Interactive Network. Measurement237 (2024), 114977. doi:10.1016/j.measurement.2024.114977
-
[70]
Haodong Zhu, Wenhao Dong, Linlin Yang, Hong Li, Yuguang Yang, Yangyang Ren, Qingcheng Zhu, Zichao Feng, Changbai Li, Shaohui Lin, Runqi Wang, Xi- aoyan Luo, and Baochang Zhang. 2025. WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 11219–11229. Bri...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.