pith. machine review for the scientific record. sign in

arxiv: 2604.16630 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Tri-Modal Fusion Transformers for UAV-based Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords tri-modal fusionUAV object detectionvision transformerRGB thermal eventsensor fusionMAGEBiTEmulti-modal detection
0
0 comments X

The pith

Tri-modal fusion of RGB, thermal, and event data in a vision transformer improves UAV object detection over dual-modal baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that processing RGB, long-wave infrared thermal, and event-camera streams together inside one hierarchical vision transformer yields higher detection accuracy for vehicles seen from UAVs than any pair of those modalities. A sympathetic reader would care because UAV flights routinely encounter low light, motion blur, and rapid scene changes that cripple single-sensor or dual-sensor detectors, so a reliable way to combine the three complementary signals matters for practical tasks such as traffic monitoring or search-and-rescue. The authors introduce two exchange modules, MAGE for gated channel-and-spatial fusion and BiTE for bidirectional token attention, inserted at chosen depths to keep resolution intact before feeding a standard feature pyramid and two-stage detector. They also release a new synchronized 10,489-frame dataset with 24,223 vehicle annotations spanning day and night flights. Sixty-one ablations confirm that full tri-modal fusion beats dual-modal baselines, that fusion depth matters, and that a lighter CSSA variant captures most of the gain at low extra cost.

Core claim

A dual-stream hierarchical vision transformer equipped with Modality-Aware Gated Exchange (MAGE) modules for inter-sensor channel and spatial gating and Bidirectional Token Exchange (BiTE) modules for token-level bidirectional attention produces resolution-preserving fused feature maps from RGB, thermal, and event inputs that raise detection performance in a standard two-stage detector. On the introduced 10,489-frame UAV dataset the tri-modal system outperforms every dual-modal combination, fusion depth exerts a measurable effect, and a lightweight CSSA variant recovers most of the accuracy at minimal added cost.

What carries the argument

The Modality-Aware Gated Exchange (MAGE) and Bidirectional Token Exchange (BiTE) modules inserted at selected encoder depths inside the dual-stream hierarchical vision transformer, which gate and exchange information across RGB, thermal, and event streams to produce fused maps for a standard feature pyramid.

If this is right

  • Tri-modal fusion outperforms all tested dual-modal combinations across day and night UAV flights.
  • The depth at which fusion occurs inside the encoder significantly changes final detection accuracy.
  • A lightweight CSSA fusion variant achieves nearly the same gains as the full MAGE+BiTE design at far lower cost.
  • The modular tri-modal backbone supplies the first systematic benchmark for three-way UAV object detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gated-exchange pattern could be tested on other multi-sensor platforms such as ground robots or autonomous vehicles where RGB, thermal, and event data are available.
  • Event-camera temporal edges may give particular help for detecting fast-moving vehicles that blur in RGB or thermal frames.
  • The released dataset offers a fixed testbed for comparing alternative fusion strategies or even single-modality enhancements without new data collection.

Load-bearing premise

The MAGE and BiTE modules can merge the three sensor streams in a resolution-preserving way without introducing artifacts or discarding critical details, an assumption checked only through ablations on the authors' own dataset.

What would settle it

Run the identical tri-modal network on an independent UAV dataset with different sensor calibration, lighting, or motion statistics and measure whether the accuracy margin over the best dual-modal baseline shrinks to zero or negative.

Figures

Figures reproduced from arXiv: 2604.16630 by Craig Iaboni, Pramod Abichandani.

Figure 1
Figure 1. Figure 1: Example tri-modal UAV dataset samples. Each scene shows synchronized RGB, thermal (LWIR), and event projections with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tri-modal UAV payload showing the underside-mounted [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the tri-modal detection framework. Left: the baseline fusion block, combining Modality-Aware Gated Exchange [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative cases where adding events improves [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a tri-modal fusion transformer framework for UAV object detection that processes synchronized RGB, thermal (LWIR), and event camera streams via a dual-stream hierarchical vision transformer. It proposes Modality-Aware Gated Exchange (MAGE) for inter-sensor gating and Bidirectional Token Exchange (BiTE) for token-level attention with refinement, feeding into a standard FPN and two-stage detector. The authors contribute a new 10,489-frame pre-aligned UAV dataset with 24,223 vehicle annotations across day/night conditions and report 61 controlled ablations on fusion placement, mechanisms (including CSSA and GAFF variants), modality subsets, and backbones, claiming tri-modal fusion outperforms all dual-modal baselines with significant effects from fusion depth and a lightweight CSSA recovering most gains.

Significance. If the empirical results hold under broader validation, the work would provide the first systematic benchmark and modular architecture for tri-modal UAV detection, addressing robustness gaps in illumination, blur, and dynamics where single or dual modalities fail. The scale of the ablation study (61 controlled experiments) and the new synchronized multi-modal dataset are clear strengths that could serve as a foundation for future research in sensor fusion for aerial robotics.

major comments (1)
  1. [Experimental Results / Ablation Studies] The assertion that this constitutes the 'first systematic benchmark' for tri-modal UAV detection (abstract and conclusion) rests entirely on internal ablations performed on the authors' newly introduced 10,489-frame dataset. No results are reported on any public multi-modal UAV benchmarks, cross-dataset transfer tests, or held-out sensor/condition splits, which leaves open the possibility that observed tri-modal gains arise from dataset-specific factors such as pre-alignment quality or annotation protocol rather than the MAGE/BiTE modules themselves. This directly weakens the generalizability claim and should be addressed by adding at least one external validation experiment or a clear limitations discussion.
minor comments (2)
  1. [Abstract] The abstract states '61 controlled ablations' but does not specify the exact performance metrics (e.g., mAP@0.5, mAP@0.5:0.95) or statistical measures (error bars, significance tests) used to declare 'significant effect' for fusion depth; this should be clarified for reproducibility.
  2. [Method] Notation for the new modules (MAGE, BiTE, CSSA) is introduced without an explicit comparison table summarizing their parameter counts and computational overhead relative to the baseline transformer blocks; adding this would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Experimental Results / Ablation Studies] The assertion that this constitutes the 'first systematic benchmark' for tri-modal UAV detection (abstract and conclusion) rests entirely on internal ablations performed on the authors' newly introduced 10,489-frame dataset. No results are reported on any public multi-modal UAV benchmarks, cross-dataset transfer tests, or held-out sensor/condition splits, which leaves open the possibility that observed tri-modal gains arise from dataset-specific factors such as pre-alignment quality or annotation protocol rather than the MAGE/BiTE modules themselves. This directly weakens the generalizability claim and should be addressed by adding at least one external validation experiment or a clear limitations discussion.

    Authors: We thank the referee for highlighting this important point. To the best of our knowledge, no public tri-modal (RGB + thermal + event) UAV datasets with synchronized, pre-aligned streams and object annotations exist, which motivated the creation of our 10,489-frame dataset. The 61 controlled ablations isolate the contributions of MAGE, BiTE, fusion depth, and modality subsets within this setting. We agree that the single-dataset evaluation limits strong generalizability claims and will add a dedicated Limitations section in the revised manuscript. This section will explicitly discuss potential dataset-specific factors (pre-alignment quality, annotation protocol, day/night distribution) and the current absence of external tri-modal benchmarks. We will also revise the phrasing of the 'first systematic benchmark' claim in the abstract and conclusion to 'the first systematic study and benchmark for tri-modal UAV object detection on a dedicated multi-modal dataset' to more accurately reflect the scope. These changes directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study on new dataset

full rationale

The paper contains no derivation chain, equations, or first-principles claims. It proposes MAGE and BiTE modules, introduces a new synchronized RGB-thermal-event UAV dataset, and reports performance via 61 controlled ablations comparing modality subsets, fusion depths, and variants (including CSSA). All results are direct experimental measurements on the authors' data; no parameter is fitted and then relabeled as a prediction, no self-citation is invoked to justify uniqueness or ansatzes, and no known result is renamed as a novel unification. The central claim of tri-modal improvement is therefore an empirical observation, not a quantity that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

Based solely on the abstract; full details on hyperparameters, training procedures, and any additional assumptions are unavailable. The paper introduces new architectural modules as its core contribution.

free parameters (2)
  • Fusion module placement depths
    Selected encoder depths where MAGE and BiTE are applied, chosen through ablations
  • Gating and attention parameters
    Learned channel/spatial gates and token exchange weights during training
axioms (2)
  • domain assumption Vision transformer backbones can be extended to process and fuse multiple sensor modalities
    Assumes standard hierarchical ViT architecture supports tri-modal input streams
  • domain assumption Synchronized RGB, thermal, and event data streams provide complementary information for object detection
    Core premise justifying the fusion approach and dataset collection
invented entities (2)
  • Modality-Aware Gated Exchange (MAGE) module no independent evidence
    purpose: Applies inter-sensor channel and spatial gating at selected encoder depths
    Newly proposed fusion component
  • Bidirectional Token Exchange (BiTE) module no independent evidence
    purpose: Performs bidirectional token-level attention with depthwise-pointwise refinement for fused maps
    Newly proposed fusion component

pith-pipeline@v0.9.0 · 5536 in / 1623 out tokens · 59484 ms · 2026-05-10T08:43:55.619458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Daff: dual attentive feature fusion for multispectral pedestrian detection

    Afnan Althoupety, Li-Yun Wang, Wu-Chi Feng, and Ba- nafsheh Rekabdar. Daff: dual attentive feature fusion for multispectral pedestrian detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2997–3006, 2024. 2

  2. [2]

    Multimodal machine learning: A survey and tax- onomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

    Tadas Baltru ˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018. 2

  3. [3]

    Multimodal object detection by channel switching and spatial attention

    Yue Cao, Junchi Bin, Jozsef Hamari, Erik Blasch, and Zheng Liu. Multimodal object detection by channel switching and spatial attention. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 403–411, 2023. 2, 3

  4. [4]

    Crossvit: Cross-attention multi-scale vision transformer for image classification

    Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 357–366,

  5. [5]

    Kaist multi-spectral day/night data set for autonomous and assisted driving.IEEE Transactions on Intelligent Transportation Systems, 19(3):934–948, 2018

    Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist multi-spectral day/night data set for autonomous and assisted driving.IEEE Transactions on Intelligent Transportation Systems, 19(3):934–948, 2018. 2

  6. [6]

    Accessed: 2024- 11-14

    DJI.Matrice 300 RTK User Manual, 2020. Accessed: 2024- 11-14. 3

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

  8. [8]

    Convolutional two-stream network fusion for video action recognition

    Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941,

  9. [9]

    Rtdod: A large-scale rgb-thermal domain- incremental object detection dataset for uavs.Image and Vi- sion Computing, 140:104856, 2023

    Hangtao Feng, Lu Zhang, Siqi Zhang, Dong Wang, Xu Yang, and Zhiyong Liu. Rtdod: A large-scale rgb-thermal domain- incremental object detection dataset for uavs.Image and Vi- sion Computing, 140:104856, 2023. 7

  10. [10]

    Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

    Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1, 3

  11. [11]

    Asynchronous, photometric feature track- ing using events and frames

    Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Da- vide Scaramuzza. Asynchronous, photometric feature track- ing using events and frames. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 750– 765, 2018. 1

  12. [12]

    End-to-end learning of repre- sentations for asynchronous event-based data

    Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 5633–5643, 2019. 2

  13. [13]

    Event-based visible and in- frared fusion via multi-task collaboration

    Mengyue Geng, Lin Zhu, Lizhi Wang, Wei Zhang, Ruiqin Xiong, and Yonghong Tian. Event-based visible and in- frared fusion via multi-task collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26929–26939, 2024. 2

  14. [14]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 580–587, 2014. 2

  15. [15]

    An object detection algorithm based on infrared-visible dual modal feature fusion.Infrared Physics & Technology, 137:105107, 2024

    Zhiqiang Hou, Chen Yang, Ying Sun, Sugang Ma, Xi- aobao Yang, and Jiulun Fan. An object detection algorithm based on infrared-visible dual modal feature fusion.Infrared Physics & Technology, 137:105107, 2024. 1

  16. [16]

    Multispectral pedestrian detection: Benchmark dataset and baseline

    Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1037–1045, 2015. 2

  17. [17]

    Large-scale video classification with convolutional neural networks

    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 2

  18. [18]

    Explicit attention-enhanced fusion for rgb-thermal perception tasks.IEEE Robotics and Au- tomation Letters, 8(7):4060–4067, 2023

    Mingjian Liang, Junjie Hu, Chenyu Bao, Hua Feng, Fuqin Deng, and Tin Lun Lam. Explicit attention-enhanced fusion for rgb-thermal perception tasks.IEEE Robotics and Au- tomation Letters, 8(7):4060–4067, 2023. 2

  19. [19]

    A 128 x 128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

    Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128 x 128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008. 2

  20. [20]

    Multispectral deep neural networks for pedestrian detection.arXiv preprint arXiv:1611.02644, 2016

    Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N Metaxas. Multispectral deep neural networks for pedestrian detection.arXiv preprint arXiv:1611.02644, 2016. 2

  21. [21]

    Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022. 7

  22. [22]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 3

  23. [23]

    Nystromformer based cross-modality transformer for visible-infrared person re-identification.Scientific Reports, 15(1):16224, 2025

    Ranjit Kumar Mishra, Arijit Mondal, and Jimson Mathew. Nystromformer based cross-modality transformer for visible-infrared person re-identification.Scientific Reports, 15(1):16224, 2025. 1

  24. [24]

    A review of modern thermal imaging sensor technology and applications for autonomous aerial navigation.Journal of Imaging, 7(10):217, 2021

    Tran Xuan Bach Nguyen, Kent Rosser, and Javaan Chahl. A review of modern thermal imaging sensor technology and applications for autonomous aerial navigation.Journal of Imaging, 7(10):217, 2021. 1

  25. [25]

    Girshick, and Jian Sun

    Shaoqing Ren. Faster r-cnn: Towards real-time object detection with region proposal networks.arXiv preprint arXiv:1506.01497, 2015. 2

  26. [26]

    Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection.Pattern Recognition, 145:109913, 2024

    Jifeng Shen, Yifei Chen, Yue Liu, Xin Zuo, Heng Fan, and Wankou Yang. Icafusion: Iterative cross-attention guided feature fusion for multispectral object detection.Pattern Recognition, 145:109913, 2024. 2

  27. [27]

    Pst900: Rgb- thermal calibration, dataset and segmentation network

    Shreyas S Shivakumar, Neil Rodrigues, Alex Zhou, Ian D Miller, Vijay Kumar, and Camillo J Taylor. Pst900: Rgb- thermal calibration, dataset and segmentation network. In 2020 IEEE international conference on robotics and au- tomation (ICRA), pages 9441–9447. IEEE, 2020. 2

  28. [28]

    Two-stream con- volutional networks for action recognition in videos.Ad- vances in neural information processing systems, 27, 2014

    Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos.Ad- vances in neural information processing systems, 27, 2014. 2

  29. [29]

    Fuseseg: Semantic segmentation of urban scenes based on rgb and thermal data fusion.IEEE Transactions on Automation Science and Engineering, 18(3):1000–1011,

    Yuxiang Sun, Weixun Zuo, Peng Yun, Hengli Wang, and Ming Liu. Fuseseg: Semantic segmentation of urban scenes based on rgb and thermal data fusion.IEEE Transactions on Automation Science and Engineering, 18(3):1000–1011,

  30. [30]

    Det- fusion: A detection-driven infrared and visible image fusion network

    Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Det- fusion: A detection-driven infrared and visible image fusion network. InProceedings of the 30th ACM international con- ference on multimedia, pages 4003–4011, 2022. 7

  31. [31]

    Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, 32(10):6700–6713,

    Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Transactions on Cir- cuits and Systems for Video Technology, 32(10):6700–6713,

  32. [32]

    Fusing event- based and rgb camera for robust object detection in adverse conditions

    Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event- based and rgb camera for robust object detection in adverse conditions. In2022 International Conference on Robotics and Automation (ICRA), pages 933–939. IEEE, 2022. 2

  33. [33]

    Early or late fu- sion matters: Efficient rgb-d fusion in vision transformers for 3d object recognition

    Georgios Tziafas and Hamidreza Kasaei. Early or late fu- sion matters: Efficient rgb-d fusion in vision transformers for 3d object recognition. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9558–9565. IEEE, 2023. 1

  34. [34]

    Yolov11-rgbt: Towards a comprehensive single-stage multispectral object detection framework.arXiv preprint arXiv:2506.14696, 2025

    Dahang Wan, Rongsheng Lu, Yang Fang, Xianli Lang, Shuangbao Shu, Jingjing Chen, Siyuan Shen, Ting Xu, and Zecong Ye. Yolov11-rgbt: Towards a comprehensive single-stage multispectral object detection framework.arXiv preprint arXiv:2506.14696, 2025. 7

  35. [35]

    Cross-modal oriented object detection of uav aerial images based on im- age feature.IEEE Transactions on Geoscience and Remote Sensing, 62:1–21, 2024

    Huiying Wang, Chunping Wang, Qiang Fu, Dongdong Zhang, Renke Kou, Ying Yu, and Jian Song. Cross-modal oriented object detection of uav aerial images based on im- age feature.IEEE Transactions on Geoscience and Remote Sensing, 62:1–21, 2024. 2

  36. [36]

    Cgfnet: Cross-guided fusion network for rgb-t salient object detection.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):2949–2961, 2021

    Jie Wang, Kechen Song, Yanqi Bao, Liming Huang, and Yunhui Yan. Cgfnet: Cross-guided fusion network for rgb-t salient object detection.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):2949–2961, 2021. 2

  37. [37]

    Eca-net: Efficient channel at- tention for deep convolutional neural networks

    Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wang- meng Zuo, and Qinghua Hu. Eca-net: Efficient channel at- tention for deep convolutional neural networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020. 2

  38. [38]

    Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 3

  39. [39]

    Comparison of multimodal rgb-thermal fusion techniques for exterior wall multi-defect detection.Journal of Infrastructure Intelligence and Resilience, 2(2):100029, 2023

    Xincong Yang, Runhao Guo, and Heng Li. Comparison of multimodal rgb-thermal fusion techniques for exterior wall multi-defect detection.Journal of Infrastructure Intelligence and Resilience, 2(2):100029, 2023. 2

  40. [40]

    Multispectral fusion for object detection with cyclic fuse-and-refine blocks

    Heng Zhang, Elisa Fromont, S ´ebastien Lefevre, and Bruno Avignon. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In2020 IEEE International conference on image processing (ICIP), pages 276–280. IEEE, 2020. 2

  41. [41]

    Guided attentive feature fusion for multispectral pedestrian detection

    Heng Zhang, Elisa Fromont, S ´ebastien Lef `evre, and Bruno Avignon. Guided attentive feature fusion for multispectral pedestrian detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 72–80,

  42. [42]

    Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023

    Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 3

  43. [43]

    Deep multimodal fusion for semantic image segmentation: A survey.Image and Vision Computing, 105: 104042, 2021

    Yifei Zhang, D ´esir´e Sidib ´e, Olivier Morel, and Fabrice M´eriaudeau. Deep multimodal fusion for semantic image segmentation: A survey.Image and Vision Computing, 105: 104042, 2021. 2

  44. [44]

    Rethinking multi-modal object detection from the perspective of mono-modality fea- ture learning.arXiv preprint arXiv:2503.11780, 2025

    Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, and Xingxing Wei. Rethinking multi-modal object detection from the perspective of mono-modality fea- ture learning.arXiv preprint arXiv:2503.11780, 2025. 2, 3

  45. [45]

    Rgb-event fu- sion for moving object detection in autonomous driving

    Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fu- sion for moving object detection in autonomous driving. In 2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7808–7815. IEEE, 2023. 2