COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection
Pith reviewed 2026-05-18 23:03 UTC · model grok-4.3
The pith
COXNet fuses cross-layer visible and thermal features with dynamic alignment to boost tiny object detection in RGBT drone imagery.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COXNet is a novel framework for RGBT tiny object detection that uses three core innovations: the Cross-Layer Fusion Module for combining features across layers and modalities, the Dynamic Alignment and Scale Refinement module for correcting spatial misalignments and preserving scales, and an optimized label assignment strategy based on the GeoShape Similarity Measure. These allow effective leveraging of complementary information between visible and thermal modalities despite challenges like misalignment and occlusion. The approach yields a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods.
What carries the argument
The Cross-Layer Fusion Module that integrates high-level visible features with low-level thermal features, aided by dynamic alignment, scale refinement, and GeoShape Similarity for label assignment.
Load-bearing premise
The three proposed modules are the main reason for the performance improvement rather than other unstated factors like tuning or implementation details.
What would settle it
An ablation study removing the Cross-Layer Fusion Module, the Dynamic Alignment module, or the GeoShape assignment one by one and checking the resulting mAP50 on the RGBTDronePerson dataset.
Figures
read the original abstract
Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes COXNet, a framework for RGBT tiny object detection that introduces three modules: the Cross-Layer Fusion Module to combine high-level visible and low-level thermal features, the Dynamic Alignment and Scale Refinement module to address cross-modal spatial misalignments while preserving multi-scale features, and an optimized label assignment strategy based on the GeoShape Similarity Measure. It claims a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods.
Significance. If the performance gains can be robustly attributed to the proposed modules through controlled experiments, the work could advance multimodal detection for small objects in drone-based scenarios involving misalignment, occlusion, and low light, with relevance to surveillance and search-and-rescue applications. The empirical result on an external dataset is presented as an outcome rather than a derived quantity, but stronger isolation of contributions would be needed to establish its impact.
major comments (2)
- The central claim of a 3.32% mAP50 gain attributable to the Cross-Layer Fusion Module, Dynamic Alignment and Scale Refinement module, and GeoShape Similarity Measure is not supported by ablation studies that add modules sequentially to a fixed strong baseline under identical training conditions, optimizer schedules, and data augmentations. Without such controls, the delta could arise from unmentioned implementation details rather than the proposed components.
- Comparisons to state-of-the-art methods appear to rely on previously published numbers rather than re-trained baselines under the same experimental protocol. Given that mAP in object detection is highly sensitive to backbone choice, label assignment hyperparameters, and training details, this weakens the attribution of the reported improvement specifically to the three innovations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the experimental validation of our contributions.
read point-by-point responses
-
Referee: The central claim of a 3.32% mAP50 gain attributable to the Cross-Layer Fusion Module, Dynamic Alignment and Scale Refinement module, and GeoShape Similarity Measure is not supported by ablation studies that add modules sequentially to a fixed strong baseline under identical training conditions, optimizer schedules, and data augmentations. Without such controls, the delta could arise from unmentioned implementation details rather than the proposed components.
Authors: We appreciate the referee's emphasis on rigorous controls for attributing performance gains. Our ablation studies (Section 4.3) do add modules incrementally to a baseline, but we acknowledge that they may not have been conducted under strictly identical schedules and augmentations in all cases. In the revised manuscript, we will re-run and present a new set of ablation experiments using a fixed strong baseline with identical optimizer, learning rate schedule, data augmentations, and training epochs to better isolate the contribution of each module. revision: yes
-
Referee: Comparisons to state-of-the-art methods appear to rely on previously published numbers rather than re-trained baselines under the same experimental protocol. Given that mAP in object detection is highly sensitive to backbone choice, label assignment hyperparameters, and training details, this weakens the attribution of the reported improvement specifically to the three innovations.
Authors: We agree that re-training all SOTA methods under our exact protocol would provide the most direct comparison. Our current results follow common practice by citing the originally reported numbers on the RGBTDronePerson dataset using the same evaluation protocol. To address the concern, we will expand the experimental section with a discussion of implementation differences and, where code is publicly available, include results from re-training the top two competing methods under our training setup for direct comparison. revision: partial
Circularity Check
No circularity in empirical architecture and performance claims
full rationale
The paper presents COXNet as an empirical CNN framework for RGBT tiny object detection, introducing three modules (Cross-Layer Fusion, Dynamic Alignment and Scale Refinement, GeoShape label assignment) and reporting a 3.32% mAP50 gain on the external RGBTDronePerson dataset. No equations, derivations, or self-referential predictions appear that reduce the claimed improvements to inputs by construction. The result is framed as an experimental outcome rather than a mathematical necessity. Any prior-work citations are peripheral and do not serve as load-bearing justification for the central performance attribution, leaving the chain self-contained and externally falsifiable via re-implementation.
Axiom & Free-Parameter Ledger
free parameters (1)
- GeoShape Similarity Measure hyperparameters
invented entities (2)
-
Cross-Layer Fusion Module
no independent evidence
-
Dynamic Alignment and Scale Refinement module
no independent evidence
Lean theorems connected to this paper
-
Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cross-Layer Fusion Module (CLFM) ... Discrete Wavelet Transform (DWT) ... Frequency Fusion and Reconstruction
-
Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GeoShape similarity metric ... dc, dr, IoU ... ψ(a, g) = exp[−(dc + γ·dr + β·(1−IoU))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
X. Xiong, M. He, T. Li, G. Zheng, W. Xu, X. Fan, and Y . Zhang, “Adaptive feature fusion and improved attention mechanism based small object detection for uav target tracking,” IEEE Internet of Things Journal, 2024
work page 2024
-
[2]
Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,
C. Zhao, H. Liu, N. Su, and Y . Yan, “Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,” IEEE Transactions on Geoscience and Remote Sensing , vol. 60, pp. 1–15, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
work page 2022
-
[3]
Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,
B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2852–2865, 2024
work page 2024
-
[4]
High-resolution feature generator for small ship detection in optical remote sensing images,
H. Zhang, S. Wen, Z. Wei, and Z. Chen, “High-resolution feature generator for small ship detection in optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing , 2024
work page 2024
-
[5]
Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,
G. Liao, W. Gao, G. Li, J. Wang, and S. Kwong, “Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 32, no. 11, pp. 7646–7661, 2022
work page 2022
-
[6]
Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,
Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,” IEEE Trans- actions on Circuits and Systems for Video Technology , vol. 32, no. 10, pp. 6700–6713, 2022
work page 2022
-
[7]
Save the tiny, save the all: hierarchical activation network for tiny object detection,
G. Guo, P. Chen, X. Yu, Z. Han, Q. Ye, and S. Gao, “Save the tiny, save the all: hierarchical activation network for tiny object detection,” IEEE transactions on circuits and systems for video technology , vol. 34, no. 1, pp. 221–234, 2023
work page 2023
-
[8]
B. Du, C. Du, and L. Yu, “Megf-net: multi-exposure generation and fusion network for vehicle detection under dim light conditions,” Visual Intelligence, vol. 1, no. 1, p. 28, 2023
work page 2023
-
[9]
Cross- modality person re-identification with memory-based contrastive embed- ding,
D. Cheng, X. Wang, N. Wang, Z. Wang, X. Wang, and X. Gao, “Cross- modality person re-identification with memory-based contrastive embed- ding,” in Proceedings of the AAAI conference on artificial intelligence , vol. 37, no. 1, 2023, pp. 425–432
work page 2023
-
[10]
Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,
M. Yuan, Y . Wang, and X. Wei, “Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,” in European Conference on Computer Vision . Springer, 2022, pp. 509–525
work page 2022
-
[11]
H. Wang, K. Song, L. Huang, H. Wen, and Y . Yan, “Thermal images- aware guided early fusion network for cross-illumination rgb-t salient object detection,”Engineering Applications of Artificial Intelligence, vol. 118, p. 105640, 2023
work page 2023
-
[12]
Drone- based rgbt tiny person detection,
Y . Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone- based rgbt tiny person detection,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023
work page 2023
-
[13]
X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020
work page 2020
-
[14]
Deep learning with rgb and thermal images onboard a drone for monitoring operations,
S. Speth, A. Goncalves, B. Rigault, S. Suzuki, M. Bouazizi, Y . Matsuo, and H. Prendinger, “Deep learning with rgb and thermal images onboard a drone for monitoring operations,” Journal of Field Robotics , vol. 39, no. 6, pp. 840–868, 2022
work page 2022
-
[15]
R. Li, J. Xiang, F. Sun, Y . Yuan, L. Yuan, and S. Gou, “Multiscale cross-modal homogeneity enhancement and confidence-aware fusion for multispectral pedestrian detection,” IEEE Transactions on Multimedia , vol. 26, pp. 852–863, 2023
work page 2023
-
[16]
Dacfn: divide-and-conquer fusion network for rgb-t object detection,
B. Wang, H. Zhao, and Y . Zhuang, “Dacfn: divide-and-conquer fusion network for rgb-t object detection,” International Journal of Machine Learning and Cybernetics , vol. 14, no. 7, pp. 2407–2420, 2023
work page 2023
-
[17]
Mlpd: Multi-label pedes- trian detector in multispectral domain,
J. Kim, H. Kim, T. Kim, N. Kim, and Y . Choi, “Mlpd: Multi-label pedes- trian detector in multispectral domain,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7846–7853, 2021
work page 2021
-
[18]
Weakly aligned feature fusion for multimodal object detection,
L. Zhang, Z. Liu, X. Zhu, Z. Song, X. Yang, Z. Lei, and H. Qiao, “Weakly aligned feature fusion for multimodal object detection,” IEEE Transactions on Neural Networks and Learning Systems , 2021
work page 2021
-
[19]
Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273
F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fu- sion transformer for multispectral object detection,” arXiv preprint arXiv:2111.00273, 2021
-
[20]
J. Guo, C. Gao, F. Liu, D. Meng, and X. Gao, “Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion,” in European Conference on Computer Vision. Springer, 2025, pp. 464–481
work page 2025
-
[21]
C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,
M. Yuan and X. Wei, “C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,” IEEE Transactions on Geoscience and Remote Sensing , 2024
work page 2024
-
[22]
Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,
T. Kim, S. Shin, Y . Yu, H. G. Kim, and Y . M. Ro, “Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 26 784–26 793
work page 2024
-
[23]
Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,
S. Lai, C. Liu, J. Zhu, B. Kang, Y . Liu, D. Wang, and H. Lu, “Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,” arXiv preprint arXiv:2408.07889, 2024
-
[24]
Q. Zhao, B. Liu, S. Lyu, C. Wang, and H. Zhang, “Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer,” Remote Sensing, vol. 15, no. 6, p. 1687, 2023
work page 2023
-
[25]
Querydet: Cascaded sparse query for accelerating high-resolution small object detection,
C. Yang, Z. Huang, and N. Wang, “Querydet: Cascaded sparse query for accelerating high-resolution small object detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2022, pp. 13 668–13 677
work page 2022
-
[26]
Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,
Y . Zhang, C. Wu, W. Guo, T. Zhang, and W. Li, “Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,” IEEE Transactions on Geoscience and Remote Sensing , vol. 61, pp. 1–11, 2023
work page 2023
-
[27]
Famba-v: Fast vision mamba with cross-layer token fusion,
H. Shen, Z. Wan, X. Wang, and M. Zhang, “Famba-v: Fast vision mamba with cross-layer token fusion,” arXiv preprint arXiv:2409.09808 , 2024
-
[28]
Improving multispectral pedestrian detection by addressing modality imbalance problems,
K. Zhou, L. Chen, and X. Cao, “Improving multispectral pedestrian detection by addressing modality imbalance problems,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 2020, pp. 787–803
work page 2020
-
[29]
Robust visual tracking via scale-and-state-awareness,
Y . Qi, L. Qin, S. Zhang, Q. Huang, and H. Yao, “Robust visual tracking via scale-and-state-awareness,” Neurocomputing, vol. 329, pp. 75–85, 2019
work page 2019
-
[30]
Rfla: Gaussian receptive field based label assignment for tiny object detection,
C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian receptive field based label assignment for tiny object detection,” in European conference on computer vision . Springer, 2022, pp. 526– 543
work page 2022
-
[31]
Cascade r-cnn: Delving into high quality object detection,
Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2018, pp. 6154–6162
work page 2018
-
[32]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2980–2988
work page 2017
-
[33]
Fcos: A simple and strong anchor-free object detector,
Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong anchor-free object detector,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 4, pp. 1922–1933, 2020
work page 1922
-
[34]
S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9759–9768
work page 2020
-
[35]
Tinet: learning invariant networks via knowledge transfer,
C. Luo, Z. Chen, L.-A. Tang, A. Shrivastava, Z. Li, H. Chen, and J. Ye, “Tinet: learning invariant networks via knowledge transfer,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 1890–1899
work page 2018
-
[36]
Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,
T. Broedermann, C. Sakaridis, D. Dai, and L. Van Gool, “Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 4159–4166
work page 2023
-
[37]
MMDetection: Open MMLab Detection Toolbox and Benchmark
K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[38]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13...
work page 2014
-
[39]
His research interests include machine learning, com- puter vision, and data mining
He is currently an Associate Professor with Space Engineering University, Beijing, China. His research interests include machine learning, com- puter vision, and data mining. Jianan Li is currently an assistant professor at School of Optics and Photonics, Beijing Institute of Technology, Beijing, China, where he received his B.S. and Ph.D. degree in 2013 ...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.