COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Jianan Li; Liqiang Song; Mengqi Zhu; Peiran Peng; Tingfa Xu; Yuqiang Fang

arxiv: 2508.09533 · v2 · submitted 2025-08-13 · 💻 cs.CV · cs.AI

COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Peiran Peng , Tingfa Xu , Liqiang Song , Mengqi Zhu , Yuqiang Fang , Jianan Li This is my paper

Pith reviewed 2026-05-18 23:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords RGBTtiny object detectioncross-layer fusiondynamic alignmentscale refinementGeoShape Similarity Measuredrone imagerymultimodal object detection

0 comments

The pith

COXNet fuses cross-layer visible and thermal features with dynamic alignment to boost tiny object detection in RGBT drone imagery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops COXNet to detect tiny objects more accurately in combined RGB and thermal images from drones. It proposes a Cross-Layer Fusion Module to merge high-level semantic details from visible images with low-level spatial details from thermal images. A Dynamic Alignment and Scale Refinement module handles misalignments between modalities and keeps multi-scale information intact. An improved label assignment using GeoShape Similarity Measure aids better localization. This matters for applications in surveillance and search and rescue where small objects are hard to spot in challenging conditions.

Core claim

COXNet is a novel framework for RGBT tiny object detection that uses three core innovations: the Cross-Layer Fusion Module for combining features across layers and modalities, the Dynamic Alignment and Scale Refinement module for correcting spatial misalignments and preserving scales, and an optimized label assignment strategy based on the GeoShape Similarity Measure. These allow effective leveraging of complementary information between visible and thermal modalities despite challenges like misalignment and occlusion. The approach yields a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods.

What carries the argument

The Cross-Layer Fusion Module that integrates high-level visible features with low-level thermal features, aided by dynamic alignment, scale refinement, and GeoShape Similarity for label assignment.

Load-bearing premise

The three proposed modules are the main reason for the performance improvement rather than other unstated factors like tuning or implementation details.

What would settle it

An ablation study removing the Cross-Layer Fusion Module, the Dynamic Alignment module, or the GeoShape assignment one by one and checking the resulting mAP50 on the RGBTDronePerson dataset.

Figures

Figures reproduced from arXiv: 2508.09533 by Jianan Li, Liqiang Song, Mengqi Zhu, Peiran Peng, Tingfa Xu, Yuqiang Fang.

**Figure 2.** Figure 2: Performance–efficiency trade-off on the RGBTDronePerson dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of COXNet. (a) COXNet integrates thermal and visible inputs via independent backbones, employing the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of feature maps on the RGBTDronePerson dataset. The left column shows the original visible and thermal ground truth images, while the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on the RGBTDronePerson dataset. COXNet outperforms GFL and QFDet, particularly in detecting tiny, occluded objects. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on the VTUAV-det dataset under challenging conditions, including cluttered scenes and occlusion. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on the NII-CU dataset under low illumination and occlusion conditions. COXNet demonstrates superior performance compared [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Feature map comparison on the RGBTDronePerson dataset. The first [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of CLFM wavelet decomposition. Low-frequency bands [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Feature map comparison with and without the Adaptive Alignment [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes COXNet, a framework for RGBT tiny object detection that introduces three modules: the Cross-Layer Fusion Module to combine high-level visible and low-level thermal features, the Dynamic Alignment and Scale Refinement module to address cross-modal spatial misalignments while preserving multi-scale features, and an optimized label assignment strategy based on the GeoShape Similarity Measure. It claims a 3.32% mAP50 improvement on the RGBTDronePerson dataset over state-of-the-art methods.

Significance. If the performance gains can be robustly attributed to the proposed modules through controlled experiments, the work could advance multimodal detection for small objects in drone-based scenarios involving misalignment, occlusion, and low light, with relevance to surveillance and search-and-rescue applications. The empirical result on an external dataset is presented as an outcome rather than a derived quantity, but stronger isolation of contributions would be needed to establish its impact.

major comments (2)

The central claim of a 3.32% mAP50 gain attributable to the Cross-Layer Fusion Module, Dynamic Alignment and Scale Refinement module, and GeoShape Similarity Measure is not supported by ablation studies that add modules sequentially to a fixed strong baseline under identical training conditions, optimizer schedules, and data augmentations. Without such controls, the delta could arise from unmentioned implementation details rather than the proposed components.
Comparisons to state-of-the-art methods appear to rely on previously published numbers rather than re-trained baselines under the same experimental protocol. Given that mAP in object detection is highly sensitive to backbone choice, label assignment hyperparameters, and training details, this weakens the attribution of the reported improvement specifically to the three innovations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the experimental validation of our contributions.

read point-by-point responses

Referee: The central claim of a 3.32% mAP50 gain attributable to the Cross-Layer Fusion Module, Dynamic Alignment and Scale Refinement module, and GeoShape Similarity Measure is not supported by ablation studies that add modules sequentially to a fixed strong baseline under identical training conditions, optimizer schedules, and data augmentations. Without such controls, the delta could arise from unmentioned implementation details rather than the proposed components.

Authors: We appreciate the referee's emphasis on rigorous controls for attributing performance gains. Our ablation studies (Section 4.3) do add modules incrementally to a baseline, but we acknowledge that they may not have been conducted under strictly identical schedules and augmentations in all cases. In the revised manuscript, we will re-run and present a new set of ablation experiments using a fixed strong baseline with identical optimizer, learning rate schedule, data augmentations, and training epochs to better isolate the contribution of each module. revision: yes
Referee: Comparisons to state-of-the-art methods appear to rely on previously published numbers rather than re-trained baselines under the same experimental protocol. Given that mAP in object detection is highly sensitive to backbone choice, label assignment hyperparameters, and training details, this weakens the attribution of the reported improvement specifically to the three innovations.

Authors: We agree that re-training all SOTA methods under our exact protocol would provide the most direct comparison. Our current results follow common practice by citing the originally reported numbers on the RGBTDronePerson dataset using the same evaluation protocol. To address the concern, we will expand the experimental section with a discussion of implementation differences and, where code is publicly available, include results from re-training the top two competing methods under our training setup for direct comparison. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical architecture and performance claims

full rationale

The paper presents COXNet as an empirical CNN framework for RGBT tiny object detection, introducing three modules (Cross-Layer Fusion, Dynamic Alignment and Scale Refinement, GeoShape label assignment) and reporting a 3.32% mAP50 gain on the external RGBTDronePerson dataset. No equations, derivations, or self-referential predictions appear that reduce the claimed improvements to inputs by construction. The result is framed as an experimental outcome rather than a mathematical necessity. Any prior-work citations are peripheral and do not serve as load-bearing justification for the central performance attribution, leaving the chain self-contained and externally falsifiable via re-implementation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The central claim rests on standard deep-learning training assumptions, the representativeness of the RGBTDronePerson dataset, and the effectiveness of the three newly introduced modules; no explicit free parameters or invented physical entities are stated in the abstract.

free parameters (1)

GeoShape Similarity Measure hyperparameters
Parameters controlling the label assignment strategy that are chosen or tuned to improve localization on the target dataset.

invented entities (2)

Cross-Layer Fusion Module no independent evidence
purpose: Fuse high-level visible features with low-level thermal features
Newly proposed component whose contribution is asserted but not independently verified outside the paper.
Dynamic Alignment and Scale Refinement module no independent evidence
purpose: Correct cross-modal spatial misalignments while preserving multi-scale features
Newly proposed component whose contribution is asserted but not independently verified outside the paper.

pith-pipeline@v0.9.0 · 5725 in / 1270 out tokens · 40477 ms · 2026-05-18T23:03:20.328824+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cross-Layer Fusion Module (CLFM) ... Discrete Wavelet Transform (DWT) ... Frequency Fusion and Reconstruction
Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GeoShape similarity metric ... dc, dr, IoU ... ψ(a, g) = exp[−(dc + γ·dr + β·(1−IoU))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Adaptive feature fusion and improved attention mechanism based small object detection for uav target tracking,

X. Xiong, M. He, T. Li, G. Zheng, W. Xu, X. Fan, and Y . Zhang, “Adaptive feature fusion and improved attention mechanism based small object detection for uav target tracking,” IEEE Internet of Things Journal, 2024

work page 2024
[2]

Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,

C. Zhao, H. Liu, N. Su, and Y . Yan, “Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,” IEEE Transactions on Geoscience and Remote Sensing , vol. 60, pp. 1–15, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

work page 2022
[3]

Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,

B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2852–2865, 2024

work page 2024
[4]

High-resolution feature generator for small ship detection in optical remote sensing images,

H. Zhang, S. Wen, Z. Wei, and Z. Chen, “High-resolution feature generator for small ship detection in optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing , 2024

work page 2024
[5]

Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,

G. Liao, W. Gao, G. Li, J. Wang, and S. Kwong, “Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 32, no. 11, pp. 7646–7661, 2022

work page 2022
[6]

Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,

Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,” IEEE Trans- actions on Circuits and Systems for Video Technology , vol. 32, no. 10, pp. 6700–6713, 2022

work page 2022
[7]

Save the tiny, save the all: hierarchical activation network for tiny object detection,

G. Guo, P. Chen, X. Yu, Z. Han, Q. Ye, and S. Gao, “Save the tiny, save the all: hierarchical activation network for tiny object detection,” IEEE transactions on circuits and systems for video technology , vol. 34, no. 1, pp. 221–234, 2023

work page 2023
[8]

Megf-net: multi-exposure generation and fusion network for vehicle detection under dim light conditions,

B. Du, C. Du, and L. Yu, “Megf-net: multi-exposure generation and fusion network for vehicle detection under dim light conditions,” Visual Intelligence, vol. 1, no. 1, p. 28, 2023

work page 2023
[9]

Cross- modality person re-identification with memory-based contrastive embed- ding,

D. Cheng, X. Wang, N. Wang, Z. Wang, X. Wang, and X. Gao, “Cross- modality person re-identification with memory-based contrastive embed- ding,” in Proceedings of the AAAI conference on artificial intelligence , vol. 37, no. 1, 2023, pp. 425–432

work page 2023
[10]

Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,

M. Yuan, Y . Wang, and X. Wei, “Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,” in European Conference on Computer Vision . Springer, 2022, pp. 509–525

work page 2022
[11]

Thermal images- aware guided early fusion network for cross-illumination rgb-t salient object detection,

H. Wang, K. Song, L. Huang, H. Wen, and Y . Yan, “Thermal images- aware guided early fusion network for cross-illumination rgb-t salient object detection,”Engineering Applications of Artificial Intelligence, vol. 118, p. 105640, 2023

work page 2023
[12]

Drone- based rgbt tiny person detection,

Y . Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone- based rgbt tiny person detection,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023

work page 2023
[13]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020

work page 2020
[14]

Deep learning with rgb and thermal images onboard a drone for monitoring operations,

S. Speth, A. Goncalves, B. Rigault, S. Suzuki, M. Bouazizi, Y . Matsuo, and H. Prendinger, “Deep learning with rgb and thermal images onboard a drone for monitoring operations,” Journal of Field Robotics , vol. 39, no. 6, pp. 840–868, 2022

work page 2022
[15]

Multiscale cross-modal homogeneity enhancement and confidence-aware fusion for multispectral pedestrian detection,

R. Li, J. Xiang, F. Sun, Y . Yuan, L. Yuan, and S. Gou, “Multiscale cross-modal homogeneity enhancement and confidence-aware fusion for multispectral pedestrian detection,” IEEE Transactions on Multimedia , vol. 26, pp. 852–863, 2023

work page 2023
[16]

Dacfn: divide-and-conquer fusion network for rgb-t object detection,

B. Wang, H. Zhao, and Y . Zhuang, “Dacfn: divide-and-conquer fusion network for rgb-t object detection,” International Journal of Machine Learning and Cybernetics , vol. 14, no. 7, pp. 2407–2420, 2023

work page 2023
[17]

Mlpd: Multi-label pedes- trian detector in multispectral domain,

J. Kim, H. Kim, T. Kim, N. Kim, and Y . Choi, “Mlpd: Multi-label pedes- trian detector in multispectral domain,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7846–7853, 2021

work page 2021
[18]

Weakly aligned feature fusion for multimodal object detection,

L. Zhang, Z. Liu, X. Zhu, Z. Song, X. Yang, Z. Lei, and H. Qiao, “Weakly aligned feature fusion for multimodal object detection,” IEEE Transactions on Neural Networks and Learning Systems , 2021

work page 2021
[19]

Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fu- sion transformer for multispectral object detection,” arXiv preprint arXiv:2111.00273, 2021

work page arXiv 2021
[20]

Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion,

J. Guo, C. Gao, F. Liu, D. Meng, and X. Gao, “Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion,” in European Conference on Computer Vision. Springer, 2025, pp. 464–481

work page 2025
[21]

C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,

M. Yuan and X. Wei, “C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,” IEEE Transactions on Geoscience and Remote Sensing , 2024

work page 2024
[22]

Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,

T. Kim, S. Shin, Y . Yu, H. G. Kim, and Y . M. Ro, “Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 26 784–26 793

work page 2024
[23]

Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,

S. Lai, C. Liu, J. Zhu, B. Kang, Y . Liu, D. Wang, and H. Lu, “Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,” arXiv preprint arXiv:2408.07889, 2024

work page arXiv 2024
[24]

Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer,

Q. Zhao, B. Liu, S. Lyu, C. Wang, and H. Zhang, “Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer,” Remote Sensing, vol. 15, no. 6, p. 1687, 2023

work page 2023
[25]

Querydet: Cascaded sparse query for accelerating high-resolution small object detection,

C. Yang, Z. Huang, and N. Wang, “Querydet: Cascaded sparse query for accelerating high-resolution small object detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2022, pp. 13 668–13 677

work page 2022
[26]

Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,

Y . Zhang, C. Wu, W. Guo, T. Zhang, and W. Li, “Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,” IEEE Transactions on Geoscience and Remote Sensing , vol. 61, pp. 1–11, 2023

work page 2023
[27]

Famba-v: Fast vision mamba with cross-layer token fusion,

H. Shen, Z. Wan, X. Wang, and M. Zhang, “Famba-v: Fast vision mamba with cross-layer token fusion,” arXiv preprint arXiv:2409.09808 , 2024

work page arXiv 2024
[28]

Improving multispectral pedestrian detection by addressing modality imbalance problems,

K. Zhou, L. Chen, and X. Cao, “Improving multispectral pedestrian detection by addressing modality imbalance problems,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 2020, pp. 787–803

work page 2020
[29]

Robust visual tracking via scale-and-state-awareness,

Y . Qi, L. Qin, S. Zhang, Q. Huang, and H. Yao, “Robust visual tracking via scale-and-state-awareness,” Neurocomputing, vol. 329, pp. 75–85, 2019

work page 2019
[30]

Rfla: Gaussian receptive field based label assignment for tiny object detection,

C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian receptive field based label assignment for tiny object detection,” in European conference on computer vision . Springer, 2022, pp. 526– 543

work page 2022
[31]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2018, pp. 6154–6162

work page 2018
[32]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2980–2988

work page 2017
[33]

Fcos: A simple and strong anchor-free object detector,

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong anchor-free object detector,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 4, pp. 1922–1933, 2020

work page 1922
[34]

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,

S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9759–9768

work page 2020
[35]

Tinet: learning invariant networks via knowledge transfer,

C. Luo, Z. Chen, L.-A. Tang, A. Shrivastava, Z. Li, H. Chen, and J. Ye, “Tinet: learning invariant networks via knowledge transfer,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 1890–1899

work page 2018
[36]

Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,

T. Broedermann, C. Sakaridis, D. Dai, and L. Van Gool, “Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 4159–4166

work page 2023
[37]

MMDetection: Open MMLab Detection Toolbox and Benchmark

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[38]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13...

work page 2014
[39]

His research interests include machine learning, com- puter vision, and data mining

He is currently an Associate Professor with Space Engineering University, Beijing, China. His research interests include machine learning, com- puter vision, and data mining. Jianan Li is currently an assistant professor at School of Optics and Photonics, Beijing Institute of Technology, Beijing, China, where he received his B.S. and Ph.D. degree in 2013 ...

work page 2013

[1] [1]

Adaptive feature fusion and improved attention mechanism based small object detection for uav target tracking,

X. Xiong, M. He, T. Li, G. Zheng, W. Xu, X. Fan, and Y . Zhang, “Adaptive feature fusion and improved attention mechanism based small object detection for uav target tracking,” IEEE Internet of Things Journal, 2024

work page 2024

[2] [2]

Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,

C. Zhao, H. Liu, N. Su, and Y . Yan, “Tftn: A transformer-based fusion tracking framework of hyperspectral and rgb,” IEEE Transactions on Geoscience and Remote Sensing , vol. 60, pp. 1–15, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

work page 2022

[3] [3]

Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,

B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 2852–2865, 2024

work page 2024

[4] [4]

High-resolution feature generator for small ship detection in optical remote sensing images,

H. Zhang, S. Wen, Z. Wei, and Z. Chen, “High-resolution feature generator for small ship detection in optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing , 2024

work page 2024

[5] [5]

Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,

G. Liao, W. Gao, G. Li, J. Wang, and S. Kwong, “Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 32, no. 11, pp. 7646–7661, 2022

work page 2022

[6] [6]

Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,

Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,” IEEE Trans- actions on Circuits and Systems for Video Technology , vol. 32, no. 10, pp. 6700–6713, 2022

work page 2022

[7] [7]

Save the tiny, save the all: hierarchical activation network for tiny object detection,

G. Guo, P. Chen, X. Yu, Z. Han, Q. Ye, and S. Gao, “Save the tiny, save the all: hierarchical activation network for tiny object detection,” IEEE transactions on circuits and systems for video technology , vol. 34, no. 1, pp. 221–234, 2023

work page 2023

[8] [8]

Megf-net: multi-exposure generation and fusion network for vehicle detection under dim light conditions,

B. Du, C. Du, and L. Yu, “Megf-net: multi-exposure generation and fusion network for vehicle detection under dim light conditions,” Visual Intelligence, vol. 1, no. 1, p. 28, 2023

work page 2023

[9] [9]

Cross- modality person re-identification with memory-based contrastive embed- ding,

D. Cheng, X. Wang, N. Wang, Z. Wang, X. Wang, and X. Gao, “Cross- modality person re-identification with memory-based contrastive embed- ding,” in Proceedings of the AAAI conference on artificial intelligence , vol. 37, no. 1, 2023, pp. 425–432

work page 2023

[10] [10]

Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,

M. Yuan, Y . Wang, and X. Wei, “Translation, scale and rotation: cross- modal alignment meets rgb-infrared vehicle detection,” in European Conference on Computer Vision . Springer, 2022, pp. 509–525

work page 2022

[11] [11]

Thermal images- aware guided early fusion network for cross-illumination rgb-t salient object detection,

H. Wang, K. Song, L. Huang, H. Wen, and Y . Yan, “Thermal images- aware guided early fusion network for cross-illumination rgb-t salient object detection,”Engineering Applications of Artificial Intelligence, vol. 118, p. 105640, 2023

work page 2023

[12] [12]

Drone- based rgbt tiny person detection,

Y . Zhang, C. Xu, W. Yang, G. He, H. Yu, L. Yu, and G.-S. Xia, “Drone- based rgbt tiny person detection,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 61–76, 2023

work page 2023

[13] [13]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020

work page 2020

[14] [14]

Deep learning with rgb and thermal images onboard a drone for monitoring operations,

S. Speth, A. Goncalves, B. Rigault, S. Suzuki, M. Bouazizi, Y . Matsuo, and H. Prendinger, “Deep learning with rgb and thermal images onboard a drone for monitoring operations,” Journal of Field Robotics , vol. 39, no. 6, pp. 840–868, 2022

work page 2022

[15] [15]

Multiscale cross-modal homogeneity enhancement and confidence-aware fusion for multispectral pedestrian detection,

R. Li, J. Xiang, F. Sun, Y . Yuan, L. Yuan, and S. Gou, “Multiscale cross-modal homogeneity enhancement and confidence-aware fusion for multispectral pedestrian detection,” IEEE Transactions on Multimedia , vol. 26, pp. 852–863, 2023

work page 2023

[16] [16]

Dacfn: divide-and-conquer fusion network for rgb-t object detection,

B. Wang, H. Zhao, and Y . Zhuang, “Dacfn: divide-and-conquer fusion network for rgb-t object detection,” International Journal of Machine Learning and Cybernetics , vol. 14, no. 7, pp. 2407–2420, 2023

work page 2023

[17] [17]

Mlpd: Multi-label pedes- trian detector in multispectral domain,

J. Kim, H. Kim, T. Kim, N. Kim, and Y . Choi, “Mlpd: Multi-label pedes- trian detector in multispectral domain,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7846–7853, 2021

work page 2021

[18] [18]

Weakly aligned feature fusion for multimodal object detection,

L. Zhang, Z. Liu, X. Zhu, Z. Song, X. Yang, Z. Lei, and H. Qiao, “Weakly aligned feature fusion for multimodal object detection,” IEEE Transactions on Neural Networks and Learning Systems , 2021

work page 2021

[19] [19]

Cross-Modality Fusion Transformer for Multispectral Object Detection.arXiv2021, arXiv:2111.00273

F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fu- sion transformer for multispectral object detection,” arXiv preprint arXiv:2111.00273, 2021

work page arXiv 2021

[20] [20]

Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion,

J. Guo, C. Gao, F. Liu, D. Meng, and X. Gao, “Damsdet: Dynamic adaptive multispectral detection transformer with competitive query selection and adaptive feature fusion,” in European Conference on Computer Vision. Springer, 2025, pp. 464–481

work page 2025

[21] [21]

C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,

M. Yuan and X. Wei, “C 2 former: Calibrated and complementary transformer for rgb-infrared object detection,” IEEE Transactions on Geoscience and Remote Sensing , 2024

work page 2024

[22] [22]

Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,

T. Kim, S. Shin, Y . Yu, H. G. Kim, and Y . M. Ro, “Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 26 784–26 793

work page 2024

[23] [23]

Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,

S. Lai, C. Liu, J. Zhu, B. Kang, Y . Liu, D. Wang, and H. Lu, “Mambavt: Spatio-temporal contextual modeling for robust rgb-t tracking,” arXiv preprint arXiv:2408.07889, 2024

work page arXiv 2024

[24] [24]

Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer,

Q. Zhao, B. Liu, S. Lyu, C. Wang, and H. Zhang, “Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer,” Remote Sensing, vol. 15, no. 6, p. 1687, 2023

work page 2023

[25] [25]

Querydet: Cascaded sparse query for accelerating high-resolution small object detection,

C. Yang, Z. Huang, and N. Wang, “Querydet: Cascaded sparse query for accelerating high-resolution small object detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2022, pp. 13 668–13 677

work page 2022

[26] [26]

Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,

Y . Zhang, C. Wu, W. Guo, T. Zhang, and W. Li, “Cfanet: Efficient detection of uav image based on cross-layer feature aggregation,” IEEE Transactions on Geoscience and Remote Sensing , vol. 61, pp. 1–11, 2023

work page 2023

[27] [27]

Famba-v: Fast vision mamba with cross-layer token fusion,

H. Shen, Z. Wan, X. Wang, and M. Zhang, “Famba-v: Fast vision mamba with cross-layer token fusion,” arXiv preprint arXiv:2409.09808 , 2024

work page arXiv 2024

[28] [28]

Improving multispectral pedestrian detection by addressing modality imbalance problems,

K. Zhou, L. Chen, and X. Cao, “Improving multispectral pedestrian detection by addressing modality imbalance problems,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. Springer, 2020, pp. 787–803

work page 2020

[29] [29]

Robust visual tracking via scale-and-state-awareness,

Y . Qi, L. Qin, S. Zhang, Q. Huang, and H. Yao, “Robust visual tracking via scale-and-state-awareness,” Neurocomputing, vol. 329, pp. 75–85, 2019

work page 2019

[30] [30]

Rfla: Gaussian receptive field based label assignment for tiny object detection,

C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Rfla: Gaussian receptive field based label assignment for tiny object detection,” in European conference on computer vision . Springer, 2022, pp. 526– 543

work page 2022

[31] [31]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2018, pp. 6154–6162

work page 2018

[32] [32]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2980–2988

work page 2017

[33] [33]

Fcos: A simple and strong anchor-free object detector,

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: A simple and strong anchor-free object detector,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 4, pp. 1922–1933, 2020

work page 1922

[34] [34]

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,

S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 9759–9768

work page 2020

[35] [35]

Tinet: learning invariant networks via knowledge transfer,

C. Luo, Z. Chen, L.-A. Tang, A. Shrivastava, Z. Li, H. Chen, and J. Ye, “Tinet: learning invariant networks via knowledge transfer,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 1890–1899

work page 2018

[36] [36]

Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,

T. Broedermann, C. Sakaridis, D. Dai, and L. Van Gool, “Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 4159–4166

work page 2023

[37] [37]

MMDetection: Open MMLab Detection Toolbox and Benchmark

K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[38] [38]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13...

work page 2014

[39] [39]

His research interests include machine learning, com- puter vision, and data mining

He is currently an Associate Professor with Space Engineering University, Beijing, China. His research interests include machine learning, com- puter vision, and data mining. Jianan Li is currently an assistant professor at School of Optics and Photonics, Beijing Institute of Technology, Beijing, China, where he received his B.S. and Ph.D. degree in 2013 ...

work page 2013