arxiv: 2604.03176 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.MM

Recognition: no theorem link

SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

Wenfeng Zhang , Jun Ni , Yue Meng , Xiaodong Pei , Wei Hu , Qibing Qin , Lei Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:46 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords UAV object detectionfeature fusiondual-domain edge enhancementmulti-scale detectionfeature pyramid networkVisDroneUAVDT

0 comments

The pith

SFFNet improves UAV object detection by fusing frequency and spatial domain edges to separate small targets from background noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SFFNet to handle the challenges of complex backgrounds and varying object scales in UAV images. It introduces a multi-scale dynamic dual-domain coupling module that extracts edges simultaneously in frequency and spatial domains to decouple objects from noise. A synergistic feature pyramid network then uses linear deformable convolutions and wide-area perception to capture irregular shapes and build contextual links. Six scaled versions of the detector are built, with the largest achieving 36.8 AP on VisDrone and 20.6 AP on UAVDT while lighter variants trade some accuracy for efficiency.

Core claim

The central claim is that dual-domain edge extraction in the MDDC module, combined with the SFPN's adaptive geometric and semantic fusion through deformable convolutions and long-range perception, reliably isolates multi-scale objects from UAV clutter and yields higher detection accuracy than prior single-domain or standard pyramid approaches.

What carries the argument

The multi-scale dynamic dual-domain coupling (MDDC) module that performs dual-driven edge extraction in frequency and spatial domains to separate objects from noise, and the synergistic feature pyramid network (SFPN) that employs linear deformable convolutions plus a wide-area perception module (WPM) to handle irregular shapes and context.

If this is right

The design enables detectors that maintain accuracy across widely different object sizes typical in aerial views.
Resource-constrained applications can use the smaller N or S variants without complete loss of the dual-domain benefit.
Long-range contextual associations reduce errors on objects partially obscured by clutter or at unusual angles.
The overall architecture supports deployment in varied UAV missions by offering a family of models rather than a single fixed network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-domain separation technique could be tested on satellite or ground-based surveillance imagery where background noise similarly overwhelms small targets.
Ablation studies isolating frequency versus spatial contributions would clarify which domain drives most of the reported gain.
If the wide-area perception module generalizes, it might be combined with other pyramid networks to improve detection in non-aerial cluttered scenes.

Load-bearing premise

That the dual-domain edge extraction and deformable fusion steps will continue to distinguish object boundaries from background noise even in new UAV scenes not seen during training.

What would settle it

An experiment that disables the MDDC module or the SFPN and measures whether average precision on VisDrone falls below the performance of existing methods without these components.

Figures

Figures reproduced from arXiv: 2604.03176 by Jun Ni, Lei Huang, Qibing Qin, Wei Hu, Wenfeng Zhang, Xiaodong Pei, Yue Meng.

**Figure 2.** Figure 2: Overview of the SFFNet framework. The framework integrates a backbone network with MDDC for efficient multi-scale dual-domain feature extraction, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The detailed structure of the MDDC module. The MDDC module initially performs multi-scale decomposition on the input feature map to construct the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The detailed structure of the WPM. WPM employs a parallel structure [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The qualitative results of the ablation experiment for all fine-grained [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: The comparison of detection visualization results between the baseline model and SFFNet. The first three rows are from the VisDrone dataset, and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: The comparison of heatmap visualization results between the baseline model and SFFNet on the VisDrone dataset. Our model not only suppresses [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and the imbalance of target scales. Traditional methods easily struggle to effectively separate objects from intricate backgrounds and fail to fully leverage the rich multi-scale information contained within images. To address these issues, we have developed a synergistic feature fusion network (SFFNet) with dual-domain edge enhancement specifically tailored for object detection in UAV images. Firstly, the multi-scale dynamic dual-domain coupling (MDDC) module is designed. This component introduces a dual-driven edge extraction architecture that operates in both the frequency and spatial domains, enabling effective decoupling of multi-scale object edges from background noise. Secondly, to further enhance the representation capability of the model's neck in terms of both geometric and semantic information, a synergistic feature pyramid network (SFPN) is proposed. SFPN leverages linear deformable convolutions to adaptively capture irregular object shapes and establishes long-range contextual associations around targets through the designed wide-area perception module (WPM). Moreover, to adapt to the various applications or resource-constrained scenarios, six detectors of different scales (N/S/M/B/L/X) are designed. Experiments on two challenging aerial datasets (VisDrone and UAVDT) demonstrate the outstanding performance of SFFNet-X, achieving 36.8 AP and 20.6 AP, respectively. The lightweight models (N/S) also maintain a balance between detection accuracy and parameter efficiency. The code will be available at https://github.com/CQNU-ZhangLab/SFFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFFNet adds dual-domain edge decoupling and a wide-area feature pyramid for UAV detection, with reported AP gains on VisDrone and UAVDT that look workable but rest on thin experimental details.

read the letter

This paper introduces SFFNet, which uses a multi-scale dynamic dual-domain coupling module to decouple object edges from background noise in both frequency and spatial domains, and a synergistic feature pyramid network with linear deformable convolutions and a wide-area perception module to capture irregular shapes and long-range context. It does well by focusing on UAV-specific issues like scale imbalance and clutter, testing on two relevant datasets, and providing a suite of models with different sizes for flexibility. The reported 36.8 AP on VisDrone and 20.6 AP on UAVDT for the largest version indicate practical performance, and releasing the code will allow others to verify and extend the work. The soft spots are the limited visibility into the experiments. There are no ablations shown in the abstract to confirm the modules' individual impacts, no error bars, and the comparisons to other methods are not detailed here. This makes it tough to assess how much of an advance it really is over existing approaches. The paper is for computer vision researchers working on aerial image analysis and drone-based detection systems. A reader in that area would find the module ideas worth considering for similar problems. It has enough substance and coherence to merit a serious referee review, even if it requires revisions for stronger evidence. I recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes SFFNet, a network for object detection in UAV imagery that addresses scale imbalance and background clutter via two main components: the Multi-scale Dynamic Dual-domain Coupling (MDDC) module, which performs edge decoupling in both frequency and spatial domains, and the Synergistic Feature Pyramid Network (SFPN), which uses linear deformable convolutions plus a Wide-area Perception Module (WPM) to capture irregular shapes and long-range context. Six scaled variants (N/S/M/B/L/X) are introduced; the largest (SFFNet-X) is reported to reach 36.8 AP on VisDrone and 20.6 AP on UAVDT, with lighter variants balancing accuracy and efficiency. Code release is promised.

Significance. If the reported gains are reproducible, the dual-domain edge enhancement and adaptive pyramid design could meaningfully advance detection under the specific constraints of UAV imagery (small objects, heavy clutter, extreme scale variation). The availability of multiple model scales and the commitment to release code are practical strengths that would aid adoption and further research.

major comments (3)

[Experiments] Experiments section: the central performance claims (36.8 AP on VisDrone, 20.6 AP on UAVDT) are given as single-point estimates without error bars, standard deviations across random seeds, or a complete training protocol (optimizer schedule, data-augmentation details, input resolution, etc.). This prevents verification of whether the improvements are statistically reliable or sensitive to implementation choices.
[Ablation studies] Ablation studies: no quantitative breakdown is provided that isolates the contribution of the frequency-domain branch versus the spatial-domain branch inside MDDC, or of the WPM versus the deformable-convolution path inside SFPN. Without these controlled ablations, it is impossible to confirm that the dual-domain coupling and synergistic fusion are the load-bearing reasons for the reported AP gains rather than other factors (backbone choice, training recipe).
[Comparison tables] Comparison tables: the baseline detectors against which SFFNet-X is evaluated are not described with identical training settings or hyper-parameters, making it unclear whether the 36.8 / 20.6 AP numbers reflect architectural superiority or differences in optimization.

minor comments (2)

[Figures] Figure captions and axis labels in the architecture diagrams use inconsistent font sizes and occasionally omit units or module dimensions, reducing readability.
[Method] The notation for the wide-area perception module (WPM) is introduced without an explicit equation or pseudocode, forcing the reader to infer its exact operation from the textual description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We will revise the manuscript to provide fuller details on training protocols, expanded ablations, and clarified comparisons while preserving the core contributions of the MDDC and SFPN modules.

read point-by-point responses

Referee: [Experiments] Experiments section: the central performance claims (36.8 AP on VisDrone, 20.6 AP on UAVDT) are given as single-point estimates without error bars, standard deviations across random seeds, or a complete training protocol (optimizer schedule, data-augmentation details, input resolution, etc.). This prevents verification of whether the improvements are statistically reliable or sensitive to implementation choices.

Authors: We agree that additional experimental details are necessary for reproducibility. In the revised manuscript we will add a dedicated subsection describing the full training protocol, including the optimizer and schedule, data-augmentation pipeline, and input resolutions used for all reported results. We will also perform the main experiments across multiple random seeds and report mean AP values together with standard deviations on both VisDrone and UAVDT to quantify statistical reliability. revision: yes
Referee: [Ablation studies] Ablation studies: no quantitative breakdown is provided that isolates the contribution of the frequency-domain branch versus the spatial-domain branch inside MDDC, or of the WPM versus the deformable-convolution path inside SFPN. Without these controlled ablations, it is impossible to confirm that the dual-domain coupling and synergistic fusion are the load-bearing reasons for the reported AP gains rather than other factors (backbone choice, training recipe).

Authors: We accept that more granular ablations are required to isolate component contributions. The revised paper will include new controlled ablation tables that separately measure the performance impact of the frequency-domain branch versus the spatial-domain branch within MDDC, and of the Wide-area Perception Module versus the linear deformable convolution path within SFPN, all under otherwise identical settings. revision: yes
Referee: [Comparison tables] Comparison tables: the baseline detectors against which SFFNet-X is evaluated are not described with identical training settings or hyper-parameters, making it unclear whether the 36.8 / 20.6 AP numbers reflect architectural superiority or differences in optimization.

Authors: We will revise the comparison section to state explicitly that every baseline detector was re-trained from scratch using the identical training recipe, hyper-parameters, data splits, and augmentation strategy employed for SFFNet. Any unavoidable differences arising from original public implementations will be noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture proposal

full rationale

The paper is an empirical architecture design for UAV object detection. It introduces MDDC and SFPN modules motivated by challenges of scale imbalance and background clutter, then reports measured AP scores on VisDrone and UAVDT. No equations, derivations, or predictions reduce the reported performance to fitted parameters, self-definitions, or self-citation chains by construction. The central claims rest on standard benchmark experiments rather than any internal reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard deep-learning assumptions.

pith-pipeline@v0.9.0 · 5607 in / 1001 out tokens · 25526 ms · 2026-05-13T20:46:58.232811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

Image-only real-time incremental UA V image mosaic for multi-strip flight,

F. Zhang, T. Yang, L. Liu, B. Liang, Y . Bai, and J. Li, “Image-only real-time incremental UA V image mosaic for multi-strip flight,”IEEE Trans. Multimedia, vol. 23, pp. 1410–1425, 2021

work page 2021
[2]

Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark,

Z. Liuet al., “Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark,”IEEE Trans. Multimedia, vol. 25, pp. 1462– 1476, 2023

work page 2023
[3]

Gait recognition with drones: A benchmark,

A. Li, S. Hou, Q. Cai, Y . Fu, and Y . Huang, “Gait recognition with drones: A benchmark,”IEEE Trans. Multimedia, vol. 26, pp. 3530– 3540, 2024

work page 2024
[4]

Spatial reliability enhanced correlation filter: An efficient approach for real-time UA V tracking,

C. Fu, J. Jin, F. Ding, Y . Li, and G. Lu, “Spatial reliability enhanced correlation filter: An efficient approach for real-time UA V tracking,” IEEE Trans. Multimedia, vol. 26, pp. 4123–4137, 2024

work page 2024
[5]

ArbiTrack: A novel multi-object tracking framework for a moving UA V to detect and track arbitrarily oriented targets,

Y . Chen, J. Wang, Q. Zhou, and H. Hu, “ArbiTrack: A novel multi-object tracking framework for a moving UA V to detect and track arbitrarily oriented targets,”IEEE Trans. Multimedia, pp. 1–11, 2025

work page 2025
[6]

Automatic detection of civilian and military personnel in reconnaissance missions using a UA V,

N. P. Santos, V . B. Rodrigues, A. B. Pinto, and B. Damas, “Automatic detection of civilian and military personnel in reconnaissance missions using a UA V,” inProc. IEEE Int. Conf. Auto. Robot. Syst. Competitions (ICARSC), Apr. 2023, pp. 157–162

work page 2023
[7]

Automated detection of wildlife using drones: Synthesis, opportunities and con- straints,

E. Corcoran, M. Winsen, A. Sudholz, and G. Hamilton, “Automated detection of wildlife using drones: Synthesis, opportunities and con- straints,”Methods Ecol. Evol., vol. 12, no. 6, pp. 1103–1114, Jun. 2021

work page 2021
[8]

QoE-driven UA V-enabled pseudo- analog wireless video broadcast: A joint optimization of power and trajectory,

X.-W. Tang, X.-L. Huang, and F. Hu, “QoE-driven UA V-enabled pseudo- analog wireless video broadcast: A joint optimization of power and trajectory,”IEEE Trans. Multimedia, vol. 23, pp. 2398–2412, 2021

work page 2021
[9]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125

work page 2017
[10]

EfficientDet: Scalable and efficient ob- ject detection,

M. Tan, R. Pang, and Q. V . Le, “EfficientDet: Scalable and efficient ob- ject detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10 781–10 790

work page 2020
[11]

Extended feature pyramid network for small object detection,

C. Deng, M. Wang, L. Liu, Y . Liu, and Y . Jiang, “Extended feature pyramid network for small object detection,”IEEE Trans. Multimedia, vol. 24, pp. 1968–1979, 2022

work page 1968
[12]

Hyperspectral image instance segmentation using spectral–spatial feature pyramid network,

L. Fang, Y . Jiang, Y . Yan, J. Yue, and Y . Deng, “Hyperspectral image instance segmentation using spectral–spatial feature pyramid network,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023

work page 2023
[13]

CascadeDumpNet: Enhancing open dumpsite detection through deep learning and AutoML integrated dual-stage ap- proach using high-resolution satellite imagery,

S. Zhang and J. Ma, “CascadeDumpNet: Enhancing open dumpsite detection through deep learning and AutoML integrated dual-stage ap- proach using high-resolution satellite imagery,”Remote Sens. Environ., vol. 313, p. 114349, 2024

work page 2024
[14]

Clustered object detection in aerial images,

F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, “Clustered object detection in aerial images,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8311–8320

work page 2019
[15]

Density map guided object detection in aerial images,

C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map guided object detection in aerial images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshop (CVPRW), Jun. 2020, pp. 190–191

work page 2020
[16]

UFPMP-Det: Toward accurate and efficient object detection on drone imagery,

Y . Huang, J. Chen, and D. Huang, “UFPMP-Det: Toward accurate and efficient object detection on drone imagery,” inProc. AAAI Conf. Artif. Intell., vol. 36, no. 1, Jun. 2022, pp. 1026–1033

work page 2022
[17]

Dense tiny object detection: A scene context guided approach and a unified benchmark,

Z. Zhao, J. Du, C. Li, X. Fang, Y . Xiao, and J. Tang, “Dense tiny object detection: A scene context guided approach and a unified benchmark,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

work page 2024
[18]

High-resolution feature generator for small ship detection in optical remote sensing images,

H. Zhang, S. Wen, Z. Wei, and Z. Chen, “High-resolution feature generator for small ship detection in optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–11, 2024

work page 2024
[19]

Enhancing aerial object detection with selective frequency interaction network,

W. Weng, M. Wei, J. Ren, and F. Shen, “Enhancing aerial object detection with selective frequency interaction network,”IEEE Trans. Artif. Intell., vol. 5, no. 12, pp. 6109–6120, 2024

work page 2024
[20]

Instance-aware spatial-frequency feature fusion detector for oriented object detection in remote-sensing images,

S. Zheng, Z. Wu, Y . Xu, and Z. Wei, “Instance-aware spatial-frequency feature fusion detector for oriented object detection in remote-sensing images,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023

work page 2023
[21]

Learning orientation information from frequency-domain for oriented object detection in remote sensing images,

S. Zheng, Z. Wu, Y . Xu, Z. Wei, and A. Plaza, “Learning orientation information from frequency-domain for oriented object detection in remote sensing images,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–12, 2022

work page 2022
[22]

Towards generalized UA V object detection: A novel perspective from frequency domain disentanglement,

K. Wang, X. Fu, C. Ge, C. Cao, and Z.-J. Zha, “Towards generalized UA V object detection: A novel perspective from frequency domain disentanglement,”Int. J. Comput. Vision, vol. 132, no. 11, pp. 5410– 5438, 2024

work page 2024
[23]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 8759–8768

work page 2018
[24]

AugFPN: Improving multi-scale feature learning for object detection,

C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan, “AugFPN: Improving multi-scale feature learning for object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 12 595– 12 604

work page 2020
[25]

Asymptotic feature pyramid network for labeling pixels and regions,

G. Yang, J. Lei, H. Tian, Z. Feng, and R. Liang, “Asymptotic feature pyramid network for labeling pixels and regions,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 7820–7829, 2024

work page 2024
[26]

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection,

H.-I. Liu, Y .-W. Tseng, K.-C. Chang, P.-J. Wang, H.-H. Shuai, and W.-H. Cheng, “A DeNoising FPN With Transformer R-CNN for Tiny Object Detection,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–15, 2024

work page 2024
[27]

Fine-Grained Ship Recognition with Spatial-Aligned Feature Pyramid Network and Adaptive Prototypical Contrastive Learning,

Y . Li, L. Chen, and W. Li, “Fine-Grained Ship Recognition with Spatial-Aligned Feature Pyramid Network and Adaptive Prototypical Contrastive Learning,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–13, 2025

work page 2025
[28]

LDConv: Linear deformable convolution for improving convolutional neural networks,

X. Zhanget al., “LDConv: Linear deformable convolution for improving convolutional neural networks,”Image Vision Comput., vol. 149, p. 105190, 2024

work page 2024
[29]

Large selective kernel network for remote sensing object detection,

Y . Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Large selective kernel network for remote sensing object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 16 794– 16 805

work page 2023
[30]

Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection,

X. Yuanet al., “Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.03775

work page arXiv 2025
[31]

Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,

X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11 963–11 975

work page 2022
[32]

PeLK: Parameter-efficient large kernel convnets with peripheral convolution,

H. Chenet al., “PeLK: Parameter-efficient large kernel convnets with peripheral convolution,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 5557–5567

work page 2024
[33]

[Online]

Ultralytics, “Yolov8,” 2023. [Online]. Available: https://docs.ultralytics. com/models/yolov8/

work page 2023
[34]

Yolov10: Real-time end-to-end object detection,

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Hanet al., “Yolov10: Real-time end-to-end object detection,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 107 984–108 011, Dec. 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2024
[35]

[Online]

Ultralytics, “Yolo11,” 2024. [Online]. Available: https://docs.ultralytics. com/models/yolo11/

work page 2024
[36]

VisDrone-DET2019: The vision meets drone object detection in image challenge results,

D. Duet al., “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 213–226

work page 2019
[37]

The unmanned aerial vehicle benchmark: Object detection and tracking,

D. Du and et al., “The unmanned aerial vehicle benchmark: Object detection and tracking,” inProc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 370–386

work page 2018
[38]

Yolov9: Learning what you want to learn using programmable gradient information,

C.-Y . Wang, I.-H. Yeh, and H.-Y . Mark Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 1–21

work page 2024
[39]

YOLOv4: Optimal Speed and Accuracy of Object Detection

A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Optimal Speed and Accuracy of Object Detection,” 2020. [Online]. Available: https://arxiv.org/abs/2004.10934

work page internal anchor Pith review Pith/arXiv arXiv 2020
[40]

[Online]

Ultralytics, “Yolov5,” 2022. [Online]. Available: https://github.com/ ultralytics/yolov5/tree/v7.0

work page 2022
[41]

Faster R-CNN: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 28, 2015, pp. 1–9

work page 2015
[42]

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,

X. Liet al., “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 21 002–21 012, 2020

work page 2020
[43]

AMRNet: Chips Augmentation in Aerial Images Object Detection,

Z. Wei, C. Duan, X. Song, Y . Tian, and H. Wang, “AMRNet: Chips Augmentation in Aerial Images Object Detection,” 2020. [Online]. Available: https://arxiv.org/abs/2009.07168

work page arXiv 2020
[44]

TOOD: Task- aligned One-stage Object Detection ,

C. Feng, Y . Zhong, Y . Gao, M. R. Scott, and W. Huang, “ TOOD: Task- aligned One-stage Object Detection ,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3490–3499

work page 2021
[45]

Coarse-Grained Density Map Guided Object Detection in Aerial Images,

C. Duan, Z. Wei, C. Zhang, S. Qu, and H. Wang, “Coarse-Grained Density Map Guided Object Detection in Aerial Images,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2789–2798

work page 2021
[46]

A Global-Local Self-Adaptive Network for Drone-View Object Detection,

S. Denget al., “A Global-Local Self-Adaptive Network for Drone-View Object Detection,”IEEE Trans. Image Process., vol. 30, pp. 1556–1569, 2021

work page 2021
[47]

YOLOX: Exceeding YOLO Series in 2021,

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” 2021. [Online]. Available: https://arxiv.org/abs/2107. 08430

work page 2021
[48]

QueryDet: Cascaded sparse query for accelerating high-resolution small object detection,

C. Yang, Z. Huang, and N. Wang, “QueryDet: Cascaded sparse query for accelerating high-resolution small object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 13 668– 13 677

work page 2022
[49]

Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark,

C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark,”ISPRS J. Photogramm. Remote Sens., vol. 190, pp. 79–93, Aug. 2022

work page 2022
[50]

Scale Decoupled Pyramid for Object Detection in Aerial Images,

Y . Ma, L. Chai, and L. Jin, “Scale Decoupled Pyramid for Object Detection in Aerial Images,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–14, 2023

work page 2023
[51]

Dense distinct query for end-to-end object detection,

S. Zhang and et al., “Dense distinct query for end-to-end object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 7329–7338

work page 2023
[52]

Cascaded Zoom-In Detector for High Resolution Aerial Images,

A. Meethal, E. Granger, and M. Pedersoli, “Cascaded Zoom-In Detector for High Resolution Aerial Images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 2046–2055

work page 2023
[53]

Efficient multi-scale attention module with cross- spatial learning,

D. Ouyanget al., “Efficient multi-scale attention module with cross- spatial learning,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5

work page 2023
[54]

Pareto refocusing for drone-view object detection,

J. Leng, M. Mo, Y . Zhou, C. Gao, W. Li, and X. Gao, “Pareto refocusing for drone-view object detection,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1320–1334, 2023

work page 2023
[55]

DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection,

H. Zhang and et al., “DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection,” inProc. Int. Conf. Learn. Represent. (ICLR), May. 2023

work page 2023
[56]

EF-DETR: A lightweight transformer-based object detector with an encoder-free neck,

S. Chenget al., “EF-DETR: A lightweight transformer-based object detector with an encoder-free neck,”IEEE Trans. Ind. Informat., vol. 20, no. 11, pp. 12 994–13 002, 2024

work page 2024
[57]

BRSTD: Bio-inspired remote sensing tiny object detection,

S. Huang, C. Lin, X. Jiang, and Z. Qu, “BRSTD: Bio-inspired remote sensing tiny object detection,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–15, 2024

work page 2024
[58]

No-extra components density map cropping guided object detection in aerial images,

Z. Guo, G. Bi, H. Lv, Y . Feng, Y . Zhang, and M. Sun, “No-extra components density map cropping guided object detection in aerial images,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

work page 2024
[59]

DETRs Beat YOLOs on Real-time Object Detection,

Y . Zhaoet al., “DETRs Beat YOLOs on Real-time Object Detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 16 965–16 974

work page 2024
[60]

DDOD: Dive deeper into the disentanglement of object detector,

Z. Chen and et al., “DDOD: Dive deeper into the disentanglement of object detector,”IEEE Trans. Multimedia, vol. 26, pp. 284–298, Jan. 2024

work page 2024
[61]

SDPDet: Learning scale-separated dynamic proposals for end-to-end drone-view detection,

N. Yin, C. Liu, R. Tian, and X. Qian, “SDPDet: Learning scale-separated dynamic proposals for end-to-end drone-view detection,”IEEE Trans. Multimedia, vol. 26, pp. 7812–7822, 2024

work page 2024
[62]

Boundary-aware feature fusion with dual-stream attention for remote sensing small object detection,

J. Songet al., “Boundary-aware feature fusion with dual-stream attention for remote sensing small object detection,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–13, 2025

work page 2025
[63]

Microsoft COCO: Common Objects in Context,

T.-Y . Linet al., “Microsoft COCO: Common Objects in Context,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2014, pp. 740–755

work page 2014
[64]

Objects as Points

X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” 2019. [Online]. Available: https://arxiv.org/abs/1904.07850

work page Pith review arXiv 2019
[65]

Adaptive Sparse Convolu- tional Networks With Global Context Enhancement for Faster Object Detection on Drone Images,

B. Du, Y . Huang, J. Chen, and D. Huang, “Adaptive Sparse Convolu- tional Networks With Global Context Enhancement for Faster Object Detection on Drone Images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 13 435–13 444

work page 2023
[66]

SCLNet: A scale-robust complementary learning network for object detection in UA V images,

X. Li, W. Diao, Y . Mao, X. Li, and X. Sun, “SCLNet: A scale-robust complementary learning network for object detection in UA V images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–19, 2024

work page 2024
[67]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2020, pp. 213–229

work page 2020
[68]

DAMO-YOLO : A report on real-time object detection design,

X. Xu, Y . Jiang, W. Chen, Y . Huang, Y . Zhang, and X. Sun, “DAMO-YOLO : A report on real-time object detection design,” 2023. [Online]. Available: https://arxiv.org/abs/2211.15444

work page arXiv 2023
[69]

Multi-branch auxiliary fusion yolo with re- parameterization heterogeneous convolutional for accurate object de- tection,

Z. Yanget al., “Multi-branch auxiliary fusion yolo with re- parameterization heterogeneous convolutional for accurate object de- tection,” inChin. Conf. Pattern Recognit. Comput. Vision (PRCV). Springer, 2025, pp. 492–505

work page 2025