pith. machine review for the scientific record. sign in

arxiv: 2604.03176 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.MM

Recognition: no theorem link

SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:46 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords UAV object detectionfeature fusiondual-domain edge enhancementmulti-scale detectionfeature pyramid networkVisDroneUAVDT
0
0 comments X

The pith

SFFNet improves UAV object detection by fusing frequency and spatial domain edges to separate small targets from background noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SFFNet to handle the challenges of complex backgrounds and varying object scales in UAV images. It introduces a multi-scale dynamic dual-domain coupling module that extracts edges simultaneously in frequency and spatial domains to decouple objects from noise. A synergistic feature pyramid network then uses linear deformable convolutions and wide-area perception to capture irregular shapes and build contextual links. Six scaled versions of the detector are built, with the largest achieving 36.8 AP on VisDrone and 20.6 AP on UAVDT while lighter variants trade some accuracy for efficiency.

Core claim

The central claim is that dual-domain edge extraction in the MDDC module, combined with the SFPN's adaptive geometric and semantic fusion through deformable convolutions and long-range perception, reliably isolates multi-scale objects from UAV clutter and yields higher detection accuracy than prior single-domain or standard pyramid approaches.

What carries the argument

The multi-scale dynamic dual-domain coupling (MDDC) module that performs dual-driven edge extraction in frequency and spatial domains to separate objects from noise, and the synergistic feature pyramid network (SFPN) that employs linear deformable convolutions plus a wide-area perception module (WPM) to handle irregular shapes and context.

If this is right

  • The design enables detectors that maintain accuracy across widely different object sizes typical in aerial views.
  • Resource-constrained applications can use the smaller N or S variants without complete loss of the dual-domain benefit.
  • Long-range contextual associations reduce errors on objects partially obscured by clutter or at unusual angles.
  • The overall architecture supports deployment in varied UAV missions by offering a family of models rather than a single fixed network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-domain separation technique could be tested on satellite or ground-based surveillance imagery where background noise similarly overwhelms small targets.
  • Ablation studies isolating frequency versus spatial contributions would clarify which domain drives most of the reported gain.
  • If the wide-area perception module generalizes, it might be combined with other pyramid networks to improve detection in non-aerial cluttered scenes.

Load-bearing premise

That the dual-domain edge extraction and deformable fusion steps will continue to distinguish object boundaries from background noise even in new UAV scenes not seen during training.

What would settle it

An experiment that disables the MDDC module or the SFPN and measures whether average precision on VisDrone falls below the performance of existing methods without these components.

Figures

Figures reproduced from arXiv: 2604.03176 by Jun Ni, Lei Huang, Qibing Qin, Wei Hu, Wenfeng Zhang, Xiaodong Pei, Yue Meng.

Figure 1
Figure 1. Figure 1: The relationship between the AP value and the number of parameters [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SFFNet framework. The framework integrates a backbone network with MDDC for efficient multi-scale dual-domain feature extraction, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The detailed structure of the MDDC module. The MDDC module initially performs multi-scale decomposition on the input feature map to construct the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The detailed structure of the WPM. WPM employs a parallel structure [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The qualitative results of the ablation experiment for all fine-grained [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The comparison of detection visualization results between the baseline model and SFFNet. The first three rows are from the VisDrone dataset, and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The comparison of heatmap visualization results between the baseline model and SFFNet on the VisDrone dataset. Our model not only suppresses [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and the imbalance of target scales. Traditional methods easily struggle to effectively separate objects from intricate backgrounds and fail to fully leverage the rich multi-scale information contained within images. To address these issues, we have developed a synergistic feature fusion network (SFFNet) with dual-domain edge enhancement specifically tailored for object detection in UAV images. Firstly, the multi-scale dynamic dual-domain coupling (MDDC) module is designed. This component introduces a dual-driven edge extraction architecture that operates in both the frequency and spatial domains, enabling effective decoupling of multi-scale object edges from background noise. Secondly, to further enhance the representation capability of the model's neck in terms of both geometric and semantic information, a synergistic feature pyramid network (SFPN) is proposed. SFPN leverages linear deformable convolutions to adaptively capture irregular object shapes and establishes long-range contextual associations around targets through the designed wide-area perception module (WPM). Moreover, to adapt to the various applications or resource-constrained scenarios, six detectors of different scales (N/S/M/B/L/X) are designed. Experiments on two challenging aerial datasets (VisDrone and UAVDT) demonstrate the outstanding performance of SFFNet-X, achieving 36.8 AP and 20.6 AP, respectively. The lightweight models (N/S) also maintain a balance between detection accuracy and parameter efficiency. The code will be available at https://github.com/CQNU-ZhangLab/SFFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SFFNet, a network for object detection in UAV imagery that addresses scale imbalance and background clutter via two main components: the Multi-scale Dynamic Dual-domain Coupling (MDDC) module, which performs edge decoupling in both frequency and spatial domains, and the Synergistic Feature Pyramid Network (SFPN), which uses linear deformable convolutions plus a Wide-area Perception Module (WPM) to capture irregular shapes and long-range context. Six scaled variants (N/S/M/B/L/X) are introduced; the largest (SFFNet-X) is reported to reach 36.8 AP on VisDrone and 20.6 AP on UAVDT, with lighter variants balancing accuracy and efficiency. Code release is promised.

Significance. If the reported gains are reproducible, the dual-domain edge enhancement and adaptive pyramid design could meaningfully advance detection under the specific constraints of UAV imagery (small objects, heavy clutter, extreme scale variation). The availability of multiple model scales and the commitment to release code are practical strengths that would aid adoption and further research.

major comments (3)
  1. [Experiments] Experiments section: the central performance claims (36.8 AP on VisDrone, 20.6 AP on UAVDT) are given as single-point estimates without error bars, standard deviations across random seeds, or a complete training protocol (optimizer schedule, data-augmentation details, input resolution, etc.). This prevents verification of whether the improvements are statistically reliable or sensitive to implementation choices.
  2. [Ablation studies] Ablation studies: no quantitative breakdown is provided that isolates the contribution of the frequency-domain branch versus the spatial-domain branch inside MDDC, or of the WPM versus the deformable-convolution path inside SFPN. Without these controlled ablations, it is impossible to confirm that the dual-domain coupling and synergistic fusion are the load-bearing reasons for the reported AP gains rather than other factors (backbone choice, training recipe).
  3. [Comparison tables] Comparison tables: the baseline detectors against which SFFNet-X is evaluated are not described with identical training settings or hyper-parameters, making it unclear whether the 36.8 / 20.6 AP numbers reflect architectural superiority or differences in optimization.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the architecture diagrams use inconsistent font sizes and occasionally omit units or module dimensions, reducing readability.
  2. [Method] The notation for the wide-area perception module (WPM) is introduced without an explicit equation or pseudocode, forcing the reader to infer its exact operation from the textual description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We will revise the manuscript to provide fuller details on training protocols, expanded ablations, and clarified comparisons while preserving the core contributions of the MDDC and SFPN modules.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central performance claims (36.8 AP on VisDrone, 20.6 AP on UAVDT) are given as single-point estimates without error bars, standard deviations across random seeds, or a complete training protocol (optimizer schedule, data-augmentation details, input resolution, etc.). This prevents verification of whether the improvements are statistically reliable or sensitive to implementation choices.

    Authors: We agree that additional experimental details are necessary for reproducibility. In the revised manuscript we will add a dedicated subsection describing the full training protocol, including the optimizer and schedule, data-augmentation pipeline, and input resolutions used for all reported results. We will also perform the main experiments across multiple random seeds and report mean AP values together with standard deviations on both VisDrone and UAVDT to quantify statistical reliability. revision: yes

  2. Referee: [Ablation studies] Ablation studies: no quantitative breakdown is provided that isolates the contribution of the frequency-domain branch versus the spatial-domain branch inside MDDC, or of the WPM versus the deformable-convolution path inside SFPN. Without these controlled ablations, it is impossible to confirm that the dual-domain coupling and synergistic fusion are the load-bearing reasons for the reported AP gains rather than other factors (backbone choice, training recipe).

    Authors: We accept that more granular ablations are required to isolate component contributions. The revised paper will include new controlled ablation tables that separately measure the performance impact of the frequency-domain branch versus the spatial-domain branch within MDDC, and of the Wide-area Perception Module versus the linear deformable convolution path within SFPN, all under otherwise identical settings. revision: yes

  3. Referee: [Comparison tables] Comparison tables: the baseline detectors against which SFFNet-X is evaluated are not described with identical training settings or hyper-parameters, making it unclear whether the 36.8 / 20.6 AP numbers reflect architectural superiority or differences in optimization.

    Authors: We will revise the comparison section to state explicitly that every baseline detector was re-trained from scratch using the identical training recipe, hyper-parameters, data splits, and augmentation strategy employed for SFFNet. Any unavoidable differences arising from original public implementations will be noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical architecture proposal

full rationale

The paper is an empirical architecture design for UAV object detection. It introduces MDDC and SFPN modules motivated by challenges of scale imbalance and background clutter, then reports measured AP scores on VisDrone and UAVDT. No equations, derivations, or predictions reduce the reported performance to fitted parameters, self-definitions, or self-citation chains by construction. The central claims rest on standard benchmark experiments rather than any internal reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard deep-learning assumptions.

pith-pipeline@v0.9.0 · 5607 in / 1001 out tokens · 25526 ms · 2026-05-13T20:46:58.232811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    Image-only real-time incremental UA V image mosaic for multi-strip flight,

    F. Zhang, T. Yang, L. Liu, B. Liang, Y . Bai, and J. Li, “Image-only real-time incremental UA V image mosaic for multi-strip flight,”IEEE Trans. Multimedia, vol. 23, pp. 1410–1425, 2021

  2. [2]

    Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark,

    Z. Liuet al., “Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark,”IEEE Trans. Multimedia, vol. 25, pp. 1462– 1476, 2023

  3. [3]

    Gait recognition with drones: A benchmark,

    A. Li, S. Hou, Q. Cai, Y . Fu, and Y . Huang, “Gait recognition with drones: A benchmark,”IEEE Trans. Multimedia, vol. 26, pp. 3530– 3540, 2024

  4. [4]

    Spatial reliability enhanced correlation filter: An efficient approach for real-time UA V tracking,

    C. Fu, J. Jin, F. Ding, Y . Li, and G. Lu, “Spatial reliability enhanced correlation filter: An efficient approach for real-time UA V tracking,” IEEE Trans. Multimedia, vol. 26, pp. 4123–4137, 2024

  5. [5]

    ArbiTrack: A novel multi-object tracking framework for a moving UA V to detect and track arbitrarily oriented targets,

    Y . Chen, J. Wang, Q. Zhou, and H. Hu, “ArbiTrack: A novel multi-object tracking framework for a moving UA V to detect and track arbitrarily oriented targets,”IEEE Trans. Multimedia, pp. 1–11, 2025

  6. [6]

    Automatic detection of civilian and military personnel in reconnaissance missions using a UA V,

    N. P. Santos, V . B. Rodrigues, A. B. Pinto, and B. Damas, “Automatic detection of civilian and military personnel in reconnaissance missions using a UA V,” inProc. IEEE Int. Conf. Auto. Robot. Syst. Competitions (ICARSC), Apr. 2023, pp. 157–162

  7. [7]

    Automated detection of wildlife using drones: Synthesis, opportunities and con- straints,

    E. Corcoran, M. Winsen, A. Sudholz, and G. Hamilton, “Automated detection of wildlife using drones: Synthesis, opportunities and con- straints,”Methods Ecol. Evol., vol. 12, no. 6, pp. 1103–1114, Jun. 2021

  8. [8]

    QoE-driven UA V-enabled pseudo- analog wireless video broadcast: A joint optimization of power and trajectory,

    X.-W. Tang, X.-L. Huang, and F. Hu, “QoE-driven UA V-enabled pseudo- analog wireless video broadcast: A joint optimization of power and trajectory,”IEEE Trans. Multimedia, vol. 23, pp. 2398–2412, 2021

  9. [9]

    Feature pyramid networks for object detection,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125

  10. [10]

    EfficientDet: Scalable and efficient ob- ject detection,

    M. Tan, R. Pang, and Q. V . Le, “EfficientDet: Scalable and efficient ob- ject detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10 781–10 790

  11. [11]

    Extended feature pyramid network for small object detection,

    C. Deng, M. Wang, L. Liu, Y . Liu, and Y . Jiang, “Extended feature pyramid network for small object detection,”IEEE Trans. Multimedia, vol. 24, pp. 1968–1979, 2022

  12. [12]

    Hyperspectral image instance segmentation using spectral–spatial feature pyramid network,

    L. Fang, Y . Jiang, Y . Yan, J. Yue, and Y . Deng, “Hyperspectral image instance segmentation using spectral–spatial feature pyramid network,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023

  13. [13]

    CascadeDumpNet: Enhancing open dumpsite detection through deep learning and AutoML integrated dual-stage ap- proach using high-resolution satellite imagery,

    S. Zhang and J. Ma, “CascadeDumpNet: Enhancing open dumpsite detection through deep learning and AutoML integrated dual-stage ap- proach using high-resolution satellite imagery,”Remote Sens. Environ., vol. 313, p. 114349, 2024

  14. [14]

    Clustered object detection in aerial images,

    F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, “Clustered object detection in aerial images,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8311–8320

  15. [15]

    Density map guided object detection in aerial images,

    C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map guided object detection in aerial images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshop (CVPRW), Jun. 2020, pp. 190–191

  16. [16]

    UFPMP-Det: Toward accurate and efficient object detection on drone imagery,

    Y . Huang, J. Chen, and D. Huang, “UFPMP-Det: Toward accurate and efficient object detection on drone imagery,” inProc. AAAI Conf. Artif. Intell., vol. 36, no. 1, Jun. 2022, pp. 1026–1033

  17. [17]

    Dense tiny object detection: A scene context guided approach and a unified benchmark,

    Z. Zhao, J. Du, C. Li, X. Fang, Y . Xiao, and J. Tang, “Dense tiny object detection: A scene context guided approach and a unified benchmark,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

  18. [18]

    High-resolution feature generator for small ship detection in optical remote sensing images,

    H. Zhang, S. Wen, Z. Wei, and Z. Chen, “High-resolution feature generator for small ship detection in optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–11, 2024

  19. [19]

    Enhancing aerial object detection with selective frequency interaction network,

    W. Weng, M. Wei, J. Ren, and F. Shen, “Enhancing aerial object detection with selective frequency interaction network,”IEEE Trans. Artif. Intell., vol. 5, no. 12, pp. 6109–6120, 2024

  20. [20]

    Instance-aware spatial-frequency feature fusion detector for oriented object detection in remote-sensing images,

    S. Zheng, Z. Wu, Y . Xu, and Z. Wei, “Instance-aware spatial-frequency feature fusion detector for oriented object detection in remote-sensing images,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023

  21. [21]

    Learning orientation information from frequency-domain for oriented object detection in remote sensing images,

    S. Zheng, Z. Wu, Y . Xu, Z. Wei, and A. Plaza, “Learning orientation information from frequency-domain for oriented object detection in remote sensing images,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–12, 2022

  22. [22]

    Towards generalized UA V object detection: A novel perspective from frequency domain disentanglement,

    K. Wang, X. Fu, C. Ge, C. Cao, and Z.-J. Zha, “Towards generalized UA V object detection: A novel perspective from frequency domain disentanglement,”Int. J. Comput. Vision, vol. 132, no. 11, pp. 5410– 5438, 2024

  23. [23]

    Path aggregation network for instance segmentation,

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 8759–8768

  24. [24]

    AugFPN: Improving multi-scale feature learning for object detection,

    C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan, “AugFPN: Improving multi-scale feature learning for object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 12 595– 12 604

  25. [25]

    Asymptotic feature pyramid network for labeling pixels and regions,

    G. Yang, J. Lei, H. Tian, Z. Feng, and R. Liang, “Asymptotic feature pyramid network for labeling pixels and regions,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 7820–7829, 2024

  26. [26]

    A DeNoising FPN With Transformer R-CNN for Tiny Object Detection,

    H.-I. Liu, Y .-W. Tseng, K.-C. Chang, P.-J. Wang, H.-H. Shuai, and W.-H. Cheng, “A DeNoising FPN With Transformer R-CNN for Tiny Object Detection,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–15, 2024

  27. [27]

    Fine-Grained Ship Recognition with Spatial-Aligned Feature Pyramid Network and Adaptive Prototypical Contrastive Learning,

    Y . Li, L. Chen, and W. Li, “Fine-Grained Ship Recognition with Spatial-Aligned Feature Pyramid Network and Adaptive Prototypical Contrastive Learning,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–13, 2025

  28. [28]

    LDConv: Linear deformable convolution for improving convolutional neural networks,

    X. Zhanget al., “LDConv: Linear deformable convolution for improving convolutional neural networks,”Image Vision Comput., vol. 149, p. 105190, 2024

  29. [29]

    Large selective kernel network for remote sensing object detection,

    Y . Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Large selective kernel network for remote sensing object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 16 794– 16 805

  30. [30]

    Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection,

    X. Yuanet al., “Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.03775

  31. [31]

    Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,

    X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11 963–11 975

  32. [32]

    PeLK: Parameter-efficient large kernel convnets with peripheral convolution,

    H. Chenet al., “PeLK: Parameter-efficient large kernel convnets with peripheral convolution,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 5557–5567

  33. [33]

    [Online]

    Ultralytics, “Yolov8,” 2023. [Online]. Available: https://docs.ultralytics. com/models/yolov8/

  34. [34]

    Yolov10: Real-time end-to-end object detection,

    A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Hanet al., “Yolov10: Real-time end-to-end object detection,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 107 984–108 011, Dec. 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  35. [35]

    [Online]

    Ultralytics, “Yolo11,” 2024. [Online]. Available: https://docs.ultralytics. com/models/yolo11/

  36. [36]

    VisDrone-DET2019: The vision meets drone object detection in image challenge results,

    D. Duet al., “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 213–226

  37. [37]

    The unmanned aerial vehicle benchmark: Object detection and tracking,

    D. Du and et al., “The unmanned aerial vehicle benchmark: Object detection and tracking,” inProc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 370–386

  38. [38]

    Yolov9: Learning what you want to learn using programmable gradient information,

    C.-Y . Wang, I.-H. Yeh, and H.-Y . Mark Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2024, pp. 1–21

  39. [39]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Optimal Speed and Accuracy of Object Detection,” 2020. [Online]. Available: https://arxiv.org/abs/2004.10934

  40. [40]

    [Online]

    Ultralytics, “Yolov5,” 2022. [Online]. Available: https://github.com/ ultralytics/yolov5/tree/v7.0

  41. [41]

    Faster R-CNN: Towards real- time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 28, 2015, pp. 1–9

  42. [42]

    Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,

    X. Liet al., “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,”Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 21 002–21 012, 2020

  43. [43]

    AMRNet: Chips Augmentation in Aerial Images Object Detection,

    Z. Wei, C. Duan, X. Song, Y . Tian, and H. Wang, “AMRNet: Chips Augmentation in Aerial Images Object Detection,” 2020. [Online]. Available: https://arxiv.org/abs/2009.07168

  44. [44]

    TOOD: Task- aligned One-stage Object Detection ,

    C. Feng, Y . Zhong, Y . Gao, M. R. Scott, and W. Huang, “ TOOD: Task- aligned One-stage Object Detection ,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3490–3499

  45. [45]

    Coarse-Grained Density Map Guided Object Detection in Aerial Images,

    C. Duan, Z. Wei, C. Zhang, S. Qu, and H. Wang, “Coarse-Grained Density Map Guided Object Detection in Aerial Images,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2789–2798

  46. [46]

    A Global-Local Self-Adaptive Network for Drone-View Object Detection,

    S. Denget al., “A Global-Local Self-Adaptive Network for Drone-View Object Detection,”IEEE Trans. Image Process., vol. 30, pp. 1556–1569, 2021

  47. [47]

    YOLOX: Exceeding YOLO Series in 2021,

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” 2021. [Online]. Available: https://arxiv.org/abs/2107. 08430

  48. [48]

    QueryDet: Cascaded sparse query for accelerating high-resolution small object detection,

    C. Yang, Z. Huang, and N. Wang, “QueryDet: Cascaded sparse query for accelerating high-resolution small object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 13 668– 13 677

  49. [49]

    Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark,

    C. Xu, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, “Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark,”ISPRS J. Photogramm. Remote Sens., vol. 190, pp. 79–93, Aug. 2022

  50. [50]

    Scale Decoupled Pyramid for Object Detection in Aerial Images,

    Y . Ma, L. Chai, and L. Jin, “Scale Decoupled Pyramid for Object Detection in Aerial Images,”IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–14, 2023

  51. [51]

    Dense distinct query for end-to-end object detection,

    S. Zhang and et al., “Dense distinct query for end-to-end object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 7329–7338

  52. [52]

    Cascaded Zoom-In Detector for High Resolution Aerial Images,

    A. Meethal, E. Granger, and M. Pedersoli, “Cascaded Zoom-In Detector for High Resolution Aerial Images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 2046–2055

  53. [53]

    Efficient multi-scale attention module with cross- spatial learning,

    D. Ouyanget al., “Efficient multi-scale attention module with cross- spatial learning,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5

  54. [54]

    Pareto refocusing for drone-view object detection,

    J. Leng, M. Mo, Y . Zhou, C. Gao, W. Li, and X. Gao, “Pareto refocusing for drone-view object detection,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1320–1334, 2023

  55. [55]

    DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection,

    H. Zhang and et al., “DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection,” inProc. Int. Conf. Learn. Represent. (ICLR), May. 2023

  56. [56]

    EF-DETR: A lightweight transformer-based object detector with an encoder-free neck,

    S. Chenget al., “EF-DETR: A lightweight transformer-based object detector with an encoder-free neck,”IEEE Trans. Ind. Informat., vol. 20, no. 11, pp. 12 994–13 002, 2024

  57. [57]

    BRSTD: Bio-inspired remote sensing tiny object detection,

    S. Huang, C. Lin, X. Jiang, and Z. Qu, “BRSTD: Bio-inspired remote sensing tiny object detection,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–15, 2024

  58. [58]

    No-extra components density map cropping guided object detection in aerial images,

    Z. Guo, G. Bi, H. Lv, Y . Feng, Y . Zhang, and M. Sun, “No-extra components density map cropping guided object detection in aerial images,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

  59. [59]

    DETRs Beat YOLOs on Real-time Object Detection,

    Y . Zhaoet al., “DETRs Beat YOLOs on Real-time Object Detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 16 965–16 974

  60. [60]

    DDOD: Dive deeper into the disentanglement of object detector,

    Z. Chen and et al., “DDOD: Dive deeper into the disentanglement of object detector,”IEEE Trans. Multimedia, vol. 26, pp. 284–298, Jan. 2024

  61. [61]

    SDPDet: Learning scale-separated dynamic proposals for end-to-end drone-view detection,

    N. Yin, C. Liu, R. Tian, and X. Qian, “SDPDet: Learning scale-separated dynamic proposals for end-to-end drone-view detection,”IEEE Trans. Multimedia, vol. 26, pp. 7812–7822, 2024

  62. [62]

    Boundary-aware feature fusion with dual-stream attention for remote sensing small object detection,

    J. Songet al., “Boundary-aware feature fusion with dual-stream attention for remote sensing small object detection,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–13, 2025

  63. [63]

    Microsoft COCO: Common Objects in Context,

    T.-Y . Linet al., “Microsoft COCO: Common Objects in Context,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2014, pp. 740–755

  64. [64]

    Objects as Points

    X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” 2019. [Online]. Available: https://arxiv.org/abs/1904.07850

  65. [65]

    Adaptive Sparse Convolu- tional Networks With Global Context Enhancement for Faster Object Detection on Drone Images,

    B. Du, Y . Huang, J. Chen, and D. Huang, “Adaptive Sparse Convolu- tional Networks With Global Context Enhancement for Faster Object Detection on Drone Images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 13 435–13 444

  66. [66]

    SCLNet: A scale-robust complementary learning network for object detection in UA V images,

    X. Li, W. Diao, Y . Mao, X. Li, and X. Sun, “SCLNet: A scale-robust complementary learning network for object detection in UA V images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–19, 2024

  67. [67]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2020, pp. 213–229

  68. [68]

    DAMO-YOLO : A report on real-time object detection design,

    X. Xu, Y . Jiang, W. Chen, Y . Huang, Y . Zhang, and X. Sun, “DAMO-YOLO : A report on real-time object detection design,” 2023. [Online]. Available: https://arxiv.org/abs/2211.15444

  69. [69]

    Multi-branch auxiliary fusion yolo with re- parameterization heterogeneous convolutional for accurate object de- tection,

    Z. Yanget al., “Multi-branch auxiliary fusion yolo with re- parameterization heterogeneous convolutional for accurate object de- tection,” inChin. Conf. Pattern Recognit. Comput. Vision (PRCV). Springer, 2025, pp. 492–505