pith. machine review for the scientific record. sign in

arxiv: 2604.13278 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.LG· eess.IV

Recognition: 2 theorem links

· Lean Theorem

DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV
keywords UAV object detectiontiny objectslightweight detectionYOLO architectureaerial imageryVisDrone datasetreal-time inference
0
0 comments X

The pith

DroneScan-YOLO boosts tiny object detection in UAV images by 16 mAP points over standard YOLO while preserving real-time speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that YOLO detectors can be made effective for spotting very small objects in drone footage by raising the input resolution, adding a filter pruning step, including an extra detection head for the smallest scales, and changing the loss function to better handle non-overlapping boxes. This matters because drones often fly high enough that targets occupy only a few pixels, yet current models either miss them or run too slowly for practical use. If the approach works, it would let lightweight UAV systems identify bicycles, people, and vehicles from greater distances without needing more powerful onboard computers. The gains are shown mainly on one standard benchmark dataset for aerial detection.

Core claim

By integrating four specific changes into the YOLOv8 architecture—increasing input size to 1280 by 1280 pixels, inserting an RPA-Block that prunes redundant filters using cosine similarity, adding a lightweight MSFD branch for stride-4 detection, and applying a SAL-NWD loss that mixes Wasserstein distance with size-adaptive weighting—the model reaches 55.3 percent mAP at IoU 0.5 and 35.6 percent mAP at 0.5-0.95 on the VisDrone2019 detection test set. These figures exceed the plain YOLOv8 small baseline by 16.6 and 12.3 points, lift recall from 0.374 to 0.518, and require only 4.1 percent more parameters while running at 96.7 frames per second.

What carries the argument

The four coordinated design choices of higher input resolution, RPA-Block dynamic filter pruning, MSFD P2 detection branch at stride 4, and SAL-NWD hybrid loss function.

Load-bearing premise

The reported accuracy gains result from the four proposed components rather than simply using higher resolution or dataset-specific adjustments, and similar gains will appear on other UAV datasets and in varied real-world conditions.

What would settle it

Training and testing the same architecture on a second UAV detection dataset with different object distributions and measuring whether the mAP improvement over baseline shrinks or disappears.

Figures

Figures reproduced from arXiv: 2604.13278 by Yann V. Bellec.

Figure 1
Figure 1. Figure 1: Qualitative results on VisDrone2019-DET validation images. Ground truth annotations (left) vs. DroneScan-YOLO predictions with confidence scores (right). The model successfully detects high-density scenes including challenging small-scale targets such as bicycles, tricycles and pedestrians [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VisDrone2019-DET training set statistics. Class distribution (top-left) showing the dominance of car instances and scarcity of small-object categories. Bounding box anchors (top-right). Spatial distribution of object centers (bottom-left), concentrated near image center due to UAV capture angle. Bounding box size distribution (bottom-right) confirming the prevalence of sub-32px objects (width and height < … view at source ↗
Figure 3
Figure 3. Figure 3: F1-Confidence curve for DroneScan-YOLO. The optimal confidence threshold of 0.265 yields macro-averaged F1 = 0.56. Per-class curves reveal the expected gap between large objects (car: F1≈0.84) and small-object categories (bicycle, awning-tricycle), consistent with their sub-32px size distribution. 4.4 Architecture Comparison [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized confusion matrices. DroneScan-YOLO (left) vs. YOLOv8s baseline (right). The background row is substantially lighter for DroneScan, indicating a 40% reduction in pedestrian non-detection rate and improved recall across all small-object categories. (a) DroneScan-YOLO — mAP@50 = 0.550 (b) YOLOv8s baseline — mAP@50 = 0.387 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Precision-Recall curves. DroneScan-YOLO (left) consistently dominates the baseline (right) across all 10 VisDrone categories. Most pronounced gains on bicycle (+0.207) and awning￾tricycle (+0.086). 4.5 Per-class Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DroneScan-YOLO training dynamics over 100 epochs at 1280×1280 px. Training and validation losses (box, cls, DFL) decrease monotonically. Validation mAP@50 rises from 0.15 to 0.553. The brief plateau near epoch 85 corresponds to a training interruption and resumption from the last checkpoint. 4.8 TTA Evaluation We evaluated Test-Time Augmentation (TTA), which performs inference on the original image and aug… view at source ↗
Figure 7
Figure 7. Figure 7: presents representative detection examples on VisDrone validation images, illustrating DroneScan-YOLO’s behaviour in diverse real-world scenes. (a) Residential parking scene (b) Urban intersection scene [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8's TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DroneScan-YOLO, a modified YOLOv8s detector for tiny objects in UAV imagery. It proposes four coordinated changes: 1280x1280 input resolution, RPA-Block (dynamic filter pruning via lazy cosine-similarity with 10-epoch warm-up), MSFD (lightweight P2 stride-4 detection branch adding 114k parameters), and SAL-NWD (hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting inside TaskAligned assignment). On VisDrone2019-DET the model reports 55.3% mAP@50 and 35.6% mAP@50-95, +16.6 and +12.3 points over the YOLOv8s baseline, with recall rising from 0.374 to 0.518, 96.7 FPS, and only +4.1% parameters; gains are largest on tiny classes (e.g., bicycle AP@50 from 0.114 to 0.328).

Significance. If the reported gains are shown to arise from the three architectural/loss innovations rather than resolution scaling alone, the work supplies a practical, real-time UAV detector that directly targets the sub-32 px regime. The empirical numbers on a public benchmark and the emphasis on parameter/FPS efficiency would constitute a useful incremental contribution to aerial object detection.

major comments (3)
  1. [Experiments section] Experiments section (and any ablation tables): the manuscript evaluates the unmodified YOLOv8s baseline only at its conventional 640x640 resolution. No control experiment holds resolution fixed at 1280x1280 while ablating RPA-Block, MSFD, and SAL-NWD. Because tiny-object mAP is known to scale strongly with input resolution and because MSFD explicitly adds the stride-4 head that most directly benefits sub-32 px objects, the central attribution of the +16.6 mAP@50 and +12.3 mAP@50-95 lifts to the four proposed components cannot be verified from the presented results.
  2. [§3.2] §3.2 (MSFD description): the claim that the P2 branch adds only 114,592 parameters (+1.1%) is load-bearing for the “lightweight” assertion, yet the paper provides neither the exact channel configuration of the added head nor a parameter-count breakdown that isolates the contribution of the new detection layer from the rest of the network.
  3. [§3.3] §3.3 (SAL-NWD): the hybrid loss is integrated into TaskAligned assignment, but the manuscript does not report the value or selection procedure for the size-adaptive weighting hyper-parameter, nor does it show an ablation that isolates NWD from the adaptive CIoU term. This leaves open whether the recall improvement (0.374 → 0.518) is driven by the loss or by the higher-resolution input.
minor comments (2)
  1. [Abstract and §3.1] The 10-epoch warm-up period for RPA-Block is mentioned only in the abstract; its effect on convergence and final performance should be quantified or at least stated in the main text for reproducibility.
  2. [Tables and figures] Table captions and axis labels in the experimental figures should explicitly indicate whether each row/curve uses 640x640 or 1280x1280 input so readers can immediately distinguish resolution effects from module effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive comments. We address each major point below, agreeing where additional experiments and details are needed, and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (and any ablation tables): the manuscript evaluates the unmodified YOLOv8s baseline only at its conventional 640x640 resolution. No control experiment holds resolution fixed at 1280x1280 while ablating RPA-Block, MSFD, and SAL-NWD. Because tiny-object mAP is known to scale strongly with input resolution and because MSFD explicitly adds the stride-4 head that most directly benefits sub-32 px objects, the central attribution of the +16.6 mAP@50 and +12.3 mAP@50-95 lifts to the four proposed components cannot be verified from the presented results.

    Authors: We agree that a controlled ablation at fixed 1280x1280 resolution would strengthen the attribution of gains to the proposed components. The increased resolution is one of our four coordinated contributions, chosen specifically to address tiny objects, but we acknowledge the referee's point. In the revised manuscript, we will include additional experiments ablating RPA-Block, MSFD, and SAL-NWD while holding input resolution at 1280x1280, and also report the baseline YOLOv8s performance at 1280x1280 for direct comparison. This will allow readers to better assess the individual and combined effects. revision: yes

  2. Referee: [§3.2] §3.2 (MSFD description): the claim that the P2 branch adds only 114,592 parameters (+1.1%) is load-bearing for the “lightweight” assertion, yet the paper provides neither the exact channel configuration of the added head nor a parameter-count breakdown that isolates the contribution of the new detection layer from the rest of the network.

    Authors: We appreciate this observation. The MSFD module was designed to be lightweight, with the P2 branch using reduced channels (specifically, 64 channels for the detection head convolutions). We will revise §3.2 to include the exact channel configuration and add a supplementary table breaking down the parameter counts for each added component, confirming the +114,592 parameters for the new stride-4 head. revision: yes

  3. Referee: [§3.3] §3.3 (SAL-NWD): the hybrid loss is integrated into TaskAligned assignment, but the manuscript does not report the value or selection procedure for the size-adaptive weighting hyper-parameter, nor does it show an ablation that isolates NWD from the adaptive CIoU term. This leaves open whether the recall improvement (0.374 → 0.518) is driven by the loss or by the higher-resolution input.

    Authors: We agree that more details are warranted. The size-adaptive weighting hyper-parameter was set to 0.5 after validation on a held-out subset of VisDrone. We will report this value and the selection procedure in the revised §3.3. Additionally, we will add an ablation study in the experiments section comparing the full SAL-NWD loss against versions using only NWD and only the size-adaptive CIoU, all at the same resolution, to isolate their effects on recall and mAP. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model proposal with external benchmark results

full rationale

The paper proposes four architectural and loss components (RPA-Block, MSFD P2 branch, SAL-NWD loss, 1280x1280 resolution) and reports measured mAP, recall, FPS, and parameter counts on the public VisDrone2019-DET dataset against a YOLOv8s baseline. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-referential definitions. All performance numbers are direct experimental outcomes on an external benchmark; the work contains no self-citation load-bearing claims, ansatz smuggling, or renaming of known results as novel derivations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on three newly introduced architectural and loss components plus standard YOLOv8 assumptions; no external benchmarks or formal proofs are supplied in the abstract.

free parameters (2)
  • 10-epoch warm-up period
    Duration chosen for lazy cosine-similarity updates in RPA-Block before pruning begins.
  • 1280x1280 input resolution
    Fixed higher resolution selected to retain spatial detail for sub-32 px objects.
axioms (1)
  • domain assumption YOLOv8 base architecture and TaskAligned label assignment remain valid after the added modules
    All modifications are described as integrated into the existing YOLOv8 pipeline.
invented entities (3)
  • RPA-Block no independent evidence
    purpose: Dynamic filter pruning via lazy cosine-similarity updates
    New pruning mechanism introduced to reduce redundancy while preserving accuracy.
  • MSFD no independent evidence
    purpose: Lightweight P2 detection branch at stride 4
    New branch added to detect objects smaller than the standard stride-8 grid.
  • SAL-NWD no independent evidence
    purpose: Hybrid loss of Normalized Wasserstein Distance plus size-adaptive CIoU
    New loss formulation to provide gradients for non-overlapping tiny boxes.

pith-pipeline@v0.9.0 · 5627 in / 1778 out tokens · 81281 ms · 2026-05-10T15:56:49.174917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    (2019).VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results

    Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., & Ling, H. (2019).VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. Workshop Vision Meets Drone, ICCV 2019

  2. [2]

    (2025).SOD-YOLO: Enhancing YOLO-Based detection of small objects in UAV imagery

    Wang, P., & Zhao, J. (2025).SOD-YOLO: Enhancing YOLO-Based detection of small objects in UAV imagery. arXiv:2507.12727

  3. [3]

    (2025).Enhancing UAV object detection with an efficient multi-scale feature fusion framework

    Lai, D., Kang, K., Xu, K., Ma, X., Zhang, Y ., Huang, F., & Chen, J. (2025).Enhancing UAV object detection with an efficient multi-scale feature fusion framework. PLoS ONE, 20(10), e0332408

  4. [4]

    (2022).Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark

    Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., & Xia, G. (2022).Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 79–93

  5. [5]

    (2023).Ultralytics YOLOv8(Version 8.0.0) [Computer software].https://github.com/ultralytics/ultralytics

    Jocher, G., Chaurasia, A., & Qiu, J. (2023).Ultralytics YOLOv8(Version 8.0.0) [Computer software].https://github.com/ultralytics/ultralytics

  6. [6]

    (2025).YOLO-LE: A Lightweight and Effi- cient UAV Aerial Image Target Detection Model

    Chen, Z., Zhang, Y ., & Xing, S. (2025).YOLO-LE: A Lightweight and Effi- cient UAV Aerial Image Target Detection Model. Computers, Materials & Continua. DOI: 10.32604/cmc.2025.065238

  7. [7]

    Wan, Z. et al. (2025).DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sensing. DOI: 10.3390/rs17101768

  8. [8]

    (2025).Improved YOLO for long range detection of small drones

    Zhou, S., Yang, L., Liu, H., Zhou, C., Liu, J., Wang, Y ., Zhao, S., & Wang, K. (2025).Improved YOLO for long range detection of small drones. Scientific Reports, 15(1), 12280

  9. [9]

    Cheng, H., Zhang, M., & Shi, J. Q. (2024).A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE TPAMI. DOI: 10.1109/TPAMI.2024.3447085

  10. [10]

    S., & Elsen, E

    Evci, U., Gale, T., Menick, J., Castro, P. S., & Elsen, E. (2020).Rigging the Lottery: Making All Tickets Winners. arXiv:1911.11134. 11

  11. [11]

    Rezatofighi, H. et al. (2019).Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. CVPR 2019

  12. [12]

    (2019).The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Frankle, J., & Carlin, M. (2019).The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019

  13. [13]

    You only look once: Unified, real-time object detection

    Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016).You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016, pp. 779–788. arXiv:1506.02640

  14. [14]

    YOLOv3: An Incremental Improvement

    Redmon, J., & Farhadi, A. (2018).YOLOv3: An Incremental Improvement. arXiv:1804.02767

  15. [15]

    Girshick, Kaiming He, Bharath Hariharan, and Serge J

    Lin, T.-Y ., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017).Feature Pyramid Networks for Object Detection. CVPR 2017, pp. 936–944. arXiv:1612.03144

  16. [16]

    (2018).Path Aggregation Network for Instance Segmentation

    Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018).Path Aggregation Network for Instance Segmentation. CVPR 2018, pp. 8759–8768. arXiv:1803.01534

  17. [17]

    (2020).Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression

    Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020).Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI 2020. arXiv:1911.08287

  18. [18]

    (2018).Squeeze-and-Excitation Networks

    Hu, J., Shen, L., & Sun, G. (2018).Squeeze-and-Excitation Networks. CVPR 2018, pp. 7132–

  19. [19]

    Howard, A. G. et al. (2017).MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861

  20. [20]

    Girshick, and Jian Sun

    Ren, S., He, K., Girshick, R., & Sun, J. (2015).Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS 2015. arXiv:1506.01497

  21. [21]

    Decoupled Weight Decay Regularization

    Loshchilov, I., & Hutter, F. (2019).Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101. 12