arxiv: 2604.13278 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.LG· eess.IV

Recognition: 2 theorem links

· Lean Theorem

DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

Yann V. Bellec

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords UAV object detectiontiny objectslightweight detectionYOLO architectureaerial imageryVisDrone datasetreal-time inference

0 comments

The pith

DroneScan-YOLO boosts tiny object detection in UAV images by 16 mAP points over standard YOLO while preserving real-time speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that YOLO detectors can be made effective for spotting very small objects in drone footage by raising the input resolution, adding a filter pruning step, including an extra detection head for the smallest scales, and changing the loss function to better handle non-overlapping boxes. This matters because drones often fly high enough that targets occupy only a few pixels, yet current models either miss them or run too slowly for practical use. If the approach works, it would let lightweight UAV systems identify bicycles, people, and vehicles from greater distances without needing more powerful onboard computers. The gains are shown mainly on one standard benchmark dataset for aerial detection.

Core claim

By integrating four specific changes into the YOLOv8 architecture—increasing input size to 1280 by 1280 pixels, inserting an RPA-Block that prunes redundant filters using cosine similarity, adding a lightweight MSFD branch for stride-4 detection, and applying a SAL-NWD loss that mixes Wasserstein distance with size-adaptive weighting—the model reaches 55.3 percent mAP at IoU 0.5 and 35.6 percent mAP at 0.5-0.95 on the VisDrone2019 detection test set. These figures exceed the plain YOLOv8 small baseline by 16.6 and 12.3 points, lift recall from 0.374 to 0.518, and require only 4.1 percent more parameters while running at 96.7 frames per second.

What carries the argument

The four coordinated design choices of higher input resolution, RPA-Block dynamic filter pruning, MSFD P2 detection branch at stride 4, and SAL-NWD hybrid loss function.

Load-bearing premise

The reported accuracy gains result from the four proposed components rather than simply using higher resolution or dataset-specific adjustments, and similar gains will appear on other UAV datasets and in varied real-world conditions.

What would settle it

Training and testing the same architecture on a second UAV detection dataset with different object distributions and measuring whether the mAP improvement over baseline shrinks or disappears.

Figures

Figures reproduced from arXiv: 2604.13278 by Yann V. Bellec.

**Figure 1.** Figure 1: Qualitative results on VisDrone2019-DET validation images. Ground truth annotations (left) vs. DroneScan-YOLO predictions with confidence scores (right). The model successfully detects high-density scenes including challenging small-scale targets such as bicycles, tricycles and pedestrians [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: VisDrone2019-DET training set statistics. Class distribution (top-left) showing the dominance of car instances and scarcity of small-object categories. Bounding box anchors (top-right). Spatial distribution of object centers (bottom-left), concentrated near image center due to UAV capture angle. Bounding box size distribution (bottom-right) confirming the prevalence of sub-32px objects (width and height < … view at source ↗

**Figure 3.** Figure 3: F1-Confidence curve for DroneScan-YOLO. The optimal confidence threshold of 0.265 yields macro-averaged F1 = 0.56. Per-class curves reveal the expected gap between large objects (car: F1≈0.84) and small-object categories (bicycle, awning-tricycle), consistent with their sub-32px size distribution. 4.4 Architecture Comparison [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized confusion matrices. DroneScan-YOLO (left) vs. YOLOv8s baseline (right). The background row is substantially lighter for DroneScan, indicating a 40% reduction in pedestrian non-detection rate and improved recall across all small-object categories. (a) DroneScan-YOLO — mAP@50 = 0.550 (b) YOLOv8s baseline — mAP@50 = 0.387 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Precision-Recall curves. DroneScan-YOLO (left) consistently dominates the baseline (right) across all 10 VisDrone categories. Most pronounced gains on bicycle (+0.207) and awningtricycle (+0.086). 4.5 Per-class Results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: DroneScan-YOLO training dynamics over 100 epochs at 1280×1280 px. Training and validation losses (box, cls, DFL) decrease monotonically. Validation mAP@50 rises from 0.15 to 0.553. The brief plateau near epoch 85 corresponds to a training interruption and resumption from the last checkpoint. 4.8 TTA Evaluation We evaluated Test-Time Augmentation (TTA), which performs inference on the original image and aug… view at source ↗

**Figure 7.** Figure 7: presents representative detection examples on VisDrone validation images, illustrating DroneScan-YOLO’s behaviour in diverse real-world scenes. (a) Residential parking scene (b) Urban intersection scene [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8's TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The mAP and recall gains on VisDrone look real but are hard to credit to the new modules because the paper scales input to 1280x1280 without showing what a plain YOLOv8 does at that same resolution.

read the letter

The main thing here is that DroneScan-YOLO reports clear lifts on VisDrone for tiny objects, but those lifts are difficult to attribute cleanly to RPA-Block, MSFD, or SAL-NWD because input resolution was raised at the same time as the other changes were added. Standard YOLOv8 baselines run at 640x640, and tiny-object performance is known to improve sharply with more pixels, so the +16.6 mAP@50 and recall jump from 0.374 to 0.518 could be driven mostly by that scaling plus the obvious addition of a stride-4 head. The abstract gives no ablation that holds resolution fixed at 1280 while turning the new pieces on and off, which leaves the central claim under-supported. If the full paper contains those controls, the picture changes; otherwise the numbers describe a combined system rather than isolated contributions. What the work does well is stay practical. The MSFD branch adds only 1.1% parameters, the pruning keeps inference near 97 FPS, and the hybrid loss targets the zero-gradient problem for non-overlapping boxes. Those are reasonable engineering choices for UAV constraints. The paper is aimed at people who need a drop-in detector for drone imagery in surveillance or agriculture. A reader already working on small-object losses or lightweight backbones might borrow the SAL-NWD weighting or the cosine-similarity pruning idea and test them separately. It deserves peer review because the empirical results are concrete and the problem matters, even if the current evidence does not yet separate the effects of resolution from the rest. A referee can ask for the missing ablations and check whether the gains hold on other UAV datasets.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DroneScan-YOLO, a modified YOLOv8s detector for tiny objects in UAV imagery. It proposes four coordinated changes: 1280x1280 input resolution, RPA-Block (dynamic filter pruning via lazy cosine-similarity with 10-epoch warm-up), MSFD (lightweight P2 stride-4 detection branch adding 114k parameters), and SAL-NWD (hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting inside TaskAligned assignment). On VisDrone2019-DET the model reports 55.3% mAP@50 and 35.6% mAP@50-95, +16.6 and +12.3 points over the YOLOv8s baseline, with recall rising from 0.374 to 0.518, 96.7 FPS, and only +4.1% parameters; gains are largest on tiny classes (e.g., bicycle AP@50 from 0.114 to 0.328).

Significance. If the reported gains are shown to arise from the three architectural/loss innovations rather than resolution scaling alone, the work supplies a practical, real-time UAV detector that directly targets the sub-32 px regime. The empirical numbers on a public benchmark and the emphasis on parameter/FPS efficiency would constitute a useful incremental contribution to aerial object detection.

major comments (3)

[Experiments section] Experiments section (and any ablation tables): the manuscript evaluates the unmodified YOLOv8s baseline only at its conventional 640x640 resolution. No control experiment holds resolution fixed at 1280x1280 while ablating RPA-Block, MSFD, and SAL-NWD. Because tiny-object mAP is known to scale strongly with input resolution and because MSFD explicitly adds the stride-4 head that most directly benefits sub-32 px objects, the central attribution of the +16.6 mAP@50 and +12.3 mAP@50-95 lifts to the four proposed components cannot be verified from the presented results.
[§3.2] §3.2 (MSFD description): the claim that the P2 branch adds only 114,592 parameters (+1.1%) is load-bearing for the “lightweight” assertion, yet the paper provides neither the exact channel configuration of the added head nor a parameter-count breakdown that isolates the contribution of the new detection layer from the rest of the network.
[§3.3] §3.3 (SAL-NWD): the hybrid loss is integrated into TaskAligned assignment, but the manuscript does not report the value or selection procedure for the size-adaptive weighting hyper-parameter, nor does it show an ablation that isolates NWD from the adaptive CIoU term. This leaves open whether the recall improvement (0.374 → 0.518) is driven by the loss or by the higher-resolution input.

minor comments (2)

[Abstract and §3.1] The 10-epoch warm-up period for RPA-Block is mentioned only in the abstract; its effect on convergence and final performance should be quantified or at least stated in the main text for reproducibility.
[Tables and figures] Table captions and axis labels in the experimental figures should explicitly indicate whether each row/curve uses 640x640 or 1280x1280 input so readers can immediately distinguish resolution effects from module effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive comments. We address each major point below, agreeing where additional experiments and details are needed, and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Experiments section] Experiments section (and any ablation tables): the manuscript evaluates the unmodified YOLOv8s baseline only at its conventional 640x640 resolution. No control experiment holds resolution fixed at 1280x1280 while ablating RPA-Block, MSFD, and SAL-NWD. Because tiny-object mAP is known to scale strongly with input resolution and because MSFD explicitly adds the stride-4 head that most directly benefits sub-32 px objects, the central attribution of the +16.6 mAP@50 and +12.3 mAP@50-95 lifts to the four proposed components cannot be verified from the presented results.

Authors: We agree that a controlled ablation at fixed 1280x1280 resolution would strengthen the attribution of gains to the proposed components. The increased resolution is one of our four coordinated contributions, chosen specifically to address tiny objects, but we acknowledge the referee's point. In the revised manuscript, we will include additional experiments ablating RPA-Block, MSFD, and SAL-NWD while holding input resolution at 1280x1280, and also report the baseline YOLOv8s performance at 1280x1280 for direct comparison. This will allow readers to better assess the individual and combined effects. revision: yes
Referee: [§3.2] §3.2 (MSFD description): the claim that the P2 branch adds only 114,592 parameters (+1.1%) is load-bearing for the “lightweight” assertion, yet the paper provides neither the exact channel configuration of the added head nor a parameter-count breakdown that isolates the contribution of the new detection layer from the rest of the network.

Authors: We appreciate this observation. The MSFD module was designed to be lightweight, with the P2 branch using reduced channels (specifically, 64 channels for the detection head convolutions). We will revise §3.2 to include the exact channel configuration and add a supplementary table breaking down the parameter counts for each added component, confirming the +114,592 parameters for the new stride-4 head. revision: yes
Referee: [§3.3] §3.3 (SAL-NWD): the hybrid loss is integrated into TaskAligned assignment, but the manuscript does not report the value or selection procedure for the size-adaptive weighting hyper-parameter, nor does it show an ablation that isolates NWD from the adaptive CIoU term. This leaves open whether the recall improvement (0.374 → 0.518) is driven by the loss or by the higher-resolution input.

Authors: We agree that more details are warranted. The size-adaptive weighting hyper-parameter was set to 0.5 after validation on a held-out subset of VisDrone. We will report this value and the selection procedure in the revised §3.3. Additionally, we will add an ablation study in the experiments section comparing the full SAL-NWD loss against versions using only NWD and only the size-adaptive CIoU, all at the same resolution, to isolate their effects on recall and mAP. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model proposal with external benchmark results

full rationale

The paper proposes four architectural and loss components (RPA-Block, MSFD P2 branch, SAL-NWD loss, 1280x1280 resolution) and reports measured mAP, recall, FPS, and parameter counts on the public VisDrone2019-DET dataset against a YOLOv8s baseline. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-referential definitions. All performance numbers are direct experimental outcomes on an external benchmark; the work contains no self-citation load-bearing claims, ansatz smuggling, or renaming of known results as novel derivations. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on three newly introduced architectural and loss components plus standard YOLOv8 assumptions; no external benchmarks or formal proofs are supplied in the abstract.

free parameters (2)

10-epoch warm-up period
Duration chosen for lazy cosine-similarity updates in RPA-Block before pruning begins.
1280x1280 input resolution
Fixed higher resolution selected to retain spatial detail for sub-32 px objects.

axioms (1)

domain assumption YOLOv8 base architecture and TaskAligned label assignment remain valid after the added modules
All modifications are described as integrated into the existing YOLOv8 pipeline.

invented entities (3)

RPA-Block no independent evidence
purpose: Dynamic filter pruning via lazy cosine-similarity updates
New pruning mechanism introduced to reduce redundancy while preserving accuracy.
MSFD no independent evidence
purpose: Lightweight P2 detection branch at stride 4
New branch added to detect objects smaller than the standard stride-8 grid.
SAL-NWD no independent evidence
purpose: Hybrid loss of Normalized Wasserstein Distance plus size-adaptive CIoU
New loss formulation to provide gradients for non-overlapping tiny boxes.

pith-pipeline@v0.9.0 · 5627 in / 1778 out tokens · 81281 ms · 2026-05-10T15:56:49.174917+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RPA-Block... cosine similarity matrix Sij = wi·wj / ||wi||·||wj|| ... lazy updates every N=5 epochs
IndisputableMonolith/Foundation/ArithmeticFromLogic LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAL-NWD... Normalized Wasserstein Distance NWD(a,b)=exp(-sqrt(W2(a,b))/C) ... size-adaptive CIoU weighting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 3 internal anchors

[1]

(2019).VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results

Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., & Ling, H. (2019).VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. Workshop Vision Meets Drone, ICCV 2019

2019
[2]

(2025).SOD-YOLO: Enhancing YOLO-Based detection of small objects in UAV imagery

Wang, P., & Zhao, J. (2025).SOD-YOLO: Enhancing YOLO-Based detection of small objects in UAV imagery. arXiv:2507.12727

work page arXiv 2025
[3]

(2025).Enhancing UAV object detection with an efficient multi-scale feature fusion framework

Lai, D., Kang, K., Xu, K., Ma, X., Zhang, Y ., Huang, F., & Chen, J. (2025).Enhancing UAV object detection with an efficient multi-scale feature fusion framework. PLoS ONE, 20(10), e0332408

2025
[4]

(2022).Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark

Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., & Xia, G. (2022).Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 79–93

2022
[5]

(2023).Ultralytics YOLOv8(Version 8.0.0) [Computer software].https://github.com/ultralytics/ultralytics

Jocher, G., Chaurasia, A., & Qiu, J. (2023).Ultralytics YOLOv8(Version 8.0.0) [Computer software].https://github.com/ultralytics/ultralytics

2023
[6]

(2025).YOLO-LE: A Lightweight and Effi- cient UAV Aerial Image Target Detection Model

Chen, Z., Zhang, Y ., & Xing, S. (2025).YOLO-LE: A Lightweight and Effi- cient UAV Aerial Image Target Detection Model. Computers, Materials & Continua. DOI: 10.32604/cmc.2025.065238

work page doi:10.32604/cmc.2025.065238 2025
[7]

Wan, Z. et al. (2025).DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sensing. DOI: 10.3390/rs17101768

work page doi:10.3390/rs17101768 2025
[8]

(2025).Improved YOLO for long range detection of small drones

Zhou, S., Yang, L., Liu, H., Zhou, C., Liu, J., Wang, Y ., Zhao, S., & Wang, K. (2025).Improved YOLO for long range detection of small drones. Scientific Reports, 15(1), 12280

2025
[9]

Cheng, H., Zhang, M., & Shi, J. Q. (2024).A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE TPAMI. DOI: 10.1109/TPAMI.2024.3447085

work page doi:10.1109/tpami.2024.3447085 2024
[10]

S., & Elsen, E

Evci, U., Gale, T., Menick, J., Castro, P. S., & Elsen, E. (2020).Rigging the Lottery: Making All Tickets Winners. arXiv:1911.11134. 11

work page arXiv 2020
[11]

Rezatofighi, H. et al. (2019).Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. CVPR 2019

2019
[12]

(2019).The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J., & Carlin, M. (2019).The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019

2019
[13]

You only look once: Uniﬁed, real-time object detection

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016).You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016, pp. 779–788. arXiv:1506.02640

work page arXiv 2016
[14]

YOLOv3: An Incremental Improvement

Redmon, J., & Farhadi, A. (2018).YOLOv3: An Incremental Improvement. arXiv:1804.02767

work page internal anchor Pith review arXiv 2018
[15]

Girshick, Kaiming He, Bharath Hariharan, and Serge J

Lin, T.-Y ., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017).Feature Pyramid Networks for Object Detection. CVPR 2017, pp. 936–944. arXiv:1612.03144

work page arXiv 2017
[16]

(2018).Path Aggregation Network for Instance Segmentation

Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018).Path Aggregation Network for Instance Segmentation. CVPR 2018, pp. 8759–8768. arXiv:1803.01534

work page arXiv 2018
[17]

(2020).Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020).Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI 2020. arXiv:1911.08287

work page arXiv 2020
[18]

(2018).Squeeze-and-Excitation Networks

Hu, J., Shen, L., & Sun, G. (2018).Squeeze-and-Excitation Networks. CVPR 2018, pp. 7132–

2018
[19]

Howard, A. G. et al. (2017).MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861

work page internal anchor Pith review arXiv 2017
[20]

Girshick, and Jian Sun

Ren, S., He, K., Girshick, R., & Sun, J. (2015).Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS 2015. arXiv:1506.01497

work page arXiv 2015
[21]

Decoupled Weight Decay Regularization

Loshchilov, I., & Hutter, F. (2019).Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101. 12

work page internal anchor Pith review Pith/arXiv arXiv 2019