From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

Athena Zhuoming Zhong; Dongsheng Hou; Mingxi Yu; Qi Hao; Shihan Qiao; Yanqiao Chen; Yibin Lou; Yuhan Rui; Yutong Wan; Zhen Cao

arxiv: 2606.23825 · v1 · pith:ZJ7VYY7Jnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

Yuhan Rui , Shihan Qiao , Yibin Lou , Mingxi Yu , Yutong Wan , Yanqiao Chen , Dongsheng Hou , Zhen Cao

show 2 more authors

Athena Zhuoming Zhong Qi Hao

This is my paper

Pith reviewed 2026-06-26 08:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords small object detectionfrequency domainwavelet transformfeature representationDERNetaerial imageryspectral analysisparameter efficient

0 comments

The pith

A frequency-guided framework with DER modules lets small object detectors outperform YOLOv11 using one-sixth the parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that spatial-domain detectors discard high-frequency details critical for tiny targets and that recovering them in the spatial domain is costly and noisy. It proposes shifting feature processing to the spectral domain through a Frequency-Guided Feature Representation framework built around the Decompose-Enhance-Reconstruct operator. This operator is realized by three lightweight plug-and-play modules that inject frequency modulation into the backbone, neck, and head of existing detectors. On aerial and drone benchmarks the resulting DERNet models deliver higher accuracy than same-scale YOLOv11 detectors while using far fewer parameters, with the gains traced to spectral diagnostics and error decomposition.

Core claim

The central claim is that the Decompose-Enhance-Reconstruct (DER) operator, implemented via the Wavelet-Difference Gate, Log-Gabor Enhancer, and Frequency-Driven Head, systematically decouples feature modeling from resolution reduction, captures discriminative high-frequency components, and enables accurate small-object localization across CNN and Transformer architectures with markedly lower parameter counts.

What carries the argument

The DER (Decompose-Enhance-Reconstruct) operator realized through three plug-and-play frequency modules (Wavelet-Difference Gate, Log-Gabor Enhancer, Frequency-Driven Head) that perform spectral modulation at backbone, neck, and head stages.

If this is right

The modules integrate into both CNN and Transformer detectors without architecture-specific retuning.
Detection performance improves consistently across VisDrone2019, UAVDT, TinyPerson, and DOTAv1.
Parameter count drops to roughly one-sixth that of YOLOv11 at equivalent scale while accuracy holds or rises.
Spectral diagnostics and error decomposition directly attribute gains to the recovered high-frequency cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency-modulation pattern may transfer to other fine-detail tasks such as small-lesion segmentation in medical images.
Reduced model size opens the possibility of running accurate small-object detection on edge hardware for real-time drone monitoring.
Testing additional frequency bases beyond wavelets and Gabor filters could reveal further efficiency gains on new domains.

Load-bearing premise

The high-frequency components isolated by the modules are reliably more discriminative for small objects than they are sources of background noise.

What would settle it

An ablation study in which the three frequency modules are removed or replaced by equivalent spatial operations and the detector shows no accuracy gain or a loss on VisDrone2019 or UAVDT would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.23825 by Athena Zhuoming Zhong, Dongsheng Hou, Mingxi Yu, Qi Hao, Shihan Qiao, Yanqiao Chen, Yibin Lou, Yuhan Rui, Yutong Wan, Zhen Cao.

**Figure 1.** Figure 1: Motivation of frequency-domain bias in small-object detection. High-frequency energy ratio (HF/total) across object scales, where HF is defined by 2D FFT components whose radial distance from the spectrum center exceeds 33% of the maximum radius. Small object detection underpins critical applications ranging from aerial surveillance to autonomous navigation (Tong et al., 2020). Despite advances in general… view at source ↗

**Figure 2.** Figure 2: Performance comparison of DERNet with state-of-theart methods on the VisDrone2019 test set. 2. Related Work We review prior work from three angles that are most relevant to our goal. 2.1. Efficient Detector Architectures Real-time detection advances largely through architectural optimizations in backbones and feature pyramids. Onestage YOLO-style detectors (Redmon et al., 2016; Khanam & Hussain, 2024; S… view at source ↗

**Figure 3.** Figure 3: Overall framework of our Frequency-Guided Feature Representation Learner. This architecture instantiates the Decompose– Enhance–Reconstruct (DER) operator via WDG, LGE, and FDHead to systematically decouple feature modeling from resolution reduction, ensuring that discriminative high-frequency cues are explicitly preserved and amplified across the entire feature stream. Recently, the focus has shifted towa… view at source ↗

**Figure 4.** Figure 4: Architecture of the Wavelet-Difference Gate (WDG). WDG decomposes features via Haar DWT, refines the low-frequency subband with RepCDC, predicts a content-adaptive gate from high-frequency subbands, and reconstructs via IDWT with a skip connection. Algorithm 1 DER insertion across the architecture. Input: image I; detector (B, N, H) Output: enhanced prediction Yb Definitions: B: backbone, N: neck, H: detec… view at source ↗

**Figure 5.** Figure 5: Architecture of Log-Gabor Enhancer (LGE) and its WTConv variant (LGE-W). LGE captures directional highfrequency residuals via log-gabor filters and learnable aggregation, injecting them through a skip pathway to prevent feature dilution. Reconstruction and residual output. We keep the original HF subbands unchanged and reconstruct via inverse Haar transform: y = f out 1×1(IDWT(xeLL, xLH, xHL, xHH)). (6) A… view at source ↗

**Figure 6.** Figure 6: Architecture of the Frequency-Driven Head (FDHead). y = xskip + fmix σ(γ) h . (9) We adopt small K=2 orientations and S=1 scale as a lightweight default; ablations (Appendix E) confirm that gains saturate quickly while cost grows super-linearly on high-resolution maps, making this configuration an effective efficiency–accuracy trade-off for a neck-stage plug-in. Wavelet variant (LGE-W). At the highest-re… view at source ↗

**Figure 8.** Figure 8: Visualization comparison between baseline and improved models on VisDrone2019 (top row) and TinyPerson (bottom row) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 7.** Figure 7: High-Frequency Ratio Analysis on VisDrone2019 validation set (Left) and TinyPerson validation set (Right). 6.2. Spectral Diagnostics: Layer-wise High-Frequency Preservation and Reconstruction We analyze high-frequency preservation across stages on VisDrone2019 and TinyPerson via 2D FFT, averaging spectral magnitude, and partitioning frequency bands at 1/6 and 1/3 of the maximum frequency radius to comput… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of frequency-domain methods on a VisDrone2019 image. (a) Original image; (b) FFT magnitude spectrum (global, no spatial localization); (c) Log-Gabor final enhanced response (spatially localized, emphasizes edges and textures); (d) Wavelet high-frequency energy map (spatially localized, highlights structured regions). (a) 0° orientation (b) 45° orientation (c) Log-Gabor aggregated (d)… view at source ↗

**Figure 10.** Figure 10: Log-Gabor directional selectivity demonstration. (a-b) Responses at two orientations illustrate directional edge detection; (c) Aggregated Log-Gabor response across orientations; (d) Wavelet high-frequency energy map for comparison. We justify using wavelet in the backbone (WDG) and head (FDHead) and Log-Gabor in the neck (LGE/LGE-W) via qualitative visualizations and quantitative analyses on VisDrone2019… view at source ↗

**Figure 11.** Figure 11: Successful detection cases where DERNet-S accurately detects and localizes small and distant objects. (a) Distant objects (b) Incorrect segmentation (c) Extremely small objects (d) Tightly-clustered objects (e) Occluded objects [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison between ground truth and DERNet-S predictions on challenging small object detection scenarios. The first row shows ground truth annotations, while the second row shows DERNet-S model predictions. G. Appendix G: Error Analysis of Small Object Detection [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Per-class performance comparison on VisDrone2019 validation set and test set. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Per-class performance comparison on TinyPerson validation set and test set. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Per-class performance comparison on UAVDT validation set and test set. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Per-class performance comparison on Dotav1 validation set. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Dataset-dependent overall distribution comparison across four benchmarks. The dataset-dependent trend in [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

read the original abstract

Efficient small object detection is bottlenecked by the inherent feature scarcity of tiny targets, which is further aggravated by operations of spatial-domain detectors that indiscriminately discard critical high-frequency details. Recovering these fragile cues within the spatial domain is notoriously difficult, as it often requires computationally expensive architectural upscaling that inadvertently amplifies background noise. To bridge this gap, we propose a paradigm \textbf{shift from spatial to spectral} feature processing, introducing a holistic solution with the following novelty: (1) A versatile \textbf{Frequency-Guided Feature Representation framework} that generalizes across diverse detector architectures (both CNN and Transformer-based), offering a robust alternative to spatial-only feature extraction; (2) The unified \textbf{Decompose--Enhance--Reconstruct (DER)} operator, instantiated via three \textbf{lightweight, plug-and-play} modules -- Wavelet-Difference Gate (WDG), Log-Gabor Enhancer (LGE), and Frequency-Driven Head (FDHead) -- to systematically inject frequency-aware modulation into the backbone, neck, and head. This mechanism decouples feature modeling from resolution reduction, capturing discriminative high-frequency components to enable accurate localization with significantly reduced parameter redundancy; (3) Extensive validation on multi-domain benchmarks (VisDrone2019, UAVDT, TinyPerson, DOTAv1) demonstrating consistent gains. Notably, our proposed \textbf{DERNet} series outperforms YOLOv11 models under the same scale while requiring \textbf{only 1/6 of the parameters}, backed by rigorous spectral diagnostics and error decomposition analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages three frequency modules into a DER operator for small-object detectors and claims big efficiency gains over YOLOv11, but those gains rest on unverified assumptions about signal versus noise.

read the letter

The main takeaway is that the authors have packaged a frequency-guided approach with three specific modules to try to recover high-frequency details for small objects in detectors, and they report it lets them beat YOLOv11 at same scale with only a sixth of the parameters.

What is new is the particular Decompose-Enhance-Reconstruct operator built from the Wavelet-Difference Gate, Log-Gabor Enhancer, and Frequency-Driven Head. These are presented as lightweight and insertable into CNN or Transformer backbones without much retuning. The paper does a decent job framing the problem of spatial methods losing fine details and why frequency might help, especially for aerial and surveillance imagery where small objects are common.

The modules themselves seem like a concrete attempt to make spectral processing practical, and the claim of architecture-agnostic use is worth testing.

The soft spots are around the central performance numbers. The abstract makes strong claims about consistent gains and parameter reduction, but without the actual tables, ablations, or error bars in front of me, it's difficult to judge if the 1/6 parameter advantage holds or if it's tied to particular dataset characteristics. The stress-test point about whether the high-frequency components are truly discriminative or just amplifying correlated background is a real one; the paper would need to show that the modules aren't just fitting to UAV dataset biases.

If the full experiments include rigorous spectral diagnostics and comparisons across multiple scales, that would strengthen it. Otherwise the generalization claim stays shaky.

This paper is mainly for researchers working on small object detection in constrained environments like drones or surveillance. A reader who cares about efficient detectors for tiny targets could get some ideas from the module designs, even if they don't adopt the whole thing.

I would recommend sending it to peer review. The idea is grounded enough in a known problem that referees could usefully check the implementation details and results.

Referee Report

2 major / 2 minor

Summary. The paper proposes shifting small object detection from spatial to spectral feature processing via a Frequency-Guided Feature Representation framework. It introduces the Decompose-Enhance-Reconstruct (DER) operator instantiated as three lightweight plug-and-play modules (Wavelet-Difference Gate, Log-Gabor Enhancer, Frequency-Driven Head) inserted into backbone, neck, and head. These are claimed to generalize across CNN and Transformer detectors, capture high-frequency cues without resolution upscaling, and yield DERNet variants that outperform same-scale YOLOv11 models while using only 1/6 the parameters on VisDrone2019, UAVDT, TinyPerson, and DOTAv1, supported by spectral diagnostics and error decomposition.

Significance. If the empirical claims hold with rigorous controls, the work could offer a parameter-efficient, architecture-agnostic route to recovering discriminative high-frequency information for tiny targets in aerial imagery. The emphasis on lightweight modules and diagnostic analysis would be a positive contribution to efficient detection if the frequency components prove reliably signal rather than noise.

major comments (2)

[Abstract] Abstract, claim (3): The headline result that DERNet outperforms YOLOv11 at the same scale with 1/6 the parameters is load-bearing for the contribution, yet the abstract supplies no mAP values, parameter tables, dataset splits, or error bars. The full manuscript must include these quantitative comparisons and ablations to substantiate the efficiency claim.
[Spectral diagnostics section] § on spectral diagnostics and error decomposition: The central premise that WDG, LGE, and FDHead isolate discriminative high-frequency components for small objects (rather than amplifying background clutter) is not secured by the description. The paper should provide concrete evidence, such as frequency-spectrum comparisons before/after each module or controlled experiments on synthetic small-object data, to rule out dataset-specific frequency bias in the UAV benchmarks.

minor comments (2)

[Method] The description of the DER operator would benefit from explicit equations showing how the three modules compose and their parameter counts relative to the baseline detector.
[Experiments] Clarify whether the modules require any architecture-specific retuning when inserted into Transformer-based detectors, as asserted to be plug-and-play.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below with specific responses. The full manuscript already contains the quantitative results and spectral analysis referenced in the claims; we are prepared to enhance clarity where needed.

read point-by-point responses

Referee: [Abstract] Abstract, claim (3): The headline result that DERNet outperforms YOLOv11 at the same scale with 1/6 the parameters is load-bearing for the contribution, yet the abstract supplies no mAP values, parameter tables, dataset splits, or error bars. The full manuscript must include these quantitative comparisons and ablations to substantiate the efficiency claim.

Authors: The abstract is intentionally concise per conference guidelines. The full manuscript substantiates the efficiency claim with mAP values, parameter counts, dataset details, splits, and ablations (including error bars) in Section 4, Tables 1–4, and Figures 5–8. These cover all four benchmarks and direct comparisons to YOLOv11 variants. We will revise the abstract to include one or two key mAP deltas if space allows under the word limit. revision: partial
Referee: [Spectral diagnostics section] § on spectral diagnostics and error decomposition: The central premise that WDG, LGE, and FDHead isolate discriminative high-frequency components for small objects (rather than amplifying background clutter) is not secured by the description. The paper should provide concrete evidence, such as frequency-spectrum comparisons before/after each module or controlled experiments on synthetic small-object data, to rule out dataset-specific frequency bias in the UAV benchmarks.

Authors: Section 4.3 already presents spectral diagnostics, frequency visualizations, and error decomposition across modules. To directly address the request for stronger isolation evidence, we will add explicit before/after frequency-spectrum plots for WDG, LGE, and FDHead individually, plus controlled experiments on synthetic small-object data with known frequency content, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal validated empirically on external benchmarks

full rationale

The paper introduces a new Frequency-Guided Feature Representation framework and DER operator (instantiated as WDG, LGE, and FDHead modules) as independent architectural additions to existing CNN/Transformer detectors. Performance claims (DERNet outperforming same-scale YOLOv11 with 1/6 parameters) rest on empirical results across VisDrone2019, UAVDT, TinyPerson, and DOTAv1 rather than any equations, fitted parameters, or self-citations that reduce the gains to a definitional loop or construction from the inputs themselves. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renamings of known results appear in the provided text; the spectral diagnostics are presented as post-hoc analysis, not as the source of the claimed improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available; the ledger records the high-level assumptions stated in the motivation and contribution list.

axioms (1)

domain assumption High-frequency spectral components carry the critical discriminative information for small objects that spatial downsampling discards.
Invoked in the opening motivation paragraph to justify the shift from spatial to spectral processing.

invented entities (1)

Decompose-Enhance-Reconstruct (DER) operator no independent evidence
purpose: Systematically inject frequency-aware modulation into backbone, neck, and head stages.
Newly introduced unified operator instantiated by the three modules.

pith-pipeline@v0.9.1-grok · 5851 in / 1239 out tokens · 27186 ms · 2026-06-26T08:53:22.344707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 9 canonical work pages

[1]

Tide: A general toolbox for identifying object detection errors, 2020

Bolya, D., Foley, S., Hays, J., and Hoffman, J. Tide: A general toolbox for identifying object detection errors, 2020. URL https://arxiv.org/abs/2008.08115

arXiv 2020
[5]

Frequency-aware feature fusion for dense image prediction

Chen, L., Fu, Y., Gu, L., Yan, C., Harada, T., and Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. doi:10.48550/arXiv.2408.12879. Accepted by TPAMI, 2024

work page doi:10.48550/arxiv.2408.12879 2024
[8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., and Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 370--386, 2018

2018
[9]

Visdrone-det2019: The vision meets drone object detection in image challenge results

Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.\ 0--0, 2019

2019
[10]

Cross-layer feature pyramid transformer for small object detection in aerial images, 2024

Du, Z., Hu, Z., Zhao, G., Jin, Y., and Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images, 2024. URL https://arxiv.org/abs/2407.19696

arXiv 2024
[12]

E., Amoyal, R., Treister, E., and Freifeld, O

Finder, S. E., Amoyal, R., Treister, E., and Freifeld, O. Wavelet convolutions for large receptive fields. In Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., and Varol, G. (eds.), Computer Vision -- ECCV 2024, pp.\ 363--380, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72949-2

2024
[14]

Deim: Detr with improved matching for fast convergence, 2025

Huang, S., Lu, Z., Cun, X., Yu, Y., Zhou, X., and Shen, X. Deim: Detr with improved matching for fast convergence, 2025. URL https://arxiv.org/abs/2412.04234

arXiv 2025
[15]

and Hussain, M

Khanam, R. and Hussain, M. YOLOv11 : An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

Pith/arXiv arXiv 2024
[16]

Rethinking features-fused-pyramid-neck for object detection

Li, H. Rethinking features-fused-pyramid-neck for object detection. In European Conference on Computer Vision (ECCV). Springer, 2024

2024
[18]

Adaptive complex wavelet informed transformer operator

Li, X., Jiao, L., Liu, F., Yang, S., Zhu, H., Liu, X., Li, L., and Ma, W. Adaptive complex wavelet informed transformer operator. IEEE Transactions on Multimedia, 27: 0 3513--3526, 2025. doi:10.1109/TMM.2025.3535392

work page doi:10.1109/tmm.2025.3535392 2025
[20]

L., and Doll \'a r, P

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Doll \'a r, P. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pp.\ 740--755. Springer, 2014

2014
[21]

Feature pyramid networks for object detection

Lin, T.-Y., Doll \'a r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2117--2125, 2017 b

2017
[23]

Path aggregation network for instance segmentation, 2018

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. Path aggregation network for instance segmentation, 2018. URL https://arxiv.org/abs/1803.01534

Pith/arXiv arXiv 2018
[24]

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. SSD : Single shot multibox detector. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp.\ 21--37. Springer, 2016

2016
[25]

WCDB-YOLO : Wavelet-enhanced contextual dual-backbone network for small object detection in uav aerial imagery

Luan, D., Dong, Y., Zhou, J., Li, A., Xie, L., Liu, H., and Zhu, J. WCDB-YOLO : Wavelet-enhanced contextual dual-backbone network for small object detection in uav aerial imagery. Drones, 10 0 (3): 0 155, 2026. doi:10.3390/drones10030155. URL https://www.mdpi.com/2504-446X/10/3/155

work page doi:10.3390/drones10030155 2026
[26]

D-FINE : Redefine regression task in DETR s as fine-grained distribution refinement

Peng, Y., Li, H., Wu, P., Zhang, Y., Sun, X., and Wu, F. D-FINE : Redefine regression task in DETR s as fine-grained distribution refinement. arXiv preprint arXiv:2410.13842, 2024. doi:10.48550/arXiv.2410.13842. URL https://arxiv.org/abs/2410.13842

work page doi:10.48550/arxiv.2410.13842 2024
[27]

Fcanet: Frequency channel attention networks

Qin, Z., Zhang, P., Wu, F., and Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 783--792, 2021

2021
[28]

GFNet : Global filter networks for visual recognition

Rao, Y., Zhao, W., Zhu, Z., Zhou, J., and Lu, J. GFNet : Global filter networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45 0 (9): 0 10960--10973, September 2023. doi:10.1109/TPAMI.2023.3263824

work page doi:10.1109/tpami.2023.3263824 2023
[29]

You only look once: Unified, real-time object detection, 2016

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection, 2016. URL https://arxiv.org/abs/1506.02640

Pith/arXiv arXiv 2016
[30]

Faster R-CNN : Towards real-time object detection with region proposal networks

Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN : Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

2015
[31]

H., Sharda, A., and Karkee, M

Sapkota, R., Cheppally, R. H., Sharda, A., and Karkee, M. Yolo26: Key architectural enhancements and performance benchmarking for real-time object detection, 2026. URL https://arxiv.org/abs/2509.25164

arXiv 2026
[32]

HS-FPN : High frequency and spatial perception fpn for tiny object detection

Shi, Z., Hu, J., Ren, J., Ye, H., Yuan, X., Ouyang, Y., He, J., Ji, B., and Guo, J. HS-FPN : High frequency and spatial perception fpn for tiny object detection. arXiv preprint arXiv:2412.10116, 2025

arXiv 2025
[33]

Tan, M., Pang, R., and Le, Q. V. Efficientdet: Scalable and efficient object detection, 2020. URL https://arxiv.org/abs/1911.09070

arXiv 2020
[34]

Tang, F., Nian, B., Ding, J., Ma, W., Quan, Q., Dong, C., Yang, J., Liu, W., and Zhou, S. K. Mobile U-ViT : Revisiting large kernel and U -shaped vit for efficient medical image segmentation. arXiv preprint arXiv:2508.01064, 2025

arXiv 2025
[35]

Yolov12: Attention-centric real-time object detectors, 2025

Tian, Y., Ye, Q., and Doermann, D. Yolov12: Attention-centric real-time object detectors, 2025. URL https://arxiv.org/abs/2502.12524

Pith/arXiv arXiv 2025
[37]

LSNet : See large, focus small

Wang, A., Chen, H., Lin, Z., Han, J., and Ding, G. LSNet : See large, focus small. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[38]

C., and Lin, D

Wang, J., Zhang, W., Cao, Y., Chen, K., Pang, J., Gong, T., Shi, J., Loy, C. C., and Lin, D. Side-aware boundary localization for more precise object detection, 2020. URL https://arxiv.org/abs/1912.04260

arXiv 2020
[39]

Dota: A large-scale dataset for object detection in aerial images

Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., and Zhang, L. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3974--3983, 2018

2018
[40]

FBRT-YOLO : Faster and better for real-time aerial image detection

Xiao, Y., Xu, T., Xin, Y., and Li, J. FBRT-YOLO : Faster and better for real-time aerial image detection. arXiv preprint arXiv:2504.20670, 2025

arXiv 2025
[42]

and Koltun, V

Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions, 2016. URL https://arxiv.org/abs/1511.07122

Pith/arXiv arXiv 2016
[43]

Scale match for tiny person detection

Yu, X., Gong, Y., Jiang, N., Ye, Q., and Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 1257--1265, 2020

2020
[44]

M., and Shum, H.-Y

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., and Shum, H.-Y. DINO : DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022 a

Pith/arXiv arXiv 2022
[45]

Making convolutional networks shift-invariant again, 2019

Zhang, R. Making convolutional networks shift-invariant again, 2019. URL https://arxiv.org/abs/1904.11486

Pith/arXiv arXiv 2019
[46]

Efficient long-range attention network for image super-resolution

Zhang, X., Zeng, H., Guo, S., and Zhang, L. Efficient long-range attention network for image super-resolution. In European Conference on Computer Vision, pp.\ 649--667. Springer, 2022 b

2022
[48]

arXiv preprint arXiv:2005.12872 , year=

End-to-End Object Detection with Transformers , author=. arXiv preprint arXiv:2005.12872 , year=

arXiv 2005
[49]

European Conference on Computer Vision , pages=

Efficient Long-Range Attention Network for Image Super-resolution , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[50]

Wang, Ao and Chen, Hui and Liu, Lihao and Chen, Kai and Lin, Zijia and Han, Jungong and Ding, Guiguang , journal=
[51]

Khanam, Rahima and Hussain, Muhammad , journal=
[52]

Xiao, Yao and Xu, Tingfa and Xin, Yu and Li, Jianan , journal=
[53]

arXiv preprint arXiv:2304.08069 , year=

DETRs Beat YOLOs on Real-time Object Detection , author=. arXiv preprint arXiv:2304.08069 , year=

arXiv
[54]

and Shum, Heung-Yeung , journal=

Zhang, Hao and Li, Feng and Liu, Shilong and Zhang, Lei and Su, Hang and Zhu, Jun and Ni, Lionel M. and Shum, Heung-Yeung , journal=
[55]

Wang, Ao and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang , booktitle=
[56]

Kevin , journal=

Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S. Kevin , journal=
[57]

, booktitle=

Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott and Fu, Cheng-Yang and Berg, Alexander C. , booktitle=. 2016 , publisher=

2016
[58]

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle=. Faster
[59]

Shi, Zican and Hu, Jing and Ren, Jie and Ye, Hengkang and Yuan, Xuyang and Ouyang, Yan and He, Jia and Ji, Bo and Guo, Junyu , journal=
[60]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Feature Pyramid Networks for Object Detection , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
[61]

2023 , month=

Rao, Yongming and Zhao, Wenliang and Zhu, Zheng and Zhou, Jie and Lu, Jiwen , journal=. 2023 , month=

2023
[62]

Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation , journal =

Guoping Xu and Wentao Liao and Xuan Zhang and Chang Li and Xinwei He and Xinglong Wu , keywords =. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.patcog.2023.109819 , url =

work page doi:10.1016/j.patcog.2023.109819 2023
[63]

arXiv preprint arXiv:2503.18783 , year=

Frequency Dynamic Convolution for Dense Image Prediction , author=. arXiv preprint arXiv:2503.18783 , year=

arXiv
[64]

Adaptive Complex Wavelet Informed Transformer Operator , year=

Li, Xiaotong and Jiao, Licheng and Liu, Fang and Yang, Shuyuan and Zhu, Hao and Liu, Xu and Li, Lingling and Ma, Wenping , journal=. Adaptive Complex Wavelet Informed Transformer Operator , year=
[65]

and Amoyal, Roy and Treister, Eran and Freifeld, Oren , editor =

Finder, Shahaf E. and Amoyal, Roy and Treister, Eran and Freifeld, Oren , editor =. Wavelet Convolutions for Large Receptive Fields , booktitle =. 2025 , publisher =

2025
[66]

Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages=

VisDrone-DET2019: The vision meets drone object detection in image challenge results , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages=
[67]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

The unmanned aerial vehicle benchmark: Object detection and tracking , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
[68]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Scale match for tiny person detection , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[69]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

DOTA: A large-scale dataset for object detection in aerial images , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[70]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

FcaNet: Frequency Channel Attention Networks , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=
[71]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Frequency-aware Feature Fusion for Dense Image Prediction , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[72]

2026 , eprint=

YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection , author=. 2026 , eprint=

2026
[73]

2016 , eprint=

You Only Look Once: Unified, Real-Time Object Detection , author=. 2016 , eprint=

2016
[74]

Recent advances in small object detection based on deep learning: A review , journal =

Kang Tong and Yiquan Wu and Fei Zhou , keywords =. Recent advances in small object detection based on deep learning: A review , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.imavis.2020.103910 , url =

work page doi:10.1016/j.imavis.2020.103910 2020
[75]

2018 , eprint=

Path Aggregation Network for Instance Segmentation , author=. 2018 , eprint=

2018
[76]

2016 , eprint=

Multi-Scale Context Aggregation by Dilated Convolutions , author=. 2016 , eprint=

2016
[77]

Proceedings of the European conference on computer vision (ECCV) , pages=

Sod-mtgan: Small object detection via multi-task generative adversarial network , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[78]

2020 , eprint=

EfficientDet: Scalable and Efficient Object Detection , author=. 2020 , eprint=

2020
[79]

2019 , eprint=

Making Convolutional Networks Shift-Invariant Again , author=. 2019 , eprint=

2019
[80]

2020 , eprint=

Side-Aware Boundary Localization for More Precise Object Detection , author=. 2020 , eprint=

2020
[81]

European Conference on Computer Vision (ECCV) , year =

Microsoft COCO: Common Objects in Context , author =. European Conference on Computer Vision (ECCV) , year =
[82]

European Conference on Computer Vision (ECCV) , year =

Rethinking Features-Fused-Pyramid-Neck for Object Detection , author =. European Conference on Computer Vision (ECCV) , year =
[83]

2024 , eprint =

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images , author =. 2024 , eprint =

2024
[84]

Freq-DETR: Frequency-aware transformer for real-time small object detection in unmanned aerial vehicle imagery , journal =

Jiayi Chen and Ningzhong Liu and Han Sun and Yu Wang , keywords =. Freq-DETR: Frequency-aware transformer for real-time small object detection in unmanned aerial vehicle imagery , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.129710 , url =

work page doi:10.1016/j.eswa.2025.129710 2026
[85]

2020 , eprint=

TIDE: A General Toolbox for Identifying Object Detection Errors , author=. 2020 , eprint=

2020
[86]

RTMDet-R2: An Improved Real-Time Rotated Object Detector

Xiang, Haifeng and Jing, Naifeng and Jiang, Jianfei and Guo, Hongbo and Sheng, Weiguang and Mao, Zhigang and Wang, Qin. RTMDet-R2: An Improved Real-Time Rotated Object Detector. Pattern Recognition and Computer Vision. 2024

2024
[87]

arXiv preprint arXiv:2111.00902 , year =

PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices , author =. arXiv preprint arXiv:2111.00902 , year =

arXiv
[88]

2026 , doi =

Luan, Di and Dong, Yuna and Zhou, Jian and Li, Ang and Xie, Ling and Liu, Hongying and Zhu, Jun , journal =. 2026 , doi =

2026
[89]

2024 , doi =

Peng, Yansong and Li, Hebei and Wu, Peixi and Zhang, Yueyi and Sun, Xiaoyan and Wu, Feng , journal =. 2024 , doi =

2024
[90]

WDFS-DETR: A Transformer-based framework with multi-scale attention for small object detection in UAV Engineering Tasks , journal =

Jinjiang Liu and Yonghua Xie , keywords =. WDFS-DETR: A Transformer-based framework with multi-scale attention for small object detection in UAV Engineering Tasks , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.rineng.2025.105930 , url =

work page doi:10.1016/j.rineng.2025.105930 2025
[91]

CoRR , volume =

Xiyang Dai and Yinpeng Chen and Bin Xiao and Dongdong Chen and Mengchen Liu and Lu Yuan and Lei Zhang , title =. CoRR , volume =. 2021 , url =. 2106.08322 , timestamp =

arXiv 2021
[92]

CoRR , volume =

Xiang Li and Wenhai Wang and Lijun Wu and Shuo Chen and Xiaolin Hu and Jun Li and Jinhui Tang and Jian Yang , title =. CoRR , volume =. 2020 , url =. 2006.04388 , timestamp =

arXiv 2020
[93]

Scott and Weilin Huang , title =

Chengjian Feng and Yujie Zhong and Yu Gao and Matthew R. Scott and Weilin Huang , title =. CoRR , volume =. 2021 , url =. 2108.07755 , timestamp =

arXiv 2021

Showing first 80 references.

[1] [1]

Tide: A general toolbox for identifying object detection errors, 2020

Bolya, D., Foley, S., Hays, J., and Hoffman, J. Tide: A general toolbox for identifying object detection errors, 2020. URL https://arxiv.org/abs/2008.08115

arXiv 2020

[2] [5]

Frequency-aware feature fusion for dense image prediction

Chen, L., Fu, Y., Gu, L., Yan, C., Harada, T., and Huang, G. Frequency-aware feature fusion for dense image prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. doi:10.48550/arXiv.2408.12879. Accepted by TPAMI, 2024

work page doi:10.48550/arxiv.2408.12879 2024

[3] [8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., and Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 370--386, 2018

2018

[4] [9]

Visdrone-det2019: The vision meets drone object detection in image challenge results

Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.\ 0--0, 2019

2019

[5] [10]

Cross-layer feature pyramid transformer for small object detection in aerial images, 2024

Du, Z., Hu, Z., Zhao, G., Jin, Y., and Ma, H. Cross-layer feature pyramid transformer for small object detection in aerial images, 2024. URL https://arxiv.org/abs/2407.19696

arXiv 2024

[6] [12]

E., Amoyal, R., Treister, E., and Freifeld, O

Finder, S. E., Amoyal, R., Treister, E., and Freifeld, O. Wavelet convolutions for large receptive fields. In Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., and Varol, G. (eds.), Computer Vision -- ECCV 2024, pp.\ 363--380, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72949-2

2024

[7] [14]

Deim: Detr with improved matching for fast convergence, 2025

Huang, S., Lu, Z., Cun, X., Yu, Y., Zhou, X., and Shen, X. Deim: Detr with improved matching for fast convergence, 2025. URL https://arxiv.org/abs/2412.04234

arXiv 2025

[8] [15]

and Hussain, M

Khanam, R. and Hussain, M. YOLOv11 : An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024

Pith/arXiv arXiv 2024

[9] [16]

Rethinking features-fused-pyramid-neck for object detection

Li, H. Rethinking features-fused-pyramid-neck for object detection. In European Conference on Computer Vision (ECCV). Springer, 2024

2024

[10] [18]

Adaptive complex wavelet informed transformer operator

Li, X., Jiao, L., Liu, F., Yang, S., Zhu, H., Liu, X., Li, L., and Ma, W. Adaptive complex wavelet informed transformer operator. IEEE Transactions on Multimedia, 27: 0 3513--3526, 2025. doi:10.1109/TMM.2025.3535392

work page doi:10.1109/tmm.2025.3535392 2025

[11] [20]

L., and Doll \'a r, P

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Doll \'a r, P. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pp.\ 740--755. Springer, 2014

2014

[12] [21]

Feature pyramid networks for object detection

Lin, T.-Y., Doll \'a r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2117--2125, 2017 b

2017

[13] [23]

Path aggregation network for instance segmentation, 2018

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. Path aggregation network for instance segmentation, 2018. URL https://arxiv.org/abs/1803.01534

Pith/arXiv arXiv 2018

[14] [24]

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. SSD : Single shot multibox detector. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp.\ 21--37. Springer, 2016

2016

[15] [25]

WCDB-YOLO : Wavelet-enhanced contextual dual-backbone network for small object detection in uav aerial imagery

Luan, D., Dong, Y., Zhou, J., Li, A., Xie, L., Liu, H., and Zhu, J. WCDB-YOLO : Wavelet-enhanced contextual dual-backbone network for small object detection in uav aerial imagery. Drones, 10 0 (3): 0 155, 2026. doi:10.3390/drones10030155. URL https://www.mdpi.com/2504-446X/10/3/155

work page doi:10.3390/drones10030155 2026

[16] [26]

D-FINE : Redefine regression task in DETR s as fine-grained distribution refinement

Peng, Y., Li, H., Wu, P., Zhang, Y., Sun, X., and Wu, F. D-FINE : Redefine regression task in DETR s as fine-grained distribution refinement. arXiv preprint arXiv:2410.13842, 2024. doi:10.48550/arXiv.2410.13842. URL https://arxiv.org/abs/2410.13842

work page doi:10.48550/arxiv.2410.13842 2024

[17] [27]

Fcanet: Frequency channel attention networks

Qin, Z., Zhang, P., Wu, F., and Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 783--792, 2021

2021

[18] [28]

GFNet : Global filter networks for visual recognition

Rao, Y., Zhao, W., Zhu, Z., Zhou, J., and Lu, J. GFNet : Global filter networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45 0 (9): 0 10960--10973, September 2023. doi:10.1109/TPAMI.2023.3263824

work page doi:10.1109/tpami.2023.3263824 2023

[19] [29]

You only look once: Unified, real-time object detection, 2016

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection, 2016. URL https://arxiv.org/abs/1506.02640

Pith/arXiv arXiv 2016

[20] [30]

Faster R-CNN : Towards real-time object detection with region proposal networks

Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN : Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

2015

[21] [31]

H., Sharda, A., and Karkee, M

Sapkota, R., Cheppally, R. H., Sharda, A., and Karkee, M. Yolo26: Key architectural enhancements and performance benchmarking for real-time object detection, 2026. URL https://arxiv.org/abs/2509.25164

arXiv 2026

[22] [32]

HS-FPN : High frequency and spatial perception fpn for tiny object detection

Shi, Z., Hu, J., Ren, J., Ye, H., Yuan, X., Ouyang, Y., He, J., Ji, B., and Guo, J. HS-FPN : High frequency and spatial perception fpn for tiny object detection. arXiv preprint arXiv:2412.10116, 2025

arXiv 2025

[23] [33]

Tan, M., Pang, R., and Le, Q. V. Efficientdet: Scalable and efficient object detection, 2020. URL https://arxiv.org/abs/1911.09070

arXiv 2020

[24] [34]

Tang, F., Nian, B., Ding, J., Ma, W., Quan, Q., Dong, C., Yang, J., Liu, W., and Zhou, S. K. Mobile U-ViT : Revisiting large kernel and U -shaped vit for efficient medical image segmentation. arXiv preprint arXiv:2508.01064, 2025

arXiv 2025

[25] [35]

Yolov12: Attention-centric real-time object detectors, 2025

Tian, Y., Ye, Q., and Doermann, D. Yolov12: Attention-centric real-time object detectors, 2025. URL https://arxiv.org/abs/2502.12524

Pith/arXiv arXiv 2025

[26] [37]

LSNet : See large, focus small

Wang, A., Chen, H., Lin, Z., Han, J., and Ding, G. LSNet : See large, focus small. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[27] [38]

C., and Lin, D

Wang, J., Zhang, W., Cao, Y., Chen, K., Pang, J., Gong, T., Shi, J., Loy, C. C., and Lin, D. Side-aware boundary localization for more precise object detection, 2020. URL https://arxiv.org/abs/1912.04260

arXiv 2020

[28] [39]

Dota: A large-scale dataset for object detection in aerial images

Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., and Zhang, L. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3974--3983, 2018

2018

[29] [40]

FBRT-YOLO : Faster and better for real-time aerial image detection

Xiao, Y., Xu, T., Xin, Y., and Li, J. FBRT-YOLO : Faster and better for real-time aerial image detection. arXiv preprint arXiv:2504.20670, 2025

arXiv 2025

[30] [42]

and Koltun, V

Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions, 2016. URL https://arxiv.org/abs/1511.07122

Pith/arXiv arXiv 2016

[31] [43]

Scale match for tiny person detection

Yu, X., Gong, Y., Jiang, N., Ye, Q., and Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.\ 1257--1265, 2020

2020

[32] [44]

M., and Shum, H.-Y

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., and Shum, H.-Y. DINO : DETR with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022 a

Pith/arXiv arXiv 2022

[33] [45]

Making convolutional networks shift-invariant again, 2019

Zhang, R. Making convolutional networks shift-invariant again, 2019. URL https://arxiv.org/abs/1904.11486

Pith/arXiv arXiv 2019

[34] [46]

Efficient long-range attention network for image super-resolution

Zhang, X., Zeng, H., Guo, S., and Zhang, L. Efficient long-range attention network for image super-resolution. In European Conference on Computer Vision, pp.\ 649--667. Springer, 2022 b

2022

[35] [48]

arXiv preprint arXiv:2005.12872 , year=

End-to-End Object Detection with Transformers , author=. arXiv preprint arXiv:2005.12872 , year=

arXiv 2005

[36] [49]

European Conference on Computer Vision , pages=

Efficient Long-Range Attention Network for Image Super-resolution , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[37] [50]

Wang, Ao and Chen, Hui and Liu, Lihao and Chen, Kai and Lin, Zijia and Han, Jungong and Ding, Guiguang , journal=

[38] [51]

Khanam, Rahima and Hussain, Muhammad , journal=

[39] [52]

Xiao, Yao and Xu, Tingfa and Xin, Yu and Li, Jianan , journal=

[40] [53]

arXiv preprint arXiv:2304.08069 , year=

DETRs Beat YOLOs on Real-time Object Detection , author=. arXiv preprint arXiv:2304.08069 , year=

arXiv

[41] [54]

and Shum, Heung-Yeung , journal=

Zhang, Hao and Li, Feng and Liu, Shilong and Zhang, Lei and Su, Hang and Zhu, Jun and Ni, Lionel M. and Shum, Heung-Yeung , journal=

[42] [55]

Wang, Ao and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang , booktitle=

[43] [56]

Kevin , journal=

Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S. Kevin , journal=

[44] [57]

, booktitle=

Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott and Fu, Cheng-Yang and Berg, Alexander C. , booktitle=. 2016 , publisher=

2016

[45] [58]

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle=. Faster

[46] [59]

Shi, Zican and Hu, Jing and Ren, Jie and Ye, Hengkang and Yuan, Xuyang and Ouyang, Yan and He, Jia and Ji, Bo and Guo, Junyu , journal=

[47] [60]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Feature Pyramid Networks for Object Detection , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

[48] [61]

2023 , month=

Rao, Yongming and Zhao, Wenliang and Zhu, Zheng and Zhou, Jie and Lu, Jiwen , journal=. 2023 , month=

2023

[49] [62]

Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation , journal =

Guoping Xu and Wentao Liao and Xuan Zhang and Chang Li and Xinwei He and Xinglong Wu , keywords =. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.patcog.2023.109819 , url =

work page doi:10.1016/j.patcog.2023.109819 2023

[50] [63]

arXiv preprint arXiv:2503.18783 , year=

Frequency Dynamic Convolution for Dense Image Prediction , author=. arXiv preprint arXiv:2503.18783 , year=

arXiv

[51] [64]

Adaptive Complex Wavelet Informed Transformer Operator , year=

Li, Xiaotong and Jiao, Licheng and Liu, Fang and Yang, Shuyuan and Zhu, Hao and Liu, Xu and Li, Lingling and Ma, Wenping , journal=. Adaptive Complex Wavelet Informed Transformer Operator , year=

[52] [65]

and Amoyal, Roy and Treister, Eran and Freifeld, Oren , editor =

Finder, Shahaf E. and Amoyal, Roy and Treister, Eran and Freifeld, Oren , editor =. Wavelet Convolutions for Large Receptive Fields , booktitle =. 2025 , publisher =

2025

[53] [66]

Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages=

VisDrone-DET2019: The vision meets drone object detection in image challenge results , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages=

[54] [67]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

The unmanned aerial vehicle benchmark: Object detection and tracking , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

[55] [68]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Scale match for tiny person detection , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[56] [69]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

DOTA: A large-scale dataset for object detection in aerial images , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[57] [70]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

FcaNet: Frequency Channel Attention Networks , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

[58] [71]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Frequency-aware Feature Fusion for Dense Image Prediction , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[59] [72]

2026 , eprint=

YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection , author=. 2026 , eprint=

2026

[60] [73]

2016 , eprint=

You Only Look Once: Unified, Real-Time Object Detection , author=. 2016 , eprint=

2016

[61] [74]

Recent advances in small object detection based on deep learning: A review , journal =

Kang Tong and Yiquan Wu and Fei Zhou , keywords =. Recent advances in small object detection based on deep learning: A review , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.imavis.2020.103910 , url =

work page doi:10.1016/j.imavis.2020.103910 2020

[62] [75]

2018 , eprint=

Path Aggregation Network for Instance Segmentation , author=. 2018 , eprint=

2018

[63] [76]

2016 , eprint=

Multi-Scale Context Aggregation by Dilated Convolutions , author=. 2016 , eprint=

2016

[64] [77]

Proceedings of the European conference on computer vision (ECCV) , pages=

Sod-mtgan: Small object detection via multi-task generative adversarial network , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[65] [78]

2020 , eprint=

EfficientDet: Scalable and Efficient Object Detection , author=. 2020 , eprint=

2020

[66] [79]

2019 , eprint=

Making Convolutional Networks Shift-Invariant Again , author=. 2019 , eprint=

2019

[67] [80]

2020 , eprint=

Side-Aware Boundary Localization for More Precise Object Detection , author=. 2020 , eprint=

2020

[68] [81]

European Conference on Computer Vision (ECCV) , year =

Microsoft COCO: Common Objects in Context , author =. European Conference on Computer Vision (ECCV) , year =

[69] [82]

European Conference on Computer Vision (ECCV) , year =

Rethinking Features-Fused-Pyramid-Neck for Object Detection , author =. European Conference on Computer Vision (ECCV) , year =

[70] [83]

2024 , eprint =

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images , author =. 2024 , eprint =

2024

[71] [84]

Freq-DETR: Frequency-aware transformer for real-time small object detection in unmanned aerial vehicle imagery , journal =

Jiayi Chen and Ningzhong Liu and Han Sun and Yu Wang , keywords =. Freq-DETR: Frequency-aware transformer for real-time small object detection in unmanned aerial vehicle imagery , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.eswa.2025.129710 , url =

work page doi:10.1016/j.eswa.2025.129710 2026

[72] [85]

2020 , eprint=

TIDE: A General Toolbox for Identifying Object Detection Errors , author=. 2020 , eprint=

2020

[73] [86]

RTMDet-R2: An Improved Real-Time Rotated Object Detector

Xiang, Haifeng and Jing, Naifeng and Jiang, Jianfei and Guo, Hongbo and Sheng, Weiguang and Mao, Zhigang and Wang, Qin. RTMDet-R2: An Improved Real-Time Rotated Object Detector. Pattern Recognition and Computer Vision. 2024

2024

[74] [87]

arXiv preprint arXiv:2111.00902 , year =

PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices , author =. arXiv preprint arXiv:2111.00902 , year =

arXiv

[75] [88]

2026 , doi =

Luan, Di and Dong, Yuna and Zhou, Jian and Li, Ang and Xie, Ling and Liu, Hongying and Zhu, Jun , journal =. 2026 , doi =

2026

[76] [89]

2024 , doi =

Peng, Yansong and Li, Hebei and Wu, Peixi and Zhang, Yueyi and Sun, Xiaoyan and Wu, Feng , journal =. 2024 , doi =

2024

[77] [90]

WDFS-DETR: A Transformer-based framework with multi-scale attention for small object detection in UAV Engineering Tasks , journal =

Jinjiang Liu and Yonghua Xie , keywords =. WDFS-DETR: A Transformer-based framework with multi-scale attention for small object detection in UAV Engineering Tasks , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.rineng.2025.105930 , url =

work page doi:10.1016/j.rineng.2025.105930 2025

[78] [91]

CoRR , volume =

Xiyang Dai and Yinpeng Chen and Bin Xiao and Dongdong Chen and Mengchen Liu and Lu Yuan and Lei Zhang , title =. CoRR , volume =. 2021 , url =. 2106.08322 , timestamp =

arXiv 2021

[79] [92]

CoRR , volume =

Xiang Li and Wenhai Wang and Lijun Wu and Shuo Chen and Xiaolin Hu and Jun Li and Jinhui Tang and Jian Yang , title =. CoRR , volume =. 2020 , url =. 2006.04388 , timestamp =

arXiv 2020

[80] [93]

Scott and Weilin Huang , title =

Chengjian Feng and Yujie Zhong and Yu Gao and Matthew R. Scott and Weilin Huang , title =. CoRR , volume =. 2021 , url =. 2108.07755 , timestamp =

arXiv 2021