DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

Bo Gao; Han Yu; Jingcheng Tong; Xingsheng Chen; Zichen Li

arxiv: 2512.07078 · v4 · pith:HR7GLOADnew · submitted 2025-12-08 · 💻 cs.CV · cs.LG

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

Bo Gao , Jingcheng Tong , Xingsheng Chen , Han Yu , Zichen Li This is my paper

Pith reviewed 2026-05-25 07:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords small object detectionDETRfrequency domainiterative refinementfeature aggregationRT-DETRNEU-DETVisDrone

0 comments

The pith

DFIR-DETR fixes uniform attention, norm drift, and high-frequency loss in RT-DETR to raise small-object detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies three specific shortcomings in the RT-DETR detector that hinder small object performance: attention applied uniformly without regard to spatial complexity, norm drift introduced during feature upsampling, and progressive suppression of high-frequency details by repeated spatial convolutions. It responds by building DFIR-DETR with modules that directly target each shortcoming through frequency-domain iterative refinement and dynamic feature aggregation. Results on NEU-DET and VisDrone show mAP50 scores of 92.9 percent and 51.6 percent respectively while using only 11.7 million parameters and 47.2 GFLOPs. A sympathetic reader would care because the work supplies concrete, traceable fixes rather than generic scaling, and it demonstrates the fixes work across industrial defect detection and aerial imagery domains.

Core claim

By tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline—uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on—DFIR-DETR achieves 92.9 percent and 51.6 percent mAP50 on NEU-DET and VisDrone with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

What carries the argument

Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation modules, each explicitly linked to one of the three listed deficiencies in RT-DETR.

If this is right

Small objects in cluttered or low-resolution scenes become reliably detectable without increasing model capacity.
The same module-to-deficiency tracing method can be applied to other transformer-based detectors that share the RT-DETR backbone and neck structure.
Detection pipelines for industrial inspection and drone imagery can adopt the architecture while staying within tight compute limits.
High-frequency preservation techniques may reduce the need for deeper backbones when the task depends on edge detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the frequency-domain module proves portable, it could be inserted into other vision transformers to protect fine detail without custom redesign.
Dynamic feature aggregation might offer a general alternative to fixed attention patterns in tasks where object scale varies sharply within a single image.
The reported efficiency numbers suggest the approach could support real-time small-object detection on edge hardware once integrated with existing deployment frameworks.

Load-bearing premise

The three listed deficiencies in RT-DETR are the main causes of weak small-object performance and are directly mitigated by the proposed modules.

What would settle it

An ablation study in which removing the frequency-domain refinement or dynamic aggregation module produces no drop in small-object mAP on NEU-DET or VisDrone while the full model still meets the reported parameter and FLOP budget.

Figures

Figures reproduced from arXiv: 2512.07078 by Bo Gao, Han Yu, Jingcheng Tong, Xingsheng Chen, Zichen Li.

**Figure 2.** Figure 2: DCFA block Z = SGLU (H + DKSA(H)), H = X + ϕdw(X) (3) where ϕdw denotes a 3 × 3 depthwise separable convolution with batch normalization. The DKSA mechanism subsequently operates on the enhanced features H, selectively focusing on defect regions in industrial scenarios through dynamic sparsification strategies while establishing long-range associations between small objects and contexts in remote sensing … view at source ↗

**Figure 3.** Figure 3: DKSA block preprocessing, the complete attention computation process can be uniformly expressed as: DKSA(X) = ϕproj Concat h V AT reshape , X2 i (5) Aij = ( exp(sij ) P j ′∈T i K exp(sij′ ) , j ∈ T i K 0, j /∈ T i K (6) The dynamic Top-K selection mechanism is defined as: K = ⌊N · σ (AvgPool(ψ(X)))⌋ (7) where ψ represents a gating network composed of two convolutional layers, σ denotes the sigmoid fun… view at source ↗

**Figure 4.** Figure 4: DFPN block amplitude normalization and preserves fine-grained spatial details through dual-path convolution operations, establishing more coherent cross-scale feature representations and significantly enhancing the model’s capability to detect small objects in complex scenarios. DFPN consists of two synergistic components operating on complementary pathways of the feature pyramid. In the top-down pathway… view at source ↗

**Figure 5.** Figure 5: FIRC3 block maintaining high-frequency details. The entire transformation process essentially solves a frequency-domain-constrained least squares problem, adaptively balancing contributions of different frequency components and enabling the network to dynamically adjust sensitivity to high-frequency information of small objects. Periodization processing of the frequency domain convolution kernel is achiev… view at source ↗

**Figure 6.** Figure 6: VisDrone data set object instances distribution in space. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Spatial distribution of defect instances in the NEU-DET [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison on NEU-DET [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 10.** Figure 10: Qualitative visualization comparison on NEU-DET dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFIR-DETR adds two modules to RT-DETR to target small-object issues but the abstract leaves the gains hard to attribute to those specific fixes.

read the letter

The paper's main move is to name three problems in RT-DETR—uniform attention, norm drift during upsampling, and progressive high-frequency loss—and attach one module to each: frequency-domain iterative refinement for the frequency issue and dynamic feature aggregation for the attention and norm problems. It reports 92.9% mAP50 on NEU-DET and 51.6% on VisDrone with 11.7M parameters and 47.2 GFLOPs. Those numbers and the efficiency claim are the clearest things to take away right now.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DFIR-DETR, an extension of RT-DETR for small-object detection. It identifies three specific deficiencies in the RT-DETR baseline (uniform attention that ignores spatial complexity, norm drift during upsampling, and progressive high-frequency suppression by spatial convolutions) and introduces frequency-domain iterative refinement plus dynamic feature aggregation modules to address them. On NEU-DET and VisDrone the model is reported to reach 92.9 % and 51.6 % mAP50 respectively while using 11.7 M parameters and 47.2 GFLOPs.

Significance. If the performance gains can be shown to arise from the hypothesized module-level corrections rather than uncontrolled capacity or training differences, the work would supply a concrete, efficiency-aware route to improving high-frequency detail preservation in DETR-style detectors for industrial and aerial imagery.

major comments (2)

Abstract: the claim that uniform attention, norm drift, and high-frequency suppression are the dominant causes of weak small-object performance is asserted without any quantitative diagnostics (attention entropy, per-stage feature-norm statistics, or Fourier spectra of feature maps) that would confirm the deficiencies exist at the claimed severity in the RT-DETR baseline.
Abstract: the reported mAP50 figures are presented without ablation tables, controlled baseline comparisons, or statistical tests that isolate the contribution of each proposed module while holding parameter count and other architectural choices fixed; consequently the causal link between the modules and the observed gains cannot be evaluated.

minor comments (1)

Abstract: baseline RT-DETR mAP50 numbers on the same two datasets are not supplied, preventing immediate assessment of the magnitude of improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the motivation and empirical validation of our proposed modules. We address each major comment below and will revise the manuscript to strengthen the supporting evidence.

read point-by-point responses

Referee: Abstract: the claim that uniform attention, norm drift, and high-frequency suppression are the dominant causes of weak small-object performance is asserted without any quantitative diagnostics (attention entropy, per-stage feature-norm statistics, or Fourier spectra of feature maps) that would confirm the deficiencies exist at the claimed severity in the RT-DETR baseline.

Authors: We acknowledge that the abstract states these deficiencies without accompanying quantitative diagnostics. The design of each module was motivated by observed behaviors during development of the RT-DETR baseline, but explicit metrics such as attention entropy, feature-norm statistics, or Fourier spectra were not reported. In the revised manuscript we will add these diagnostic analyses on the baseline to substantiate the claimed severity of each issue. revision: yes
Referee: Abstract: the reported mAP50 figures are presented without ablation tables, controlled baseline comparisons, or statistical tests that isolate the contribution of each proposed module while holding parameter count and other architectural choices fixed; consequently the causal link between the modules and the observed gains cannot be evaluated.

Authors: The current manuscript presents end-to-end results on NEU-DET and VisDrone but does not include module-level ablations with parameter-controlled baselines or statistical significance tests. We agree that such experiments are necessary to establish the contribution of each component. The revised version will incorporate detailed ablation tables that isolate the frequency-domain iterative refinement and dynamic feature aggregation modules while keeping parameter count and training settings fixed. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical architecture claims

full rationale

The paper introduces DFIR-DETR as an empirical modification of RT-DETR, motivated by three listed deficiencies and validated solely by reported mAP50 numbers on NEU-DET and VisDrone. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The performance figures are measurements, not outputs that reduce to the inputs by construction, so the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5684 in / 1035 out tokens · 21973 ms · 2026-05-25T07:40:02.080769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

[1]

A survey of object detection for uavs based on deep learning,

J. Yin, F. Wu, Y . Qiu, C. Liu, B. Guo, and C. Zhu, “A survey of object detection for uavs based on deep learning,”Remote Sensing, vol. 16, no. 1, p. 149, 2024

work page 2024
[2]

Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,

K. Liu and J. Zheng, “Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,”IEEE Internet of Things Journal, vol. 9, no. 24, pp. 24 300–24 314, 2022

work page 2022
[3]

Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,

Z. Feng, D. Wu, M. Huanget al., “Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 21 847–21 862, 2024

work page 2024
[4]

Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,

T. Lei, R. Wang, Y . Zhang, Y . Wan, C. Liu, and A. K. Nandi, “Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–10, 2022

work page 2022
[5]

Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,

W. Zhouet al., “Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 20 327–20 339, 2024

work page 2024
[6]

A new subspace clustering strategy for ai-based data analysis in iot system,

Z. Cui, X. Jing, P. Zhao, W. Zhang, and J. Chen, “A new subspace clustering strategy for ai-based data analysis in iot system,”IEEE Internet of Things Journal, vol. 9, no. 1, pp. 97–112, 2022. 15

work page 2022
[7]

Object detection with deep learning: A review,

Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,”IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019

work page 2019
[10]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213– 229

work page 2020
[11]

Detrs beat yolos on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2024, pp. 16 965–16 974

work page 2024
[12]

A survey of small object detection based on deep learning in aerial images,

J. Liu, L. Wang, and M. Zhang, “A survey of small object detection based on deep learning in aerial images,”Artificial Intelligence Review, vol. 58, pp. 1–45, 2025

work page 2025
[13]

Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,

Z. Yuan, J. Gong, B. Guo, C. Wang, N. Liao, J. Song, and Q. Wu, “Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,”Remote Sensing, vol. 16, no. 22, p. 4265, 2024

work page 2024
[14]

Attention mechanisms in computer vision: A survey,

M. Wang and W. Deng, “Attention mechanisms in computer vision: A survey,”Computational Visual Media, vol. 10, no. 1, pp. 3–25, 2024

work page 2024
[15]

Fast fourier convolution,

L. Chi, B. Jiang, and Y . Mu, “Fast fourier convolution,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 4479–4488

work page 2020
[16]

Rich feature hierarchies for accurate object detection and semantic segmentation,

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587

work page 2014
[17]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137– 1149, 2017

work page 2017
[18]

Ssd: Single shot multibox detector,

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 21–37

work page 2016
[19]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– 788

work page 2016
[20]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

work page 2016
[23]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944

work page 2017
[24]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

work page 2018
[25]

Efficientdet: Scalable and efficient object detection,

M. Tan, R. Pang, and Q. V . Le, “Efficientdet: Scalable and efficient object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790

work page 2020
[26]

Nas-fpn: Learning scalable feature pyramid architecture for object detection,

G. Ghiasi, T.-Y . Lin, R. Pang, and Q. V . Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045

work page 2019
[27]

Detection and tracking meet drones challenge,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

work page 2021
[28]

A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,

K. Song and Y . Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,”Applied Surface Science, vol. 285, pp. 858–864, 2013

work page 2013
[29]

Strip: Spatial transformer for efficient image processing,

Z. Guo, L. Leng, Y . Wu, C. Li, Y . Wang, and Q. Zhang, “Strip: Spatial transformer for efficient image processing,”Pattern Recognition, vol. 135, p. 109139, 2023

work page 2023
[30]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” arXiv preprint arXiv:2405.07992, 2024

work page arXiv 2024
[31]

Global filter networks for image classification,

Y . Rao, W. Zhao, Y . Tang, J. Zhou, S.-N. Lim, and J. Lu, “Global filter networks for image classification,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 980–993

work page 2022
[32]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773

work page 2017
[33]

Fdt: Fast and effective dynamic token for vision transformer,

Y . Mao, H. Zhou, J. Xia, and K. Zhang, “Fdt: Fast and effective dynamic token for vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7598–7607

work page 2023
[34]

Dtab: Dual-token attention block for efficient vision transformers,

Z. Liu, Y . Han, Q. Zhang, and K. Li, “Dtab: Dual-token attention block for efficient vision transformers,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4163–4177, 2023

work page 2023
[35]

Camixer: Convolution and attention mixing for efficient image processing,

Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y . Li, “Camixer: Convolution and attention mixing for efficient image processing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2589–2599

work page 2023
[36]

Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,

J. Zhu, J. Li, J. Chen, and Q. Chen, “Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,”arXiv preprint arXiv:2402.02509, 2024

work page arXiv 2024
[37]

Elgca: Efficient local-global context aggregation for remote sensing change detection,

L. Song, M. Xia, L. Weng, H. Lin, M. Qian, and B. Chen, “Elgca: Efficient local-global context aggregation for remote sensing change detection,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

work page 2024
[38]

Hdrab: High-dynamic range attention block for efficient image super-resolution,

X. Wang, D. Liu, Y . Song, and D. Liang, “Hdrab: High-dynamic range attention block for efficient image super-resolution,”Pattern Recogni- tion, vol. 139, p. 109451, 2023

work page 2023
[39]

Msn: Multi- scale network for object detection,

Z. Huang, J. Wang, X. Fu, T. Yu, Y . Guo, and R. Wang, “Msn: Multi- scale network for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3368–3378

work page 2023
[40]

Fcanet: Frequency channel attention networks,

Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks,”arXiv preprint arXiv:2012.11879, 2020

work page arXiv 2012
[41]

Rab: Residual attention block for efficient image super- resolution,

W. Yang, Y . Yuan, W. Guo, W. Ren, J. Zhang, X. He, S. Kwong, and S. Wang, “Rab: Residual attention block for efficient image super- resolution,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021, pp. 1477–1486

work page 2021
[42]

Yolov6 v3.0: A full-scale reloading,

C. Li, L. Li, H. Jiang, K. Weng, Y . Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nieet al., “Yolov6 v3.0: A full-scale reloading,”arXiv preprint arXiv:2301.05586, 2023

work page arXiv 2023
[43]

Yolov11: An improved real-time object detection model,

Ultralytics, “Yolov11: An improved real-time object detection model,” https://docs.ultralytics.com, 2024

work page 2024
[44]

You only look one-level feature,

Q. Chen, Y . Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only look one-level feature,” pp. 13 039–13 048, 2021

work page 2021
[45]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

work page 2017
[46]

Grid r-cnn,

X. Lu, B. Li, Y . Yue, Q. Li, and J. Yan, “Grid r-cnn,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372

work page 2019
[47]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 21 002–21 012

work page 2020
[48]

Objects as points,

X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” 2019

work page 2019
[49]

Asf: Adaptive spatial fusion for efficient multi-scale feature learning,

C. Yang, Z. Huang, and N. Wang, “Asf: Adaptive spatial fusion for efficient multi-scale feature learning,”arXiv preprint arXiv:2202.03149, 2022

work page arXiv 2022
[50]

Sdi: Spatial detail injection network for multi-scale semantic segmentation,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Sdi: Spatial detail injection network for multi-scale semantic segmentation,”Pattern Recognition, vol. 138, p. 109367, 2023

work page 2023
[51]

Gold- yolo: Efficient object detector via gather-and-distribute mechanism,

C. Wang, W. He, Y . Nie, J. Guo, C. Liu, K. Han, and Y . Wang, “Gold- yolo: Efficient object detector via gather-and-distribute mechanism,” arXiv preprint arXiv:2309.11331, 2023

work page arXiv 2023
[52]

Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,

Y . Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,”IEEE Transactions on Image Processing, vol. 32, pp. 2918– 2931, 2023

work page 2023
[53]

Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,

H. Guo, J. Yang, B. Yang, and G. Xu, “Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 4156–4165

work page 2023
[54]

Psfm: Progressive semantic feature module for object detection,

P. Sun, R. Zhang, Y . Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Yuan, P. Wang, and P. Luo, “Psfm: Progressive semantic feature module for object detection,”arXiv preprint arXiv:2302.02923, 2023

work page arXiv 2023
[55]

Glsa: Global- local self-attention for multi-scale feature learning,

Y . Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y . Fu, “Glsa: Global- local self-attention for multi-scale feature learning,”IEEE Transactions 16 on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8784– 8800, 2023

work page 2023
[56]

Ctrans: Cross- transformer network for multi-scale feature fusion,

X. Yan, H. Tang, S. Sun, H. Ma, D. Kong, and X. Xie, “Ctrans: Cross- transformer network for multi-scale feature fusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3868–3877

work page 2023
[57]

Maffn: Multi-scale attention feature fusion network for semantic segmentation,

W. Liu, Z. Wang, X. Liu, N. Zeng, Y . Liu, and F. E. Alsaadi, “Maffn: Multi-scale attention feature fusion network for semantic segmentation,” Neurocomputing, vol. 520, pp. 29–40, 2023

work page 2023
[58]

Msga: Multi-scale grouped attention mechanism for object detection,

J. Wang, K. Chen, J. Yang, C. C. Loy, and D. Lin, “Msga: Multi-scale grouped attention mechanism for object detection,”Pattern Recognition, vol. 140, p. 109545, 2023

work page 2023
[59]

Fsa: Feature separation and aggregation network for semantic segmentation,

X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, and Y . Tong, “Fsa: Feature separation and aggregation network for semantic segmentation,” Neurocomputing, vol. 523, pp. 103–114, 2023

work page 2023
[60]

Mfm: Multi-frequency multiscale feature fusion for object detection,

J. Hu, L. Shen, and G. Sun, “Mfm: Multi-frequency multiscale feature fusion for object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 860–868

work page 2023
[61]

Diverse branch block: Building a convolution as an inception-like unit,

X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Diverse branch block: Building a convolution as an inception-like unit,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 886–10 895

work page 2021
[62]

Dbbc3: Dynamic branching bottleneck for efficient neural networks,

K. Han, Y . Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Dbbc3: Dynamic branching bottleneck for efficient neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4456– 4468, 2023

work page 2023
[63]

Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,

X. Chen, H. Wang, Y . Hong, J. Guo, X. Wang, and Q. Zhang, “Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,”Pattern Recognition Letters, vol. 168, pp. 36–43, 2023

work page 2023
[64]

Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 043–11 053

work page 2023
[65]

Fcos: Fully convolutional one- stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636

work page 2019
[66]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162

work page 2018
[67]

Yolov5: A state-of-the-art real-time object detection sys- tem,

Ultralytics, “Yolov5: A state-of-the-art real-time object detection sys- tem,” https://github.com/ultralytics/yolov5, 2021

work page 2021
[68]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023
[69]

Yolov9: Learning what you want to learn using programmable gradient information,

C. Y . Wang and H. Y . Liao, “Yolov9: Learning what you want to learn using programmable gradient information,”arXiv preprint arXiv:2402.13616, 2024

work page arXiv 2024
[70]

Yolov10: Real-time end-to-end object detection,

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding, “Yolov10: Real-time end-to-end object detection,”arXiv preprint arXiv:2405.14458, 2024

work page arXiv 2024

[1] [1]

A survey of object detection for uavs based on deep learning,

J. Yin, F. Wu, Y . Qiu, C. Liu, B. Guo, and C. Zhu, “A survey of object detection for uavs based on deep learning,”Remote Sensing, vol. 16, no. 1, p. 149, 2024

work page 2024

[2] [2]

Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,

K. Liu and J. Zheng, “Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,”IEEE Internet of Things Journal, vol. 9, no. 24, pp. 24 300–24 314, 2022

work page 2022

[3] [3]

Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,

Z. Feng, D. Wu, M. Huanget al., “Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 21 847–21 862, 2024

work page 2024

[4] [4]

Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,

T. Lei, R. Wang, Y . Zhang, Y . Wan, C. Liu, and A. K. Nandi, “Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–10, 2022

work page 2022

[5] [5]

Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,

W. Zhouet al., “Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 20 327–20 339, 2024

work page 2024

[6] [6]

A new subspace clustering strategy for ai-based data analysis in iot system,

Z. Cui, X. Jing, P. Zhao, W. Zhang, and J. Chen, “A new subspace clustering strategy for ai-based data analysis in iot system,”IEEE Internet of Things Journal, vol. 9, no. 1, pp. 97–112, 2022. 15

work page 2022

[7] [7]

Object detection with deep learning: A review,

Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,”IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019

work page 2019

[8] [10]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213– 229

work page 2020

[9] [11]

Detrs beat yolos on real-time object detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2024, pp. 16 965–16 974

work page 2024

[10] [12]

A survey of small object detection based on deep learning in aerial images,

J. Liu, L. Wang, and M. Zhang, “A survey of small object detection based on deep learning in aerial images,”Artificial Intelligence Review, vol. 58, pp. 1–45, 2025

work page 2025

[11] [13]

Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,

Z. Yuan, J. Gong, B. Guo, C. Wang, N. Liao, J. Song, and Q. Wu, “Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,”Remote Sensing, vol. 16, no. 22, p. 4265, 2024

work page 2024

[12] [14]

Attention mechanisms in computer vision: A survey,

M. Wang and W. Deng, “Attention mechanisms in computer vision: A survey,”Computational Visual Media, vol. 10, no. 1, pp. 3–25, 2024

work page 2024

[13] [15]

Fast fourier convolution,

L. Chi, B. Jiang, and Y . Mu, “Fast fourier convolution,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 4479–4488

work page 2020

[14] [16]

Rich feature hierarchies for accurate object detection and semantic segmentation,

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587

work page 2014

[15] [17]

Faster r-cnn: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137– 1149, 2017

work page 2017

[16] [18]

Ssd: Single shot multibox detector,

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 21–37

work page 2016

[17] [19]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– 788

work page 2016

[18] [20]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

work page 2016

[20] [23]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944

work page 2017

[21] [24]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

work page 2018

[22] [25]

Efficientdet: Scalable and efficient object detection,

M. Tan, R. Pang, and Q. V . Le, “Efficientdet: Scalable and efficient object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790

work page 2020

[23] [26]

Nas-fpn: Learning scalable feature pyramid architecture for object detection,

G. Ghiasi, T.-Y . Lin, R. Pang, and Q. V . Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045

work page 2019

[24] [27]

Detection and tracking meet drones challenge,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

work page 2021

[25] [28]

A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,

K. Song and Y . Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,”Applied Surface Science, vol. 285, pp. 858–864, 2013

work page 2013

[26] [29]

Strip: Spatial transformer for efficient image processing,

Z. Guo, L. Leng, Y . Wu, C. Li, Y . Wang, and Q. Zhang, “Strip: Spatial transformer for efficient image processing,”Pattern Recognition, vol. 135, p. 109139, 2023

work page 2023

[27] [30]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” arXiv preprint arXiv:2405.07992, 2024

work page arXiv 2024

[28] [31]

Global filter networks for image classification,

Y . Rao, W. Zhao, Y . Tang, J. Zhou, S.-N. Lim, and J. Lu, “Global filter networks for image classification,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 980–993

work page 2022

[29] [32]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773

work page 2017

[30] [33]

Fdt: Fast and effective dynamic token for vision transformer,

Y . Mao, H. Zhou, J. Xia, and K. Zhang, “Fdt: Fast and effective dynamic token for vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7598–7607

work page 2023

[31] [34]

Dtab: Dual-token attention block for efficient vision transformers,

Z. Liu, Y . Han, Q. Zhang, and K. Li, “Dtab: Dual-token attention block for efficient vision transformers,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4163–4177, 2023

work page 2023

[32] [35]

Camixer: Convolution and attention mixing for efficient image processing,

Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y . Li, “Camixer: Convolution and attention mixing for efficient image processing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2589–2599

work page 2023

[33] [36]

Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,

J. Zhu, J. Li, J. Chen, and Q. Chen, “Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,”arXiv preprint arXiv:2402.02509, 2024

work page arXiv 2024

[34] [37]

Elgca: Efficient local-global context aggregation for remote sensing change detection,

L. Song, M. Xia, L. Weng, H. Lin, M. Qian, and B. Chen, “Elgca: Efficient local-global context aggregation for remote sensing change detection,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

work page 2024

[35] [38]

Hdrab: High-dynamic range attention block for efficient image super-resolution,

X. Wang, D. Liu, Y . Song, and D. Liang, “Hdrab: High-dynamic range attention block for efficient image super-resolution,”Pattern Recogni- tion, vol. 139, p. 109451, 2023

work page 2023

[36] [39]

Msn: Multi- scale network for object detection,

Z. Huang, J. Wang, X. Fu, T. Yu, Y . Guo, and R. Wang, “Msn: Multi- scale network for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3368–3378

work page 2023

[37] [40]

Fcanet: Frequency channel attention networks,

Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks,”arXiv preprint arXiv:2012.11879, 2020

work page arXiv 2012

[38] [41]

Rab: Residual attention block for efficient image super- resolution,

W. Yang, Y . Yuan, W. Guo, W. Ren, J. Zhang, X. He, S. Kwong, and S. Wang, “Rab: Residual attention block for efficient image super- resolution,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021, pp. 1477–1486

work page 2021

[39] [42]

Yolov6 v3.0: A full-scale reloading,

C. Li, L. Li, H. Jiang, K. Weng, Y . Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nieet al., “Yolov6 v3.0: A full-scale reloading,”arXiv preprint arXiv:2301.05586, 2023

work page arXiv 2023

[40] [43]

Yolov11: An improved real-time object detection model,

Ultralytics, “Yolov11: An improved real-time object detection model,” https://docs.ultralytics.com, 2024

work page 2024

[41] [44]

You only look one-level feature,

Q. Chen, Y . Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only look one-level feature,” pp. 13 039–13 048, 2021

work page 2021

[42] [45]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

work page 2017

[43] [46]

Grid r-cnn,

X. Lu, B. Li, Y . Yue, Q. Li, and J. Yan, “Grid r-cnn,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372

work page 2019

[44] [47]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 21 002–21 012

work page 2020

[45] [48]

Objects as points,

X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” 2019

work page 2019

[46] [49]

Asf: Adaptive spatial fusion for efficient multi-scale feature learning,

C. Yang, Z. Huang, and N. Wang, “Asf: Adaptive spatial fusion for efficient multi-scale feature learning,”arXiv preprint arXiv:2202.03149, 2022

work page arXiv 2022

[47] [50]

Sdi: Spatial detail injection network for multi-scale semantic segmentation,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Sdi: Spatial detail injection network for multi-scale semantic segmentation,”Pattern Recognition, vol. 138, p. 109367, 2023

work page 2023

[48] [51]

Gold- yolo: Efficient object detector via gather-and-distribute mechanism,

C. Wang, W. He, Y . Nie, J. Guo, C. Liu, K. Han, and Y . Wang, “Gold- yolo: Efficient object detector via gather-and-distribute mechanism,” arXiv preprint arXiv:2309.11331, 2023

work page arXiv 2023

[49] [52]

Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,

Y . Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,”IEEE Transactions on Image Processing, vol. 32, pp. 2918– 2931, 2023

work page 2023

[50] [53]

Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,

H. Guo, J. Yang, B. Yang, and G. Xu, “Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 4156–4165

work page 2023

[51] [54]

Psfm: Progressive semantic feature module for object detection,

P. Sun, R. Zhang, Y . Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Yuan, P. Wang, and P. Luo, “Psfm: Progressive semantic feature module for object detection,”arXiv preprint arXiv:2302.02923, 2023

work page arXiv 2023

[52] [55]

Glsa: Global- local self-attention for multi-scale feature learning,

Y . Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y . Fu, “Glsa: Global- local self-attention for multi-scale feature learning,”IEEE Transactions 16 on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8784– 8800, 2023

work page 2023

[53] [56]

Ctrans: Cross- transformer network for multi-scale feature fusion,

X. Yan, H. Tang, S. Sun, H. Ma, D. Kong, and X. Xie, “Ctrans: Cross- transformer network for multi-scale feature fusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3868–3877

work page 2023

[54] [57]

Maffn: Multi-scale attention feature fusion network for semantic segmentation,

W. Liu, Z. Wang, X. Liu, N. Zeng, Y . Liu, and F. E. Alsaadi, “Maffn: Multi-scale attention feature fusion network for semantic segmentation,” Neurocomputing, vol. 520, pp. 29–40, 2023

work page 2023

[55] [58]

Msga: Multi-scale grouped attention mechanism for object detection,

J. Wang, K. Chen, J. Yang, C. C. Loy, and D. Lin, “Msga: Multi-scale grouped attention mechanism for object detection,”Pattern Recognition, vol. 140, p. 109545, 2023

work page 2023

[56] [59]

Fsa: Feature separation and aggregation network for semantic segmentation,

X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, and Y . Tong, “Fsa: Feature separation and aggregation network for semantic segmentation,” Neurocomputing, vol. 523, pp. 103–114, 2023

work page 2023

[57] [60]

Mfm: Multi-frequency multiscale feature fusion for object detection,

J. Hu, L. Shen, and G. Sun, “Mfm: Multi-frequency multiscale feature fusion for object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 860–868

work page 2023

[58] [61]

Diverse branch block: Building a convolution as an inception-like unit,

X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Diverse branch block: Building a convolution as an inception-like unit,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 886–10 895

work page 2021

[59] [62]

Dbbc3: Dynamic branching bottleneck for efficient neural networks,

K. Han, Y . Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Dbbc3: Dynamic branching bottleneck for efficient neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4456– 4468, 2023

work page 2023

[60] [63]

Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,

X. Chen, H. Wang, Y . Hong, J. Guo, X. Wang, and Q. Zhang, “Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,”Pattern Recognition Letters, vol. 168, pp. 36–43, 2023

work page 2023

[61] [64]

Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 043–11 053

work page 2023

[62] [65]

Fcos: Fully convolutional one- stage object detection,

Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636

work page 2019

[63] [66]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162

work page 2018

[64] [67]

Yolov5: A state-of-the-art real-time object detection sys- tem,

Ultralytics, “Yolov5: A state-of-the-art real-time object detection sys- tem,” https://github.com/ultralytics/yolov5, 2021

work page 2021

[65] [68]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023

[66] [69]

Yolov9: Learning what you want to learn using programmable gradient information,

C. Y . Wang and H. Y . Liao, “Yolov9: Learning what you want to learn using programmable gradient information,”arXiv preprint arXiv:2402.13616, 2024

work page arXiv 2024

[67] [70]

Yolov10: Real-time end-to-end object detection,

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding, “Yolov10: Real-time end-to-end object detection,”arXiv preprint arXiv:2405.14458, 2024

work page arXiv 2024