pith. sign in

arxiv: 2512.07078 · v4 · pith:HR7GLOADnew · submitted 2025-12-08 · 💻 cs.CV · cs.LG

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

Pith reviewed 2026-05-25 07:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords small object detectionDETRfrequency domainiterative refinementfeature aggregationRT-DETRNEU-DETVisDrone
0
0 comments X

The pith

DFIR-DETR fixes uniform attention, norm drift, and high-frequency loss in RT-DETR to raise small-object detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies three specific shortcomings in the RT-DETR detector that hinder small object performance: attention applied uniformly without regard to spatial complexity, norm drift introduced during feature upsampling, and progressive suppression of high-frequency details by repeated spatial convolutions. It responds by building DFIR-DETR with modules that directly target each shortcoming through frequency-domain iterative refinement and dynamic feature aggregation. Results on NEU-DET and VisDrone show mAP50 scores of 92.9 percent and 51.6 percent respectively while using only 11.7 million parameters and 47.2 GFLOPs. A sympathetic reader would care because the work supplies concrete, traceable fixes rather than generic scaling, and it demonstrates the fixes work across industrial defect detection and aerial imagery domains.

Core claim

By tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline—uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on—DFIR-DETR achieves 92.9 percent and 51.6 percent mAP50 on NEU-DET and VisDrone with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

What carries the argument

Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation modules, each explicitly linked to one of the three listed deficiencies in RT-DETR.

If this is right

  • Small objects in cluttered or low-resolution scenes become reliably detectable without increasing model capacity.
  • The same module-to-deficiency tracing method can be applied to other transformer-based detectors that share the RT-DETR backbone and neck structure.
  • Detection pipelines for industrial inspection and drone imagery can adopt the architecture while staying within tight compute limits.
  • High-frequency preservation techniques may reduce the need for deeper backbones when the task depends on edge detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the frequency-domain module proves portable, it could be inserted into other vision transformers to protect fine detail without custom redesign.
  • Dynamic feature aggregation might offer a general alternative to fixed attention patterns in tasks where object scale varies sharply within a single image.
  • The reported efficiency numbers suggest the approach could support real-time small-object detection on edge hardware once integrated with existing deployment frameworks.

Load-bearing premise

The three listed deficiencies in RT-DETR are the main causes of weak small-object performance and are directly mitigated by the proposed modules.

What would settle it

An ablation study in which removing the frequency-domain refinement or dynamic aggregation module produces no drop in small-object mAP on NEU-DET or VisDrone while the full model still meets the reported parameter and FLOP budget.

Figures

Figures reproduced from arXiv: 2512.07078 by Bo Gao, Han Yu, Jingcheng Tong, Xingsheng Chen, Zichen Li.

Figure 1
Figure 1. Figure 1: Overall architecture of DFIR-DETR. DAFB Concat F1 ... DAFB DAFB F2 Split Conv 1×1 DCFA n DW 3×3 DKSA SGLU Conv 1×1 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DCFA block Z = SGLU (H + DKSA(H)), H = X + ϕdw(X) (3) where ϕdw denotes a 3 × 3 depthwise separable convolution with batch normalization. The DKSA mechanism subsequently operates on the enhanced features H, selectively focusing on defect regions in industrial scenarios through dynamic sparsification strategies while establishing long-range associa￾tions between small objects and contexts in remote sensing … view at source ↗
Figure 3
Figure 3. Figure 3: DKSA block preprocessing, the complete attention computation process can be uniformly expressed as: DKSA(X) = ϕproj  Concat h V AT reshape , X2 i (5) Aij = ( exp(sij ) P j ′∈T i K exp(sij′ ) , j ∈ T i K 0, j /∈ T i K (6) The dynamic Top-K selection mechanism is defined as: K = ⌊N · σ (AvgPool(ψ(X)))⌋ (7) where ψ represents a gating network composed of two convo￾lutional layers, σ denotes the sigmoid fun… view at source ↗
Figure 4
Figure 4. Figure 4: DFPN block amplitude normalization and preserves fine-grained spatial details through dual-path convolution operations, establishing more coherent cross-scale feature representations and signif￾icantly enhancing the model’s capability to detect small ob￾jects in complex scenarios. DFPN consists of two synergistic components operating on complementary pathways of the feature pyramid. In the top-down pathway… view at source ↗
Figure 5
Figure 5. Figure 5: FIRC3 block maintaining high-frequency details. The entire transforma￾tion process essentially solves a frequency-domain-constrained least squares problem, adaptively balancing contributions of different frequency components and enabling the network to dynamically adjust sensitivity to high-frequency information of small objects. Periodization processing of the frequency domain convolution kernel is achiev… view at source ↗
Figure 6
Figure 6. Figure 6: VisDrone data set object instances distribution in space. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spatial distribution of defect instances in the NEU-DET [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison on NEU-DET [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative visualization comparison on NEU-DET dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DFIR-DETR, an extension of RT-DETR for small-object detection. It identifies three specific deficiencies in the RT-DETR baseline (uniform attention that ignores spatial complexity, norm drift during upsampling, and progressive high-frequency suppression by spatial convolutions) and introduces frequency-domain iterative refinement plus dynamic feature aggregation modules to address them. On NEU-DET and VisDrone the model is reported to reach 92.9 % and 51.6 % mAP50 respectively while using 11.7 M parameters and 47.2 GFLOPs.

Significance. If the performance gains can be shown to arise from the hypothesized module-level corrections rather than uncontrolled capacity or training differences, the work would supply a concrete, efficiency-aware route to improving high-frequency detail preservation in DETR-style detectors for industrial and aerial imagery.

major comments (2)
  1. Abstract: the claim that uniform attention, norm drift, and high-frequency suppression are the dominant causes of weak small-object performance is asserted without any quantitative diagnostics (attention entropy, per-stage feature-norm statistics, or Fourier spectra of feature maps) that would confirm the deficiencies exist at the claimed severity in the RT-DETR baseline.
  2. Abstract: the reported mAP50 figures are presented without ablation tables, controlled baseline comparisons, or statistical tests that isolate the contribution of each proposed module while holding parameter count and other architectural choices fixed; consequently the causal link between the modules and the observed gains cannot be evaluated.
minor comments (1)
  1. Abstract: baseline RT-DETR mAP50 numbers on the same two datasets are not supplied, preventing immediate assessment of the magnitude of improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the motivation and empirical validation of our proposed modules. We address each major comment below and will revise the manuscript to strengthen the supporting evidence.

read point-by-point responses
  1. Referee: Abstract: the claim that uniform attention, norm drift, and high-frequency suppression are the dominant causes of weak small-object performance is asserted without any quantitative diagnostics (attention entropy, per-stage feature-norm statistics, or Fourier spectra of feature maps) that would confirm the deficiencies exist at the claimed severity in the RT-DETR baseline.

    Authors: We acknowledge that the abstract states these deficiencies without accompanying quantitative diagnostics. The design of each module was motivated by observed behaviors during development of the RT-DETR baseline, but explicit metrics such as attention entropy, feature-norm statistics, or Fourier spectra were not reported. In the revised manuscript we will add these diagnostic analyses on the baseline to substantiate the claimed severity of each issue. revision: yes

  2. Referee: Abstract: the reported mAP50 figures are presented without ablation tables, controlled baseline comparisons, or statistical tests that isolate the contribution of each proposed module while holding parameter count and other architectural choices fixed; consequently the causal link between the modules and the observed gains cannot be evaluated.

    Authors: The current manuscript presents end-to-end results on NEU-DET and VisDrone but does not include module-level ablations with parameter-controlled baselines or statistical significance tests. We agree that such experiments are necessary to establish the contribution of each component. The revised version will incorporate detailed ablation tables that isolate the frequency-domain iterative refinement and dynamic feature aggregation modules while keeping parameter count and training settings fixed. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical architecture claims

full rationale

The paper introduces DFIR-DETR as an empirical modification of RT-DETR, motivated by three listed deficiencies and validated solely by reported mAP50 numbers on NEU-DET and VisDrone. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The performance figures are measurements, not outputs that reduce to the inputs by construction, so the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5684 in / 1035 out tokens · 21973 ms · 2026-05-25T07:40:02.080769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

  1. [1]

    A survey of object detection for uavs based on deep learning,

    J. Yin, F. Wu, Y . Qiu, C. Liu, B. Guo, and C. Zhu, “A survey of object detection for uavs based on deep learning,”Remote Sensing, vol. 16, no. 1, p. 149, 2024

  2. [2]

    Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,

    K. Liu and J. Zheng, “Uav trajectory optimization for time-constrained data collection in uav-enabled environmental monitoring systems,”IEEE Internet of Things Journal, vol. 9, no. 24, pp. 24 300–24 314, 2022

  3. [3]

    Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,

    Z. Feng, D. Wu, M. Huanget al., “Graph attention-based reinforcement learning for trajectory design and resource assignment in multi-uav assisted communication,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 21 847–21 862, 2024

  4. [4]

    Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,

    T. Lei, R. Wang, Y . Zhang, Y . Wan, C. Liu, and A. K. Nandi, “Cat- ednet: Cross-attention transformer-based encoder-decoder network for salient defect detection of strip steel surface,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–10, 2022

  5. [5]

    Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,

    W. Zhouet al., “Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,”IEEE Internet of Things Journal, vol. 11, no. 12, pp. 20 327–20 339, 2024

  6. [6]

    A new subspace clustering strategy for ai-based data analysis in iot system,

    Z. Cui, X. Jing, P. Zhao, W. Zhang, and J. Chen, “A new subspace clustering strategy for ai-based data analysis in iot system,”IEEE Internet of Things Journal, vol. 9, no. 1, pp. 97–112, 2022. 15

  7. [7]

    Object detection with deep learning: A review,

    Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,”IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019

  8. [10]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213– 229

  9. [11]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2024, pp. 16 965–16 974

  10. [12]

    A survey of small object detection based on deep learning in aerial images,

    J. Liu, L. Wang, and M. Zhang, “A survey of small object detection based on deep learning in aerial images,”Artificial Intelligence Review, vol. 58, pp. 1–45, 2025

  11. [13]

    Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,

    Z. Yuan, J. Gong, B. Guo, C. Wang, N. Liao, J. Song, and Q. Wu, “Small object detection in uav remote sensing images based on intra- group multi-scale fusion attention and adaptive weighted feature fusion mechanism,”Remote Sensing, vol. 16, no. 22, p. 4265, 2024

  12. [14]

    Attention mechanisms in computer vision: A survey,

    M. Wang and W. Deng, “Attention mechanisms in computer vision: A survey,”Computational Visual Media, vol. 10, no. 1, pp. 3–25, 2024

  13. [15]

    Fast fourier convolution,

    L. Chi, B. Jiang, and Y . Mu, “Fast fourier convolution,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 4479–4488

  14. [16]

    Rich feature hierarchies for accurate object detection and semantic segmentation,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587

  15. [17]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137– 1149, 2017

  16. [18]

    Ssd: Single shot multibox detector,

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 21–37

  17. [19]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– 788

  18. [20]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018

  19. [21]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

  20. [23]

    Feature pyramid networks for object detection,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944

  21. [24]

    Path aggregation network for instance segmentation,

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

  22. [25]

    Efficientdet: Scalable and efficient object detection,

    M. Tan, R. Pang, and Q. V . Le, “Efficientdet: Scalable and efficient object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790

  23. [26]

    Nas-fpn: Learning scalable feature pyramid architecture for object detection,

    G. Ghiasi, T.-Y . Lin, R. Pang, and Q. V . Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045

  24. [27]

    Detection and tracking meet drones challenge,

    P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021

  25. [28]

    A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,

    K. Song and Y . Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,”Applied Surface Science, vol. 285, pp. 858–864, 2013

  26. [29]

    Strip: Spatial transformer for efficient image processing,

    Z. Guo, L. Leng, Y . Wu, C. Li, Y . Wang, and Q. Zhang, “Strip: Spatial transformer for efficient image processing,”Pattern Recognition, vol. 135, p. 109139, 2023

  27. [30]

    Mambaout: Do we really need mamba for vision?

    W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” arXiv preprint arXiv:2405.07992, 2024

  28. [31]

    Global filter networks for image classification,

    Y . Rao, W. Zhao, Y . Tang, J. Zhou, S.-N. Lim, and J. Lu, “Global filter networks for image classification,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 980–993

  29. [32]

    Deformable convolutional networks,

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773

  30. [33]

    Fdt: Fast and effective dynamic token for vision transformer,

    Y . Mao, H. Zhou, J. Xia, and K. Zhang, “Fdt: Fast and effective dynamic token for vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7598–7607

  31. [34]

    Dtab: Dual-token attention block for efficient vision transformers,

    Z. Liu, Y . Han, Q. Zhang, and K. Li, “Dtab: Dual-token attention block for efficient vision transformers,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4163–4177, 2023

  32. [35]

    Camixer: Convolution and attention mixing for efficient image processing,

    Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y . Li, “Camixer: Convolution and attention mixing for efficient image processing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2589–2599

  33. [36]

    Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,

    J. Zhu, J. Li, J. Chen, and Q. Chen, “Efficientvim: Efficient vision mamba with bidirectional state space models for semantic segmenta- tion,”arXiv preprint arXiv:2402.02509, 2024

  34. [37]

    Elgca: Efficient local-global context aggregation for remote sensing change detection,

    L. Song, M. Xia, L. Weng, H. Lin, M. Qian, and B. Chen, “Elgca: Efficient local-global context aggregation for remote sensing change detection,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

  35. [38]

    Hdrab: High-dynamic range attention block for efficient image super-resolution,

    X. Wang, D. Liu, Y . Song, and D. Liang, “Hdrab: High-dynamic range attention block for efficient image super-resolution,”Pattern Recogni- tion, vol. 139, p. 109451, 2023

  36. [39]

    Msn: Multi- scale network for object detection,

    Z. Huang, J. Wang, X. Fu, T. Yu, Y . Guo, and R. Wang, “Msn: Multi- scale network for object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3368–3378

  37. [40]

    Fcanet: Frequency channel attention networks,

    Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention networks,”arXiv preprint arXiv:2012.11879, 2020

  38. [41]

    Rab: Residual attention block for efficient image super- resolution,

    W. Yang, Y . Yuan, W. Guo, W. Ren, J. Zhang, X. He, S. Kwong, and S. Wang, “Rab: Residual attention block for efficient image super- resolution,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021, pp. 1477–1486

  39. [42]

    Yolov6 v3.0: A full-scale reloading,

    C. Li, L. Li, H. Jiang, K. Weng, Y . Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nieet al., “Yolov6 v3.0: A full-scale reloading,”arXiv preprint arXiv:2301.05586, 2023

  40. [43]

    Yolov11: An improved real-time object detection model,

    Ultralytics, “Yolov11: An improved real-time object detection model,” https://docs.ultralytics.com, 2024

  41. [44]

    You only look one-level feature,

    Q. Chen, Y . Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only look one-level feature,” pp. 13 039–13 048, 2021

  42. [45]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  43. [46]

    Grid r-cnn,

    X. Lu, B. Li, Y . Yue, Q. Li, and J. Yan, “Grid r-cnn,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372

  44. [47]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

    X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 21 002–21 012

  45. [48]

    Objects as points,

    X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” 2019

  46. [49]

    Asf: Adaptive spatial fusion for efficient multi-scale feature learning,

    C. Yang, Z. Huang, and N. Wang, “Asf: Adaptive spatial fusion for efficient multi-scale feature learning,”arXiv preprint arXiv:2202.03149, 2022

  47. [50]

    Sdi: Spatial detail injection network for multi-scale semantic segmentation,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Sdi: Spatial detail injection network for multi-scale semantic segmentation,”Pattern Recognition, vol. 138, p. 109367, 2023

  48. [51]

    Gold- yolo: Efficient object detector via gather-and-distribute mechanism,

    C. Wang, W. He, Y . Nie, J. Guo, C. Liu, K. Han, and Y . Wang, “Gold- yolo: Efficient object detector via gather-and-distribute mechanism,” arXiv preprint arXiv:2309.11331, 2023

  49. [52]

    Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,

    Y . Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Hsfpn: Hierarchical semantic fusion pyramid network for multi-scale object detection,”IEEE Transactions on Image Processing, vol. 32, pp. 2918– 2931, 2023

  50. [53]

    Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,

    H. Guo, J. Yang, B. Yang, and G. Xu, “Cgafusion: Context-guided adap- tive fusion network for rgb-t semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 4156–4165

  51. [54]

    Psfm: Progressive semantic feature module for object detection,

    P. Sun, R. Zhang, Y . Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Yuan, P. Wang, and P. Luo, “Psfm: Progressive semantic feature module for object detection,”arXiv preprint arXiv:2302.02923, 2023

  52. [55]

    Glsa: Global- local self-attention for multi-scale feature learning,

    Y . Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y . Fu, “Glsa: Global- local self-attention for multi-scale feature learning,”IEEE Transactions 16 on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8784– 8800, 2023

  53. [56]

    Ctrans: Cross- transformer network for multi-scale feature fusion,

    X. Yan, H. Tang, S. Sun, H. Ma, D. Kong, and X. Xie, “Ctrans: Cross- transformer network for multi-scale feature fusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3868–3877

  54. [57]

    Maffn: Multi-scale attention feature fusion network for semantic segmentation,

    W. Liu, Z. Wang, X. Liu, N. Zeng, Y . Liu, and F. E. Alsaadi, “Maffn: Multi-scale attention feature fusion network for semantic segmentation,” Neurocomputing, vol. 520, pp. 29–40, 2023

  55. [58]

    Msga: Multi-scale grouped attention mechanism for object detection,

    J. Wang, K. Chen, J. Yang, C. C. Loy, and D. Lin, “Msga: Multi-scale grouped attention mechanism for object detection,”Pattern Recognition, vol. 140, p. 109545, 2023

  56. [59]

    Fsa: Feature separation and aggregation network for semantic segmentation,

    X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, and Y . Tong, “Fsa: Feature separation and aggregation network for semantic segmentation,” Neurocomputing, vol. 523, pp. 103–114, 2023

  57. [60]

    Mfm: Multi-frequency multiscale feature fusion for object detection,

    J. Hu, L. Shen, and G. Sun, “Mfm: Multi-frequency multiscale feature fusion for object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 860–868

  58. [61]

    Diverse branch block: Building a convolution as an inception-like unit,

    X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “Diverse branch block: Building a convolution as an inception-like unit,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 886–10 895

  59. [62]

    Dbbc3: Dynamic branching bottleneck for efficient neural networks,

    K. Han, Y . Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Dbbc3: Dynamic branching bottleneck for efficient neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4456– 4468, 2023

  60. [63]

    Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,

    X. Chen, H. Wang, Y . Hong, J. Guo, X. Wang, and Q. Zhang, “Dgcst: Dynamic group convolution shuffle transformer for efficient vision backbone,”Pattern Recognition Letters, vol. 168, pp. 36–43, 2023

  61. [64]

    Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Litv2: Efficient self-attention for vision transformers with learnable interaction tokens,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 043–11 053

  62. [65]

    Fcos: Fully convolutional one- stage object detection,

    Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9627–9636

  63. [66]

    Cascade r-cnn: Delving into high quality object detection,

    Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162

  64. [67]

    Yolov5: A state-of-the-art real-time object detection sys- tem,

    Ultralytics, “Yolov5: A state-of-the-art real-time object detection sys- tem,” https://github.com/ultralytics/yolov5, 2021

  65. [68]

    Ultralytics yolov8,

    G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

  66. [69]

    Yolov9: Learning what you want to learn using programmable gradient information,

    C. Y . Wang and H. Y . Liao, “Yolov9: Learning what you want to learn using programmable gradient information,”arXiv preprint arXiv:2402.13616, 2024

  67. [70]

    Yolov10: Real-time end-to-end object detection,

    A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding, “Yolov10: Real-time end-to-end object detection,”arXiv preprint arXiv:2405.14458, 2024