pith. machine review for the scientific record. sign in

arxiv: 2604.07759 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: no theorem link

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords ship detectionobject detection datasetsmall object detectionmaritime scenesTransformer detectionCNN detectionMamba detectionbenchmark evaluation
0
0 comments X

The pith

A 100k-image ship detection dataset with dense small objects shows Transformers outperforming CNNs and Mamba on accuracy in complex maritime scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs WUTDet as a dataset of 100,576 images containing 381,378 ship instances drawn from ports, anchorages, navigation routes, and berthing under varied conditions including fog, glare, low light, and rain. It then runs 20 baseline detectors spanning CNN, Transformer, and Mamba architectures on this data to measure overall average precision, small-object precision, and inference speed. The evaluation also includes a cross-dataset test set called Ship-GEN that measures how well models trained on WUTDet transfer to other distributions. If the results hold, the dataset supplies a standardized, large-scale resource that makes systematic comparison of detection methods in real-world waterway settings possible for the first time.

Core claim

WUTDet supplies 100,576 images and 381,378 annotated ship instances that cover diverse operational scenarios and imaging conditions. Benchmarks on 20 models reveal that Transformer architectures deliver the highest overall AP and APs scores and adapt best to complex maritime scenes, CNN architectures retain an edge in inference speed for real-time use, and Mamba architectures provide a balanced accuracy-efficiency trade-off. Models trained on WUTDet further demonstrate stronger generalization when tested on the unified Ship-GEN cross-dataset set.

What carries the argument

The WUTDet dataset itself, defined by its scale, high proportion of dense small ship instances, and coverage of varied maritime scenes and weather, which enables controlled comparison of CNN, Transformer, and Mamba detectors.

If this is right

  • Transformer detectors become the default choice when maximum detection accuracy on small vessels is required.
  • CNN detectors remain the practical option for real-time onboard navigation systems where speed matters most.
  • Mamba detectors serve as a middle path for applications that need both reasonable accuracy and moderate compute.
  • Training on WUTDet improves a model's ability to handle shifts in data distribution across different maritime sources.
  • Future ship-detection papers can use WUTDet as a common benchmark to report comparable results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset's emphasis on dense small objects could be extended to create similar resources for other domains with crowded small targets, such as aerial vehicle or cell detection.
  • Architectural comparisons performed here suggest that hybrid models combining Transformer attention with Mamba efficiency may further improve the accuracy-speed frontier on this data.
  • Wider adoption of WUTDet would allow researchers to isolate the effect of scene diversity versus sheer scale when studying generalization failures in detection.
  • The cross-dataset Ship-GEN protocol could be reused as a template for measuring domain shift in other perception tasks.

Load-bearing premise

The manual annotations of all 381k ship instances are accurate and consistent, and the 20 chosen baseline implementations plus evaluation protocols represent each architecture class without hidden biases.

What would settle it

A re-run of the 20 baselines on WUTDet with independently verified annotations that reverses the ranking between Transformer and other architectures, or a test on Ship-GEN where WUTDet-trained models no longer show superior generalization.

Figures

Figures reproduced from arXiv: 2604.07759 by An-An Liu, Junxiong Liang, Mengwei Bao, Ryan Wen Liu, Tianxiang Wang, Xinggang Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework, including the construction of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: WUTDet Data Collection Setup. (a) Onboard data acquisition device [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Image examples in WUTDet from four typical operational scenarios, namely port, anchorage, navigation, and berthing. These scenarios differ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Image examples in WUTDet under fog, glare, low-lightness, rain, and normal conditions. These diverse imaging conditions provide data support for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the number of objects per image in WUTDet. The [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Presents the statistics of object scales in WUTDet. (a) shows the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pipeline of the YOLOv8 architecture [41]. As a representative CNN-based object detection model, YOLOv8 improves detection efficiency and [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pipeline of the RT-DETR architecture [14]. As a representative Transformer-based object detection model, RT-DETR [14] improves computational [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pipeline of the Mamba-YOLO architecture [40]. As a representative Mamba-based object detection model, Mamba-YOLO [40] introduces SSM to [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of detection results of different CNN-based object detection methods on WUTDet. CT denotes the prediction confidence score, whose [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Continuation of Fig. 11. Visualization of detection results of different Transformer- and Mamba-based object detection methods on WUTDet. CT [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of detection results of different CNN-based object detection methods on WUTDet under various weather and illumination conditions. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Continuation of Fig. 13. Visualization of detection results of different Transformer- and Mamba-based object detection methods on WUTDet under [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: https://github.com/MAPGroup/WUTDet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WUTDet, a dataset of 100,576 images containing 381,378 annotated ship instances across diverse maritime scenarios (ports, anchorages, navigation, berthing) and imaging conditions (fog, glare, low-light, rain). It benchmarks 20 detection models spanning CNN, Transformer, and Mamba architectures, claiming Transformer superiority in overall AP and small-object APs, CNN advantage in inference speed, Mamba balance of accuracy and efficiency, and stronger generalization of WUTDet-trained models on the constructed Ship-GEN cross-dataset test set.

Significance. If the annotations are reliable and the benchmarks are free of hidden biases, WUTDet would provide a valuable large-scale resource for evaluating ship detectors under realistic maritime challenges, particularly dense small objects. The architectural comparisons and public release could inform model selection for real-time navigation systems and support further generalization studies.

major comments (2)
  1. [Dataset construction] Dataset construction section: No inter-annotator agreement metrics, multiple-reviewer protocol, or quality-control statistics are reported for the 381,378 manual labels. Given that small-object instances under adverse conditions form a core challenge, this omission directly affects the trustworthiness of all AP, APs, and Ship-GEN generalization claims.
  2. [Experimental evaluation] Experimental evaluation section: The paper evaluates 20 baselines but provides insufficient detail on data splits, small-object size thresholds, training hyperparameters, whether models were trained from scratch or fine-tuned, and whether any post-hoc selection occurred. These factors are load-bearing for the reported architecture rankings and cross-dataset results.
minor comments (2)
  1. [Implementation details] Clarify in the text whether the 20 baseline implementations follow identical training schedules and post-processing steps or if architecture-specific defaults were used.
  2. The GitHub release should include the precise annotation guidelines and any quality-assurance scripts to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below and will update the manuscript to incorporate additional details on dataset annotation and experimental setup.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: No inter-annotator agreement metrics, multiple-reviewer protocol, or quality-control statistics are reported for the 381,378 manual labels. Given that small-object instances under adverse conditions form a core challenge, this omission directly affects the trustworthiness of all AP, APs, and Ship-GEN generalization claims.

    Authors: We agree that providing details on the annotation process is important for establishing the reliability of the dataset. In the revised manuscript, we will expand the 'Dataset Construction' section to include a description of the annotation protocol. This will cover the use of multiple annotators, the review process, and any quality control measures implemented. While we did not originally compute inter-annotator agreement statistics, we will report the procedures used to ensure label quality. We believe this will strengthen the trustworthiness of our claims. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation section: The paper evaluates 20 baselines but provides insufficient detail on data splits, small-object size thresholds, training hyperparameters, whether models were trained from scratch or fine-tuned, and whether any post-hoc selection occurred. These factors are load-bearing for the reported architecture rankings and cross-dataset results.

    Authors: We acknowledge the need for more transparency in the experimental setup. In the revised version, we will add a dedicated subsection detailing the data splits (training, validation, and test sets), the definition of small objects (e.g., based on pixel area thresholds), the full list of training hyperparameters for each model, confirmation that models were fine-tuned from pre-trained weights, and assurance that no post-hoc model selection was performed beyond the standard evaluation protocol. This will allow readers to better interpret the architectural comparisons and generalization results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with direct benchmarking

full rationale

This is a dataset-construction and empirical-benchmarking paper. It introduces WUTDet (100k images, 381k instances), evaluates 20 off-the-shelf detectors from three families on it, and reports cross-dataset generalization on Ship-GEN. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. All central claims are direct experimental outcomes (AP, APs, inference speed, generalization gaps) rather than reductions to self-defined quantities or self-citation chains. Standard annotation and evaluation protocols are assumed; any quality issues would affect correctness, not create circular logic. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset construction and benchmarking paper. No mathematical derivations, free parameters, axioms, or invented entities are introduced; claims rest on the quality of image collection, annotation, and standard detection evaluation protocols.

pith-pipeline@v0.9.0 · 5607 in / 1277 out tokens · 88511 ms · 2026-05-10T18:07:58.480130+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    Maritime environment perception based on deep learning,

    J. Lin, P. Diekmann, C.-E. Framing, R. Zweigel, and D. Abel, “Maritime environment perception based on deep learning,”IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 15 487–15 497, 2022

  2. [2]

    A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance,

    A. M. Rekavandi, L. Xu, F. Boussaid, A.-K. Seghouane, S. Hoefs, and M. Bennamoun, “A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance,” IEEE Trans. Intell. Transp. Syst., 2025

  3. [3]

    Mdd-shipnet: Math-data integrated defogging for fog-occlusion ship detection,

    N. Wang, Y . Wang, Y . Feng, and Y . Wei, “Mdd-shipnet: Math-data integrated defogging for fog-occlusion ship detection,”IEEE Trans. Intell. Transp. Syst., vol. 25, no. 10, pp. 15 040–15 052, 2024

  4. [4]

    Deep-learning-empowered visual ship detection and tracking: Literature review and future direc- tion,

    B. Zhang, J. Liu, R. W. Liu, and Y . Huang, “Deep-learning-empowered visual ship detection and tracking: Literature review and future direc- tion,”Eng. Appl. Artif. Intell., vol. 141, p. 109754, 2025

  5. [5]

    Aodemar: Attention-aware occlusion detection of vessels for maritime autonomous surface ships,

    N. Wang, Y . Wang, Y . Feng, and Y . Wei, “Aodemar: Attention-aware occlusion detection of vessels for maritime autonomous surface ships,” IEEE Trans. Intell. Transp. Syst., vol. 25, no. 10, pp. 13 584–13 597, 2024

  6. [6]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”Adv. Neural Inform. Process. Syst., vol. 28, 2015

  7. [7]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018

  8. [8]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Op- timal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020

  9. [9]

    Ssd: Single shot multibox detector,

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” inproc. ECCV. Springer, 2016, pp. 21–37

  10. [10]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Adv. Neural Inform. Process. Syst., vol. 30, 2017

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  12. [12]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inproc. ECCV. Springer, 2020, pp. 213–229

  13. [13]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020. 20

  14. [14]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inproc. IEEE CVPR, 2024, pp. 16 965–16 974

  15. [15]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inproc. COLM, 2024

  16. [16]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417, 2024

  17. [17]

    Seaships: A large-scale precisely annotated dataset for ship detection,

    Z. Shao, W. Wu, Z. Wang, W. Du, and C. Li, “Seaships: A large-scale precisely annotated dataset for ship detection,”IEEE Trans. Multimedia, vol. 20, no. 10, pp. 2593–2604, 2018

  18. [18]

    An image-based benchmark dataset and a novel object detector for water surface object detection,

    Z. Zhou, J. Sun, J. Yu, K. Liu, J. Duan, L. Chen, and C. P. Chen, “An image-based benchmark dataset and a novel object detector for water surface object detection,”Front. Neurorob., vol. 15, p. 723336, 2021

  19. [19]

    Waterscenes: A multi-task 4d radar- camera fusion dataset and benchmarks for autonomous driving on water surfaces,

    S. Yao, R. Guan, Z. Wu, Y . Ni, Z. Huang, R. W. Liu, Y . Yue, W. Ding, E. G. Lim, H. Seoet al., “Waterscenes: A multi-task 4d radar- camera fusion dataset and benchmarks for autonomous driving on water surfaces,”IEEE Trans. Intell. Transp. Syst., vol. 25, no. 11, pp. 16 584– 16 598, 2024

  20. [20]

    Video processing from electro-optical sensors for object detection and tracking in a maritime environment: A survey,

    D. K. Prasad, D. Rajan, L. Rachmawati, E. Rajabally, and C. Quek, “Video processing from electro-optical sensors for object detection and tracking in a maritime environment: A survey,”IEEE Trans. Intell. Transp. Syst., vol. 18, no. 8, pp. 1993–2016, 2017

  21. [21]

    Mcships: A large-scale ship dataset for detection and fine-grained categorization in the wild,

    Y . Zheng and S. Zhang, “Mcships: A large-scale ship dataset for detection and fine-grained categorization in the wild,” inproc. IEEE ICME. IEEE, 2020, pp. 1–6

  22. [22]

    Simuships-a high resolution simulation dataset for ship detection with precise annotations,

    M. Raza, H. Prokopova, S. Huseynzade, S. Azimi, and S. Lafond, “Simuships-a high resolution simulation dataset for ship detection with precise annotations,” inproc. OCEANS. IEEE, 2022, pp. 1–5

  23. [23]

    Asynchronous trajectory matching-based multimodal maritime data fusion for vessel traffic surveillance in inland waterways,

    Y . Guo, R. W. Liu, J. Qu, Y . Lu, F. Zhu, and Y . Lv, “Asynchronous trajectory matching-based multimodal maritime data fusion for vessel traffic surveillance in inland waterways,”IEEE Trans. Intell. Transp. Syst., vol. 24, no. 11, pp. 12 779–12 792, 2023

  24. [24]

    Marine vessel detection dataset and benchmark for unmanned surface vehicles,

    N. Wang, Y . Wang, Y . Wei, B. Han, and Y . Feng, “Marine vessel detection dataset and benchmark for unmanned surface vehicles,”Appl. Ocean Res., vol. 142, p. 103835, 2024

  25. [25]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,”Int. J. Comput. Vision, vol. 88, no. 2, pp. 303–338, 2010

  26. [26]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inproc. IEEE CVPR. Ieee, 2009, pp. 248–255

  27. [27]

    Imagenet large scale visual recognition challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252, 2015

  28. [28]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inproc. ECCV. Springer, 2014, pp. 740–755

  29. [29]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikovet al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”Int. J. Comput. Vision, vol. 128, no. 7, pp. 1956–1981, 2020

  30. [30]

    Objects365: A large-scale, high-quality dataset for object detection,

    S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in proc. IEEE ICCV, 2019, pp. 8430–8439

  31. [31]

    Fast image-based ob- stacle detection from unmanned surface vehicles,

    M. Kristan, V . S. Kenk, S. Kova ˇciˇc, and J. Per ˇs, “Fast image-based ob- stacle detection from unmanned surface vehicles,”IEEE Trans. Cybern., vol. 46, no. 3, pp. 641–654, 2015

  32. [32]

    Cascade r-cnn: Delving into high quality object detection,

    Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inproc. IEEE CVPR, 2018, pp. 6154–6162

  33. [33]

    YOLOX: Exceeding YOLO Series in 2021

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,”arXiv preprint arXiv:2107.08430, 2021

  34. [34]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Y . Tian, Q. Ye, and D. Doermann, “Yolov12: Attention-centric real-time object detectors,”arXiv preprint arXiv:2502.12524, 2025

  35. [35]

    arXiv preprint arXiv:2506.17733 (2025)

    M. Lei, S. Li, Y . Wu, H. Hu, Y . Zhou, X. Zheng, G. Ding, S. Du, Z. Wu, and Y . Gao, “Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception,”arXiv preprint arXiv:2506.17733, 2025

  36. [36]

    Dab-detr: Dynamic anchor boxes are better queries for detr,

    S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,”arXiv preprint arXiv:2201.12329, 2022

  37. [37]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

  38. [38]

    D-fine: Redefine regression task in detrs as fine-grained distribution refinement.arXiv preprint arXiv:2410.13842, 2024

    Y . Peng, H. Li, P. Wu, Y . Zhang, X. Sun, and F. Wu, “D-fine: Redefine regression task in detrs as fine-grained distribution refinement,”arXiv preprint arXiv:2410.13842, 2024

  39. [39]

    Vmamba: Visual state space model,

    Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Adv. Neural Inform. Process. Syst., vol. 37, pp. 103 031–103 063, 2024

  40. [40]

    Mamba yolo: A simple baseline for object detection with state space model,

    Z. Wang, C. Li, H. Xu, X. Zhu, and H. Li, “Mamba yolo: A simple baseline for object detection with state space model,” inproc. AAAI, vol. 39, no. 8, 2025, pp. 8205–8213

  41. [41]

    Ultralytics YOLO,

    G. Jocher, J. Qiu, and A. Chaurasia, “Ultralytics YOLO,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

  42. [42]

    Yolov6: A single-stage object detection framework for industrial applications

    C. Li, L. Li, H. Jiang, K. Weng, Y . Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nieet al., “Yolov6: A single-stage object detection framework for industrial applications,”arXiv preprint arXiv:2209.02976, 2022

  43. [43]

    Hyper-yolo: When visual object detection meets hypergraph computation,

    Y . Feng, J. Huang, S. Du, S. Ying, J.-H. Yong, Y . Li, G. Ding, R. Ji, and Y . Gao, “Hyper-yolo: When visual object detection meets hypergraph computation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 4, pp. 2388–2401, 2024

  44. [44]

    Fbrt-yolo: Faster and better for real- time aerial image detection,

    Y . Xiao, T. Xu, Y . Xin, and J. Li, “Fbrt-yolo: Faster and better for real- time aerial image detection,” inproc. AAAI, vol. 39, no. 8, 2025, pp. 8673–8681

  45. [45]

    Yolo-ms: Rethinking multi-scale representation learning for real-time object detection,

    Y . Chen, X. Yuan, J. Wang, R. Wu, X. Li, Q. Hou, and M.-M. Cheng, “Yolo-ms: Rethinking multi-scale representation learning for real-time object detection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4240–4252, 2025

  46. [46]

    Lw-detr: A transformer replacement to yolo for real-time detection,

    Q. Chen, X. Su, X. Zhang, J. Wang, J. Chen, Y . Shen, C. Han, Z. Chen, W. Xu, F. Liet al., “Lw-detr: A transformer replacement to yolo for real-time detection,”arXiv preprint arXiv:2406.03459, 2024

  47. [47]

    Deim: Detr with improved matching for fast convergence,

    S. Huang, Z. Lu, X. Cun, Y . Yu, X. Zhou, and X. Shen, “Deim: Detr with improved matching for fast convergence,” inproc. CVPR, 2025, pp. 15 162–15 171

  48. [48]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2016

  49. [49]

    Mobilemamba: Lightweight multi-receptive visual mamba network,

    H. He, J. Zhang, Y . Cai, H. Chen, X. Hu, Z. Gan, Y . Wang, C. Wang, Y . Wu, and L. Xie, “Mobilemamba: Lightweight multi-receptive visual mamba network,” inproc. CVPR, 2025, pp. 4497–4507

  50. [50]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inproc. ICCV, 2017, pp. 2980–2988

  51. [51]

    Objects as points,

    X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,”arXiv preprint arXiv:1904.07850, 2019