Recognition: 2 theorem links
· Lean TheoremYOLOv12: Attention-Centric Real-Time Object Detectors
Pith reviewed 2026-05-13 21:30 UTC · model grok-4.3
The pith
YOLOv12 centers its architecture on attention mechanisms to exceed the accuracy of prior real-time object detectors while keeping inference speeds comparable to CNN-based YOLO models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
YOLOv12 is an attention-centric YOLO framework that matches the speed of previous CNN-based models while delivering higher accuracy, surpassing popular real-time detectors such as YOLOv10-N, YOLOv11-N, and RT-DETR variants on standard benchmarks.
What carries the argument
The attention-centric architectural changes in YOLOv12 that enable attention mechanisms to run at CNN-comparable speeds while retaining their modeling advantages.
If this is right
- YOLOv12-N reaches 40.6 percent mAP at 1.64 ms inference latency on T4 GPU, exceeding YOLOv10-N and YOLOv11-N by 2.1 and 1.2 percent mAP.
- YOLOv12-S runs 42 percent faster than RT-DETR-R18 while using 36 percent of the computation and 45 percent of the parameters.
- The accuracy advantage holds across multiple model scales from nano to larger variants.
- Attention mechanisms become viable as the primary backbone for real-time object detection without custom hardware.
Where Pith is reading between the lines
- Designers of other real-time vision systems may shift priority from CNN blocks to attention blocks once speed parity is shown feasible.
- The result suggests that targeted architectural tuning can close the efficiency gap between attention and convolution in latency-sensitive tasks.
- Future work could test whether the same attention-centric pattern transfers to related problems such as real-time instance segmentation or video object tracking.
Load-bearing premise
The specific attention mechanisms and any accompanying optimizations can be implemented to run at speeds matching CNN-based YOLO models on standard hardware.
What would settle it
A side-by-side benchmark on COCO showing YOLOv12 achieving lower mAP than YOLOv11-N at equal or higher latency on a T4 GPU would falsify the central performance claim.
read the original abstract
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces YOLOv12, an attention-centric real-time object detector that replaces or augments CNN components with attention mechanisms while claiming to retain CNN-comparable inference speeds. It reports that YOLOv12-N achieves 40.6% mAP at 1.64 ms latency on T4 GPU, outperforming YOLOv10-N and YOLOv11-N by 2.1% and 1.2% mAP respectively, with similar advantages across scales and against RT-DETR variants in speed, compute, and parameters.
Significance. If the efficiency claims hold, the result would be significant for real-time detection by showing that attention can deliver measurable accuracy gains without the usual quadratic latency penalty, potentially shifting design paradigms away from pure CNN backbones. The concrete benchmark numbers and cross-family comparisons provide falsifiable predictions that could be directly tested on standard hardware.
major comments (2)
- [Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.
- [Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.
minor comments (2)
- [Figure 1] Figure 1 caption and latency table: confirm that all reported latencies use the same T4 GPU, batch size 1, and FP16/INT8 precision to ensure apples-to-apples comparison.
- [Section 3] Notation for model scales (N/S/M/L/X): explicitly define how the attention module widths and depths scale with these variants to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our YOLOv12 manuscript. We have revised the paper to incorporate explicit complexity analysis and experimental protocol details, addressing the concerns while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.
Authors: We agree that an explicit complexity analysis strengthens the validation of our efficiency claims. In the revised manuscript, Section 3 now includes a dedicated complexity analysis subsection. It details the FLOPs scaling for the attention modules, the windowed and linear attention formulations that achieve linear complexity, and the FlashAttention integration used to eliminate quadratic costs. This confirms that the reported 1.64 ms latency on T4 for YOLOv12-N arises directly from the attention-centric design without CNN fallbacks or resolution reductions. revision: yes
-
Referee: [Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.
Authors: We acknowledge the need for transparent experimental details to ensure fair comparisons. The revised Experiments section now includes an explicit subsection describing the training protocols. All compared models (YOLOv10-N, YOLOv11-N, RT-DETR variants) were trained and evaluated using identical schedules, augmentation pipelines, and input resolutions as defined in their original papers and the standard COCO benchmark settings. This confirms that the mAP and speed gains are attributable to YOLOv12's attention-centric architecture. revision: yes
Circularity Check
No circularity; empirical architecture proposal with benchmark results
full rationale
The paper introduces YOLOv12 as an attention-centric YOLO variant and supports its claims solely through empirical benchmark comparisons (e.g., mAP and latency numbers on T4 GPU against YOLOv10/YOLOv11 and RT-DETR variants). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- model scale definitions (N/S/M/L/X)
axioms (1)
- domain assumption Attention mechanisms have superior modeling capabilities compared with CNNs
Forward citations
Cited by 23 Pith papers
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...
-
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering repor...
-
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
-
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
-
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
-
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
-
Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
-
InsHuman: Towards Natural and Identity-Preserving Human Insertion
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
-
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
-
Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection
Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
-
StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network
StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.
-
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M...
-
A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and ...
-
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
-
Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model
YOLOv12 with Otsu thresholding on cell-based segmentation classifies AML cells at 99.3% validation and test accuracy.
-
FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
-
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
-
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
-
Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface
A local multi-agent framework integrates YOLO object detection with Slack-Ollama natural language control entirely on Raspberry Pi hardware.
Reference graph
Works this paper leans on
-
[1]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Low-rank bottleneck in multi-head attention models
Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Low-rank bottleneck in multi-head attention models. In International conference on machine learning, pages 864–873. PMLR, 2020. 4
work page 2020
-
[3]
YOLOv4: Optimal Speed and Accuracy of Object Detection
Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 6, 11
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Anomaly detection in autonomous driving: A survey
Daniel Bogdoll, Maximilian Nitsche, and J Marius Z ¨ollner. Anomaly detection in autonomous driving: A survey. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4488–4499, 2022. 1
work page 2022
-
[5]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3
work page 1901
-
[6]
Albumentations: fast and flexible image augmenta- tions
Alexander Buslaev, Vladimir I Iglovikov, Eugene Khved- chenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. Albumentations: fast and flexible image augmenta- tions. Information, 11(2):125, 2020. 11
work page 2020
-
[7]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2
work page 2020
-
[8]
Ap-loss for accurate one-stage object detection
Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. Ap-loss for accurate one-stage object detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(11):3782–3798, 2020. 1
work page 2020
-
[9]
Yolo-ms: rethinking multi- scale representation learning for real-time object detection
Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, and Ming-Ming Cheng. Yolo-ms: rethinking multi- scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023. 2
-
[11]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[12]
Twins: Revisiting the design of spatial attention in vision transformers
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021. 3
work page 2021
-
[13]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 2, 3, 7, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Flashattention: Fast and memory-efficient exact at- tention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 2, 3, 7, 11
work page 2022
-
[15]
BERT: pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019. 3
work page 2019
-
[16]
Cswin transformer: A general vision transformer backbone with cross-shaped windows
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022. 2, 4
work page 2022
-
[17]
Douglas Henke Dos Reis, Daniel Welfer, Marco Anto- nio De Souza Leite Cuadros, and Daniel Fernando Tello Gamarra. Mobile robot navigation using an object recogni- tion software with rgbd images and the yolo algorithm. Ap- plied Artificial Intelligence, 33(14):1290–1305, 2019. 1
work page 2019
-
[18]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[19]
Eva: Exploring the limits of masked visual representa- tion learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3, 6
work page 2023
-
[21]
Eva-02: A visual representation for neon genesis
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. Image and Vision Computing, 149:105171,
-
[22]
Tood: Task-aligned one-stage object detec- tion
Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detec- tion. In 2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 3490–3499. IEEE Computer So- ciety, 2021. 1
work page 2021
-
[23]
Ota: Optimal transport assignment for object detection
Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 303–312, 2021. 1
work page 2021
-
[24]
Jocher Glenn. Yolov8. https://github.com/ultralytics/ultralytics/tree/main, 2023. 1, 2, 5, 6, 9, 11
work page 2023
-
[25]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 1, 6, 9
work page 2022
-
[26]
Ax- ial attention in multidimensional transformers,
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019. 2
-
[27]
Ccnet: Criss-cross attention for semantic segmentation
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 603–612, 2019. 2, 4
work page 2019
-
[28]
Glenn Jocher. yolov11. https://github.com/ultralytics, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 10, 11
work page 2024
-
[29]
Glenn Jocher, K Nishimura, T Mineeva, and RJAM Vilari˜no. yolov5. https://github.com/ultralytics/yolov5/tree, 2, 2020. 1, 2, 6
work page 2020
-
[30]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 4
work page 2020
-
[31]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 3
work page 2020
- [32]
-
[33]
Dn-detr: Accelerate detr training by intro- ducing query denoising
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 2
work page 2022
-
[34]
A dual weighting label assignment scheme for object detection
Shuai Li, Chenhang He, Ruihuang Li, and Lei Zhang. A dual weighting label assignment scheme for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9387–9396, 2022. 1
work page 2022
-
[35]
Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection
Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33:21002–21012, 2020. 1
work page 2020
-
[36]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6, 10
work page 2014
-
[37]
Dab-detr: Dynamic anchor boxes are better queries for detr,
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 2
-
[38]
Vmamba: Visual state space model
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 3
work page 2024
-
[39]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 3, 4
work page 2021
-
[41]
Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer
Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024. 5, 6
-
[42]
Conditional detr for fast training convergence
Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2
work page 2021
-
[43]
A ranking-based, balanced loss function unifying classification and localisation in object detection
Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33:15534–15545,
-
[44]
Rank & sort loss for object detection and instance segmentation
Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Rank & sort loss for object detection and instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3009–3018, 2021. 1
work page 2021
-
[45]
You only look once: Unified, real-time object de- tection
J Redmon. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 1, 2, 6
work page 2016
-
[46]
YOLOv3: An Incremental Improvement
Joseph Redmon. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Yolo9000: better, faster, stronger
Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition , pages 7263–7271, 2017. 1, 2, 6
work page 2017
-
[48]
Generalized in- tersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,
-
[49]
Efficient attention: Attention with lin- ear complexities
Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with lin- ear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531– 3539, 2021. 3, 4
work page 2021
-
[50]
Fast-itpn: Integrally pre- trained transformer pyramid network with token migration
Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 1, 3
work page 2024
-
[51]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 6
work page 2021
-
[52]
Going deeper with im- age transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 4
work page 2021
-
[53]
Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection. arXiv preprint arXiv:2405.14458 ,
-
[54]
1, 2, 5, 6, 7, 8, 9, 10, 11
-
[55]
Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism
Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Yunhe Wang, and Kai Han. Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems , 36,
-
[56]
Cspnet: A new backbone that can enhance learning capability of cnn
Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 390–391,
-
[57]
Designing network design strategies through gradient path analysis
Chien-Yao Wang, Hong-Yuan Mark Liao, and I-Hau Yeh. Designing network design strategies through gradient path analysis. arXiv preprint arXiv:2211.04800, 2022. 2, 4
-
[58]
Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023. 1, 2, 4, 6, 11
work page 2023
-
[59]
Yolov9: Learning what you want to learn us- ing programmable gradient information
Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information. arXiv preprint arXiv:2402.13616, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 11
-
[60]
End-to-end object detection with fully convolutional network
Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15849–15858, 2021. 1
work page 2021
-
[61]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[62]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision , pages 568–578, 2021. 2
work page 2021
-
[63]
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025. 3
-
[64]
Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14138–14148, 2021. 4
work page 2021
-
[65]
Glance-and-gaze vision transformer
Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan L Yuille, and Wei Shen. Glance-and-gaze vision transformer. Advances in Neural Information Processing Systems , 34: 12992–13003, 2021. 3
work page 2021
-
[66]
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 11
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[67]
Detrs beat yolos on real-time object detection
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16965–16974, 2024. 2, 5, 6
work page 2024
-
[68]
Distance-iou loss: Faster and bet- ter learning for bounding box regression
Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and bet- ter learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, pages 12993– 13000, 2020. 1
work page 2020
-
[69]
Iou loss for 2d/3d ob- ject detection
Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob- ject detection. In 2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 1
work page 2019
-
[70]
Autoassign: Differ- entiable label assignment for dense object detection
Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ- entiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020. 1
-
[71]
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 2, 11
work page internal anchor Pith review Pith/arXiv arXiv 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.