arxiv: 2502.12524 · v1 · submitted 2025-02-18 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian , Qixiang Ye , David Doermann

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords real-time object detectionattention mechanismsYOLO frameworkobject detectorsinference latencyaccuracy comparisonCNN alternatives

0 comments

The pith

YOLOv12 centers its architecture on attention mechanisms to exceed the accuracy of prior real-time object detectors while keeping inference speeds comparable to CNN-based YOLO models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes YOLOv12 as an attention-centric redesign of the YOLO framework. It demonstrates that attention mechanisms, long known for stronger modeling but previously too slow for real-time use, can be integrated to match the inference latency of CNN-based predecessors. On standard benchmarks, the smallest YOLOv12 variant reaches 40.6 percent mAP at 1.64 milliseconds on a T4 GPU, beating recent YOLOv10 and YOLOv11 versions by 2.1 and 1.2 percent mAP at similar speed. The gains hold across model sizes and extend to comparisons against end-to-end detectors such as RT-DETR, where YOLOv12 uses far fewer parameters and computations while running faster. This directly challenges the assumption that attention-based detectors must trade speed for accuracy in real-time settings.

Core claim

YOLOv12 is an attention-centric YOLO framework that matches the speed of previous CNN-based models while delivering higher accuracy, surpassing popular real-time detectors such as YOLOv10-N, YOLOv11-N, and RT-DETR variants on standard benchmarks.

What carries the argument

The attention-centric architectural changes in YOLOv12 that enable attention mechanisms to run at CNN-comparable speeds while retaining their modeling advantages.

If this is right

YOLOv12-N reaches 40.6 percent mAP at 1.64 ms inference latency on T4 GPU, exceeding YOLOv10-N and YOLOv11-N by 2.1 and 1.2 percent mAP.
YOLOv12-S runs 42 percent faster than RT-DETR-R18 while using 36 percent of the computation and 45 percent of the parameters.
The accuracy advantage holds across multiple model scales from nano to larger variants.
Attention mechanisms become viable as the primary backbone for real-time object detection without custom hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of other real-time vision systems may shift priority from CNN blocks to attention blocks once speed parity is shown feasible.
The result suggests that targeted architectural tuning can close the efficiency gap between attention and convolution in latency-sensitive tasks.
Future work could test whether the same attention-centric pattern transfers to related problems such as real-time instance segmentation or video object tracking.

Load-bearing premise

The specific attention mechanisms and any accompanying optimizations can be implemented to run at speeds matching CNN-based YOLO models on standard hardware.

What would settle it

A side-by-side benchmark on COCO showing YOLOv12 achieving lower mAP than YOLOv11-N at equal or higher latency on a T4 GPU would falsify the central performance claim.

read the original abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLOv12 shifts YOLO to attention-centric design with reported accuracy gains at matched speeds, but the efficiency details are key to the claim.

read the letter

YOLOv12 is an attempt to make attention the main driver in YOLO instead of CNN layers, and the paper says this version matches the speed of prior models while improving accuracy on real-time detection tasks. What stands out is the shift in focus. Earlier YOLO releases improved CNN components because attention was too slow for real-time use. Here they claim to have fixed that. The reported results include YOLOv12-N reaching 40.6% mAP with 1.64 ms inference on a T4 GPU, beating YOLOv10-N by 2.1% and YOLOv11-N by 1.2% at similar speed. The S model also outperforms RT-DETR-R18 and RT-DETRv2-R18 while using less compute and running faster. The paper does a good job laying out these comparisons for different model scales and against both the YOLO family and DETR-based alternatives. The efficiency numbers on parameters and computation add to the case. The main uncertainty is how the attention blocks achieve CNN-like latency. Standard attention has quadratic scaling, so there must be design choices like linear attention, windowed attention, or hybrid CNN-attention layers that cut the cost. The abstract does not spell out those changes, so the full paper's methods and any complexity analysis will decide if the attention-centric claim holds or if speed comes from something else. This work is relevant for anyone working on practical object detectors where both speed and accuracy count, such as in autonomous systems or video analytics. I would recommend sending it to peer review. The empirical results are clear enough to evaluate, and the architectural idea is worth a closer look even if some revisions on the details are likely.

Referee Report

2 major / 2 minor

Summary. The paper introduces YOLOv12, an attention-centric real-time object detector that replaces or augments CNN components with attention mechanisms while claiming to retain CNN-comparable inference speeds. It reports that YOLOv12-N achieves 40.6% mAP at 1.64 ms latency on T4 GPU, outperforming YOLOv10-N and YOLOv11-N by 2.1% and 1.2% mAP respectively, with similar advantages across scales and against RT-DETR variants in speed, compute, and parameters.

Significance. If the efficiency claims hold, the result would be significant for real-time detection by showing that attention can deliver measurable accuracy gains without the usual quadratic latency penalty, potentially shifting design paradigms away from pure CNN backbones. The concrete benchmark numbers and cross-family comparisons provide falsifiable predictions that could be directly tested on standard hardware.

major comments (2)

[Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.
[Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.

minor comments (2)

[Figure 1] Figure 1 caption and latency table: confirm that all reported latencies use the same T4 GPU, batch size 1, and FP16/INT8 precision to ensure apples-to-apples comparison.
[Section 3] Notation for model scales (N/S/M/L/X): explicitly define how the attention module widths and depths scale with these variants to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our YOLOv12 manuscript. We have revised the paper to incorporate explicit complexity analysis and experimental protocol details, addressing the concerns while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and architecture description: the central claim that attention-centric modules achieve 1.64 ms latency on T4 for the N-scale model while improving mAP requires an explicit complexity analysis (FLOPs scaling, windowed/linear attention formulation, or FlashAttention integration) showing how quadratic costs are eliminated; without this, it is unclear whether the reported speed derives from the attention design or from unstated CNN fallbacks or resolution reductions.

Authors: We agree that an explicit complexity analysis strengthens the validation of our efficiency claims. In the revised manuscript, Section 3 now includes a dedicated complexity analysis subsection. It details the FLOPs scaling for the attention modules, the windowed and linear attention formulations that achieve linear complexity, and the FlashAttention integration used to eliminate quadratic costs. This confirms that the reported 1.64 ms latency on T4 for YOLOv12-N arises directly from the attention-centric design without CNN fallbacks or resolution reductions. revision: yes
Referee: [Experiments] Experiments section: the 2.1%/1.2% mAP gains over YOLOv10-N/YOLOv11-N and the 42% speed advantage over RT-DETR-R18 are load-bearing for the 'surpasses all popular real-time detectors' claim, yet no details are supplied on whether all models use identical training schedules, augmentation pipelines, or input resolutions; this prevents verification that the gains are attributable to the attention-centric changes rather than training differences.

Authors: We acknowledge the need for transparent experimental details to ensure fair comparisons. The revised Experiments section now includes an explicit subsection describing the training protocols. All compared models (YOLOv10-N, YOLOv11-N, RT-DETR variants) were trained and evaluated using identical schedules, augmentation pipelines, and input resolutions as defined in their original papers and the standard COCO benchmark settings. This confirms that the mAP and speed gains are attributable to YOLOv12's attention-centric architecture. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with benchmark results

full rationale

The paper introduces YOLOv12 as an attention-centric YOLO variant and supports its claims solely through empirical benchmark comparisons (e.g., mAP and latency numbers on T4 GPU against YOLOv10/YOLOv11 and RT-DETR variants). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is an empirical architecture paper; it relies on standard deep-learning assumptions about attention superiority and benchmark validity rather than new axioms or invented physical entities.

free parameters (1)

model scale definitions (N/S/M/L/X)
Specific channel counts, layer depths, and block configurations per scale are chosen to balance speed and accuracy.

axioms (1)

domain assumption Attention mechanisms have superior modeling capabilities compared with CNNs
Stated directly in the abstract as a premise for the design shift.

pith-pipeline@v0.9.0 · 5526 in / 1281 out tokens · 39861 ms · 2026-05-13T21:30:28.301722+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
cs.CV 2026-04 unverdicted novelty 7.0

SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
cs.CV 2026-05 unverdicted novelty 6.0

A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain ...
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering repor...
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
cs.CV 2026-03 conditional novelty 6.0

Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
cs.AI 2026-03 unverdicted novelty 6.0

SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
cs.CV 2026-05 unverdicted novelty 5.0

TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation
cs.RO 2026-05 unverdicted novelty 5.0

A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
InsHuman: Towards Natural and Identity-Preserving Human Insertion
cs.CV 2026-05 unverdicted novelty 5.0

InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
cs.AI 2026-04 unverdicted novelty 5.0

A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection
cs.CV 2026-04 unverdicted novelty 5.0

Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network
cs.CV 2026-04 unverdicted novelty 5.0

StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
cs.CV 2026-04 unverdicted novelty 5.0

WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M...
A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
cs.CV 2026-05 unverdicted novelty 4.0

YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and ...
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
cs.CV 2026-04 unverdicted novelty 4.0

YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model
cs.CV 2026-04 unverdicted novelty 4.0

YOLOv12 with Otsu thresholding on cell-based segmentation classifies AML cells at 99.3% validation and test accuracy.
FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
cs.MM 2026-04 unverdicted novelty 4.0

DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
cs.CV 2026-04 unverdicted novelty 3.0

Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
cs.HC 2026-04 unverdicted novelty 3.0

Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface
cs.CV 2026-04 unverdicted novelty 3.0

A local multi-agent framework integrates YOLO object detection with Slack-Ollama natural language control entirely on Raspberry Pi hardware.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 23 Pith papers · 10 internal anchors

[1]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Low-rank bottleneck in multi-head attention models

Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Low-rank bottleneck in multi-head attention models. In International conference on machine learning, pages 864–873. PMLR, 2020. 4

work page 2020
[3]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 6, 11

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Anomaly detection in autonomous driving: A survey

Daniel Bogdoll, Maximilian Nitsche, and J Marius Z ¨ollner. Anomaly detection in autonomous driving: A survey. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4488–4499, 2022. 1

work page 2022
[5]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[6]

Albumentations: fast and flexible image augmenta- tions

Alexander Buslaev, Vladimir I Iglovikov, Eugene Khved- chenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. Albumentations: fast and flexible image augmenta- tions. Information, 11(2):125, 2020. 11

work page 2020
[7]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2

work page 2020
[8]

Ap-loss for accurate one-stage object detection

Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. Ap-loss for accurate one-stage object detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(11):3782–3798, 2020. 1

work page 2020
[9]

Yolo-ms: rethinking multi- scale representation learning for real-time object detection

Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, and Ming-Ming Cheng. Yolo-ms: rethinking multi- scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023. 2

work page arXiv 2023
[11]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Twins: Revisiting the design of spatial attention in vision transformers

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021. 3

work page 2021
[13]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 2, 3, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Flashattention: Fast and memory-efficient exact at- tention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 2, 3, 7, 11

work page 2022
[15]

BERT: pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019. 3

work page 2019
[16]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022. 2, 4

work page 2022
[17]

Mobile robot navigation using an object recogni- tion software with rgbd images and the yolo algorithm

Douglas Henke Dos Reis, Daniel Welfer, Marco Anto- nio De Souza Leite Cuadros, and Daniel Fernando Tello Gamarra. Mobile robot navigation using an object recogni- tion software with rgbd images and the yolo algorithm. Ap- plied Artificial Intelligence, 33(14):1290–1305, 2019. 1

work page 2019
[18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

Eva: Exploring the limits of masked visual representa- tion learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3, 6

work page 2023
[21]

Eva-02: A visual representation for neon genesis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. Image and Vision Computing, 149:105171,

work page
[22]

Tood: Task-aligned one-stage object detec- tion

Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detec- tion. In 2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 3490–3499. IEEE Computer So- ciety, 2021. 1

work page 2021
[23]

Ota: Optimal transport assignment for object detection

Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 303–312, 2021. 1

work page 2021
[24]

Jocher Glenn. Yolov8. https://github.com/ultralytics/ultralytics/tree/main, 2023. 1, 2, 5, 6, 9, 11

work page 2023
[25]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000– 16009, 2022. 1, 6, 9

work page 2022
[26]

Ax- ial attention in multidimensional transformers,

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019. 2

work page arXiv 1912
[27]

Ccnet: Criss-cross attention for semantic segmentation

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 603–612, 2019. 2, 4

work page 2019
[28]

Glenn Jocher. yolov11. https://github.com/ultralytics, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 10, 11

work page 2024
[29]

Glenn Jocher, K Nishimura, T Mineeva, and RJAM Vilari˜no. yolov5. https://github.com/ultralytics/yolov5/tree, 2, 2020. 1, 2, 6

work page 2020
[30]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 4

work page 2020
[31]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International confer- ence on machine learning, pages 5156–5165. PMLR, 2020. 3

work page 2020
[32]

Yolov6 v3

Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, and Xiangxi- ang Chu. Yolov6 v3. 0: A full-scale reloading.arXiv preprint arXiv:2301.05586, 2023. 1, 2, 5, 6

work page arXiv 2023
[33]

Dn-detr: Accelerate detr training by intro- ducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13619–13627, 2022. 2

work page 2022
[34]

A dual weighting label assignment scheme for object detection

Shuai Li, Chenhang He, Ruihuang Li, and Lei Zhang. A dual weighting label assignment scheme for object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9387–9396, 2022. 1

work page 2022
[35]

Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection

Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33:21002–21012, 2020. 1

work page 2020
[36]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6, 10

work page 2014
[37]

Dab-detr: Dynamic anchor boxes are better queries for detr,

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 2

work page arXiv 2022
[38]

Vmamba: Visual state space model

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. In The Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 3

work page 2024
[39]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 3, 4

work page 2021
[41]

Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer

Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024. 5, 6

work page arXiv 2024
[42]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2

work page 2021
[43]

A ranking-based, balanced loss function unifying classification and localisation in object detection

Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33:15534–15545,

work page
[44]

Rank & sort loss for object detection and instance segmentation

Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Rank & sort loss for object detection and instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3009–3018, 2021. 1

work page 2021
[45]

You only look once: Unified, real-time object de- tection

J Redmon. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 1, 2, 6

work page 2016
[46]

YOLOv3: An Incremental Improvement

Joseph Redmon. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Yolo9000: better, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition , pages 7263–7271, 2017. 1, 2, 6

work page 2017
[48]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

work page
[49]

Efficient attention: Attention with lin- ear complexities

Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with lin- ear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531– 3539, 2021. 3, 4

work page 2021
[50]

Fast-itpn: Integrally pre- trained transformer pyramid network with token migration

Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pre- trained transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 1, 3

work page 2024
[51]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 6

work page 2021
[52]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 4

work page 2021
[53]

Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jun- gong Han, and Guiguang Ding. Yolov10: Real-time end- to-end object detection. arXiv preprint arXiv:2405.14458 ,

work page arXiv
[54]

1, 2, 5, 6, 7, 8, 9, 10, 11

work page
[55]

Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism

Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Yunhe Wang, and Kai Han. Gold-yolo: Ef- ficient object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems , 36,

work page
[56]

Cspnet: A new backbone that can enhance learning capability of cnn

Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 390–391,

work page
[57]

Designing network design strategies through gradient path analysis

Chien-Yao Wang, Hong-Yuan Mark Liao, and I-Hau Yeh. Designing network design strategies through gradient path analysis. arXiv preprint arXiv:2211.04800, 2022. 2, 4

work page arXiv 2022
[58]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023. 1, 2, 4, 6, 11

work page 2023
[59]

Yolov9: Learning what you want to learn us- ing programmable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn us- ing programmable gradient information. arXiv preprint arXiv:2402.13616, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 11

work page arXiv 2024
[60]

End-to-end object detection with fully convolutional network

Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15849–15858, 2021. 1

work page 2021
[61]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2006
[62]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision , pages 568–578, 2021. 2

work page 2021
[63]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025. 3

work page arXiv 2025
[64]

Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14138–14148, 2021. 4

work page 2021
[65]

Glance-and-gaze vision transformer

Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan L Yuille, and Wei Shen. Glance-and-gaze vision transformer. Advances in Neural Information Processing Systems , 34: 12992–13003, 2021. 3

work page 2021
[66]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

Detrs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16965–16974, 2024. 2, 5, 6

work page 2024
[68]

Distance-iou loss: Faster and bet- ter learning for bounding box regression

Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and bet- ter learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, pages 12993– 13000, 2020. 1

work page 2020
[69]

Iou loss for 2d/3d ob- ject detection

Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob- ject detection. In 2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 1

work page 2019
[70]

Autoassign: Differ- entiable label assignment for dense object detection

Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ- entiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020. 1

work page arXiv 2007
[71]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 2, 11

work page internal anchor Pith review Pith/arXiv arXiv 2010