arxiv: 2107.08430 · v2 · submitted 2021-07-18 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

YOLOX: Exceeding YOLO Series in 2021

Zheng Ge , Songtao Liu , Feng Wang , Zeming Li , Jian Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionYOLOanchor-freereal-time detectionCOCO benchmarklabel assignmentdecoupled head

0 comments

The pith

YOLOX turns YOLO detectors anchor-free with a decoupled head and SimOTA assignment to reach higher accuracy at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents YOLOX as an upgraded YOLO series that drops anchor-based prediction in favor of an anchor-free approach. It adds a decoupled head to separate classification from box regression and adopts the SimOTA strategy for assigning training labels. These steps produce models that hit new accuracy marks on the COCO benchmark while keeping inference fast, for example 50 percent AP at 68.9 frames per second on a V100 for the large variant, beating YOLOv5-L by 1.8 points. The same pattern holds for tiny and standard YOLOv3-sized models, and a single YOLOX-L model won the 2021 Streaming Perception Challenge. Developers gain a practical detector family that is easier to train and deploy across edge and server hardware.

Core claim

Switching YOLO to anchor-free detection, adding a decoupled classification-regression head, and replacing prior label assignment with SimOTA yields consistent gains across model scales, reaching 50.0 percent AP on COCO for YOLOX-L at 68.9 FPS on Tesla V100, which exceeds YOLOv5-L by 1.8 percent AP while also topping the CVPR 2021 Streaming Perception Challenge with one model.

What carries the argument

Anchor-free center-point prediction paired with a decoupled head and SimOTA label assignment, which dynamically matches positive samples via optimal transport to improve training stability and final accuracy.

If this is right

YOLOX variants deliver better accuracy-speed trade-offs than prior YOLO models at every size from 0.9 M parameters upward.
A single YOLOX-L model suffices to win streaming perception benchmarks without ensemble methods.
The architecture supports direct export to ONNX, TensorRT, NCNN, and OpenVINO for deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three changes could raise accuracy in other single-stage detectors that still rely on anchors.
In video pipelines the higher frame rate and accuracy together reduce the need for separate tracking modules.
Because the gains hold across scales, the approach may generalize to new backbone families without redesigning the head.

Load-bearing premise

The reported accuracy lifts come mainly from the anchor-free shift, decoupled head, and SimOTA rather than from any extra training epochs, data augmentation, or hyperparameter tuning that differs from the YOLOv4 and YOLOv5 baselines.

What would settle it

Train a YOLOv5-L model from scratch using exactly the same data augmentations, optimizer schedule, and hyperparameters reported for YOLOX-L, then measure whether the 1.8 percent AP gap on COCO disappears.

read the original abstract

In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLOX gets solid COCO gains by dropping anchors and adding SimOTA, but the 1.8-point edge over YOLOv5-L is hard to attribute cleanly because the training recipe changed too.

read the letter

The main takeaway is that switching YOLO to anchor-free, decoupling the classification and regression heads, and using SimOTA label assignment produces a detector that hits 50.0% AP on COCO for the L variant at 68.9 FPS on V100, beating the YOLOv5-L baseline by 1.8 points at similar parameter count. They also lift the old YOLOv3 backbone to 47.3% AP and show a tiny Nano version that beats NanoDet. The code is public with ONNX/TensorRT exports, and the model took first place in the CVPR 2021 streaming perception challenge with a single model. Those are concrete, checkable results on the standard benchmark.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces YOLOX, an anchor-free reformulation of the YOLO detector that incorporates a decoupled classification/regression head and the SimOTA label assignment strategy. It reports COCO results across scales, including YOLOX-Nano (0.91M params, 25.3% AP), an improved YOLOv3 (47.3% AP), and YOLOX-L (50.0% AP at 68.9 FPS on V100, exceeding YOLOv5-L by 1.8% AP at comparable parameters), plus first place in the CVPR 2021 Streaming Perception Challenge; code and deployment support (ONNX, TensorRT, NCNN, OpenVINO) are released.

Significance. If the reported gains are attributable to the architectural changes rather than training-protocol differences, the work supplies a practical, high-performance real-time detector that updates the widely used YOLO family with modern components while maintaining strong speed-accuracy trade-offs. The open-source release and deployment tools directly support reproducibility and industrial adoption.

major comments (1)

[Abstract and experimental results] Abstract and experimental results section: the central claim that YOLOX-L exceeds YOLOv5-L by 1.8% AP rests on comparisons that use a 300-epoch schedule with Mosaic+MixUp augmentations for YOLOX but do not retrain the YOLOv5 architecture under the identical recipe; internal ablations vary components inside YOLOX only, leaving the fraction of the AP delta due to schedule/hyperparameter differences unquantified and weakening attribution to the anchor-free design, decoupled head, and SimOTA.

minor comments (2)

[Abstract] Abstract: state the exact parameter count and FLOPs for YOLOX-L to enable immediate side-by-side comparison with the cited YOLOv4-CSP and YOLOv5-L baselines.
[Methods/experimental setup] Methods or experimental setup: explicitly tabulate the training schedule, augmentation pipeline, and optimizer settings used for YOLOX versus those reported in the original YOLOv5 and YOLOv4 papers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental comparisons. We address the major comment point-by-point below and propose targeted revisions to improve clarity on attribution.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim that YOLOX-L exceeds YOLOv5-L by 1.8% AP rests on comparisons that use a 300-epoch schedule with Mosaic+MixUp augmentations for YOLOX but do not retrain the YOLOv5 architecture under the identical recipe; internal ablations vary components inside YOLOX only, leaving the fraction of the AP delta due to schedule/hyperparameter differences unquantified and weakening attribution to the anchor-free design, decoupled head, and SimOTA.

Authors: We agree that the comparison would be stronger with a controlled re-training of YOLOv5-L under the exact 300-epoch Mosaic+MixUp schedule used for YOLOX. The reported YOLOv5-L numbers are taken directly from the official YOLOv5 repository (using its recommended protocol), while our ablations isolate the effect of each YOLOX component (anchor-free, decoupled head, SimOTA) within a fixed training recipe. In the revised version we will (1) explicitly state the training-protocol differences in the experimental section and abstract, (2) add a short paragraph quantifying the contribution of our components via the existing ablations, and (3) include a new row showing a YOLOv3 baseline trained with the same 300-epoch recipe for reference. These changes clarify attribution without requiring a full external re-implementation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct experimental comparisons

full rationale

The paper's central claims consist of reported AP and FPS numbers obtained by training modified YOLO architectures (anchor-free, decoupled head, SimOTA label assignment) under a stated 300-epoch schedule with Mosaic+MixUp. These are direct empirical measurements against published baseline numbers for YOLOv5-L, YOLOv4-CSP, etc.; no equations, predictions, or first-principles derivations are presented that reduce to fitted parameters or self-citations by construction. Internal ablations vary components inside the YOLOX recipe but do not create self-referential loops. The result is therefore self-contained against external benchmarks and receives score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical computer-vision paper; it relies on standard dataset and evaluation assumptions without introducing new theoretical entities or free parameters beyond typical deep-learning hyperparameters.

axioms (1)

domain assumption COCO dataset annotations and evaluation protocol are accurate and representative for object-detection performance measurement
All reported AP numbers depend on this protocol.

pith-pipeline@v0.9.0 · 5593 in / 1075 out tokens · 92301 ms · 2026-05-13T10:27:58.064967+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking
cs.CV 2026-05 unverdicted novelty 7.0

CUTAL scores multi-frame clips for uncertainty and enforces temporal diversity to train transformer MOT models to near full-supervision performance with 50% of the labels.
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
cs.CV 2026-05 unverdicted novelty 7.0

LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline
cs.CV 2026-05 unverdicted novelty 6.0

Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested...
FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging
cs.CV 2026-04 unverdicted novelty 6.0

FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 36...
GateMOT: Q-Gated Attention for Dense Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
cs.CV 2026-04 unverdicted novelty 6.0

CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
cs.CV 2026-04 unverdicted novelty 6.0

VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
cs.CV 2026-03 conditional novelty 6.0

Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
Portable Active Learning for Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

PAL is a portable active learning method for object detection that uses class-specific logistic classifiers for uncertainty and image-level diversity to select annotation batches, showing better label efficiency than ...
Utility-Aware Progressive Inference over UDP Packet Blocks for Emergency Communications
eess.SP 2026-05 unverdicted novelty 5.0

Utility-aware progressive inference on UDP packet blocks enables early hazard recognition, reducing packet budget by 34.2% and decision delay by 1209 ms while retaining 91.5% of full-reception accuracy.
SAMOFT: Robust Multi-Object Tracking via Region and Flow
cs.CV 2026-05 unverdicted novelty 5.0

SAMOFT improves multi-object tracking by using SAM segmentation and optical flow for pixel-level motion matching, flexible centroid correction, and training-free motion pattern fixes on top of standard Kalman and ReID...
Time-series Meets Complex Motion Modeling: Robust and Computational-effective Motion Predictor for Multi-object Tracking
cs.CV 2026-05 unverdicted novelty 5.0

TCMP achieves SOTA MOT metrics (HOTA 63.4%, IDF1 65.0%, AssA 49.1%) with 0.014x parameters and 0.05x FLOPs of the previous best method by using a simple dilated TCN regressor.
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
cs.CV 2026-04 unverdicted novelty 5.0

SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maint...
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
cs.CV 2026-04 unverdicted novelty 5.0

HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
cs.CL 2026-04 unverdicted novelty 5.0

Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning
cs.RO 2026-05 unverdicted novelty 4.0

A dual-LLM hierarchical framework for robotic task and motion planning, integrating object detection, achieves 86% success across 24 test scenarios ranging from simple spatial commands to infeasible requests.
Hybrid Visual Telemetry for Bandwidth-Constrained Robotic Vision: A Pilot Study with HEVC Base Video and JPEG ROI Stills
cs.CV 2026-05 unverdicted novelty 4.0

A hybrid scheme using HEVC video for continuous awareness plus selective JPEG ROI stills for detail refinement is formalized and experimentally compared to video-only transmission under matched bitrate budgets for rob...
Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation
cs.CV 2026-04 unverdicted novelty 4.0

An efficient implementation of a Bayes-optimal filter performs fast 3D multi-camera tracking and pose estimation from 2D inputs while handling intermittent camera disconnections.
InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard
cs.AR 2026-04 unverdicted novelty 4.0

InsightBoard integrates synchronized multi-metric plots, correlation analysis, and group fairness indicators into TensorBoard to reveal subgroup disparities that aggregate metrics hide during model training.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview
cs.CV 2026-04 unverdicted novelty 2.0

The report overviews five maritime computer vision benchmark challenges, their datasets, protocols, quantitative results, and top team approaches from the MaCVi 2026 workshop.
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
cs.CV 2026-04 unverdicted novelty 2.0

YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 26 Pith papers · 3 internal anchors

[1]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In ECCV, 2020. 1, 4

work page 2020
[3]

You only look one-level feature

Qiang Chen, Yingming Wang, Tong Yang, Xiangyu Zhang, Jian Cheng, and Jian Sun. You only look one-level feature. In CVPR, 2021. 3

work page 2021
[4]

Ota: Optimal transport assignment for object detection

Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In CVPR, 2021. 1, 4

work page 2021
[5]

Lla: Loss-aware label assignment for dense pedestrian detection

Zheng Ge, Jianfeng Wang, Xin Huang, Songtao Liu, and Os- amu Yoshie. Lla: Loss-aware label assignment for dense pedestrian detection. arXiv preprint arXiv:2101.04307 ,

work page arXiv
[6]

Simple 6 copy-paste is a strong data augmentation method for instance segmentation

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple 6 copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021. 5

work page 2021
[7]

glenn jocher et al. yolov5. https://github.com/ ultralytics/yolov5, 2021. 1, 2, 3, 5, 6

work page 2021
[8]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

work page
[10]

Dauphin Yann, and David Lopez-Paz

Zhang Hongyi, Cisse Moustapha, N. Dauphin Yann, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. ICLR, 2018. 3

work page 2018
[11]

Pp-yolov2: A practical object detector

Xin Huang, Xinxin Wang, Wenyu Lv, Xiaying Bai, Xiang Long, Kaipeng Deng, Qingqing Dang, Shumin Han, Qiwen Liu, Xiaoguang Hu, et al. Pp-yolov2: A practical object detector. arXiv preprint arXiv:2104.10419, 2021. 3, 6

work page arXiv 2021
[12]

Probabilistic anchor assign- ment with iou prediction for object detection

Kang Kim and Hee Seok Lee. Probabilistic anchor assign- ment with iou prediction for object detection. In ECCV,

work page
[13]

Parallel feature pyra- mid network for object detection

Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyra- mid network for object detection. In ECCV, 2018. 2

work page 2018
[14]

Cornernet: Detecting objects as paired keypoints

Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018. 1, 3

work page 2018
[15]

Towards streaming perception

Mengtian Li, Yuxiong Wang, and Deva Ramanan. Towards streaming perception. In ECCV, 2020. 5, 6

work page 2020
[16]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In ICCV,

work page
[17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2

work page 2014
[18]

Learning spa- tial fusion for single-shot object detection

Songtao Liu, Di Huang, and Yunhong Wang. Learning spa- tial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 6

work page arXiv 1911
[19]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 2, 5

work page 2018
[20]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 2

work page 2018
[21]

ICCV, 2021.https://arxiv.or g/abs/2103.14030 35 Supplementary Material S1

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows. arXiv preprint arXiv:2103.14030, 2021. 5

work page arXiv 2021
[22]

Iqdet: Instance-wise quality distribution sampling for object detec- tion

Yuchen Ma, Songtao Liu, Zeming Li, and Jian Sun. Iqdet: Instance-wise quality distribution sampling for object detec- tion. In CVPR, 2021. 1, 4

work page 2021
[23]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In CVPR, 2016. 1

work page 2016
[24]

Yolo9000: Better, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017. 1, 3

work page 2017
[25]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2, 3

work page internal anchor Pith review arXiv 2018
[26]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 1

work page 2015
[27]

Revisiting the sibling head in object detector

Guanglu Song, Yu Liu, and Xiaogang Wang. Revisiting the sibling head in object detector. In CVPR, 2020. 2

work page 2020
[28]

Efﬁcientdet: Scalable and efﬁcient object detection

Mingxing Tan, Ruoming Pang, and Quoc V Le. Efﬁcientdet: Scalable and efﬁcient object detection. In CVPR, 2020. 6

work page 2020
[29]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV,

work page
[30]

Scaled-yolov4: Scaling cross stage partial network

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Scaled-yolov4: Scaling cross stage partial network. arXiv preprint arXiv:2011.08036, 2020. 1, 5, 6

work page arXiv 2011
[31]

Cspnet: A new backbone that can enhance learning capability of cnn

Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn. In CVPR workshops, 2020. 2, 5

work page 2020
[32]

End-to-end object detection with fully convolutional network

Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In CVPR, 2020. 1

work page 2020
[33]

End-to-end object detection with fully convolutional network

Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In CVPR, 2021. 4

work page 2021
[35]

Rethinking classiﬁcation and localization for object detection

Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, Hongzhi Li, and Yun Fu. Rethinking classiﬁcation and localization for object detection. In CVPR, 2020. 2

work page 2020
[36]

Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection

Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020. 1, 4

work page 2020
[37]

Freeanchor: Learning to match anchors for vi- sual object detection

Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. Freeanchor: Learning to match anchors for vi- sual object detection. In NeurIPS, 2019. 1, 4

work page 2019
[38]

Bag of freebies for training object de- tection neural networks

Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Jun- yuan Xie, and Mu Li. Bag of freebies for training object de- tection neural networks. arXiv preprint arXiv:1902.04103 ,

work page arXiv 1902
[39]

Object detection made simpler by eliminating heuristic nms

Qiang Zhou, Chaohui Yu, Chunhua Shen, Zhibin Wang, and Hao Li. Object detection made simpler by eliminating heuristic nms. arXiv preprint arXiv:2101.11782, 2021. 1, 4

work page arXiv 2021
[40]

Objects as points,

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points. arXiv preprint arXiv:1904.07850 , 2019. 1, 3

work page arXiv 1904
[41]

Autoassign: Differ- entiable label assignment for dense object detection

Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differ- entiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020. 1, 4 7

work page arXiv 2007