arxiv: 1804.02767 · v1 · submitted 2018-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

YOLOv3: An Incremental Improvement

Joseph Redmon , Ali Farhadi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 11:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionreal-time detectionYOLOconvolutional networksaccuracy speed tradeoffincremental design

0 comments

The pith

YOLOv3 reaches SSD-level accuracy three times faster through incremental design changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a series of small updates to the YOLO object detection model. These changes aim to boost accuracy without sacrificing the model's speed advantage. Results show the new version matches the accuracy of SSD at 320x320 resolution while running three times faster, and nearly equals RetinaNet's performance at four times the speed. A sympathetic reader would care because faster detection opens up more uses in real-time applications like video surveillance and autonomous systems. The work builds on prior YOLO versions by refining the network rather than overhauling it.

Core claim

YOLOv3 incorporates a number of little design changes and trains a new network that is slightly larger but more accurate. At 320 by 320 input it runs in 22 milliseconds with 28.2 mean average precision, matching SSD accuracy at three times the speed. On the .5 IOU metric it reaches 57.9 mAP in 51 milliseconds on a Titan X, close to RetinaNet's 57.5 mAP but 3.8 times faster.

What carries the argument

The updated YOLO network architecture with incremental design changes that improve feature handling and prediction accuracy while preserving fast inference.

If this is right

Object detection systems can process more frames per second on standard hardware.
Applications needing real-time performance gain better accuracy options without added latency.
Model refinement techniques prove effective for balancing speed and precision in detection tasks.
Similar incremental updates could extend the usable life of other detector families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These tweaks might transfer to other single-shot detectors to achieve comparable gains.
Testing on edge devices would reveal if the speed benefits hold in constrained environments.
Longer-term, this suggests focusing on optimization over architecture invention for practical gains.

Load-bearing premise

That the measured improvements come primarily from the described design changes rather than from specific training details or evaluation conditions.

What would settle it

Independent reproduction of the training and testing that yields significantly lower accuracy or slower inference times than reported.

read the original abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLOv3 is a straightforward set of tweaks that lift accuracy on standard detection benchmarks while keeping the speed edge, with code that makes the numbers checkable.

read the letter

The main takeaway is that this version refines the prior YOLO model through several small architecture and training adjustments, producing higher mAP at comparable or better inference times. The concrete results are the 320x320 model at 22 ms with 28.2 mAP matching SSD but running three times faster, and the 57.9 mAP@50 in 51 ms on Titan X versus RetinaNet's 57.5 in 198 ms. These are direct measurements on COCO with external baselines, not derived claims.

Referee Report

2 major / 3 minor

Summary. The manuscript presents YOLOv3 as an incremental update to prior YOLO detectors, incorporating design changes including a Darknet-53 backbone, multi-scale feature prediction via a feature-pyramid-like structure, and logistic classifiers for independent class predictions. It reports direct empirical measurements on COCO, claiming that at 320x320 input YOLOv3 reaches 28.2 mAP in 22 ms (matching SSD accuracy at 3x speed) and 57.9 mAP@0.5 in 51 ms on Titan X (comparable to RetinaNet's 57.5 mAP@0.5 in 198 ms, at 3.8x speed). All code is released for verification.

Significance. If the reported timings and mAP values hold under the released code, the work supplies a strong, practical real-time detector baseline that improves the speed-accuracy trade-off over prior single-stage methods. The open-source release and direct external comparisons add substantial value for reproducibility and follow-on research in computer vision.

major comments (2)

[Experiments] Experiments section: no ablation studies isolate the contribution of individual changes (e.g., backbone swap, multi-scale heads, or logistic vs. softmax classification) to the measured mAP gains; without them the central claim that the listed incremental updates are responsible for the accuracy improvements remains correlational.
[Results] Results paragraph on RetinaNet comparison: the 57.9 vs. 57.5 mAP@0.5 numbers are reported, yet the paper does not provide the corresponding mAP@[.5:.95] figures for both models on the same split, weakening the direct performance equivalence claim under the standard COCO metric.

minor comments (3)

[Abstract] Abstract and introduction contain informal phrasing (e.g., 'a bunch of little design changes', 'pretty swell') that should be revised to match journal standards.
[Training] The manuscript would benefit from explicit statements of the exact training schedule, data augmentations, and optimizer settings used to obtain the quoted mAP numbers, even though code is released.
[Architecture] Figure 1 (network diagram) lacks layer-by-layer channel counts or residual-block details, making it harder to verify architectural differences from YOLOv2 without inspecting the code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments on the experimental presentation. We respond to each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation studies isolate the contribution of individual changes (e.g., backbone swap, multi-scale heads, or logistic vs. softmax classification) to the measured mAP gains; without them the central claim that the listed incremental updates are responsible for the accuracy improvements remains correlational.

Authors: We agree that ablation studies would provide stronger causal evidence for the contribution of each design change. This manuscript, however, presents YOLOv3 as a practical, incremental update whose primary goal is to deliver a strong real-time baseline with released code. Each modification is described and motivated by prior work, and the overall system is validated through direct COCO comparisons. We did not run the additional controlled ablations in the original study, and incorporating them would require substantial new training that falls outside the scope of this short incremental paper. We therefore do not plan to add ablation experiments in the revision. revision: no
Referee: [Results] Results paragraph on RetinaNet comparison: the 57.9 vs. 57.5 mAP@0.5 numbers are reported, yet the paper does not provide the corresponding mAP@[.5:.95] figures for both models on the same split, weakening the direct performance equivalence claim under the standard COCO metric.

Authors: We acknowledge that reporting mAP@[.5:.95] would allow a fuller comparison under the primary COCO metric. The equivalence statement in the manuscript is explicitly tied to the mAP@0.5 numbers published by the RetinaNet authors, which is the metric they highlighted for that speed-accuracy operating point. In the revision we will add YOLOv3’s mAP@[.5:.95] result for completeness and will clarify that the direct numerical comparison remains under mAP@0.5 because we rely on the originally reported RetinaNet figures rather than re-evaluating their model on an identical split. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarks only

full rationale

The paper reports direct empirical measurements of accuracy (mAP) and inference speed on standard COCO benchmarks, with comparisons to independently published external models (SSD, RetinaNet). Design changes are described narratively and their effects are measured experimentally rather than derived. No mathematical equations, predictions, or uniqueness theorems appear that could reduce to self-fitted inputs or self-citations by construction. Released code further supports external reproducibility, keeping the central claims independent of any internal circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical evaluation of an updated convolutional network on standard object detection benchmarks; no new mathematical derivations or invented physical entities are introduced.

free parameters (1)

network architecture hyperparameters
Specific layer counts, filter sizes, and training schedule choices that define the 'little bigger' network.

axioms (1)

domain assumption Standard mAP and mAP@50 metrics on COCO or similar benchmarks are appropriate proxies for detection quality.
Invoked when reporting and comparing accuracy figures.

pith-pipeline@v0.9.0 · 5433 in / 1172 out tokens · 31896 ms · 2026-05-13T11:33:14.559882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet
Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present some updates to YOLO! We made a bunch of little design changes to make it better.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments
cs.CV 2026-04 unverdicted novelty 7.0

CLIP language prompts guide a new weighted cross-entropy loss (CLIP-CE via AME and FAME) to boost object detection performance in hazy images, outperforming image enhancement baselines on the introduced HazyCOCO dataset.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
Towards Continuous Sign Language Conversation from Isolated Signs
cs.CV 2026-05 unverdicted novelty 6.0

Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours
cs.CV 2026-05 unverdicted novelty 6.0

FS-FSD regresses frequency-supervised Fourier contours for bridge defects, yielding higher polygon accuracy and better geometric quality than box, mask, or contour baselines on 3,767 UAV images with 42,346 instances.
UniISP: A Unified ISP Framework for Both Human and Machine Vision
cs.CV 2026-05 unverdicted novelty 6.0

UniISP unifies ISP processing with a Hybrid Attention Module and Feature Adapter to produce images that are both visually pleasing for humans and informative for computer vision models.
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
cs.CV 2026-04 unverdicted novelty 6.0

TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE
cs.CV 2026-04 unverdicted novelty 6.0

IA-CLAHE trains a lightweight network on a differentiable CLAHE extension to predict per-tile clip limits that drive local histograms toward a uniform distribution, delivering zero-shot gains in recognition accuracy a...
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
cs.CV 2026-04 unverdicted novelty 6.0

DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
cs.CV 2026-03 conditional novelty 6.0

Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
YOLOX: Exceeding YOLO Series in 2021
cs.CV 2021-07 accept novelty 6.0

YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
cs.CV 2026-04 unverdicted novelty 5.0

RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
Crowdsourcing of Real-world Image Annotation via Visual Properties
cs.CV 2026-04 unverdicted novelty 5.0

An interactive crowdsourcing method applies visual property constraints and hierarchy-based dynamic questions to produce more consistent image annotations.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception
cs.CV 2026-05 unverdicted novelty 4.0

Generative texture synthesis from StyleGAN2 diversifies 3D pedestrian assets from a single base model, improving robustness in 2D object detection while exposing 3D perception models' sensitivity to geometric domain gaps.
AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes
cs.CV 2026-05 unverdicted novelty 4.0

AMIEOD combines a multi-expert enhancement module with detection-guided regression and selection losses to raise object detection accuracy in low-illumination images.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Design and Implementation of BNN-Based Object Detection on FPGA
cs.AR 2026-05 unverdicted novelty 4.0

A BNN-based YOLOv3-tiny-like object detector with 1-bit weights and 8-bit activations is implemented in Verilog on FPGA, achieving 39.6% mAP50 on VOC and 0.999964 correlation with the ONNX model in RTL simulation.
Real-Time Frame- and Event-based Object Detection with Spiking Neural Networks on Edge Neuromorphic Hardware: Design, Deployment and Benchmark
cs.CV 2026-04 unverdicted novelty 4.0

SNNs deployed on Loihi 2 achieve real-time object detection with the lowest dynamic energy per inference and recover 87-100% of ANN accuracy via distillation-aware training.
Learning to count small and clustered objects with application to bacterial colonies
cs.CV 2026-04 unverdicted novelty 4.0

ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality
cs.HC 2026-04 unverdicted novelty 4.0

Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
cs.AI 2026-03 unverdicted novelty 4.0

A survey organizes synthetic data use, digital twin simulation, and domain adaptation techniques for autonomous driving while identifying open challenges like Sim2Real transfer.
Design and Implementation of BNN-Based Object Detection on FPGA
cs.AR 2026-05 unverdicted novelty 3.0

A BNN-based YOLOv3-tiny object detector is implemented on FPGA achieving 39.6% mAP50 on VOC dataset with 0.098 GFLOPs and near-exact match to ONNX model in RTL simulation.
Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems
cs.CV 2026-04 unverdicted novelty 3.0

An enhanced YOLOv8 model with Ghost Module, CBAM, and DCNv2 achieves 95.4% mAP@0.5 on the KITTI dataset for vehicle detection, an 8.97% gain over the baseline.
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
cs.CV 2026-04 unverdicted novelty 2.0

YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.
YOLOv11: An Overview of the Key Architectural Enhancements
cs.CV 2024-10 unverdicted novelty 1.0

YOLOv11 adds blocks such as C3k2, SPPF, and C2PSA to improve feature extraction, mAP, and efficiency while supporting detection, segmentation, pose, and oriented detection across model sizes.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 29 Pith papers

[1]

Wikipedia, Mar 2018

Analogy. Wikipedia, Mar 2018. 1

work page 2018
[2]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision , 88(2):303– 338, 2010. 6

work page 2010
[3]

C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 3

work page arXiv 2017
[4]

Gordon, A

D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017. 1

work page arXiv 2017
[5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 3

work page 2016
[6]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3

work page
[7]

Krasin, T

I. Krasin, T. Duerig, N. Alldrin, V . Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V . Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open- images: A public dataset for large-scale multi-label and multi-class image classiﬁcation. Dataset available from https://github.com/...

work page 2017
[8]

T.-Y . Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2, 3

work page 2017
[9]

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017. 1, 3, 4

work page Pith review arXiv 2017
[10]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 2

work page 2014
[11]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016. 3

work page 2016
[12]

I. Newton. Philosophiae naturalis principia mathematica . William Dawson & Sons Ltd., London, 1687. 1

work page
[13]

Parham, J

J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identiﬁcation. 2017. 4

work page 2017
[14]

J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3

work page 2013
[15]

Redmon and A

J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017. 1, 2, 3

work page 2017
[16]

Redmon and A

J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv, 2018. 4

work page 2018
[17]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To- wards real-time object detection with region proposal net- works. arXiv preprint arXiv:1506.01497, 2015. 2

work page arXiv 2015
[18]

Russakovsky, L.-J

O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2131, 2015. 4

work page 2015
[19]

M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017. 4

work page 2017
[20]

Shrivastava, R

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv preprint arXiv:1612.06851, 2016. 3

work page arXiv 2016
[21]

Entertain- ing read but the arguments against the MSCOCO metrics seem a bit weak

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017. 3 0 25 50 75 100 0 50 100 150 200 YOLOv3 All the other slow ones mAP 50 Execution time (ms) 0 25 50 75 100 0 12.5 25 37.5 50 YOLOv3All the other slow ones FPS mAP 50 Figure 4. Zero-axis charts are probably more int...

work page 2017