YOLOv3: An Incremental Improvement

Ali Farhadi; Joseph Redmon

arxiv: 1804.02767 · v1 · submitted 2018-04-08 · 💻 cs.CV

YOLOv3: An Incremental Improvement

Joseph Redmon , Ali Farhadi This is my paper

Pith reviewed 2026-05-13 11:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionreal-time detectionYOLOconvolutional networksaccuracy speed tradeoffincremental design

0 comments

The pith

YOLOv3 reaches SSD-level accuracy three times faster through incremental design changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a series of small updates to the YOLO object detection model. These changes aim to boost accuracy without sacrificing the model's speed advantage. Results show the new version matches the accuracy of SSD at 320x320 resolution while running three times faster, and nearly equals RetinaNet's performance at four times the speed. A sympathetic reader would care because faster detection opens up more uses in real-time applications like video surveillance and autonomous systems. The work builds on prior YOLO versions by refining the network rather than overhauling it.

Core claim

YOLOv3 incorporates a number of little design changes and trains a new network that is slightly larger but more accurate. At 320 by 320 input it runs in 22 milliseconds with 28.2 mean average precision, matching SSD accuracy at three times the speed. On the .5 IOU metric it reaches 57.9 mAP in 51 milliseconds on a Titan X, close to RetinaNet's 57.5 mAP but 3.8 times faster.

What carries the argument

The updated YOLO network architecture with incremental design changes that improve feature handling and prediction accuracy while preserving fast inference.

If this is right

Object detection systems can process more frames per second on standard hardware.
Applications needing real-time performance gain better accuracy options without added latency.
Model refinement techniques prove effective for balancing speed and precision in detection tasks.
Similar incremental updates could extend the usable life of other detector families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These tweaks might transfer to other single-shot detectors to achieve comparable gains.
Testing on edge devices would reveal if the speed benefits hold in constrained environments.
Longer-term, this suggests focusing on optimization over architecture invention for practical gains.

Load-bearing premise

That the measured improvements come primarily from the described design changes rather than from specific training details or evaluation conditions.

What would settle it

Independent reproduction of the training and testing that yields significantly lower accuracy or slower inference times than reported.

read the original abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOLOv3 is a straightforward set of tweaks that lift accuracy on standard detection benchmarks while keeping the speed edge, with code that makes the numbers checkable.

read the letter

The main takeaway is that this version refines the prior YOLO model through several small architecture and training adjustments, producing higher mAP at comparable or better inference times. The concrete results are the 320x320 model at 22 ms with 28.2 mAP matching SSD but running three times faster, and the 57.9 mAP@50 in 51 ms on Titan X versus RetinaNet's 57.5 in 198 ms. These are direct measurements on COCO with external baselines, not derived claims.

Referee Report

2 major / 3 minor

Summary. The manuscript presents YOLOv3 as an incremental update to prior YOLO detectors, incorporating design changes including a Darknet-53 backbone, multi-scale feature prediction via a feature-pyramid-like structure, and logistic classifiers for independent class predictions. It reports direct empirical measurements on COCO, claiming that at 320x320 input YOLOv3 reaches 28.2 mAP in 22 ms (matching SSD accuracy at 3x speed) and 57.9 mAP@0.5 in 51 ms on Titan X (comparable to RetinaNet's 57.5 mAP@0.5 in 198 ms, at 3.8x speed). All code is released for verification.

Significance. If the reported timings and mAP values hold under the released code, the work supplies a strong, practical real-time detector baseline that improves the speed-accuracy trade-off over prior single-stage methods. The open-source release and direct external comparisons add substantial value for reproducibility and follow-on research in computer vision.

major comments (2)

[Experiments] Experiments section: no ablation studies isolate the contribution of individual changes (e.g., backbone swap, multi-scale heads, or logistic vs. softmax classification) to the measured mAP gains; without them the central claim that the listed incremental updates are responsible for the accuracy improvements remains correlational.
[Results] Results paragraph on RetinaNet comparison: the 57.9 vs. 57.5 mAP@0.5 numbers are reported, yet the paper does not provide the corresponding mAP@[.5:.95] figures for both models on the same split, weakening the direct performance equivalence claim under the standard COCO metric.

minor comments (3)

[Abstract] Abstract and introduction contain informal phrasing (e.g., 'a bunch of little design changes', 'pretty swell') that should be revised to match journal standards.
[Training] The manuscript would benefit from explicit statements of the exact training schedule, data augmentations, and optimizer settings used to obtain the quoted mAP numbers, even though code is released.
[Architecture] Figure 1 (network diagram) lacks layer-by-layer channel counts or residual-block details, making it harder to verify architectural differences from YOLOv2 without inspecting the code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments on the experimental presentation. We respond to each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation studies isolate the contribution of individual changes (e.g., backbone swap, multi-scale heads, or logistic vs. softmax classification) to the measured mAP gains; without them the central claim that the listed incremental updates are responsible for the accuracy improvements remains correlational.

Authors: We agree that ablation studies would provide stronger causal evidence for the contribution of each design change. This manuscript, however, presents YOLOv3 as a practical, incremental update whose primary goal is to deliver a strong real-time baseline with released code. Each modification is described and motivated by prior work, and the overall system is validated through direct COCO comparisons. We did not run the additional controlled ablations in the original study, and incorporating them would require substantial new training that falls outside the scope of this short incremental paper. We therefore do not plan to add ablation experiments in the revision. revision: no
Referee: [Results] Results paragraph on RetinaNet comparison: the 57.9 vs. 57.5 mAP@0.5 numbers are reported, yet the paper does not provide the corresponding mAP@[.5:.95] figures for both models on the same split, weakening the direct performance equivalence claim under the standard COCO metric.

Authors: We acknowledge that reporting mAP@[.5:.95] would allow a fuller comparison under the primary COCO metric. The equivalence statement in the manuscript is explicitly tied to the mAP@0.5 numbers published by the RetinaNet authors, which is the metric they highlighted for that speed-accuracy operating point. In the revision we will add YOLOv3’s mAP@[.5:.95] result for completeness and will clarify that the direct numerical comparison remains under mAP@0.5 because we rely on the originally reported RetinaNet figures rather than re-evaluating their model on an identical split. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarks only

full rationale

The paper reports direct empirical measurements of accuracy (mAP) and inference speed on standard COCO benchmarks, with comparisons to independently published external models (SSD, RetinaNet). Design changes are described narratively and their effects are measured experimentally rather than derived. No mathematical equations, predictions, or uniqueness theorems appear that could reduce to self-fitted inputs or self-citations by construction. Released code further supports external reproducibility, keeping the central claims independent of any internal circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical evaluation of an updated convolutional network on standard object detection benchmarks; no new mathematical derivations or invented physical entities are introduced.

free parameters (1)

network architecture hyperparameters
Specific layer counts, filter sizes, and training schedule choices that define the 'little bigger' network.

axioms (1)

domain assumption Standard mAP and mAP@50 metrics on COCO or similar benchmarks are appropriate proxies for detection quality.
Invoked when reporting and comparing accuracy figures.

pith-pipeline@v0.9.0 · 5433 in / 1172 out tokens · 31896 ms · 2026-05-13T11:33:14.559882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet
Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present some updates to YOLO! We made a bunch of little design changes to make it better.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation
cs.RO 2024-03 accept novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection
cs.CV 2026-05 unverdicted novelty 7.0

Introduces a differentiable Fourier coefficient representation for generating robust physical adversarial shapes that evade infrared object detectors with over 88% success at long range.
Language Prompt vs. Image Enhancement: Boosting Object Detection With CLIP in Hazy Environments
cs.CV 2026-04 unverdicted novelty 7.0

CLIP language prompts guide a new weighted cross-entropy loss (CLIP-CE via AME and FAME) to boost object detection performance in hazy images, outperforming image enhancement baselines on the introduced HazyCOCO dataset.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
cs.CV 2026-04 unverdicted novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
cs.AI 2025-11 unverdicted novelty 7.0

Thermally activated clothing with thermochromic dyes and heaters creates dynamic adversarial patterns that evade AI surveillance in visible and infrared modalities while appearing ordinary when inactive.
The Indirect Convolution Algorithm
cs.CV 2019-07 unverdicted novelty 7.0

The Indirect Convolution algorithm avoids im2col by using an indirection buffer, reducing memory overhead proportionally to input channels and outperforming GEMM-based methods by up to 62% for convolutions requiring t...
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
cs.CV 2026-05 unverdicted novelty 6.0

Introduces the first interpretable framework for suicide risk assessment (SRA) from metro station surveillance videos, achieving 83.2% ROC-AUC via person tracking, activity recognition, semantic segmentation, and traj...
Towards Continuous Sign Language Conversation from Isolated Signs
cs.CV 2026-05 unverdicted novelty 6.0

Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours
cs.CV 2026-05 unverdicted novelty 6.0

FS-FSD regresses frequency-supervised Fourier contours for bridge defects, yielding higher polygon accuracy and better geometric quality than box, mask, or contour baselines on 3,767 UAV images with 42,346 instances.
UniISP: A Unified ISP Framework for Both Human and Machine Vision
cs.CV 2026-05 unverdicted novelty 6.0

UniISP unifies ISP processing with a Hybrid Attention Module and Feature Adapter to produce images that are both visually pleasing for humans and informative for computer vision models.
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
cs.CV 2026-04 unverdicted novelty 6.0

TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE
cs.CV 2026-04 unverdicted novelty 6.0

IA-CLAHE trains a lightweight network on a differentiable CLAHE extension to predict per-tile clip limits that drive local histograms toward a uniform distribution, delivering zero-shot gains in recognition accuracy a...
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
cs.CV 2026-04 unverdicted novelty 6.0

DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4....
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
cs.CV 2026-03 conditional novelty 6.0

Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
cs.CV 2026-03 unverdicted novelty 6.0

A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
BicKD: Bilateral Contrastive Knowledge Distillation
cs.LG 2026-02 unverdicted novelty 6.0

BicKD introduces a bilateral contrastive loss in knowledge distillation that strengthens class-wise orthogonality and intra-class consistency in predictive distributions, outperforming prior logit-based methods.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
YOLOX: Exceeding YOLO Series in 2021
cs.CV 2021-07 accept novelty 6.0

YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Reprojection R-CNN: A Fast and Accurate Object Detector for 360{\deg} Images
cs.CV 2019-07 unverdicted novelty 6.0

Reprojection R-CNN is a two-stage detector for 360° images combining a distortion-aware spherical RPN on ERP with a reprojection network on perspective projections, reporting higher mAP than prior methods on two new s...
DeFog: Fog Computing Benchmarks
cs.DC 2019-07 unverdicted novelty 6.0

DeFog introduces the first standardized benchmarking suite and metric catalogue for comparing cloud, edge, and hybrid deployments of six edge-conducive applications in fog computing.
DetectFusion: Detecting and Segmenting Both Known and Unknown Dynamic Objects in Real-time SLAM
cs.CV 2019-07 unverdicted novelty 6.0

DetectFusion combines 2D object detection with 3D geometric segmentation to handle both known and unknown dynamic objects in real-time RGB-D SLAM at about 20 FPS.
A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
cs.DC 2019-07 unverdicted novelty 6.0

A unified IR plus ML-based scheduling for CNN inference on multi-vendor integrated GPUs matches or exceeds vendor libraries (up to 1.62x) on image models while supporting more models.
Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned Arithmetic
cs.AR 2019-06 unverdicted novelty 6.0

The authors propose BIHIWE, a microarchitecture that accelerates DNN dot-products by bit-partitioning them into spatially parallel low-bitwidth MAC units operating in the charge domain and sharing A/D converters.
On Physical Adversarial Patches for Object Detection
cs.CV 2019-06 unverdicted novelty 6.0

A physical patch suppresses all object detections by YOLOv3 even for distant objects without overlapping them.
A Robust Deep Learning Framework for Prominence Detection through Composite Feature Representations
astro-ph.SR 2026-05 unverdicted novelty 5.0

Composite three-channel preprocessing of SDO/AIA images yields a YOLOv5 prominence detector with mAP@50 of 0.749 and 78% recall that also generalizes to SUVI data.
Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework
cs.CV 2026-05 unverdicted novelty 5.0

JMOF is a new optimization framework for physical adversarial attacks that improves cross-model transferability and enables simultaneous attacks on multiple vision tasks such as object detection and semantic segmentation.
Multimodal Object Detection Under Sparse Forest-Canopy Occlusion
cs.CV 2026-05 unverdicted novelty 5.0

A proof-of-concept multimodal pipeline using LiDAR, visible-thermal fusion, AOS, and fine-tuned YOLOv5 reports mAP of ~0.83 on thermal classes and notes limited LiDAR penetration plus improved visibility from fusion a...
LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
cs.CV 2026-04 unverdicted novelty 5.0

RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
Crowdsourcing of Real-world Image Annotation via Visual Properties
cs.CV 2026-04 unverdicted novelty 5.0

An interactive crowdsourcing method applies visual property constraints and hierarchy-based dynamic questions to produce more consistent image annotations.
SynSpill: Improved Industrial Spill Detection With Synthetic Data
cs.CV 2025-08 conditional novelty 5.0

SynSpill synthetic data enables PEFT of VLMs and boosts YOLO and DETR detectors for industrial spill detection, making their performance comparable after training.
Wan: Open and Advanced Large-Scale Video Generative Models
cs.CV 2025-03 unverdicted novelty 5.0

Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.
YOLOv4: Optimal Speed and Accuracy of Object Detection
cs.CV 2020-04 unverdicted novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Towards Adversarially Robust Object Detection
cs.CV 2019-07 unverdicted novelty 5.0

Develops a multi-task learning based adversarial training approach to improve robustness of object detectors to adversarial attacks, with experiments on PASCAL-VOC and MS-COCO.
AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data
cs.CV 2019-07 unverdicted novelty 5.0

AVDNet adds ConvRes residual blocks and larger output maps to a one-stage detector for small aerial vehicles, reports mAP gains on VEDAI, DLR-3K, DOTA and new ABD dataset, and introduces RFAV visualization.
Underexposed Image Correction via Hybrid Priors Navigated Deep Propagation
cs.CV 2019-07 unverdicted novelty 5.0

Hybrid-prior deep propagation model for underexposed image correction that integrates physical principles and data distributions to adjust reflectance and illumination.
Cascade RetinaNet: Maintaining Consistency for Single-Stage Object Detection
cs.CV 2019-07 unverdicted novelty 5.0

Cas-RetinaNet improves RetinaNet by 2 AP on MS COCO by training cascade stages on rising IoU thresholds and adding a Feature Consistency Module to align classification confidence with localization accuracy.
Metamorphic Detection of Adversarial Examples in Deep Learning Models With Affine Transformations
cs.CV 2019-07 unverdicted novelty 5.0

The authors propose using metamorphic relations based on distance ratio preserving affine transformations to detect whether an input image is adversarial with high accuracy.
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection
cs.CV 2026-05 unverdicted novelty 4.0

GSA-YOLO modifies YOLOv8n with structured sparsity via Group Lasso and Sparse Structure Selection plus Adaptive Knowledge Distillation, reporting 189.62 FPS and mAP50:95 gains of 2.4% and 1.8% on HiXray and PIDray datasets.
Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception
cs.CV 2026-05 unverdicted novelty 4.0

Generative texture synthesis from StyleGAN2 diversifies 3D pedestrian assets from a single base model, improving robustness in 2D object detection while exposing 3D perception models' sensitivity to geometric domain gaps.
AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes
cs.CV 2026-05 unverdicted novelty 4.0

AMIEOD combines a multi-expert enhancement module with detection-guided regression and selection losses to raise object detection accuracy in low-illumination images.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Design and Implementation of BNN-Based Object Detection on FPGA
cs.AR 2026-05 unverdicted novelty 4.0

A BNN-based YOLOv3-tiny-like object detector with 1-bit weights and 8-bit activations is implemented in Verilog on FPGA, achieving 39.6% mAP50 on VOC and 0.999964 correlation with the ONNX model in RTL simulation.
Real-Time Frame- and Event-based Object Detection with Spiking Neural Networks on Edge Neuromorphic Hardware: Design, Deployment and Benchmark
cs.CV 2026-04 unverdicted novelty 4.0

SNNs deployed on Loihi 2 achieve real-time object detection with the lowest dynamic energy per inference and recover 87-100% of ANN accuracy via distillation-aware training.
Learning to count small and clustered objects with application to bacterial colonies
cs.CV 2026-04 unverdicted novelty 4.0

ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality
cs.HC 2026-04 unverdicted novelty 4.0

Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
cs.AI 2026-03 unverdicted novelty 4.0

A survey organizes synthetic data use, digital twin simulation, and domain adaptation techniques for autonomous driving while identifying open challenges like Sim2Real transfer.
New VVC profiles targeting Feature Coding for Machines
cs.CV 2025-12 unverdicted novelty 4.0

Three lightweight VVC profiles for feature coding achieve up to 2.96% BD-Rate gain and 95.6% encoding speedup while preserving downstream task accuracy under the MPEG-AI FCM framework.
DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection
cs.CV 2025-12 unverdicted novelty 4.0

DFIR-DETR augments RT-DETR with frequency-domain iterative refinement and dynamic feature aggregation, reporting 92.9% mAP50 on NEU-DET and 51.6% on VisDrone at 11.7M parameters and 47.2 GFLOPs.
Human Pose Estimation for Real-World Crowded Scenarios
cs.CV 2019-07 unverdicted novelty 4.0

Pose estimation in crowds improves by 4.7% AP via COCO-based occlusion augmentation, explicit occlusion prediction branches, and an extended JTA dataset with higher density and variety.
Multi-Cue Vehicle Detection for Semantic Video Compression In Georegistered Aerial Videos
cs.CV 2019-07 unverdicted novelty 4.0

A multi-cue pipeline combining deep learning appearance detection and flux tensor spatio-temporal filtering achieves high-precision moving vehicle detection in aerial videos while enabling over 100:1 semantic compression.
Voxel-FPN: multi-scale voxel feature aggregation in 3D object detection from point clouds
cs.CV 2019-06 unverdicted novelty 4.0

Voxel-FPN proposes an encoder-decoder architecture for multi-scale voxel feature aggregation in one-stage 3D object detection from point clouds, reporting competitive speed and accuracy on KITTI-3D.
EdgeLens: Deep Learning based Object Detection in Integrated IoT, Fog and Cloud Computing Environments
cs.DC 2019-06 unverdicted novelty 4.0

EdgeLens is a framework that integrates deep learning object detection with fog-cloud environments to adapt between high-accuracy and low-latency service modes.
Design and Implementation of BNN-Based Object Detection on FPGA
cs.AR 2026-05 unverdicted novelty 3.0

A BNN-based YOLOv3-tiny object detector is implemented on FPGA achieving 39.6% mAP50 on VOC dataset with 0.098 GFLOPs and near-exact match to ONNX model in RTL simulation.
Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems
cs.CV 2026-04 unverdicted novelty 3.0

An enhanced YOLOv8 model with Ghost Module, CBAM, and DCNv2 achieves 95.4% mAP@0.5 on the KITTI dataset for vehicle detection, an 8.97% gain over the baseline.
DeepTEGINN: Deep Learning Based Tools to Extract Graphs from Images of Neural Networks
cs.CV 2019-07 unverdicted novelty 3.0

DeepTEGINN is a deep learning toolbox combining image processing and graph theory to automate graph extraction from brain tissue images as an alternative to manual tracing.
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection
cs.CV 2026-04 unverdicted novelty 2.0

YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 60 Pith papers

[1]

Wikipedia, Mar 2018

Analogy. Wikipedia, Mar 2018. 1

work page 2018
[2]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision , 88(2):303– 338, 2010. 6

work page 2010
[3]

C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 3

work page Pith review arXiv 2017
[4]

IQA: Visual Question Answering in Interactive Environments

D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017. 1

work page Pith review arXiv 2017
[5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 3

work page 2016
[6]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3

work page
[7]

Krasin, T

I. Krasin, T. Duerig, N. Alldrin, V . Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V . Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open- images: A public dataset for large-scale multi-label and multi-class image classiﬁcation. Dataset available from https://github.com/...

work page 2017
[8]

T.-Y . Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2, 3

work page 2017
[9]

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017. 1, 3, 4

work page Pith review arXiv 2017
[10]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 2

work page 2014
[11]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016. 3

work page 2016
[12]

I. Newton. Philosophiae naturalis principia mathematica . William Dawson & Sons Ltd., London, 1687. 1

work page
[13]

Parham, J

J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identiﬁcation. 2017. 4

work page 2017
[14]

J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3

work page 2013
[15]

Redmon and A

J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017. 1, 2, 3

work page 2017
[16]

Redmon and A

J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv, 2018. 4

work page 2018
[17]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To- wards real-time object detection with region proposal net- works. arXiv preprint arXiv:1506.01497, 2015. 2

work page Pith review arXiv 2015
[18]

Russakovsky, L.-J

O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2131, 2015. 4

work page 2015
[19]

M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017. 4

work page 2017
[20]

Beyond Skip Connections: Top-Down Modulation for Object Detection

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv preprint arXiv:1612.06851, 2016. 3

work page Pith review arXiv 2016
[21]

Entertain- ing read but the arguments against the MSCOCO metrics seem a bit weak

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017. 3 0 25 50 75 100 0 50 100 150 200 YOLOv3 All the other slow ones mAP 50 Execution time (ms) 0 25 50 75 100 0 12.5 25 37.5 50 YOLOv3All the other slow ones FPS mAP 50 Figure 4. Zero-axis charts are probably more int...

work page 2017

[1] [1]

Wikipedia, Mar 2018

Analogy. Wikipedia, Mar 2018. 1

work page 2018

[2] [2]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision , 88(2):303– 338, 2010. 6

work page 2010

[3] [3]

C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 3

work page Pith review arXiv 2017

[4] [4]

IQA: Visual Question Answering in Interactive Environments

D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017. 1

work page Pith review arXiv 2017

[5] [5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 3

work page 2016

[6] [6]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3

work page

[7] [7]

Krasin, T

I. Krasin, T. Duerig, N. Alldrin, V . Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V . Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open- images: A public dataset for large-scale multi-label and multi-class image classiﬁcation. Dataset available from https://github.com/...

work page 2017

[8] [8]

T.-Y . Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2, 3

work page 2017

[9] [9]

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017. 1, 3, 4

work page Pith review arXiv 2017

[10] [10]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 2

work page 2014

[11] [11]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016. 3

work page 2016

[12] [12]

I. Newton. Philosophiae naturalis principia mathematica . William Dawson & Sons Ltd., London, 1687. 1

work page

[13] [13]

Parham, J

J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identiﬁcation. 2017. 4

work page 2017

[14] [14]

J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3

work page 2013

[15] [15]

Redmon and A

J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017. 1, 2, 3

work page 2017

[16] [16]

Redmon and A

J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv, 2018. 4

work page 2018

[17] [17]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To- wards real-time object detection with region proposal net- works. arXiv preprint arXiv:1506.01497, 2015. 2

work page Pith review arXiv 2015

[18] [18]

Russakovsky, L.-J

O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2131, 2015. 4

work page 2015

[19] [19]

M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017. 4

work page 2017

[20] [20]

Beyond Skip Connections: Top-Down Modulation for Object Detection

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv preprint arXiv:1612.06851, 2016. 3

work page Pith review arXiv 2016

[21] [21]

Entertain- ing read but the arguments against the MSCOCO metrics seem a bit weak

C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017. 3 0 25 50 75 100 0 50 100 150 200 YOLOv3 All the other slow ones mAP 50 Execution time (ms) 0 25 50 75 100 0 12.5 25 37.5 50 YOLOv3All the other slow ones FPS mAP 50 Figure 4. Zero-axis charts are probably more int...

work page 2017