hub Canonical reference

YOLOX: Exceeding YOLO Series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun · 2021 · cs.CV · arXiv 2107.08430

Canonical reference. 70% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

background 7 use method 2 baseline 1

representative citing papers

NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research

cs.CV · 2026-05-13 · conditional · novelty 7.0

NERVE is a new 600GB multi-sensor dataset with DVS, RGB-D, and 24/77GHz radar plus baselines showing DVS+77GHz radar fusion improves human detection to 47.5% mAP with sub-1.8m distance error.

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

CUTAL scores multi-frame clips for uncertainty and enforces temporal diversity to train transformer MOT models to near full-supervision performance with 50% of the labels.

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.

Bounding-Box Trajectories Matter for Video Anomaly Detection

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

TrajVAD shows that bounding-box trajectories modeled via normalizing flows can serve as a primary cue for video anomaly detection, with the trajectory-only variant achieving 87.7% AP on ShanghaiTech and best results on MSAD.

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

cs.CV · 2026-05-19 · conditional · novelty 6.0

Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.

A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking

cs.RO · 2026-05-15 · unverdicted · novelty 6.0

A deterministic queue-based matching algorithm using geometric overlaps and virtual lane discretization enables 99.8% handover success rate for continuous identity persistence in multi-UAV vehicle tracking.

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.

CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested datasets.

FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.

GateMOT: Q-Gated Attention for Dense Object Tracking

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.

CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

cs.CV · 2026-03-16 · conditional · novelty 6.0

Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.

Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework

cs.CV · 2026-01-27 · unverdicted · novelty 6.0

Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.

AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization

cs.CV · 2025-03-05 · unverdicted · novelty 6.0

AHCQ-SAM introduces ACNR, HLUQ, CAG, and LNQ quantization techniques that deliver 15.2% mAP gain on 4-bit SAM-B and 14.01% J&F gain on 4-bit SAM2-Tiny versus prior PTQ methods.

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

cs.CV · 2024-11-13 · unverdicted · novelty 6.0

Dual-head knowledge distillation partitions the linear classifier into separate heads for logit and probability losses to exploit logits without causing classification head collapse.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

cs.CV · 2022-03-07 · conditional · novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.

STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

cs.CV · 2026-05-14 · conditional · novelty 5.0

MR2-ByteTrack maintains high accuracy in video object detection on MCUs by combining multi-resolution processing, ByteTrack for frame linking, and Rescore for confidence aggregation, achieving up to 55% energy savings and real-time performance for both CNN and Transformer models.

Portable Active Learning for Object Detection

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

PAL is a portable active learning method for object detection that uses class-specific logistic classifiers for uncertainty and image-level diversity to select annotation batches, showing better label efficiency than baselines on COCO, VOC, and BDD100K.

Utility-Aware Progressive Inference over UDP Packet Blocks for Emergency Communications

eess.SP · 2026-05-11 · unverdicted · novelty 5.0

Utility-aware progressive inference on UDP packet blocks enables early hazard recognition, reducing packet budget by 34.2% and decision delay by 1209 ms while retaining 91.5% of full-reception accuracy.

citing papers explorer

Showing 40 of 40 citing papers.

NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research cs.CV · 2026-05-13 · conditional · none · ref 26 · internal anchor
NERVE is a new 600GB multi-sensor dataset with DVS, RGB-D, and 24/77GHz radar plus baselines showing DVS+77GHz radar fusion improves human detection to 47.5% mAP with sub-1.8m distance error.
Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking cs.CV · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
CUTAL scores multi-frame clips for uncertainty and enforces temporal diversity to train transformer MOT models to near full-supervision performance with 50% of the labels.
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World cs.CV · 2026-05-06 · unverdicted · none · ref 18 · internal anchor
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 53 · 3 links · internal anchor
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects cs.CV · 2026-04-09 · unverdicted · none · ref 33 · internal anchor
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
Bounding-Box Trajectories Matter for Video Anomaly Detection cs.CV · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
TrajVAD shows that bounding-box trajectories modeled via normalizing flows can serve as a primary cue for video anomaly detection, with the trajectory-only variant achieving 87.7% AP on ShanghaiTech and best results on MSAD.
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification cs.CV · 2026-05-19 · conditional · none · ref 29 · internal anchor
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
SparseSAM: Structured Sparsification of Activations in Segment Anything Models cs.CV · 2026-05-17 · unverdicted · none · ref 8 · internal anchor
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking cs.RO · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
A deterministic queue-based matching algorithm using geometric overlaps and virtual lane discretization enables 99.8% handover success rate for continuous identity persistence in multi-UAV vehicle tracking.
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline cs.CV · 2026-05-12 · unverdicted · none · ref 60 · internal anchor
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking cs.CV · 2026-05-10 · unverdicted · none · ref 22 · internal anchor
CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested datasets.
FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging cs.CV · 2026-04-30 · unverdicted · none · ref 47 · internal anchor
FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 363-image dataset.
GateMOT: Q-Gated Attention for Dense Object Tracking cs.CV · 2026-04-29 · unverdicted · none · ref 24 · internal anchor
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras cs.CV · 2026-04-18 · unverdicted · none · ref 33 · internal anchor
CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CV · 2026-04-13 · unverdicted · none · ref 11 · internal anchor
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection cs.CV · 2026-03-16 · conditional · none · ref 13 · internal anchor
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework cs.CV · 2026-01-27 · unverdicted · none · ref 12 · internal anchor
Presents the first management-oriented multi-modal benchmark GovLA-10K and a vision-language reasoning framework GovLA-Reasoner with a spatially-aware adapter for low-altitude aerial perception.
AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization cs.CV · 2025-03-05 · unverdicted · none · ref 7 · internal anchor
AHCQ-SAM introduces ACNR, HLUQ, CAG, and LNQ quantization techniques that deliver 15.2% mAP gain on 4-bit SAM-B and 14.01% J&F gain on 4-bit SAM2-Tiny versus prior PTQ methods.
Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head cs.CV · 2024-11-13 · unverdicted · none · ref 7 · internal anchor
Dual-head knowledge distillation partitions the linear classifier into separate heads for logit and probability losses to exploit logits without causing classification head collapse.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection cs.CV · 2022-03-07 · conditional · none · ref 12 · internal anchor
DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection cs.CV · 2026-05-20 · unverdicted · none · ref 133 · internal anchor
STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.
MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes cs.CV · 2026-05-14 · conditional · none · ref 19 · internal anchor
MR2-ByteTrack maintains high accuracy in video object detection on MCUs by combining multi-resolution processing, ByteTrack for frame linking, and Rescore for confidence aggregation, achieving up to 55% energy savings and real-time performance for both CNN and Transformer models.
Portable Active Learning for Object Detection cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
PAL is a portable active learning method for object detection that uses class-specific logistic classifiers for uncertainty and image-level diversity to select annotation batches, showing better label efficiency than baselines on COCO, VOC, and BDD100K.
Utility-Aware Progressive Inference over UDP Packet Blocks for Emergency Communications eess.SP · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
Utility-aware progressive inference on UDP packet blocks enables early hazard recognition, reducing packet budget by 34.2% and decision delay by 1209 ms while retaining 91.5% of full-reception accuracy.
SAMOFT: Robust Multi-Object Tracking via Region and Flow cs.CV · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
SAMOFT improves multi-object tracking by using SAM segmentation and optical flow for pixel-level motion matching, flexible centroid correction, and training-free motion pattern fixes on top of standard Kalman and ReID baselines.
Time-series Meets Complex Motion Modeling: Robust and Computational-effective Motion Predictor for Multi-object Tracking cs.CV · 2026-05-01 · unverdicted · none · ref 12 · internal anchor
TCMP achieves SOTA MOT metrics (HOTA 63.4%, IDF1 65.0%, AssA 49.1%) with 0.014x parameters and 0.05x FLOPs of the previous best method by using a simple dilated TCN regressor.
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance cs.CV · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
SocialMirror reconstructs 3D meshes of closely interacting humans from monocular videos using semantic guidance from vision-language models and geometric constraints in a diffusion model to handle occlusions and maintain temporal and spatial consistency.
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG cs.CL · 2026-04-13 · unverdicted · none · ref 15 · internal anchor
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Enhancing Event-based Object Detection with Monocular Normal Maps cs.CV · 2025-08-04 · unverdicted · none · ref 9 · 2 links · internal anchor
NRE-Net adds geometric priors from RGB-derived normal maps to RGB and event data via ADFM and EAFM fusion modules, reporting 3% AP50 gains over dual-modal baselines on driving datasets.
Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification cs.CV · 2025-07-15 · unverdicted · none · ref 40 · internal anchor
A cross-verification strategy using three YOLO models trained on distinct views of a 2134-sample 3D GPR dataset detects road subsurface distress with over 98.6 percent recall on field data.
HunyuanVideo: A Systematic Framework For Large Video Generative Models cs.CV · 2024-12-03 · unverdicted · none · ref 24 · internal anchor
HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.
Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning cs.RO · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
A dual-LLM hierarchical framework for robotic task and motion planning, integrating object detection, achieves 86% success across 24 test scenarios ranging from simple spatial commands to infeasible requests.
Hybrid Visual Telemetry for Bandwidth-Constrained Robotic Vision: A Pilot Study with HEVC Base Video and JPEG ROI Stills cs.CV · 2026-05-03 · unverdicted · none · ref 7 · internal anchor
A hybrid scheme using HEVC video for continuous awareness plus selective JPEG ROI stills for detail refinement is formalized and experimentally compared to video-only transmission under matched bitrate budgets for robotic vision tasks.
Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation cs.CV · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
An efficient implementation of a Bayes-optimal filter performs fast 3D multi-camera tracking and pose estimation from 2D inputs while handling intermittent camera disconnections.
InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard cs.AR · 2026-04-02 · unverdicted · none · ref 5 · internal anchor
InsightBoard integrates synchronized multi-metric plots, correlation analysis, and group fairness indicators into TensorBoard to reveal subgroup disparities that aggregate metrics hide during model training.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 24 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview cs.CV · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
The report overviews five maritime computer vision benchmark challenges, their datasets, protocols, quantitative results, and top team approaches from the MaCVi 2026 workshop.
YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection cs.CV · 2026-04-03 · unverdicted · none · ref 13 · internal anchor
YOLOv11 delivers higher mean average precision on standard benchmarks than prior YOLO versions while keeping real-time inference speed through C3K2, SPPF, and C2PSA modules.
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection cs.CV · 2026-05-21 · unreviewed · ref 57 · internal anchor

YOLOX: Exceeding YOLO Series in 2021

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer