OD3 presents an optimization-free dataset distillation framework for object detection that reports new state-of-the-art accuracy on COCO and VOC at compression ratios from 0.25% to 5%.
hub
Girshick, and Jian Sun
25 Pith papers cite this work. Polarity classification is still indexing.
abstract
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A dual-stream vision transformer with modality-aware gated exchange and bidirectional token exchange fuses RGB, thermal, and event data to improve UAV vehicle detection over dual-modal baselines on a new 10,489-frame dataset.
Presents an end-to-end multitask CNN with FPN, dynamic RoI pooling, and convolutional attention for simultaneous lexicon-free text localization and recognition in complex images.
Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.
TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
PASTA enables patch-agnostic backdoor activation in ViTs via multi-location trigger insertion during training and bi-level optimization, achieving 99.13% average attack success with large gains in visual/attention stealthiness and defense robustness.
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4.1% more parameters.
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
Three lightweight VVC profiles for feature coding achieve up to 2.96% BD-Rate gain and 95.6% encoding speedup while preserving downstream task accuracy under the MPEG-AI FCM framework.
GarmNet jointly localizes garments and detects grasp landmarks on the CloPeMa dataset, reducing localization error by 24.7% when landmark detection is included.
A two-stage weakly supervised pipeline pretrains on auto-generated school labels from sparse points and fine-tunes on only 50 manual examples to achieve strong detection performance in aerial imagery.
A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher model to task-specific students.
KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a pilot test on 459 chromosomes.
ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
Virtual KITTI 2 supplies synthetic clones of real KITTI driving sequences with added weather and camera variants and multi-modal ground-truth annotations for autonomous driving vision research.
YOLOv3 achieves accuracy comparable to SSD and RetinaNet but runs substantially faster, with 28.2 mAP at 320x320 in 22 ms and 57.9 mAP@50 in 51 ms on Titan X.
A ROS-enabled V2I digital twin architecture integrates on-board stereo perception with an Unreal Engine 5 replica to deliver real-time safety monitoring for autonomous vehicles.
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.
A survey of RGB-D object detection from traditional hand-crafted features with machine learning to deep learning techniques.
citing papers explorer
-
OD3: Optimization-free Dataset Distillation for Object Detection
OD3 presents an optimization-free dataset distillation framework for object detection that reports new state-of-the-art accuracy on COCO and VOC at compression ratios from 0.25% to 5%.
-
Tri-Modal Fusion Transformers for UAV-based Object Detection
A dual-stream vision transformer with modality-aware gated exchange and bidirectional token exchange fuses RGB, thermal, and event data to improve UAV vehicle detection over dual-modal baselines on a new 10,489-frame dataset.
-
A Multitask Network for Localization and Recognition of Text in Images
Presents an end-to-end multitask CNN with FPN, dynamic RoI pooling, and convolutional attention for simultaneous lexicon-free text localization and recognition in complex images.
-
Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.
-
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
-
PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers
PASTA enables patch-agnostic backdoor activation in ViTs via multi-location trigger insertion during training and bi-level optimization, achieving 99.13% average attack success with large gains in visual/attention stealthiness and defense robustness.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
-
DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
DroneScan-YOLO reaches 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET by combining 1280x1280 input, RPA-Block pruning, MSFD stride-4 branch, and SAL-NWD loss, beating YOLOv8s by 16.6 and 12.3 points with only 4.1% more parameters.
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
-
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
-
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
-
New VVC profiles targeting Feature Coding for Machines
Three lightweight VVC profiles for feature coding achieve up to 2.96% BD-Rate gain and 95.6% encoding speedup while preserving downstream task accuracy under the MPEG-AI FCM framework.
-
GarmNet: Improving Global with Local Perception for Robotic Laundry Folding
GarmNet jointly localizes garments and detects grasp landmarks on the CloPeMa dataset, reducing localization error by 24.7% when landmark detection is included.
-
Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning
A two-stage weakly supervised pipeline pretrains on auto-generated school labels from sparse points and fine-tunes on only 50 manual examples to achieve strong detection performance in aerial imagery.
-
Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection
A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher model to task-specific students.
-
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a pilot test on 459 chromosomes.
-
Learning to count small and clustered objects with application to bacterial colonies
ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
-
Virtual KITTI 2
Virtual KITTI 2 supplies synthetic clones of real KITTI driving sequences with added weather and camera variants and multi-modal ground-truth annotations for autonomous driving vision research.
-
YOLOv3: An Incremental Improvement
YOLOv3 achieves accuracy comparable to SSD and RetinaNet but runs substantially faster, with 28.2 mAP at 320x320 in 22 ms and 57.9 mAP@50 in 51 ms on Titan X.
-
CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology
A ROS-enabled V2I digital twin architecture integrates on-board stereo perception with an Unreal Engine 5 replica to deliver real-time safety monitoring for autonomous vehicles.
-
A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.
-
RGB-D image-based Object Detection: from Traditional Methods to Deep Learning Techniques
A survey of RGB-D object detection from traditional hand-crafted features with machine learning to deep learning techniques.
-
AI Driven Soccer Analysis Using Computer Vision
A system combining object detection, segmentation, keypoint prediction, and homography transforms soccer video into real-world player positions and tactical statistics.