MusiCorpus supplies 1,309 pages of real historical handwritten music with transcriptions and annotations, the largest such resource for training optical music recognition systems under realistic conditions.
hub Mixed citations
YOLOv12: Attention-Centric Real-Time Object Detectors
Mixed citation behavior. Most common role is background (44%).
abstract
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
SPL unifies unsupervised and sparsely-supervised 3D object detection via semantic pseudo-labeling that produces bounding boxes and point labels, followed by memory-based prototype learning that mines features from both labeled and unlabeled data.
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
SoftHGNN introduces differentiable soft hyperedges via learnable prototypes and top-k sparse selection to model high-order visual interactions and improve recognition accuracy.
SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
FMC-DETR proposes a frequency-decoupled fusion framework with WeKat backbone, MDFC coordination, and CPF fusion modules that claims state-of-the-art results on remote sensing object detection benchmarks.
Knowledge distillation from a rigid-invariant 3D point cloud network into a regulated multi-view Transformer yields lower-error, faster wheat spike volume estimates from 2D images.
ATRACT integrates drone video and wearable sensor data with conditional variational autoencoder augmentation to achieve 85.7% accuracy in casualty action classification for remote battlefield triage.
YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and 0.849 mAP50 on the UODM dataset.
citing papers explorer
-
A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation
MusiCorpus supplies 1,309 pages of real historical handwritten music with transcriptions and annotations, the largest such resource for training optical music recognition systems under realistic conditions.
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
-
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
-
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
-
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
-
Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
-
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
-
Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning
SPL unifies unsupervised and sparsely-supervised 3D object detection via semantic pseudo-labeling that produces bounding boxes and point labels, followed by memory-based prototype learning that mines features from both labeled and unlabeled data.
-
Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
-
SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition
SoftHGNN introduces differentiable soft hyperedges via learnable prototypes and top-k sparse selection to model high-order visual interactions and improve recognition accuracy.
-
SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks
SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
-
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
-
Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
-
InsHuman: Towards Natural and Identity-Preserving Human Insertion
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
-
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
-
Caries DETR: Tooth Structure-aware Prior and Lesion-aware Dynamic Loss Refinement for DETR Based Caries Detection
Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
-
StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network
StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.
-
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
-
FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection
FMC-DETR proposes a frequency-decoupled fusion framework with WeKat backbone, MDFC coordination, and CPF fusion modules that claims state-of-the-art results on remote sensing object detection benchmarks.
-
3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat
Knowledge distillation from a rigid-invariant 3D point cloud network into a regulated multi-view Transformer yields lower-error, faster wheat spike volume estimates from 2D images.
-
ATRACT: A Trustworthy Robotic Autonomous system to support Casualty Triage
ATRACT integrates drone video and wearable sensor data with conditional variational autoencoder augmentation to achieve 85.7% accuracy in casualty action classification for remote battlefield triage.
-
A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and 0.849 mAP50 on the UODM dataset.
-
Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices
YOLOv11s and RT-DETRv2-R50-M provide the best accuracy-speed trade-off for real-time weed detection on edge UAV systems, with mAP50 up to 79% and low latency.
-
Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model
YOLOv12 with Otsu thresholding on cell-based segmentation classifies AML cells at 99.3% validation and test accuracy.
-
FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection
FSDETR enhances RT-DETR with SHAB, DA-AIFI, and FSFPN blocks to improve small-object detection, reporting 13.9% APS on VisDrone 2019 and 48.95% AP50 on TinyPerson using 14.7M parameters.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
-
WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery
WildfireVLM integrates YOLOv12 object detection on satellite imagery with multimodal LLMs to detect wildfires and produce contextual risk assessments and response recommendations.
-
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
-
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
-
Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface
A local multi-agent framework integrates YOLO object detection with Slack-Ollama natural language control entirely on Raspberry Pi hardware.
- TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
- Population-Scale Advancing Interface Modeling Reveals How Bacterial Swarms Encode Future Spatial Architecture