MusiCorpus supplies 1,309 pages of real historical handwritten music with transcriptions and annotations, the largest such resource for training optical music recognition systems under realistic conditions.
hub Mixed citations
YOLOv12: Attention-Centric Real-Time Object Detectors
Mixed citation behavior. Most common role is background (44%).
abstract
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
RefDiffNet is a lightweight input enhancement block that uses reference image comparison to expose PCB defects, delivering up to 18% relative mAP50:95 gains across YOLO, RT-DETR, and Faster R-CNN detectors with 0.004-0.005M extra parameters.
MORI-Seg learns morphology-aware geometric representations from semantic masks to enable instance segmentation without instance-level annotations.
Introduces the SteelDS dataset with 24,297 annotated frames of E40 steel and copper scrap for object detection and instance segmentation to aid industrial sorting.
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
SPL unifies unsupervised and sparsely-supervised 3D object detection via semantic pseudo-labeling that produces bounding boxes and point labels, followed by memory-based prototype learning that mines features from both labeled and unlabeled data.
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
SoftHGNN introduces differentiable soft hyperedges via learnable prototypes and top-k sparse selection to model high-order visual interactions and improve recognition accuracy.
SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
Caries-DETR adds tooth-structure query initialization and lesion-aware loss reweighting to DETR, reaching state-of-the-art caries detection on AlphaDent and DentalAI datasets.
StomaD2 integrates diffusion-based image restoration with a specialized rotated detection network to achieve high-accuracy stomatal phenotyping across more than 130 plant species.
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
FMC-DETR proposes a frequency-decoupled fusion framework with WeKat backbone, MDFC coordination, and CPF fusion modules that claims state-of-the-art results on remote sensing object detection benchmarks.
citing papers explorer
-
Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
-
SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition
SoftHGNN introduces differentiable soft hyperedges via learnable prototypes and top-k sparse selection to model high-order visual interactions and improve recognition accuracy.
-
SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks
SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
-
FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection
FMC-DETR proposes a frequency-decoupled fusion framework with WeKat backbone, MDFC coordination, and CPF fusion modules that claims state-of-the-art results on remote sensing object detection benchmarks.