MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.
hub Mixed citations
YOLOv12: Attention-Centric Real-Time Object Detectors
Mixed citation behavior. Most common role is background (44%).
abstract
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MusiCorpus supplies 1,309 pages of real historical handwritten music with transcriptions and annotations, the largest such resource for training optical music recognition systems under realistic conditions.
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
EpiSAM introduces neighbor-aware prediction in a prompt-guided transformer for character segmentation on challenging stone inscriptions, plus an expanded annotated dataset.
RefDiffNet is a lightweight input enhancement block that uses reference image comparison to expose PCB defects, delivering up to 18% relative mAP50:95 gains across YOLO, RT-DETR, and Faster R-CNN detectors with 0.004-0.005M extra parameters.
MORI-Seg learns morphology-aware geometric representations from semantic masks to enable instance segmentation without instance-level annotations.
Introduces the SteelDS dataset with 24,297 annotated frames of E40 steel and copper scrap for object detection and instance segmentation to aid industrial sorting.
TRACE improves multi-video event understanding by grounding evidence in structured timelines before visual reasoning, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
Scale-Gest creates a runtime-selectable family of tiny-YOLO models with device-calibrated ACE profiles and an ROI gate that cuts per-frame energy by 4x while holding event-level F1 at 0.8-0.9 on a new driving-gesture dataset.
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
SPL unifies unsupervised and sparsely-supervised 3D object detection via semantic pseudo-labeling that produces bounding boxes and point labels, followed by memory-based prototype learning that mines features from both labeled and unlabeled data.
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
SoftHGNN introduces differentiable soft hyperedges via learnable prototypes and top-k sparse selection to model high-order visual interactions and improve recognition accuracy.
SpikeDet reaches 52.2% AP on COCO 2017 with spiking networks by optimizing firing patterns via MDSNet and SMFM, using half the energy of prior SNN detectors.
TinyFormer adds Parallel Bi-fusion Module and Spatial Semantic Adapter to a YOLO-DETR hybrid, raising small-object AP by 1.6 points to 58.5% on MS COCO while keeping real-time speed.
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
citing papers explorer
-
AnyDepth-DETR/-YOLO: Any-depth object detection with a single network
A single network achieves any-depth object detection by splitting stages into always-executed essential paths and skippable refinement paths, trained via self-distillation on the full and minimal extremes to maintain stage compatibility.
-
Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation
A cooperative humanoid robot fuses camera-based collective perception with V2X messages to detect collision risks at non-line-of-sight intersections and physically stops merging vehicles.
-
InsHuman: Towards Natural and Identity-Preserving Human Insertion
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.