CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
hub Mixed citations
YOLOv11: An Overview of the Key Architectural Enhancements
Mixed citation behavior. Most common role is background (40%).
abstract
This study presents an architectural analysis of YOLOv11, the latest iteration in the YOLO (You Only Look Once) series of object detection models. We examine the models architectural innovations, including the introduction of the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling - Fast), and C2PSA (Convolutional block with Parallel Spatial Attention) components, which contribute in improving the models performance in several ways such as enhanced feature extraction. The paper explores YOLOv11's expanded capabilities across various computer vision tasks, including object detection, instance segmentation, pose estimation, and oriented object detection (OBB). We review the model's performance improvements in terms of mean Average Precision (mAP) and computational efficiency compared to its predecessors, with a focus on the trade-off between parameter count and accuracy. Additionally, the study discusses YOLOv11's versatility across different model sizes, from nano to extra-large, catering to diverse application needs from edge devices to high-performance computing environments. Our research provides insights into YOLOv11's position within the broader landscape of object detection and its potential impact on real-time computer vision applications.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
A task-specific iterative framework for weakly supervised 4D radar scene flow estimation uses instance-aware self-supervised losses from 2D tracking/segmentation and a rigid static loss from odometry to outperform LiDAR-dependent cross-modal and fully supervised methods on the VoD dataset.
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vision demands.
FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
S2-CoT coordinates a Structural Fidelity Adapter in the encoder-decoder with a Semantic Context Adapter in the entropy model to convert potential performance loss into state-of-the-art gains across base codecs while using only a small fraction of parameters.
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
UniSpector organizes visual prompt space with spatial-spectral and contrastive encoders to support open-set defect localization, beating baselines by at least 19.7% AP50b and 15.8% AP50m on the new Inspect Anything benchmark.
VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.
Gen-n-Val uses LLM and VLLM agents with Layer Diffusion and TextGrad to generate and validate synthetic instance data, cutting invalid samples from 50% to 7% and improving rare-class performance on LVIS and COCO benchmarks.
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
FedADAS uses federated distillation to support heterogeneous on-device yawn recognition models across vehicles, delivering up to 9974x lower communication cost than standard federated learning while preserving accuracy under extreme data heterogeneity.
FS-FSD regresses frequency-supervised Fourier contours for bridge defects, yielding higher polygon accuracy and better geometric quality than box, mask, or contour baselines on 3,767 UAV images with 42,346 instances.
VIBE is a camera-primed hybrid model-based closed-loop learning system for real-time double-directional mmWave beam management in vehicular networks that achieves outage rates as low as 1.1-1.4% and outperforms 5G NR and end-to-end ML baselines.
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
AdvAD produces physical-world adversarial patches with improved transferability to unseen object detectors by multi-model optimization, adaptive balancing, and physical variation robustness.
ZoomSpec achieves 78.1 mAP@0.5:0.95 on the SpaceNet dataset by combining log-space STFT, a coarse proposal net, adaptive heterodyne filtering, and dual-domain fine recognition to improve narrowband visibility in wideband spectrum sensing.
UFPR-VeSV is a new real-world dataset for fine-grained vehicle classification and automatic license plate recognition collected from Brazilian police cameras, with benchmarks demonstrating its difficulty and the value of joint task use.
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
citing papers explorer
-
Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models
CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
-
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
-
Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation
A task-specific iterative framework for weakly supervised 4D radar scene flow estimation uses instance-aware self-supervised losses from 2D tracking/segmentation and a rigid static loss from odometry to outperform LiDAR-dependent cross-modal and fully supervised methods on the VoD dataset.
-
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
-
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vision demands.
-
FluxShard: Motion-Aware Feature Cache Reuse for Collaborative Video Analytics in Mobile Edge Computing
FluxShard uses per-block motion vectors and a Receptive Field Alignment Principle to manage feature cache reuse in edge-cloud video analytics, delivering 32.6-83.8% lower latency and 14.9-64.0% lower energy than baselines while preserving accuracy.
-
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
-
What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters
S2-CoT coordinates a Structural Fidelity Adapter in the encoder-decoder with a Semantic Context Adapter in the entropy model to convert potential performance loss into state-of-the-art gains across base codecs while using only a small fraction of parameters.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
UniSpector organizes visual prompt space with spatial-spectral and contrastive encoders to support open-set defect localization, beating baselines by at least 19.7% AP50b and 15.8% AP50m on the new Inspect Anything benchmark.
-
VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.
-
Gen-n-Val: Agentic Image Data Generation and Validation
Gen-n-Val uses LLM and VLLM agents with Layer Diffusion and TextGrad to generate and validate synthetic instance data, cutting invalid samples from 50% to 7% and improving rare-class performance on LVIS and COCO benchmarks.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
-
FedADAS: Communication-Efficient Federated Distillation for On-Device Driver Yawn Recognition in Vehicular Networks
FedADAS uses federated distillation to support heterogeneous on-device yawn recognition models across vehicles, delivering up to 9974x lower communication cost than standard federated learning while preserving accuracy under extreme data heterogeneity.
-
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours
FS-FSD regresses frequency-supervised Fourier contours for bridge defects, yielding higher polygon accuracy and better geometric quality than box, mask, or contour baselines on 3,767 UAV images with 42,346 instances.
-
Look Once, Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity
VIBE is a camera-primed hybrid model-based closed-loop learning system for real-time double-directional mmWave beam management in vehicular networks that achieves outage rates as low as 1.1-1.4% and outperforms 5G NR and end-to-end ML baselines.
-
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital and physical evaluations.
-
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
-
Transferable Physical-World Adversarial Patches Against Object Detection in Autonomous Driving
AdvAD produces physical-world adversarial patches with improved transferability to unseen object detectors by multi-model optimization, adaptive balancing, and physical variation robustness.
-
ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing
ZoomSpec achieves 78.1 mAP@0.5:0.95 on the SpaceNet dataset by combining log-space STFT, a coarse proposal net, adaptive heterodyne filtering, and dual-domain fine recognition to improve narrowband visibility in wideband spectrum sensing.
-
Toward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition
UFPR-VeSV is a new real-world dataset for fine-grained vehicle classification and automatic license plate recognition collected from Brazilian police cameras, with benchmarks demonstrating its difficulty and the value of joint task use.
-
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
UAVGen generates higher-quality synthetic UAV images via visual prototype conditioning and focal region focus in diffusion models, leading to better object detection accuracy than prior methods.
-
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
-
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
-
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
-
Edge Assisted Multi-Camera Vehicle Tracking Framework for Real-Time and Scalable Deployment
EASE-MCVT is a distributed edge-assisted multi-camera vehicle tracking framework that achieves real-time performance and competitive accuracy on public datasets through edge processing and server-side optimizations.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation
Synthetic data augmentation improves instance segmentation performance for chicken carcasses when real annotated data is limited.
-
A Leaf-Level Dataset for Soybean-Cotton Detection and Segmentation
A new leaf-instance dataset for soybean-cotton detection and segmentation collected across growth stages and conditions from commercial farms is presented and validated with YOLOv11.
-
MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes
MR2-ByteTrack maintains high accuracy in video object detection on MCUs by combining multi-resolution processing, ByteTrack for frame linking, and Rescore for confidence aggregation, achieving up to 55% energy savings and real-time performance for both CNN and Transformer models.
-
ERPPO: Entropy Regularization-based Proximal Policy Optimization
ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
-
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV introduces a three-band height-aware BEV encoding of LiDAR data to enable single-pass real-time 3D detection of pedestrians, cars, and cyclists with improved KITTI accuracy.
-
Exploring Clustering Capability of Inpainting Model Embeddings for Pattern-based Individual Identification
Inpainting auxiliary task improves clustering of embeddings for individual zebrafish identification based on skin patterns.
-
Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation
Echo-α integrates organ-specific detectors with global visual context via an invoke-and-reason agentic loop, trained on a nine-task curriculum plus sequential RL, to achieve superior grounding (56.73%/43.78% F1@0.5) and diagnosis (74.90%/49.20% accuracy) on cross-center renal and breast ultrasound.
-
Edge-Cloud Collaborative Reconstruction via Structure-Aware Latent Diffusion for Downstream Remote Sensing Perception
SALD decouples remote sensing images into compressed payload plus structural prior at the edge and uses structure-gated diffusion on the cloud to improve super-resolution and downstream detection under extreme bandwidth limits.
-
DocRevive: A Unified Pipeline for Document Text Restoration
A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.
-
A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures
WSA-Net uses partial convolutions, heterogeneous grouping attention, geometric reconstruction, and context anchoring to enhance low-SCR hyperbolic signatures in GPR data, reaching 0.6958 mAP@0.5 at 164 FPS with 2.412M parameters on the RTST dataset.
-
Human Interaction-Aware 3D Reconstruction from a Single Image
HUG3D uses group-instance multi-view diffusion and physics-based optimization to create physically plausible 3D reconstructions of interacting people from a single image.
-
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning
A method combining pretrained YOLO11, YOLOE-26, and Gaze-LLE models detects student gaze targets in collaborative learning videos with F1-score 0.829 without requiring labeled training data.
-
YawDD+: Frame-level Annotations for Accurate Yawn Prediction
YawDD+ frame-level annotations improve yawn classification to 99.34% accuracy and detection to 95.69% mAP on Jetson hardware compared to video-level labels.
-
A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
YOLO-MD improves underwater marine debris detection by adding a Dual-Branch Convolutional Enhanced Self-Attention module, a lightweight shift operation, and SFG-Loss for class imbalance, achieving 0.875 precision and 0.849 mAP50 on the UODM dataset.
-
Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly
An integrated fringe projection and AI pipeline delivers aligned high-accuracy 3D sensing and instance segmentation for autonomous HDD disassembly at 77.7 FPS.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
Real-Time Structural Detection for Indoor Navigation from 3D LiDAR Using Bird's-Eye-View Images
Projecting 3D LiDAR to BEV images and applying YOLO-OBB with spatiotemporal fusion enables reliable real-time structural detection on resource-constrained robots.
-
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
-
StreakMind: AI detection and analysis of satellite streaks in astronomical images with automated database integration
StreakMind trains a YOLO OBB model on 2335 images to detect satellite streaks in FITS frames with 94% precision and 97% recall, then applies geometric refinement and orbital database matching.
-
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.