SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
Mixed citations
End- to-end object detection with transformers
Mixed citation behavior. Most common role is background (67%).
citation-role summary
citation-polarity summary
representative citing papers
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.
AMAR uses a transformer with learnable query embeddings for set-based prediction of concurrent activities from composite Wi-Fi CSI, combined with edge feature extraction and vector quantization for bandwidth-efficient deployment.
Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and pseudo-fake samples.
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
CondHead conditionally parameterizes detection heads on semantic embeddings via aggregated expert and dynamically generated streams to improve generalization for novel categories.
A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.
citing papers explorer
-
SARES-DEIM: Sparse Mixture-of-Experts Meets DETR for Robust SAR Ship Detection
SARES-DEIM achieves 76.4% mAP50:95 and 93.8% mAP50 on HRSID by routing SAR features through sparse frequency and wavelet experts plus a high-resolution preservation neck, outperforming prior YOLO and SAR detectors.
-
Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
-
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.
-
AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI
AMAR uses a transformer with learnable query embeddings for set-based prediction of concurrent activities from composite Wi-Fi CSI, combined with edge feature extraction and vector quantization for bandwidth-efficient deployment.
-
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and pseudo-fake samples.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
-
Learning to Detect and Segment for Open Vocabulary Object Detection
CondHead conditionally parameterizes detection heads on semantic embeddings via aggregated expert and dynamically generated streams to improve generalization for novel categories.
-
Predicting the thermodynamics in the chromosphere from the translation of SDO data into the IRIS$^{2}$ inversion results using a visual transformer model
A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.
-
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
-
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
-
MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling
MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.
-
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authentic samples.
-
A Comparative Study of Modern Object Detectors for Robust Apple Detection in Orchard Imagery
YOLO11n achieves the highest mAP@0.5:0.95 of 0.6065 for apple localization, with other detectors showing trade-offs in recall and precision at low confidence thresholds.