PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.
super hub Mixed citations
Microsoft COCO: common objects in context
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.
Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
MHSA mitigates hallucinations in LVLMs by training an MLP to steer cross-modal attention, extending detection work to mitigation via attention replacement at inference.
DeFakerOne is a unified foundation model for joint image-level fake image detection and pixel-level localization that reports SOTA results on 39 detection and 9 localization benchmarks.
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Watermark removal leaves detectable forensic artifacts, so no current method balances attack success, perceptual quality, and undetectability.
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.
citing papers explorer
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
-
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton
BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
-
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
-
Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source
DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.
-
When transformers learn "impossible" languages, what do they learn?
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation
QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.
-
RedactionBench
Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.
-
MDS-DETR: DETR with Masked Duplicate Suppressor
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
-
GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.
-
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
-
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
-
Stylistic Attribute Control in Latent Diffusion Models
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
-
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
-
Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks
Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.
-
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
MHSA mitigates hallucinations in LVLMs by training an MLP to steer cross-modal attention, extending detection work to mitigation via attention replacement at inference.
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne is a unified foundation model for joint image-level fake image detection and pixel-level localization that reports SOTA results on 39 detection and 9 localization benchmarks.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing
Watermark removal leaves detectable forensic artifacts, so no current method balances attack success, perceptual quality, and undetectability.
-
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
-
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.