super hub Mixed citations

Microsoft COCO: common objects in context

C. Lawrence Zitnick, Deva Ramanan, James Hays, Michael Maire, Pietro Perona, Piotr Dollár + 2 more · 2014 · Lecture Notes in Computer Science · DOI 10.1007/978-3-319-10602-1_48

Mixed citation behavior. Most common role is background (67%).

26 Pith papers citing it

25k external citations · Crossref

Background 67% of classified citations

open at publisher browse 26 citing papers more from C. Lawrence Zitnick

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 dataset 3

citation-polarity summary

background 4 use dataset 2

authors

C. Lawrence Zitnick Deva Ramanan James Hays Michael Maire Pietro Perona Piotr Dollár Serge Belongie Tsung-Yi Lin

co-cited works

representative citing papers

A strongly annotated passive acoustic dataset for tropical bird monitoring

cs.SD · 2026-05-20 · accept · novelty 7.0 · 2 refs

PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.

BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.

FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation

astro-ph.IM · 2026-04-14 · unverdicted · novelty 7.0

FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.

Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

cs.CV · 2025-11-17 · unverdicted · novelty 7.0

DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.

When transformers learn "impossible" languages, what do they learn?

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.

RedactionBench

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.

MDS-DETR: DETR with Masked Duplicate Suppressor

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.

Stylistic Attribute Control in Latent Diffusion Models

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

cs.NI · 2026-05-01 · unverdicted · novelty 6.0

EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.

Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks

cs.CV · 2024-11-29 · unverdicted · novelty 6.0

Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

cs.CV · 2024-03-05 · conditional · novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

MHSA mitigates hallucinations in LVLMs by training an MLP to steer cross-modal attention, extending detection work to mitigation via attention replacement at inference.

Venus-DeFakerOne: Unified Fake Image Detection & Localization

cs.CV · 2026-05-13 · unverdicted · novelty 5.0

DeFakerOne is a unified foundation model for joint image-level fake image detection and pixel-level localization that reports SOTA results on 39 detection and 9 localization benchmarks.

Make Your LVLM KV Cache More Lightweight

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

Watermark removal leaves detectable forensic artifacts, so no current method balances attack success, perceptual quality, and undetectability.

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics

cs.CY · 2026-04-10 · unverdicted · novelty 5.0

A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.

citing papers explorer

Showing 22 of 22 citing papers after filters.

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception cs.CV · 2026-05-11 · unverdicted · none · ref 24
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting cs.CV · 2026-05-04 · unverdicted · none · ref 48
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton cs.CV · 2026-05-03 · unverdicted · none · ref 25
BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation astro-ph.IM · 2026-04-14 · unverdicted · none · ref 37
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source cs.CV · 2025-11-17 · unverdicted · none · ref 3
DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.
When transformers learn "impossible" languages, what do they learn? cs.CL · 2026-06-29 · unverdicted · none · ref 159
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 70
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation cs.CV · 2026-06-18 · unverdicted · none · ref 14
QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.
RedactionBench cs.CL · 2026-06-17 · unverdicted · none · ref 35
Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.
MDS-DETR: DETR with Masked Duplicate Suppressor cs.CV · 2026-05-22 · unverdicted · none · ref 12
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations cs.CV · 2026-05-18 · unverdicted · none · ref 3
GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning cs.CV · 2026-05-12 · unverdicted · none · ref 18
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency cs.CV · 2026-05-07 · unverdicted · none · ref 8
DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 112
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure cs.NI · 2026-05-01 · unverdicted · none · ref 26
EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks cs.CV · 2024-11-29 · unverdicted · none · ref 8
Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs cs.CV · 2026-05-14 · unverdicted · none · ref 17
MHSA mitigates hallucinations in LVLMs by training an MLP to steer cross-modal attention, extending detection work to mitigation via attention replacement at inference.
Venus-DeFakerOne: Unified Fake Image Detection & Localization cs.CV · 2026-05-13 · unverdicted · none · ref 63
DeFakerOne is a unified foundation model for joint image-level fake image detection and pixel-level localization that reports SOTA results on 39 detection and 9 localization benchmarks.
Make Your LVLM KV Cache More Lightweight cs.CV · 2026-05-01 · unverdicted · none · ref 54
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing cs.CV · 2026-04-28 · unverdicted · none · ref 18
Watermark removal leaves detectable forensic artifacts, so no current method balances attack success, perceptual quality, and undetectability.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 87
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics cs.CY · 2026-04-10 · unverdicted · none · ref 16
A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.

Microsoft COCO: common objects in context

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer