super hub Mixed citations

Lawrence Zitnick

C. Lawrence Zitnick, Deva Ramanan, James Hays, Michael Maire, Pietro Perona, Piotr Dollár + 2 more · 2014 · Lecture Notes in Computer Science · DOI 10.1007/978-3-319-10602-1_48

Mixed citation behavior. Most common role is background (67%).

21 Pith papers citing it

25k external citations · Crossref

Background 67% of classified citations

open at publisher browse 21 citing papers more from C. Lawrence Zitnick

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 dataset 3

citation-polarity summary

background 4 use dataset 2

authors

C. Lawrence Zitnick Deva Ramanan James Hays Michael Maire Pietro Perona Piotr Dollár Serge Belongie Tsung-Yi Lin

co-cited works

representative citing papers

A strongly annotated passive acoustic dataset for tropical bird monitoring

cs.SD · 2026-05-20 · accept · novelty 7.0 · 2 refs

PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.

BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.

FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation

astro-ph.IM · 2026-04-14 · unverdicted · novelty 7.0

FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.

Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

cs.CV · 2025-11-17 · unverdicted · novelty 7.0

DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.

MDS-DETR: DETR with Masked Duplicate Suppressor

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.

Venus-DeFakerOne: Unified Fake Image Detection & Localization

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.

Stylistic Attribute Control in Latent Diffusion Models

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

cs.NI · 2026-05-01 · unverdicted · novelty 6.0

EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.

Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks

cs.CV · 2024-11-29 · unverdicted · novelty 6.0

Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

cs.CV · 2024-03-05 · conditional · novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

Make Your LVLM KV Cache More Lightweight

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics

cs.CY · 2026-04-10 · unverdicted · novelty 5.0

A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.

Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control

cs.RO · 2026-05-22 · conditional · novelty 4.0

A modified YOLO segmentation model plus sim-trained PPO control yields 84.3% overall success harvesting 281 strawberries in greenhouse trials on a real UR10e manipulator.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

cs.CV · 2026-04-28

citing papers explorer

Showing 21 of 21 citing papers.

A strongly annotated passive acoustic dataset for tropical bird monitoring cs.SD · 2026-05-20 · accept · none · ref 36 · 2 links
PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception cs.CV · 2026-05-11 · unverdicted · none · ref 24
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting cs.CV · 2026-05-04 · unverdicted · none · ref 48
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton cs.CV · 2026-05-03 · unverdicted · none · ref 25
BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation astro-ph.IM · 2026-04-14 · unverdicted · none · ref 37
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source cs.CV · 2025-11-17 · unverdicted · none · ref 3
DelAnyFlow combines a YOLOv11 model trained on the FBIS 22M dataset with post-processing to generate accurate vector field boundaries from multi-resolution satellite imagery, enabling country-scale mapping in hours.
MDS-DETR: DETR with Masked Duplicate Suppressor cs.CV · 2026-05-22 · unverdicted · none · ref 12
MDS-DETR introduces a masked duplicate suppressor in self-attention to enable one-to-many supervision inside a single decoder, yielding +2.8 mAP over Deformable-DETR on COCO with 5% more training time and outperforming MR.DETR by 0.3 mAP while training 20% faster.
GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations cs.CV · 2026-05-18 · unverdicted · none · ref 3
GraSP-VL turns frozen VLM embedding length into a controllable semantic granularity interface via a learned shared prefix transform that creates a Semantic Matryoshka structure.
Venus-DeFakerOne: Unified Fake Image Detection & Localization cs.CV · 2026-05-13 · unverdicted · none · ref 63
DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning cs.CV · 2026-05-12 · unverdicted · none · ref 18
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency cs.CV · 2026-05-07 · unverdicted · none · ref 8
DINORANKCLIP outperforms CLIP and RANKCLIP on fine-grained and out-of-distribution tasks by injecting DINOv3 local structure and using third-order ranking consistency trained on Conceptual Captions 3M.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 112
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure cs.NI · 2026-05-01 · unverdicted · none · ref 26
EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks cs.CV · 2024-11-29 · unverdicted · none · ref 8
Four object detection models achieve over 90% average precision detecting excretions in pigsties from thermal images and remain reasonably robust on out-of-distribution data from different barns.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis cs.CV · 2024-03-05 · conditional · none · ref 151
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Make Your LVLM KV Cache More Lightweight cs.CV · 2026-05-01 · unverdicted · none · ref 54
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 87
HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics cs.CY · 2026-04-10 · unverdicted · none · ref 16
A visual-semantic spatiotemporal framework creates the Street Economic Vitality Index (SEVI) to diagnose urban street economic vitality by parsing streetscapes with AI, standardizing brands via VLM-LLM, and incorporating lagged LBS demand data with Gaussian spillover modeling.
Robotic Strawberry Harvesting with Robust Vision and Deep Reinforcement Learning based Sim-to-Real Control cs.RO · 2026-05-22 · conditional · none · ref 43
A modified YOLO segmentation model plus sim-trained PPO control yields 84.3% overall success harvesting 281 strawberries in greenhouse trials on a real UR10e manipulator.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 86
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing cs.CV · 2026-04-28 · unreviewed · ref 18

Lawrence Zitnick

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer