Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.
hub
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with multiclass annotations.
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
UniSpector organizes visual prompt space with spatial-spectral and contrastive encoders to support open-set defect localization, beating baselines by at least 19.7% AP50b and 15.8% AP50m on the new Inspect Anything benchmark.
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
EmoVerse is a large open-source dataset enabling interpretable visual emotion analysis via B-A-S triplets, region grounding, and unified CES/DES representations created through an MLLM-driven pipeline.
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.
HSA-DINO improves open-vocabulary object detection on domain-shifted tasks via hierarchical semantic prompts and dynamic routing while preserving pre-trained generalization.
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.
citing papers explorer
-
3D Instruction Ambiguity Detection
Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.
-
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
-
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
-
Scene Change Detection with Vision-Language Representation Learning
LangSCD fuses VLM-generated text descriptions with visual features and adds geometric-semantic matching to improve scene change detection, while releasing the NYC-CD dataset of 8122 New York City image pairs with multiclass annotations.
-
STORM: End-to-End Referring Multi-Object Tracking in Videos
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
UniSpector organizes visual prompt space with spatial-spectral and contrastive encoders to support open-set defect localization, beating baselines by at least 19.7% AP50b and 15.8% AP50m on the new Inspect Anything benchmark.
-
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
-
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis
EmoVerse is a large open-source dataset enabling interpretable visual emotion analysis via B-A-S triplets, region grounding, and unified CES/DES representations created through an MLLM-driven pipeline.
-
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
-
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
-
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
-
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.
-
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
HSA-DINO improves open-vocabulary object detection on domain-shifted tasks via hierarchical semantic prompts and dynamic routing while preserving pre-trained generalization.
-
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
-
OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.