COCOTree is a 21K-image benchmark with 1.8M nodes and an OTQ metric for the new task of open tree-structured visual decomposition.
Semantic-sam: Segment and recognize anything at any granularity
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 8roles
background 1polarities
background 1representative citing papers
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Amodal SAM extends SAM with a Spatial Completion Adapter, Target-Aware Occlusion Synthesis for data, and consistency losses to reach SOTA amodal segmentation with strong generalization to new objects and scenes.
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
Presents a training-free personalization toolkit for LVLMs that extracts features via vision foundation models, applies RAG for instance retrieval, and uses visual prompting for multi-concept adaptation on images and videos, claiming SOTA results on a new real-world benchmark.
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
citing papers explorer
-
COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition
COCOTree is a 21K-image benchmark with 1.8M nodes and an OTQ metric for the new task of open tree-structured visual decomposition.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Amodal SAM: A Unified Amodal Segmentation Framework with Generalization
Amodal SAM extends SAM with a Spatial Completion Adapter, Target-Aware Occlusion Synthesis for data, and consistency losses to reach SOTA amodal segmentation with strong generalization to new objects and scenes.
-
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
-
Personalization Toolkit: Training Free Personalization of Large Vision Language Models
Presents a training-free personalization toolkit for LVLMs that extracts features via vision foundation models, applies RAG for instance retrieval, and uses visual prompting for multi-concept adaptation on images and videos, claiming SOTA results on a new real-world benchmark.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.