VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Sclip: Rethinking self- attention for dense vision-language inference,
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5verdicts
UNVERDICTED 5roles
background 1polarities
support 1representative citing papers
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
SAM 3 can be applied training-free to remote sensing open-vocabulary segmentation and change detection by fusing its semantic and instance heads and filtering with presence scores.
TeD-Loc improves weakly supervised object localization by distilling CLIP text embeddings to patch embeddings through contrastive alignment plus a localization-guided classifier and QR orthogonalization of text embeddings.
citing papers explorer
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
Best Segmentation Buddies for Image-Shape Correspondence
The work defines Best Segmentation Buddies as vertices on a 3D shape whose nearest image pixel under distilled features falls inside a given 2D segment, then uses the same features to segment the shape in 3D.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
SAM 3 can be applied training-free to remote sensing open-vocabulary segmentation and change detection by fusing its semantic and instance heads and filtering with presence scores.
-
TeD-Loc: Text Distillation for Weakly Supervised Object Localization
TeD-Loc improves weakly supervised object localization by distilling CLIP text embeddings to patch embeddings through contrastive alignment plus a localization-guided classifier and QR orthogonalization of text embeddings.