VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
hub
Grounding dino 1.5: Advance the” edge” of open-set object detection
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
Vision foundation model embeddings with density modeling outperform state-of-the-art methods for unsupervised semantic and covariate shift detection in autonomous driving inputs.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing
MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
-
Unify Robot Actions in Camera Frame
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
-
Inferring Dynamic Physical Properties from Video Foundation Models
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.
-
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
Qwen2.5-VL Technical Report
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
-
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving
Vision foundation model embeddings with density modeling outperform state-of-the-art methods for unsupervised semantic and covariate shift detection in autonomous driving inputs.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.