VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
hub
An improved baseline for reasoning segmentation with large language model
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.
CCRC is a dual-chain framework using change-aware attention in an MLLM for the new ICCS task of joint change captioning and segmentation, reporting SOTA results on benchmarks.
Introduces MTRS task, MTRefSeg-21K benchmark of 21K image-text-mask triplets, and MTRefSeg-R1 LVLM baseline that outperforms standard models via two-stage change-aware training.
InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
A video-only speech-guided system for skull-base surgery segments and tracks instruments to deliver 2.32 mm tool-tip accuracy and rapid 3D model registration.
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.
Dynamic-dLLM achieves over 3x average inference speedup on dLLMs like LLaDA-8B via adaptive cache budgets and decoding thresholds while preserving benchmark performance.
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.
B-GRTO pre-trains a segmentation tool via bootstrapped group relative optimization on GRPO rollouts, yielding substantial gains over plain GRPO on referring segmentation benchmarks.
citing papers explorer
-
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
-
An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
Introduces MTRS task, MTRefSeg-21K benchmark of 21K image-text-mask triplets, and MTRefSeg-R1 LVLM baseline that outperforms standard models via two-stage change-aware training.
-
WOW-Seg: A Word-free Open World Segmentation Model
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
-
Mitigating Object Hallucinations via Sentence-Level Early Intervention
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.