The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
Villa: Video reasoning segmentation with large language model
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
-
Weakly-Supervised Referring Video Object Segmentation through Text Supervision
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
-
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
-
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.