Introduces the FeVOS task, a 968-clip dataset with foresight expressions, and an MLLM model FeVOS-R1 trained via SFT then RL that reports SOTA on the new task plus generalization to prior RVOS benchmarks.
Tracking with human-intent reasoning
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7representative citing papers
SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
EVIS decomposes videos into text-related events via learnable queries and hybrid object-pixel learning to improve referring video segmentation.
The survey formalizes MLLM perception as a unified vision-language capability and traces its evolution via a new five-stage taxonomy while outlining future challenges.
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
citing papers explorer
-
FeVOS: Foresight Expression Video Object Segmentation
Introduces the FeVOS task, a 968-clip dataset with foresight expressions, and an MLLM model FeVOS-R1 trained via SFT then RL that reports SOTA on the new task plus generalization to prior RVOS benchmarks.
-
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
-
Learning to Track Instance from Single Nature Language Description
Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
Event-Aware Instructed Assistant for Referring Video Segmentation
EVIS decomposes videos into text-related events via learnable queries and hybrid object-pixel learning to improve referring video segmentation.
-
From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
The survey formalizes MLLM perception as a unified vision-language capability and traces its evolution via a new five-stage taxonomy while outlining future challenges.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.