Tracking with human-intent reasoning

Zhu, J · 2023 · arXiv 2312.17448

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

FeVOS: Foresight Expression Video Object Segmentation

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

Introduces the FeVOS task, a 968-clip dataset with foresight expressions, and an MLLM model FeVOS-R1 trained via SFT then RL that reports SOTA on the new task plus generalization to prior RVOS benchmarks.

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.

Learning to Track Instance from Single Nature Language Description

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

Event-Aware Instructed Assistant for Referring Video Segmentation

cs.CV · 2026-06-25 · unverdicted · novelty 5.0

EVIS decomposes videos into text-related events via learnable queries and hybrid object-pixel learning to improve referring video segmentation.

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

cs.CL · 2026-06-24 · unverdicted · novelty 5.0

The survey formalizes MLLM perception as a unified vision-language capability and traces its evolution via a new five-stage taxonomy while outlining future challenges.

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

citing papers explorer

Showing 7 of 7 citing papers.

FeVOS: Foresight Expression Video Object Segmentation cs.CV · 2026-06-24 · unverdicted · none · ref 49
Introduces the FeVOS task, a 968-clip dataset with foresight expressions, and an MLLM model FeVOS-R1 trained via SFT then RL that reports SOTA on the new task plus generalization to prior RVOS benchmarks.
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction cs.CV · 2026-05-19 · unverdicted · none · ref 73
SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
Learning to Track Instance from Single Nature Language Description cs.CV · 2026-05-08 · unverdicted · none · ref 65
Tracker is a self-supervised VL tracker that uses a Dynamic Token Aggregation Module to learn instance tracking from single language descriptions in unlabeled videos and outperforms prior self-supervised methods.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 57
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Event-Aware Instructed Assistant for Referring Video Segmentation cs.CV · 2026-06-25 · unverdicted · none · ref 63
EVIS decomposes videos into text-related events via learnable queries and hybrid object-pixel learning to improve referring video segmentation.
From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models cs.CL · 2026-06-24 · unverdicted · none · ref 88
The survey formalizes MLLM perception as a unified vision-language capability and traces its evolution via a new five-stage taxonomy while outlining future challenges.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CV · 2026-05-08 · unverdicted · none · ref 35
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

Tracking with human-intent reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer