MarkIt converts videos into query-conditioned marked versions via a linguistic-parsing and open-vocabulary segmentation bridge that embeds instance masks, semantic markers, and frame indices to improve Vid-LLM temporal grounding.
Omni-rgpt: Unifying image and video region-level understanding via token marks
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.MM 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt converts videos into query-conditioned marked versions via a linguistic-parsing and open-vocabulary segmentation bridge that embeds instance masks, semantic markers, and frame indices to improve Vid-LLM temporal grounding.