PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.
Samtok: Representing any mask with two words.arXiv preprint arXiv:2601.16093, 2026
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.
citing papers explorer
-
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.