VCIFBench provides 306 test instructions, a 540-pair DPO dataset, and a conflict diagnostic set to evaluate complex constraint satisfaction in video MLLMs, finding it challenging and showing DPO training helps.
IF-VidCap: Can video caption models follow instructions?
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.
citing papers explorer
-
VCIFBench: Evaluating Complex Instruction Following for Video Understanding
VCIFBench provides 306 test instructions, a 540-pair DPO dataset, and a conflict diagnostic set to evaluate complex constraint satisfaction in video MLLMs, finding it challenging and showing DPO training helps.
-
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.