Video-of-thought: Step-by-step video reasoning from perception to cognition

Video-of-thought: Step-by-step video reasoning from perception to cognition · 2026 · arXiv 2501.03230

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Introduces AVTrack dataset for audio-visual tracking in challenging human-centric scenes, demonstrating performance drops in existing methods.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

Act2See: Emergent Active Visual Perception for Video Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

CREDiT applies counterfactual reasoning via structural causal models to decompose video representations into causal and non-causal parts for more reliable VideoQA on datasets like NExT-GQA and SportsQA.

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

SG-PVR introduces plan-and-verify reasoning grounded in spatio-temporal scene graphs to address verification gaps and implicit evidence in existing T2V reward models.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

cs.CV · 2026-04-16 · unverdicted · novelty 5.0 · 2 refs

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

cs.CV · 2026-06-05 · unverdicted · novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.

citing papers explorer

Showing 10 of 10 citing papers after filters.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 12
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes cs.CV · 2026-06-01 · unverdicted · none · ref 74
Introduces AVTrack dataset for audio-visual tracking in challenging human-centric scenes, demonstrating performance drops in existing methods.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 14
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Act2See: Emergent Active Visual Perception for Video Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 9
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging cs.CV · 2026-04-13 · unverdicted · none · ref 6
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 17
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA cs.CV · 2026-06-08 · unverdicted · none · ref 34
CREDiT applies counterfactual reasoning via structural causal models to decompose video representations into causal and non-causal parts for more reliable VideoQA on datasets like NExT-GQA and SportsQA.
Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding cs.CV · 2026-06-10 · unverdicted · none · ref 1
SG-PVR introduces plan-and-verify reasoning grounded in spatio-temporal scene graphs to address verification gaps and implicit evidence in existing T2V reward models.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CV · 2026-04-16 · unverdicted · none · ref 35 · 2 links
Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs cs.CV · 2026-06-05 · unverdicted · none · ref 178
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.

Video-of-thought: Step-by-step video reasoning from perception to cognition

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer