Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, Shanghang Zhang · 2025 · arXiv 2506.08817

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

OneVLA: A Unified Framework for Embodied Tasks

cs.RO · 2026-05-31 · unverdicted · novelty 6.0

OneVLA is a unified VLA model using a shared action head and multi-stage progressive training with CoT fine-tuning that reports state-of-the-art results on both navigation and manipulation in simulation and real-world settings.

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

citing papers explorer

Showing 2 of 2 citing papers after filters.

OneVLA: A Unified Framework for Embodied Tasks cs.RO · 2026-05-31 · unverdicted · none · ref 59
OneVLA is a unified VLA model using a shared action head and multi-stage progressive training with CoT fine-tuning that reports state-of-the-art results on both navigation and manipulation in simulation and real-world settings.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking cs.CV · 2026-02-19 · unverdicted · none · ref 72
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

fields

years

verdicts

representative citing papers

citing papers explorer