Video Reasoning without Training

Ankita Nayak; Deepak Sridhar; Harris Teague; Jeya Pradha Jeyaraj; Kartikeya Bhardwaj; Nuno Vasconcelos

arxiv: 2510.17045 · v2 · pith:GKB7Y23Vnew · submitted 2025-10-19 · 💻 cs.CV · cs.AI· cs.LG

Video Reasoning without Training

Deepak Sridhar , Kartikeya Bhardwaj , Jeya Pradha Jeyaraj , Nuno Vasconcelos , Ankita Nayak , Harris Teague This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords modelsreasoningmodelentropythinkingtrainingv-reasonvideo

0 comments

read the original abstract

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...