Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Farshad Khorrami; Francesco Croce; Nicolas Flammarion; Prashanth Krishnamurthy; Sara Ghazanfari; Siddharth Garg

arxiv: 2506.00318 · v2 · submitted 2025-05-31 · 💻 cs.CV

Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Sara Ghazanfari , Francesco Croce , Nicolas Flammarion , Prashanth Krishnamurthy , Farshad Khorrami , Siddharth Garg This is my paper

Pith reviewed 2026-05-19 12:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords video LLMsframe-aware reasoningmultimodal modelstemporal groundingvideo understandingsynthetic datachain-of-thoughtsingle-stage inference

0 comments

The pith

Fine-tuning video LLMs on traces that explicitly name key frames enables single-stage reasoning and raises accuracy on video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that video large language models can achieve better understanding by learning to reason in a single stage while explicitly naming the specific frames that support their answers. Earlier methods either chained multiple complex steps for frame selection or skipped detailed temporal links, leading to inconsistencies. The authors create a dataset of questions paired with frame-referenced reasoning traces drawn from both real and synthetic videos, then fine-tune existing video LLMs on it. The resulting models generate more accurate frame identifications in their traces and show steady gains on standard video benchmarks. Notably, training solely on synthetic examples still lifts real-world test scores.

Core claim

The central claim is that by fine-tuning video LLMs on a dataset of chain-of-frames data consisting of questions, answers, and reasoning traces with explicit references to relevant frames, the models learn to produce single-stage reasoning that accurately identifies key frames, thereby improving performance across multiple video understanding benchmarks without needing auxiliary modules.

What carries the argument

Chain-of-frames (CoF) reasoning traces, which embed explicit references to video frames within a unified natural language reasoning process.

If this is right

The fine-tuned models accurately identify key frames in their reasoning traces to answer questions.
Performance improves consistently on various video understanding benchmarks.
Synthetic data by itself delivers a notable increase in model accuracy on real benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This single-stage method could simplify deployment of video understanding systems by eliminating the need for separate frame selection tools.
The success with synthetic data suggests that controlled generated videos might serve as efficient training resources for real-world temporal tasks.
Extending the idea to other sequence data, such as time-series or audio, might yield similar grounding benefits.

Load-bearing premise

The frame-grounded reasoning traces used in training truly capture the minimal necessary frames for correct answers without introducing biases or misalignments that models could exploit.

What would settle it

A direct test would be to check whether the fine-tuned models' reasoning traces reference incorrect or irrelevant frames on new videos, or whether benchmark scores fail to rise after training on the chain-of-frames data.

read the original abstract

Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chains-of-thoughts (CoT) about the content of input images and videos. For video inputs, prior works use complex multi-step pipelines that extract and include relevant frames from videos in the CoT, or produce simpler single-stage reasoning traces at the expense of poor temporal grounding. Here, we propose the first video LLMs with single-stage reasoning that includes explicit references to relevant frames, thereby reducing temporal inconsistencies in the reasoning process. Our approach is simple, unified, and self-contained, employing a single-stage inference to handle complex video understanding tasks without relying on auxiliary modules for frame selection or caption generation. For this, we first create COF-DATA, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces from both natural and synthetic videos, spanning various topics and tasks. Our models, obtained fine-tuning video LLMs on this chain-of-frames (CoF) data, generate reasoning traces that accurately identify key frames to answer given questions. In turn, this consistently improves performance across multiple video understanding benchmarks. Surprisingly, we find that synthetic data alone, despite being out-of-distribution with respect to these real-world benchmarks, provides a significant boost in model accuracy. Code is available at https://github.com/SaraGhazanfari/CoF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-stage frame-referenced reasoning for video LLMs via COF-DATA shows consistent benchmark gains, but the quality of the generated traces is not clearly validated.

read the letter

The main thing to know is that this paper gives video LLMs a single-stage reasoning format that inserts explicit frame references into the trace, and fine-tuning on their COF-DATA dataset produces steady improvements on several video benchmarks, with an extra lift from synthetic data alone. That combination of single-pass inference plus frame grounding is the concrete step forward. Prior work either split the process into separate frame-selection stages or used reasoning that stayed ungrounded in time, so keeping everything in one forward pass while still naming specific frames is a practical difference. They also built and released a dataset spanning natural and synthetic videos, which is useful on its own, and the code is public. The synthetic-only result is worth paying attention to because it works on real benchmarks despite the distribution shift. On the soft spots, the central assumption is that the frame-grounded traces in COF-DATA actually pick the right minimal set of frames without systematic errors or biases. The abstract and available details do not show explicit validation metrics or error analysis on how those traces were created, whether manually or automatically. If the frame indices or timing are off in the training data, the observed gains could come from generic reasoning improvements or simply more data rather than genuine temporal grounding. That risk is higher with the synthetic portion. This paper is aimed at groups already working on multimodal video models and temporal reasoning. Readers who want a straightforward way to add frame awareness without extra modules will get the most out of the dataset and the empirical setup. It has a clear method, released code, and reproducible claims, so it deserves a serious referee rather than a desk reject. I would send it out for peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Chain-of-Frames (CoF), a single-stage reasoning approach for video multimodal LLMs that incorporates explicit references to relevant frames in the reasoning trace. It constructs COF-DATA, a large dataset of questions, answers, and frame-grounded reasoning traces derived from both natural and synthetic videos, then fine-tunes existing video LLMs on this data. The resulting models are claimed to produce accurate frame identifications that reduce temporal inconsistencies and yield consistent gains across video understanding benchmarks; a key observation is that synthetic data alone delivers substantial improvements despite being out-of-distribution relative to the evaluation sets.

Significance. If the central claims are substantiated, the work provides a simple, unified alternative to multi-step pipelines for frame selection and captioning in video LLMs. The public code release and the counter-intuitive synthetic-data result are clear strengths that could influence data-efficient training practices. However, the significance is limited by the absence of direct evidence that benchmark gains arise specifically from improved frame grounding rather than generic effects of additional training data or reasoning supervision.

major comments (3)

[Section 3] COF-DATA construction (Section 3): The manuscript states that the automatically or manually generated frame-grounded reasoning traces 'accurately identify key frames,' yet no validation metrics, inter-annotator agreement scores, or error analysis on frame-index accuracy or temporal alignment are reported. This verification is load-bearing for the claim that fine-tuned models learn genuine frame-aware reasoning rather than spurious associations.
[Section 4] Experimental evaluation (Section 4): Benchmark improvements are reported without ablations that compare CoF fine-tuning against equivalent volumes of non-frame-grounded reasoning data or standard CoT fine-tuning. Without such controls it is unclear whether gains derive from explicit frame references or from increased reasoning supervision and data scale.
[Section 4.3] Synthetic data analysis (Section 4.3): The surprising result that synthetic data alone boosts real-world benchmark accuracy is presented without analysis of potential label artifacts, frame-selection biases, or distribution-shift diagnostics. This omission weakens the interpretation that the gains reflect robust frame-aware generalization.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise comparison table contrasting CoF with prior multi-step and single-stage video CoT methods.
[Section 3] Notation for frame indices within reasoning traces should be illustrated with at least one concrete example in the main text or a figure caption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects that can strengthen the presentation of our work on Chain-of-Frames. We address each major comment point by point below, clarifying our approach and committing to revisions that directly incorporate the suggested improvements.

read point-by-point responses

Referee: [Section 3] COF-DATA construction (Section 3): The manuscript states that the automatically or manually generated frame-grounded reasoning traces 'accurately identify key frames,' yet no validation metrics, inter-annotator agreement scores, or error analysis on frame-index accuracy or temporal alignment are reported. This verification is load-bearing for the claim that fine-tuned models learn genuine frame-aware reasoning rather than spurious associations.

Authors: We agree that explicit quantitative validation of the frame-grounded reasoning traces is essential to support the claim of genuine frame-aware reasoning. The COF-DATA traces were produced through a combination of automated pipelines (leveraging existing video annotations and LLM-assisted generation) and targeted manual review for a subset of examples. In the revised manuscript, we will add a new subsection under Section 3 that reports: (i) frame-index accuracy on a held-out manually verified sample of 500 traces, (ii) inter-annotator agreement (Cohen's kappa) for the manually annotated portion, and (iii) a brief error analysis categorizing common temporal misalignment issues. These additions will provide direct evidence that the traces reflect accurate key-frame identification rather than spurious patterns. revision: yes
Referee: [Section 4] Experimental evaluation (Section 4): Benchmark improvements are reported without ablations that compare CoF fine-tuning against equivalent volumes of non-frame-grounded reasoning data or standard CoT fine-tuning. Without such controls it is unclear whether gains derive from explicit frame references or from increased reasoning supervision and data scale.

Authors: We acknowledge that isolating the contribution of explicit frame references requires controlled comparisons. While the current results already include comparisons against base models and standard instruction tuning, we agree that additional ablations are needed. In the revised manuscript, we will expand Section 4 with two new ablation studies: (1) fine-tuning on an equivalent volume of standard CoT data without any frame references, and (2) fine-tuning on non-frame-grounded reasoning traces of matched length and complexity. These controls will demonstrate that the observed benchmark gains are specifically attributable to the frame-aware component of CoF rather than generic increases in reasoning supervision or data scale. revision: yes
Referee: [Section 4.3] Synthetic data analysis (Section 4.3): The surprising result that synthetic data alone boosts real-world benchmark accuracy is presented without analysis of potential label artifacts, frame-selection biases, or distribution-shift diagnostics. This omission weakens the interpretation that the gains reflect robust frame-aware generalization.

Authors: We appreciate the referee's emphasis on rigorously interpreting the synthetic-data result. Although the out-of-distribution nature of the synthetic videos makes the performance lift particularly striking, we agree that further diagnostics are warranted. In the revised Section 4.3, we will add: (i) qualitative examples illustrating frame-selection patterns on real versus synthetic data, (ii) an analysis checking for obvious label artifacts (e.g., by comparing answer distributions), and (iii) a short discussion of distribution-shift metrics (such as feature-space distances between synthetic and real video embeddings). These additions will bolster the interpretation that the gains arise from improved frame-aware reasoning rather than dataset-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: standard dataset construction and fine-tuning pipeline

full rationale

The paper constructs a new dataset COF-DATA of questions, answers and frame-grounded reasoning traces from natural and synthetic videos, then fine-tunes video LLMs on this data to produce single-stage reasoning with explicit frame references. This is an empirical supervised fine-tuning workflow whose central claim (benchmark gains from the new data) rests on the independent creation and use of COF-DATA rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce outputs to inputs by construction; the approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the quality and coverage of the newly created COF-DATA dataset and the assumption that fine-tuning on frame-grounded traces transfers to benchmark improvements. No explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5833 in / 1007 out tokens · 44465 ms · 2026-05-19T12:07:39.259116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat orbit and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose chain-of-frames (CoF), a new frame-aware chain-of-thought reasoning approach for video LLMs that integrates temporal information directly into the CoT structure... explicit references to frames... 'Frame 1', 'Frame 2'
IndisputableMonolith/Foundation/AlphaDerivationExplicit.lean alphaInverseRS formula unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

synthetic data alone... provides a significant boost... CoF-InternVL3-8B achieves higher accuracy than... GPT-4o and Gemini 1.5 Pro

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
cs.CV 2026-05 unverdicted novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
Act2See: Emergent Active Visual Perception for Video Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
cs.CV 2026-02 unverdicted novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
Swift Sampling: Selecting Temporal Surprises via Taylor Series
cs.CV 2026-05 unverdicted novelty 5.0

Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.