How does longer temporal context enhance multimodal narrative video processing in the brain?

Anant Khandelwal; Bapi S. Raju; Manish Gupta; Prachi Jindal; Subba Reddy Oota; Tanmoy Chakraborty

arxiv: 2602.07570 · v2 · pith:7AMHPBDOnew · submitted 2026-02-07 · 🧬 q-bio.NC · cs.AI· cs.CV· cs.LG

How does longer temporal context enhance multimodal narrative video processing in the brain?

Prachi Jindal , Anant Khandelwal , Manish Gupta , Bapi S. Raju , Subba Reddy Oota , Tanmoy Chakraborty This is my paper

Pith reviewed 2026-05-21 13:34 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AIcs.CVcs.LG

keywords fMRImultimodal large language modelstemporal contextnarrative videobrain alignmentmovie watchingcortical hierarchyvideo models

0 comments

The pith

Longer video clips substantially improve brain alignment for multimodal large language models but not for unimodal video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how the length of temporal context in video clips shapes alignment between model features and human brain activity recorded via fMRI during movie viewing. Extending clips from 3 to 24 seconds markedly raises alignment scores for multimodal models that combine vision and language, while pure video models gain little. Short windows match activity in perceptual and early language brain areas, whereas longer windows better match higher-order integrative regions; the same short-to-long progression appears across successive layers of the multimodal models. Different narrative prompts produce distinct, region-specific alignment patterns that shift with clip length in higher brain areas.

Core claim

Increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Experiments with four narrative-task prompts show that they elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions.

What carries the argument

Alignment between fMRI signals from naturalistic movie watching and model features extracted at varying clip durations (3-24 s) and network layers.

If this is right

Multimodal models capture long-timescale narrative information in a manner that parallels higher-order cortical regions.
Short temporal windows suffice for alignment with perceptual and early language brain areas.
Narrative prompts modulate context-length effects in integrative regions in a task-dependent way.
Long-form narrative movies provide a testbed for probing temporal integration in long-context multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained with extended video contexts may more closely emulate human narrative comprehension over longer timescales.
The observed layer-to-cortex mapping could help identify which model components correspond to specific stages of narrative processing.
The differential benefit for multimodal over unimodal models points to vision-language integration as a key beneficiary of longer temporal context.
These alignment patterns might guide targeted experiments on temporal integration in other cognitive tasks such as memory retrieval.

Load-bearing premise

The chosen alignment metrics between model features and fMRI signals specifically reflect the brain's dynamic use of temporal narrative context rather than other correlated factors such as overall signal strength or model capacity.

What would settle it

No measurable rise in alignment scores for multimodal models when clip duration increases, after controlling for signal strength and model capacity in a fresh fMRI dataset of movie viewing.

read the original abstract

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--24 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, experiments with four narrative-task prompts show that they elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Our work positions long-form narrative movies as a principled testbed for studying long-timescale temporal integration in long-context MLLMs and its relationship to cortical responses during narrative comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Longer clips improve MLLM alignment with brain data more than unimodal models, but the gains may stem from basic signal stability rather than true narrative context use.

read the letter

The main thing to know is that extending clip length from 3 to 24 seconds boosts how well multimodal large language models match fMRI signals from movie viewers, while unimodal video models show little improvement. Shorter windows line up with early perceptual and language areas, longer ones with higher integrative regions, and the models show a matching layer-to-cortex pattern. Narrative prompts also shift the alignment in region-specific ways.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the effects of temporal context length (3-24s video clips) and narrative-task prompting on brain-model alignment using fMRI data from participants watching full-length movies. It claims that increasing clip duration substantially boosts alignment for multimodal large language models (MLLMs) but yields little gain for unimodal video models; shorter windows align preferentially with perceptual and early language regions while longer windows align with higher-order integrative regions, a pattern mirrored by layer-to-cortex hierarchies in MLLMs; and that four narrative-task prompts produce task-specific, region-dependent alignment patterns with context-dependent shifts in higher-order regions.

Significance. If the empirical results prove robust, the work offers a useful naturalistic testbed for linking long-timescale temporal integration in the brain to long-context capabilities in MLLMs. The reported dissociation between multimodal and unimodal models, together with the layer-cortex correspondence, could inform both computational models of narrative comprehension and the design of AI systems that better match human cortical dynamics.

major comments (2)

[Results section on clip-duration effects] The central claim that longer clips specifically enhance dynamic narrative-context integration rests on alignment metrics (Pearson correlation or ridge regression between layer activations and fMRI signals) whose improvements with clip length could arise from increased feature variance, reduced noise, or greater total input information rather than temporal integration mechanisms. No controls that equate total information content across clip lengths, normalize feature norms, or ablate the language component of MLLMs are described, leaving open the possibility that the MLLM-vs-unimodal dissociation and the region-specific shifts reflect capacity or signal-to-noise scaling instead.
[Methods and Results on alignment computation and hierarchy analysis] The reported layer-to-cortex hierarchy and preferential alignment of longer windows with higher-order regions would be strengthened by explicit tests that the alignment gains survive after regressing out overall feature strength or clip-length-dependent statistics; without such isolation the hierarchy could be an artifact of how longer inputs affect the stability of the similarity measures.

minor comments (2)

[Abstract] The abstract states that experiments used 'four narrative-task prompts' but does not list or briefly characterize them; adding one sentence describing the prompts would improve readability.
[Figure legends and Results] All figures reporting alignment scores should explicitly state participant numbers, error-bar definitions (e.g., SEM across subjects or bootstrap), and whether statistical tests correct for multiple comparisons across regions or layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. These points help us clarify potential alternative explanations for our findings on temporal context effects. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Results section on clip-duration effects] The central claim that longer clips specifically enhance dynamic narrative-context integration rests on alignment metrics (Pearson correlation or ridge regression between layer activations and fMRI signals) whose improvements with clip length could arise from increased feature variance, reduced noise, or greater total input information rather than temporal integration mechanisms. No controls that equate total information content across clip lengths, normalize feature norms, or ablate the language component of MLLMs are described, leaving open the possibility that the MLLM-vs-unimodal dissociation and the region-specific shifts reflect capacity or signal-to-noise scaling instead.

Authors: We agree that without explicit controls, alternative accounts based on feature variance, noise reduction, or total input information cannot be fully ruled out. However, the dissociation between MLLMs (which show large gains) and unimodal video models (which show little gain) on identical longer clips provides evidence against a purely capacity- or information-scaling explanation, as both model families receive the same extended inputs. The region-specific shifts toward higher-order areas further align with narrative integration rather than generic scaling. In revision we will add supplementary analyses that normalize feature norms across clip lengths and include a dedicated discussion of these alternative explanations; we will also note the limitation regarding full language-component ablation and discuss it using available model variants. revision: partial
Referee: [Methods and Results on alignment computation and hierarchy analysis] The reported layer-to-cortex hierarchy and preferential alignment of longer windows with higher-order regions would be strengthened by explicit tests that the alignment gains survive after regressing out overall feature strength or clip-length-dependent statistics; without such isolation the hierarchy could be an artifact of how longer inputs affect the stability of the similarity measures.

Authors: We concur that regressing out overall feature strength and clip-length-dependent statistics would strengthen the hierarchy claims. In the revised manuscript we will add these explicit control analyses to the Methods and Results sections, demonstrating that the layer-to-cortex correspondence and the preferential alignment of longer windows with higher-order regions remain after such regression. This will help isolate the contribution of temporal integration from potential artifacts in similarity-measure stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical brain-model alignment comparisons are independent of fitted parameters or self-referential definitions.

full rationale

The paper reports experimental results from fMRI recordings during movie viewing, computing alignment scores between model layer activations and brain signals across clip durations (3-24s) and prompt conditions. No equations, derivations, or parameter-fitting steps are described that would reduce reported alignment gains to quantities defined by the paper's own inputs or self-citations. The central findings rely on direct, externally measurable correlations between independent data sources (neural recordings and pretrained model features), with contrasts between MLLMs and unimodal models serving as controls. This structure is self-contained against external benchmarks and contains no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard domain assumptions of brain-model alignment studies without introducing new free parameters, axioms beyond common neuroimaging practice, or invented entities.

axioms (1)

domain assumption fMRI BOLD signals can be aligned with model-derived features using standard similarity metrics to index shared representational content
Invoked implicitly when reporting improved alignment with longer clips; this is a background assumption of the field rather than a paper-specific postulate.

pith-pipeline@v0.9.0 · 5765 in / 1224 out tokens · 70470 ms · 2026-05-21T13:34:37.213183+00:00 · methodology

How does longer temporal context enhance multimodal narrative video processing in the brain?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)