Time Blindness: Why Video-Language Models Can't See What Humans Can?

Mohamed Elhoseiny; Mukul Ranjan; Ujjwal Upadhyay; Zhiqiang Shen

arxiv: 2505.24867 · v2 · submitted 2025-05-30 · 💻 cs.CV · cs.AI

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay , Mukul Ranjan , Zhiqiang Shen , Mohamed Elhoseiny This is my paper

Pith reviewed 2026-05-19 12:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SpookyBenchvideo-language modelstemporal reasoningspatial biasbenchmarkhuman vs machinenoise sequencestemporal patterns

0 comments

The pith

Video-language models score zero on tasks where humans read patterns from sequences of noise-like frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SpookyBench to show that current video-language models cannot recognize information carried only by the order of noise-like frames, while humans do so with high accuracy. This reveals that models lean almost entirely on spatial details inside single frames and do not learn to read timing cues. The gap grows when spatial signals are weakened, and it appears in models of many sizes. A sympathetic reader would care because many real video tasks depend on subtle timing rather than clear pictures in each frame. The authors argue that new designs are needed to handle temporal sequences on their own.

Core claim

We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Humans recognize shapes, text, and patterns in these sequences with over 98 percent accuracy, but state-of-the-art VLMs achieve 0 percent accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. When trained on datasets with low spatial signal-to-noise ratios, temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal

What carries the argument

SpookyBench, a benchmark of noise-like frame sequences that carries all task information through temporal order alone and removes usable spatial content per frame.

If this is right

VLMs will continue to fail on any video task whose solution depends only on frame timing once spatial cues are removed.
Training on low spatial signal-to-noise data will cause faster loss of temporal capability in models than in humans.
New model architectures or training methods must be developed that separate spatial feature use from temporal processing.
The observed weakness is not limited to one model size or family but appears across current designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Performance on SpookyBench could serve as a diagnostic for whether a video model will handle long-term timing in noisy real-world footage.
Adding explicit temporal-only training stages might reduce the spatial bias seen here.
The same construction could be applied to audio or other time-series modalities to test for analogous limitations.

Load-bearing premise

The generated frames are truly noise-like and contain no residual spatial information or artifacts that models could use instead of temporal order.

What would settle it

A controlled re-generation of SpookyBench that confirms zero spatial leakage followed by VLMs still scoring above chance on the tasks.

read the original abstract

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SpookyBench, a benchmark in which information is encoded solely in the temporal sequences of noise-like frames. It claims that humans recognize shapes, text, and patterns in these sequences with over 98% accuracy, while state-of-the-art VLMs achieve 0% accuracy, attributing the gap to VLMs' over-reliance on frame-level spatial features and inability to extract temporal cues. The work further asserts that training on low spatial SNR datasets causes temporal understanding to degrade more rapidly in models than in humans, with the limitation persisting across scales and architectures, and releases the dataset and code to promote research in temporal pattern recognition.

Significance. If the SpookyBench frames are confirmed to contain no residual spatial information or artifacts, the reported human-VLM performance gap would provide compelling evidence of a fundamental limitation in current video-language models' capacity for pure temporal reasoning. This could meaningfully influence future VLM architectures for applications involving subtle temporal signals. The public release of the dataset and code is a clear strength that supports reproducibility and community follow-up work.

major comments (2)

[Abstract] Abstract: The central claim that the 0% VLM accuracy demonstrates an 'inability to extract meaning from temporal cues' rests on the assertion that 'information is encoded solely in temporal sequences of noise-like frames.' No details are supplied on the noise model, frame generation procedure, statistical tests confirming absence of spatial structure, single-frame baselines, or controls for generation artifacts. This is load-bearing for the interpretation of the performance gap.
[Abstract] Abstract: The additional claim that 'temporal understanding of models degrades more rapidly than human perception' under low spatial SNR training is stated without any description of the training protocol, datasets used, evaluation metrics, or quantitative results, preventing assessment of whether this finding supports the broader temporal-blindness argument.

minor comments (1)

[Abstract] The abstract could include a brief statement of benchmark scale (number of sequences or examples) to contextualize the reported accuracy figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and the opportunity to clarify the methodological aspects of our work. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the 0% VLM accuracy demonstrates an 'inability to extract meaning from temporal cues' rests on the assertion that 'information is encoded solely in temporal sequences of noise-like frames.' No details are supplied on the noise model, frame generation procedure, statistical tests confirming absence of spatial structure, single-frame baselines, or controls for generation artifacts. This is load-bearing for the interpretation of the performance gap.

Authors: We agree with the referee that the abstract lacks these supporting details, which are crucial for the claim. We will revise the manuscript to incorporate a brief description of the noise model, frame generation procedure, statistical tests for absence of spatial structure, single-frame baselines, and artifact controls directly into the abstract or by adding a methods summary. This will allow readers to better assess the validity of the temporal encoding assertion. revision: yes
Referee: [Abstract] Abstract: The additional claim that 'temporal understanding of models degrades more rapidly than human perception' under low spatial SNR training is stated without any description of the training protocol, datasets used, evaluation metrics, or quantitative results, preventing assessment of whether this finding supports the broader temporal-blindness argument.

Authors: We acknowledge that the abstract does not describe the training protocol or provide quantitative results for the degradation claim. We will revise the abstract to include a summary of the low spatial SNR training setup, the datasets and metrics used, and the key quantitative findings comparing model and human degradation rates. This revision will better support the broader argument regarding temporal understanding. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark results with no derivation chain or self-referential reduction

full rationale

The paper introduces SpookyBench and reports direct empirical measurements (0% VLM accuracy vs. >98% human accuracy) on a new dataset where information is claimed to reside solely in temporal sequences. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-definitions. The abstract contains no self-citations, ansatzes, or uniqueness theorems. The central claim is an experimental observation on the released benchmark, which is externally falsifiable and does not rely on any load-bearing step that collapses to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the benchmark isolates temporal information without exploitable spatial signals and that standard VLM training regimes produce the observed degradation in low-SNR settings.

axioms (1)

domain assumption SpookyBench frames contain no spatial information usable by VLMs
Stated in the abstract as information encoded solely in temporal sequences of noise-like frames.

pith-pipeline@v0.9.0 · 5730 in / 1143 out tokens · 55205 ms · 2026-05-19T12:10:37.972576+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The TIME Machine: On The Power of Motion for Efficient Perception
cs.CV 2026-05 unverdicted novelty 6.0

TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Why Do Vision Language Models Struggle To Recognize Human Emotions?
cs.CV 2026-04 unverdicted novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
cs.CV 2025-09 unverdicted novelty 5.0

Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual ...
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.