Time Blindness: Why Video-Language Models Can't See What Humans Can?
Pith reviewed 2026-05-19 12:10 UTC · model grok-4.3
The pith
Video-language models score zero on tasks where humans read patterns from sequences of noise-like frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Humans recognize shapes, text, and patterns in these sequences with over 98 percent accuracy, but state-of-the-art VLMs achieve 0 percent accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. When trained on datasets with low spatial signal-to-noise ratios, temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal
What carries the argument
SpookyBench, a benchmark of noise-like frame sequences that carries all task information through temporal order alone and removes usable spatial content per frame.
If this is right
- VLMs will continue to fail on any video task whose solution depends only on frame timing once spatial cues are removed.
- Training on low spatial signal-to-noise data will cause faster loss of temporal capability in models than in humans.
- New model architectures or training methods must be developed that separate spatial feature use from temporal processing.
- The observed weakness is not limited to one model size or family but appears across current designs.
Where Pith is reading between the lines
- Performance on SpookyBench could serve as a diagnostic for whether a video model will handle long-term timing in noisy real-world footage.
- Adding explicit temporal-only training stages might reduce the spatial bias seen here.
- The same construction could be applied to audio or other time-series modalities to test for analogous limitations.
Load-bearing premise
The generated frames are truly noise-like and contain no residual spatial information or artifacts that models could use instead of temporal order.
What would settle it
A controlled re-generation of SpookyBench that confirms zero spatial leakage followed by VLMs still scoring above chance on the tasks.
read the original abstract
Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpookyBench, a benchmark in which information is encoded solely in the temporal sequences of noise-like frames. It claims that humans recognize shapes, text, and patterns in these sequences with over 98% accuracy, while state-of-the-art VLMs achieve 0% accuracy, attributing the gap to VLMs' over-reliance on frame-level spatial features and inability to extract temporal cues. The work further asserts that training on low spatial SNR datasets causes temporal understanding to degrade more rapidly in models than in humans, with the limitation persisting across scales and architectures, and releases the dataset and code to promote research in temporal pattern recognition.
Significance. If the SpookyBench frames are confirmed to contain no residual spatial information or artifacts, the reported human-VLM performance gap would provide compelling evidence of a fundamental limitation in current video-language models' capacity for pure temporal reasoning. This could meaningfully influence future VLM architectures for applications involving subtle temporal signals. The public release of the dataset and code is a clear strength that supports reproducibility and community follow-up work.
major comments (2)
- [Abstract] Abstract: The central claim that the 0% VLM accuracy demonstrates an 'inability to extract meaning from temporal cues' rests on the assertion that 'information is encoded solely in temporal sequences of noise-like frames.' No details are supplied on the noise model, frame generation procedure, statistical tests confirming absence of spatial structure, single-frame baselines, or controls for generation artifacts. This is load-bearing for the interpretation of the performance gap.
- [Abstract] Abstract: The additional claim that 'temporal understanding of models degrades more rapidly than human perception' under low spatial SNR training is stated without any description of the training protocol, datasets used, evaluation metrics, or quantitative results, preventing assessment of whether this finding supports the broader temporal-blindness argument.
minor comments (1)
- [Abstract] The abstract could include a brief statement of benchmark scale (number of sequences or examples) to contextualize the reported accuracy figures.
Simulated Author's Rebuttal
We appreciate the referee's detailed review and the opportunity to clarify the methodological aspects of our work. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the 0% VLM accuracy demonstrates an 'inability to extract meaning from temporal cues' rests on the assertion that 'information is encoded solely in temporal sequences of noise-like frames.' No details are supplied on the noise model, frame generation procedure, statistical tests confirming absence of spatial structure, single-frame baselines, or controls for generation artifacts. This is load-bearing for the interpretation of the performance gap.
Authors: We agree with the referee that the abstract lacks these supporting details, which are crucial for the claim. We will revise the manuscript to incorporate a brief description of the noise model, frame generation procedure, statistical tests for absence of spatial structure, single-frame baselines, and artifact controls directly into the abstract or by adding a methods summary. This will allow readers to better assess the validity of the temporal encoding assertion. revision: yes
-
Referee: [Abstract] Abstract: The additional claim that 'temporal understanding of models degrades more rapidly than human perception' under low spatial SNR training is stated without any description of the training protocol, datasets used, evaluation metrics, or quantitative results, preventing assessment of whether this finding supports the broader temporal-blindness argument.
Authors: We acknowledge that the abstract does not describe the training protocol or provide quantitative results for the degradation claim. We will revise the abstract to include a summary of the low spatial SNR training setup, the datasets and metrics used, and the key quantitative findings comparing model and human degradation rates. This revision will better support the broader argument regarding temporal understanding. revision: yes
Circularity Check
Empirical benchmark results with no derivation chain or self-referential reduction
full rationale
The paper introduces SpookyBench and reports direct empirical measurements (0% VLM accuracy vs. >98% human accuracy) on a new dataset where information is claimed to reside solely in temporal sequences. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-definitions. The abstract contains no self-citations, ansatzes, or uniqueness theorems. The central claim is an experimental observation on the released benchmark, which is externally falsifiable and does not rely on any load-bearing step that collapses to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SpookyBench frames contain no spatial information usable by VLMs
Forward citations
Cited by 5 Pith papers
-
The TIME Machine: On The Power of Motion for Efficient Perception
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Why Do Vision Language Models Struggle To Recognize Human Emotions?
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
-
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual ...
-
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.