Unsupervised predictive coding models may explain visual brain representation
Pith reviewed 2026-05-25 11:54 UTC · model grok-4.3
The pith
Unsupervised predictive coding models trained on video frames may outperform supervised image classifiers when predicting visual brain activity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deep predictive coding networks learn to predict future sensory states in an unsupervised manner. Using the PredNet implementation, representations from layers trained on video prediction are compared to brain activity via RSA on Algonauts fMRI and MEG data. In contrast to previous literature, the results indicate that such unsupervised models may outperform supervised image classification baselines, with the best model scoring 16.67% and 27.67% on the respective tracks.
What carries the argument
PredNet, a deep predictive coding network trained unsupervised to predict video frames, with its layer activations compared to brain data through representational similarity analysis.
If this is right
- Predictive coding via frame prediction yields brain-like representations without supervision.
- Unsupervised learning on temporal video data can surpass supervised learning on static images for modeling visual cortex.
- RSA scores on Algonauts tracks provide a quantitative measure favoring predictive models over supervised baselines.
- This approach offers a way to generate hypotheses about brain computation using only unlabeled video data.
Where Pith is reading between the lines
- If video prediction is key, models using dynamic natural scenes may better capture brain activity than static image models.
- The finding suggests testing predictive coding in other sensory modalities or brain areas beyond visual cortex.
- Future work could examine whether prediction errors in these models correlate with specific neural signals measured in fMRI or MEG.
- It raises the possibility that brain visual processing relies on temporal prediction mechanisms similar to those in PredNet.
Load-bearing premise
The specific PredNet architecture, its training videos, and the RSA comparison method on the Algonauts dataset provide a fair and general test of whether predictive coding representations explain visual brain activity better than supervised alternatives.
What would settle it
Re-running the RSA comparison after training the same PredNet architecture on static images rather than video sequences and finding equivalent or higher scores would falsify the claimed advantage of unsupervised video prediction.
read the original abstract
Deep predictive coding networks are neuroscience-inspired unsupervised learning models that learn to predict future sensory states. We build upon the PredNet implementation by Lotter, Kreiman, and Cox (2016) to investigate if predictive coding representations are useful to predict brain activity in the visual cortex. We use representational similarity analysis (RSA) to compare PredNet representations to functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) data from the Algonauts Project. In contrast to previous findings in the literature (Khaligh-Razavi &Kriegeskorte, 2014), we report empirical data suggesting that unsupervised models trained to predict frames of videos may outperform supervised image classification baselines. Our best submission achieves an average noise normalized score of 16.67% and 27.67% on the fMRI and MEG tracks of the Algonauts Challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that unsupervised predictive coding models, specifically a PredNet implementation trained to predict frames in videos, outperform supervised image classification baselines when their representations are compared via representational similarity analysis (RSA) to fMRI and MEG recordings from the Algonauts dataset; the best submission yields noise-normalized scores of 16.67% (fMRI) and 27.67% (MEG), contrary to prior results such as Khaligh-Razavi & Kriegeskorte (2014).
Significance. If the comparison controls for training-data differences and the reported scores are reproducible, the result would be significant because it supplies empirical evidence that temporal prediction objectives can yield representations more aligned with visual cortex activity than static supervised classification, directly engaging the Algonauts challenge and offering a testable alternative to the supervised-model dominance reported in earlier literature.
major comments (2)
- [Abstract] Abstract: the reported numerical scores (16.67% fMRI, 27.67% MEG) are presented without any description of training procedure, layer selection for RSA, statistical testing, or controls for dataset differences between video prediction and image classification corpora; these omissions prevent evaluation of whether the outperformance claim is supported.
- [Abstract] Abstract (contrast to Khaligh-Razavi & Kriegeskorte 2014): the central claim that unsupervised video-frame prediction outperforms supervised image-classification baselines rests on an unmatched training-data distribution (video vs. static images); no evidence is supplied that the supervised baselines were retrained on the identical video corpus or that layer-wise feature statistics were equalized, so any observed RSA gap could arise from stimulus-domain mismatch rather than the predictive-coding objective.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported numerical scores (16.67% fMRI, 27.67% MEG) are presented without any description of training procedure, layer selection for RSA, statistical testing, or controls for dataset differences between video prediction and image classification corpora; these omissions prevent evaluation of whether the outperformance claim is supported.
Authors: We agree that the abstract is concise and omits key methodological details. In the revised manuscript we will expand the abstract to briefly describe the PredNet training procedure on video frames, the specific layers used for RSA, and reference the statistical testing and dataset controls detailed in the methods section. revision: yes
-
Referee: [Abstract] Abstract (contrast to Khaligh-Razavi & Kriegeskorte 2014): the central claim that unsupervised video-frame prediction outperforms supervised image-classification baselines rests on an unmatched training-data distribution (video vs. static images); no evidence is supplied that the supervised baselines were retrained on the identical video corpus or that layer-wise feature statistics were equalized, so any observed RSA gap could arise from stimulus-domain mismatch rather than the predictive-coding objective.
Authors: We acknowledge that the supervised baselines are standard ImageNet-trained models and were not retrained on the video corpus used for PredNet, nor were layer-wise statistics explicitly equalized. This leaves open the possibility that the RSA difference arises partly from training-data domain rather than the predictive objective alone. In revision we will add explicit discussion of this limitation while noting that the reported result still shows a video-prediction model achieving higher alignment than typical supervised image-classification models; we will also clarify that isolating the objective would require additional matched-data experiments. revision: partial
Circularity Check
No circularity: empirical RSA comparison is externally benchmarked
full rationale
The paper reports direct empirical results: PredNet (from Lotter et al. 2016) is trained on video frames, layer activations are extracted, and RSA scores are computed against held-out Algonauts fMRI/MEG recordings, then compared to supervised image-classification baselines. The headline numbers (16.67% fMRI, 27.67% MEG) are measured quantities on external data; no equation or derivation equates them to quantities defined by the authors' own parameters. The contrast to Khaligh-Razavi & Kriegeskorte (2014) is a literature citation, not a self-citation chain or uniqueness theorem. The central claim therefore rests on independent, falsifiable measurements rather than on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Representational similarity analysis (RSA) provides a valid and unbiased measure for comparing model layer activations to brain activity patterns.
- domain assumption The Algonauts fMRI and MEG datasets are representative of visual cortex responses to natural images.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.