pith. sign in

arxiv: 1907.00441 · v1 · pith:PBO7OLTJnew · submitted 2019-06-30 · 🧬 q-bio.NC · cs.LG· stat.ML

Unsupervised predictive coding models may explain visual brain representation

Pith reviewed 2026-05-25 11:54 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.LGstat.ML
keywords unsupervised learningpredictive codingPredNetvisual cortexfMRIMEGrepresentational similarity analysisAlgonauts
0
0 comments X

The pith

Unsupervised predictive coding models trained on video frames may outperform supervised image classifiers when predicting visual brain activity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether unsupervised predictive coding networks can explain visual brain representations by training PredNet to predict future video frames and comparing its internal representations to fMRI and MEG recordings using representational similarity analysis. It finds that these unsupervised models achieve higher noise-normalized scores than supervised image classification models on the Algonauts dataset, with the top submission reaching 16.67 percent on fMRI and 27.67 percent on MEG. A sympathetic reader would care if this holds because it supports the idea that the brain learns visual representations through prediction rather than explicit classification tasks. This challenges earlier results favoring supervised models and suggests video prediction as a promising unsupervised approach for modeling the visual system.

Core claim

Deep predictive coding networks learn to predict future sensory states in an unsupervised manner. Using the PredNet implementation, representations from layers trained on video prediction are compared to brain activity via RSA on Algonauts fMRI and MEG data. In contrast to previous literature, the results indicate that such unsupervised models may outperform supervised image classification baselines, with the best model scoring 16.67% and 27.67% on the respective tracks.

What carries the argument

PredNet, a deep predictive coding network trained unsupervised to predict video frames, with its layer activations compared to brain data through representational similarity analysis.

If this is right

  • Predictive coding via frame prediction yields brain-like representations without supervision.
  • Unsupervised learning on temporal video data can surpass supervised learning on static images for modeling visual cortex.
  • RSA scores on Algonauts tracks provide a quantitative measure favoring predictive models over supervised baselines.
  • This approach offers a way to generate hypotheses about brain computation using only unlabeled video data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If video prediction is key, models using dynamic natural scenes may better capture brain activity than static image models.
  • The finding suggests testing predictive coding in other sensory modalities or brain areas beyond visual cortex.
  • Future work could examine whether prediction errors in these models correlate with specific neural signals measured in fMRI or MEG.
  • It raises the possibility that brain visual processing relies on temporal prediction mechanisms similar to those in PredNet.

Load-bearing premise

The specific PredNet architecture, its training videos, and the RSA comparison method on the Algonauts dataset provide a fair and general test of whether predictive coding representations explain visual brain activity better than supervised alternatives.

What would settle it

Re-running the RSA comparison after training the same PredNet architecture on static images rather than video sequences and finding equivalent or higher scores would falsify the claimed advantage of unsupervised video prediction.

read the original abstract

Deep predictive coding networks are neuroscience-inspired unsupervised learning models that learn to predict future sensory states. We build upon the PredNet implementation by Lotter, Kreiman, and Cox (2016) to investigate if predictive coding representations are useful to predict brain activity in the visual cortex. We use representational similarity analysis (RSA) to compare PredNet representations to functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) data from the Algonauts Project. In contrast to previous findings in the literature (Khaligh-Razavi &Kriegeskorte, 2014), we report empirical data suggesting that unsupervised models trained to predict frames of videos may outperform supervised image classification baselines. Our best submission achieves an average noise normalized score of 16.67% and 27.67% on the fMRI and MEG tracks of the Algonauts Challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that unsupervised predictive coding models, specifically a PredNet implementation trained to predict frames in videos, outperform supervised image classification baselines when their representations are compared via representational similarity analysis (RSA) to fMRI and MEG recordings from the Algonauts dataset; the best submission yields noise-normalized scores of 16.67% (fMRI) and 27.67% (MEG), contrary to prior results such as Khaligh-Razavi & Kriegeskorte (2014).

Significance. If the comparison controls for training-data differences and the reported scores are reproducible, the result would be significant because it supplies empirical evidence that temporal prediction objectives can yield representations more aligned with visual cortex activity than static supervised classification, directly engaging the Algonauts challenge and offering a testable alternative to the supervised-model dominance reported in earlier literature.

major comments (2)
  1. [Abstract] Abstract: the reported numerical scores (16.67% fMRI, 27.67% MEG) are presented without any description of training procedure, layer selection for RSA, statistical testing, or controls for dataset differences between video prediction and image classification corpora; these omissions prevent evaluation of whether the outperformance claim is supported.
  2. [Abstract] Abstract (contrast to Khaligh-Razavi & Kriegeskorte 2014): the central claim that unsupervised video-frame prediction outperforms supervised image-classification baselines rests on an unmatched training-data distribution (video vs. static images); no evidence is supplied that the supervised baselines were retrained on the identical video corpus or that layer-wise feature statistics were equalized, so any observed RSA gap could arise from stimulus-domain mismatch rather than the predictive-coding objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported numerical scores (16.67% fMRI, 27.67% MEG) are presented without any description of training procedure, layer selection for RSA, statistical testing, or controls for dataset differences between video prediction and image classification corpora; these omissions prevent evaluation of whether the outperformance claim is supported.

    Authors: We agree that the abstract is concise and omits key methodological details. In the revised manuscript we will expand the abstract to briefly describe the PredNet training procedure on video frames, the specific layers used for RSA, and reference the statistical testing and dataset controls detailed in the methods section. revision: yes

  2. Referee: [Abstract] Abstract (contrast to Khaligh-Razavi & Kriegeskorte 2014): the central claim that unsupervised video-frame prediction outperforms supervised image-classification baselines rests on an unmatched training-data distribution (video vs. static images); no evidence is supplied that the supervised baselines were retrained on the identical video corpus or that layer-wise feature statistics were equalized, so any observed RSA gap could arise from stimulus-domain mismatch rather than the predictive-coding objective.

    Authors: We acknowledge that the supervised baselines are standard ImageNet-trained models and were not retrained on the video corpus used for PredNet, nor were layer-wise statistics explicitly equalized. This leaves open the possibility that the RSA difference arises partly from training-data domain rather than the predictive objective alone. In revision we will add explicit discussion of this limitation while noting that the reported result still shows a video-prediction model achieving higher alignment than typical supervised image-classification models; we will also clarify that isolating the objective would require additional matched-data experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RSA comparison is externally benchmarked

full rationale

The paper reports direct empirical results: PredNet (from Lotter et al. 2016) is trained on video frames, layer activations are extracted, and RSA scores are computed against held-out Algonauts fMRI/MEG recordings, then compared to supervised image-classification baselines. The headline numbers (16.67% fMRI, 27.67% MEG) are measured quantities on external data; no equation or derivation equates them to quantities defined by the authors' own parameters. The contrast to Khaligh-Razavi & Kriegeskorte (2014) is a literature citation, not a self-citation chain or uniqueness theorem. The central claim therefore rests on independent, falsifiable measurements rather than on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the validity of RSA as a proxy for representational similarity between artificial networks and biological cortex, the assumption that video-frame prediction is the relevant unsupervised objective, and the representativeness of the Algonauts dataset and PredNet implementation. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Representational similarity analysis (RSA) provides a valid and unbiased measure for comparing model layer activations to brain activity patterns.
    Invoked when the paper uses RSA scores to conclude that unsupervised models may explain visual brain representation.
  • domain assumption The Algonauts fMRI and MEG datasets are representative of visual cortex responses to natural images.
    Required to generalize from the challenge scores to broader claims about brain representation.

pith-pipeline@v0.9.0 · 5675 in / 1402 out tokens · 38396 ms · 2026-05-25T11:54:36.141509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.