Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

David Cox; Gabriel Kreiman; William Lotter

arxiv: 1605.08104 · v5 · pith:JK5S5EX4new · submitted 2016-05-25 · 💻 cs.LG · cs.AI· cs.CV· cs.NE· q-bio.NC

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter , Gabriel Kreiman , David Cox This is my paper

classification 💻 cs.LG cs.AIcs.CVcs.NEq-bio.NC

keywords learningnetworkslearnunsupervisedmovementnetworkobjectprediction

0 comments

read the original abstract

While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models
cs.CV 2025-10 unverdicted novelty 7.0

A time-reversed reconstruction method couples visual language models with constrained diffusion to generate past scene frames from current thermal traces in controlled scenarios.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
cs.CV 2025-12 unverdicted novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Prediction horizon shapes representations in predictive learning
cs.LG 2025-11 unverdicted novelty 6.0

Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Order Matters: Shuffling Sequence Generation for Video Prediction
cs.CV 2019-07 unverdicted novelty 6.0

SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
Do vision models perceive illusory motion in static images like humans?
cs.CV 2026-04 unverdicted novelty 4.0

Most optical flow models do not generate flow fields matching human perception of the Rotating Snakes illusion, but a dual-channel recurrent model does during simulated saccades.
Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers
eess.IV 2024-10 unverdicted novelty 4.0

Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.