Deep predictive coding networks for video prediction and unsupervised learning

William Lotter, Gabriel Kreiman, David Cox · 2016 · cs.LG · arXiv 1605.08104

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

representative citing papers

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

cs.CV · 2025-10-06 · unverdicted · novelty 7.0

A time-reversed reconstruction method couples visual language models with constrained diffusion to generate past scene frames from current thermal traces in controlled scenarios.

LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

cs.CL · 2026-03-12 · unverdicted · novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

cs.CV · 2025-12-29 · unverdicted · novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.

Prediction horizon shapes representations in predictive learning

cs.LG · 2025-11-12 · unverdicted · novelty 6.0

Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Order Matters: Shuffling Sequence Generation for Video Prediction

cs.CV · 2019-07-20 · unverdicted · novelty 6.0

SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.

Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers

eess.IV · 2024-10-08 · unverdicted · novelty 4.0

Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.

Do vision models perceive illusory motion in static images like humans?

cs.CV · 2026-04-10 · unverdicted · novelty 4.0

Most optical flow models do not generate flow fields matching human perception of the Rotating Snakes illusion, but a dual-channel recurrent model does during simulated saccades.

citing papers explorer

Showing 8 of 8 citing papers.

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models cs.CV · 2025-10-06 · unverdicted · none · ref 13 · internal anchor
A time-reversed reconstruction method couples visual language models with constrained diffusion to generate past scene frames from current thermal traces in controlled scenarios.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling cs.CL · 2026-03-12 · unverdicted · none · ref 22 · internal anchor
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 51 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
Prediction horizon shapes representations in predictive learning cs.LG · 2025-11-12 · unverdicted · none · ref 4 · internal anchor
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 163 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Order Matters: Shuffling Sequence Generation for Video Prediction cs.CV · 2019-07-20 · unverdicted · none · ref 24 · internal anchor
SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers eess.IV · 2024-10-08 · unverdicted · none · ref 30 · internal anchor
Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.
Do vision models perceive illusory motion in static images like humans? cs.CV · 2026-04-10 · unverdicted · none · ref 26
Most optical flow models do not generate flow fields matching human perception of the Rotating Snakes illusion, but a dual-channel recurrent model does during simulated saccades.

Deep predictive coding networks for video prediction and unsupervised learning

fields

years

verdicts

representative citing papers

citing papers explorer