Unsupervised Learning for Physical Interaction through Video Prediction

Chelsea Finn; Ian Goodfellow; Sergey Levine

Unsupervised Learning for Physical Interaction through Video Prediction

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1605.07157 v4 pith:WGXXCQVE submitted 2016-05-23 cs.LG cs.AIcs.CVcs.RO

Unsupervised Learning for Physical Interaction through Video Prediction

Chelsea Finn , Ian Goodfellow , Sergey Levine This is my paper

classification cs.LG cs.AIcs.CVcs.RO

keywords learningmotionobjectspredictionvideoobjectphysicalaccurate

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action. Our experiments show that our proposed method produces more accurate video predictions both quantitatively and qualitatively, when compared to prior methods.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robot Critics that Sweat the Small Stuff
cs.RO 2026-06 unverdicted novelty 6.0

Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers
cs.RO 2026-05 unverdicted novelty 6.0

A JAX-based differentiable reachability primitive for continuous- and discrete-time NN dynamics and controllers that supports certified training and sampling-based MPC with gradient refinement.
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
cs.RO 2023-04 unverdicted novelty 5.0

Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Frame forecasting in cine MRI using the PCA respiratory motion model: comparing recurrent neural networks trained online and transformers
eess.IV 2024-10 unverdicted novelty 4.0

Online RNNs (RTRL, SnAp-1) beat linear filters and transformers at medium-to-long horizon forecasting of PCA respiratory motion weights in two cine-MRI datasets, yielding sub-1.4 mm and sub-2.8 mm geometric errors.
Neural Embedding for Physical Manipulations
cs.LG 2019-07 unverdicted novelty 4.0

Generative model with normalized pairwise distance constraint discovers output space topologies from sparse data and outperforms GANs and VAEs by avoiding mode collapse.