What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Andrea Vedaldi; Christian Rupprecht; Gabrijel Boduljak; Iro Laina; Laurynas Karazija

arxiv: 2509.21592 · v2 · pith:4E5GCDSOnew · submitted 2025-09-25 · 💻 cs.CV · cs.AI· cs.LG

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Gabrijel Boduljak , Laurynas Karazija , Iro Laina , Christian Rupprecht , Andrea Vedaldi This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords motiongeneratorsdataforecastinggeneratingimageobjectpixels

0 comments

read the original abstract

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The TIME Machine: On The Power of Motion for Efficient Perception
cs.CV 2026-05 unverdicted novelty 6.0

TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.