Learning Visual Predictive Models of Physics for Playing Billiards

Katerina Fragkiadaki , Pulkit Agrawal , Sergey Levine , Jitendra Malik

Authors on Pith no claims yet

classification 💻 cs.CV

keywords actionsagentplanvisualmodelnovelballbilliards

read the original abstract

The ability to plan and execute goal specific actions in varied, unexpected settings is a central requirement of intelligent agents. In this paper, we explore how an agent can be equipped with an internal model of the dynamics of the external world, and how it can use this model to plan novel actions by running multiple internal simulations ("visual imagination"). Our models directly process raw visual input, and use a novel object-centric prediction formulation based on visual glimpses centered on objects (fixations) to enforce translational invariance of the learned physical laws. The agent gathers training data through random interaction with a collection of different environments, and the resulting model can then be used to plan goal-directed actions in novel environments that the agent has not seen before. We demonstrate that our agent can accurately plan actions for playing a simulated billiards game, which requires pushing a ball into a target position or into collision with another ball.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...