Factored Latent Action World Models

Amy Zhang; Chang Shi; Jiaheng Hu; Kevin Rohling; Peter Stone; Roberto Mart\'in-Mart\'in; Zizhao Wang

arxiv: 2602.16229 · v2 · pith:5GLSSDV3new · submitted 2026-02-18 · 💻 cs.LG

Factored Latent Action World Models

Zizhao Wang , Chang Shi , Jiaheng Hu , Kevin Rohling , Roberto Mart\'in-Mart\'in , Amy Zhang , Peter Stone This is my paper

classification 💻 cs.LG

keywords latentactionmodelsdynamicsfactoredlearningvideoaction-free

0 comments

read the original abstract

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.