Why Latent Actions Fail, and How to Prevent It

· 2026 · cs.CV · arXiv 2605.20223

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

representative citing papers

CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.

citing papers explorer

Showing 1 of 1 citing paper after filters.

CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization cs.RO · 2026-06-02 · unverdicted · none · ref 28 · internal anchor
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.

Why Latent Actions Fail, and How to Prevent It

fields

years

verdicts

representative citing papers

citing papers explorer