pith. sign in

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it
abstract

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 2 cs.RO 1

years

2026 3

verdicts

UNVERDICTED 3

roles

background 1

polarities

unclear 1

representative citing papers

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

DiLA: Disentangled Latent Action World Models

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.

Why Latent Actions Fail, and How to Prevent It

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.

citing papers explorer

Showing 3 of 3 citing papers.

  • DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 64 · internal anchor

    DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

  • DiLA: Disentangled Latent Action World Models cs.CV · 2026-05-15 · unverdicted · none · ref 19 · internal anchor

    DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.

  • Why Latent Actions Fail, and How to Prevent It cs.CV · 2026-05-13 · unverdicted · none · ref 25 · internal anchor

    Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.