SCAR: Self-Supervised Continuous Action Representation Learning

Biwei Huang; Fan Feng; Haofei Lu; Hongjia Liu; Minghao Fu; Xinyue Wang

arxiv: 2605.16412 · v1 · pith:JJJYBWFQnew · submitted 2026-05-13 · 💻 cs.RO · cs.CV

SCAR: Self-Supervised Continuous Action Representation Learning

Hongjia Liu , Fan Feng , Minghao Fu , Xinyue Wang , Haofei Lu , Biwei Huang This is my paper

Pith reviewed 2026-05-20 20:32 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords action representation learningself-supervised learningworld modelscross-embodiment transferinverse dynamicsforward dynamicsrobot learning

0 comments

The pith

A joint inverse-forward model learns latent actions from images that transfer better across robot bodies than raw motor commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that action representations can be extracted as a distinct factor capturing only controllable change, separate from how any particular body or environment produces that change. SCAR does this by running an inverse dynamics model to pull latent actions out of observation pairs and a forward dynamics model to predict the next state from those latents, all on top of a pretrained visual backbone. A Gaussian prior keeps the latents from becoming arbitrary visual codes, while adversarial training pushes embodiment and environment details out of the representation. If the claim holds, world models could be conditioned on a shared action language that works across different robots and tasks even when training data is scarce.

Core claim

SCAR is a joint inverse-forward dynamics framework built on a pretrained generative backbone. An inverse dynamics model infers latent actions from pairs of latent observations, while a forward dynamics model predicts future dynamics conditioned on these latents. The latent action posterior is regularized toward a standard Gaussian prior to limit arbitrary visual encoding, and adversarial invariance suppresses embodiment- and environment-specific nuisance factors. This produces a unified latent action representation that serves as a stronger conditioning interface for world modeling than raw actions, leading to improved cross-embodiment low-data adaptation and cross-task transfer on the Procg

What carries the argument

The joint inverse-forward dynamics model that infers latent actions from observation pairs and conditions future predictions on them, regularized by a Gaussian prior on the action posterior plus adversarial invariance training.

If this is right

The latent actions provide a stronger conditioning signal for predicting future observations in world models than raw embodiment-specific actions.
World models conditioned on these latents adapt to new robot bodies using less data than models that use raw actions.
Cross-task transfer improves because the representations focus on controllable change rather than task- or body-specific details.
Action can be treated as a shared representational factor that decouples control from actuation across embodiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of controllable change from embodiment might be applied to non-visual inputs such as proprioception or language instructions.
If the latent actions truly isolate control, they could support direct transfer of policies learned in one embodiment to another without additional fine-tuning.
This framing suggests that future world-model work should treat action inference as an explicit disentanglement step rather than an implicit byproduct of next-frame prediction.

Load-bearing premise

Regularizing the latent action posterior to a standard Gaussian and applying adversarial invariance will reliably remove embodiment-specific and environment-specific factors while keeping the information needed to control changes.

What would settle it

If a linear classifier trained on the learned latent actions can still predict which embodiment or environment produced them at above-chance accuracy, that would show the nuisance factors were not suppressed.

Figures

Figures reproduced from arXiv: 2605.16412 by Biwei Huang, Fan Feng, Haofei Lu, Hongjia Liu, Minghao Fu, Xinyue Wang.

**Figure 1.** Figure 1: Cross-embodiment action transfer. Latent actions from a source ALOHA trajectory are applied to a target Franka context. KL limits visual shortcuts, and GRL reduces embodiment leakage; KL+GRL best preserves the target embodiment while transferring the source motion structure. Recent latent-action models [13–17] learn action-like representations from visual transitions following an inverse-forward dynamics f… view at source ↗

**Figure 2.** Figure 2: Overview of SCAR. During training, a frozen Wan encoder maps video frames into latent space; the IDM infers stochastic latent actions, and the FDM predicts future latent dynamics conditioned on them. KL constrains latent capacity and GRL suppresses embodiment-specific information. At inference time, an action-to-latent controller maps raw commands and visual context to latent actions for controllable predi… view at source ↗

**Figure 3.** Figure 3: Qualitative zero-shot cross-task generation on the target embodiment. Under task shift, the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Performance trends across targetembodiment data sizes (m = 10, 50, 100) on Robotwin target and transfer tasks. SCAR-kl-grl consistently outperforms raw-action baselines across data regimes. provide a more effective world-model interface than embodiment-specific commands. Second, KL and GRL regularization further improve the shared latent baseline, with SCAR-kl-grl performing best across all Procgen and Ro… view at source ↗

read the original abstract

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCAR pairs an inverse and forward dynamics model with Gaussian regularization and adversarial invariance to learn action latents that transfer better across embodiments than raw actions.

read the letter

The main thing here is that SCAR learns unified action representations from visual transitions by running an inverse dynamics model to pull latents out of observation pairs and a forward dynamics model to predict the next state from those latents. Gaussian regularization keeps the latents from turning into generic visual encodings, and adversarial training tries to strip out embodiment and environment specifics so the space captures controllable change instead. Experiments on Procgen and Robotwin are said to show that conditioning world models on these latents beats raw actions for low-data cross-embodiment adaptation and cross-task transfer.

Referee Report

2 major / 3 minor

Summary. The paper proposes SCAR, a joint inverse-forward dynamics framework for learning unified continuous action representations from visual transitions across embodiments. Built on a pretrained generative backbone, an IDM infers latent actions from observation pairs while an FDM predicts future states conditioned on them; regularization to a standard Gaussian prior and adversarial invariance are used to suppress embodiment- and environment-specific factors while preserving controllable dynamics. Experiments on Procgen and Robotwin datasets indicate that the resulting latent action representations improve world-model conditioning, yielding better cross-embodiment low-data adaptation and cross-task transfer than raw embodiment-specific actions.

Significance. If the empirical claims hold, the work offers a concrete mechanism for learning transferable action spaces that decouple controllable change from actuator specifics, which would be a useful interface for scalable world models in robotics. The self-supervised joint IDM-FDM construction and explicit regularization strategy are technically coherent and address a recognized bottleneck in cross-embodiment generalization.

major comments (2)

[§4.3, Table 2] §4.3, Table 2: the cross-embodiment low-data adaptation results report mean success rates but omit per-seed standard deviations and statistical significance tests; without these, it is difficult to judge whether the reported gains over raw-action baselines are robust or could be explained by training variance.
[§3.2, Eq. (7)] §3.2, Eq. (7): the combined objective weights the adversarial invariance term against the Gaussian KL term, yet no sensitivity analysis or ablation on the relative weighting is provided; because the central claim that embodiment-specific factors are suppressed while controllable information is retained depends on this balance, the lack of such analysis weakens the transferability argument.

minor comments (3)

[Figure 3] Figure 3 caption does not specify the exact number of training episodes used in the low-data regime, making it hard to reproduce the adaptation curves.
[§3.1] The description of the pretrained generative backbone (VAE or diffusion model) is referenced only by citation; a brief architectural summary in §3.1 would improve self-contained readability.
[Appendix A] Hyperparameter values for the adversarial discriminator learning rate and the Gaussian prior variance are listed in the appendix but not cross-referenced in the main text, which could be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive recommendation for minor revision. We have carefully considered the comments and revised the manuscript accordingly to improve the robustness and clarity of our experimental results and analysis.

read point-by-point responses

Referee: [§4.3, Table 2] §4.3, Table 2: the cross-embodiment low-data adaptation results report mean success rates but omit per-seed standard deviations and statistical significance tests; without these, it is difficult to judge whether the reported gains over raw-action baselines are robust or could be explained by training variance.

Authors: We agree with this observation. In the revised version of the manuscript, we have updated Table 2 to report both the mean success rates and the corresponding per-seed standard deviations. Furthermore, we have conducted statistical significance tests (paired t-tests) between our method and the raw-action baselines, and included the p-values in the table to demonstrate that the observed improvements are statistically significant. revision: yes
Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): the combined objective weights the adversarial invariance term against the Gaussian KL term, yet no sensitivity analysis or ablation on the relative weighting is provided; because the central claim that embodiment-specific factors are suppressed while controllable information is retained depends on this balance, the lack of such analysis weakens the transferability argument.

Authors: We appreciate this point, as the balance between these terms is indeed crucial for the desired properties of the latent action space. To address this, we have added a sensitivity analysis in the supplementary material, where we vary the weighting coefficients for the adversarial invariance loss and the KL divergence term over a range of values and report the resulting performance on cross-embodiment transfer tasks. The results indicate that our chosen weights yield near-optimal performance, and moderate variations do not significantly degrade the transferability. We have also included a short discussion in Section 3.2 explaining the rationale behind the selected weights. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes SCAR as a joint IDM-FDM architecture on a pretrained backbone, with latent actions regularized to a standard Gaussian prior and trained with adversarial invariance. These are standard techniques whose effectiveness is assessed via downstream experiments on Procgen and Robotwin showing improved cross-embodiment transfer. No derivation step reduces by construction to its inputs, no self-citation is load-bearing for a uniqueness claim, and no fitted parameter is relabeled as a prediction. The central claim rests on empirical comparison of conditioning interfaces rather than tautological re-expression of the training objective.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that a pretrained generative backbone yields usable latent observations and that the chosen regularizations achieve decoupling without introducing new fitted parameters beyond standard training.

free parameters (1)

Gaussian prior variance for latent actions
Regularization of latent action posterior toward standard Gaussian prior to limit arbitrary visual encoding.

axioms (1)

domain assumption Pretrained generative backbone produces latent observations suitable for dynamics modeling.
Method is built on a pretrained generative backbone.

pith-pipeline@v0.9.0 · 5771 in / 1142 out tokens · 72845 ms · 2026-05-20T20:32:55.715000+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regularize the latent action posterior toward a standard Gaussian prior ... and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL regularization to limit the information capacity of the latent action posterior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 10 internal anchors

[1]

Clarendon press, 2006

Shaun Gallagher.How the body shapes the mind. Clarendon press, 2006. 1

work page 2006
[2]

Oxford University Press, 2021

Yochai Ataria, Shogo Tanaka, and Shaun Gallagher.Body schema and body image: New directions. Oxford University Press, 2021

work page 2021
[3]

Davide Sattin, Chiara Parma, Christian Lunetta, Aida Zulueta, Jacopo Lanzone, Luca Giani, Marta Vassallo, Mario Picozzi, and Eugenio Agostino Parati. An overview of the body schema and body image: theoretical models, methodological settings and pitfalls for rehabilitation of persons with neurological disorders.Brain Sciences, 13(10):1410, 2023

work page 2023
[4]

Body schema in robotics: a review.IEEE Transactions on Autonomous Mental Development, 2(4):304–324, 2010

Matej Hoffmann, Hugo Marques, Alejandro Arieta, Hidenobu Sumioka, Max Lungarella, and Rolf Pfeifer. Body schema in robotics: a review.IEEE Transactions on Autonomous Mental Development, 2(4):304–324, 2010. 1

work page 2010
[5]

Tools for the body (schema).Trends in cognitive sciences, 8 (2):79–86, 2004

Angelo Maravita and Atsushi Iriki. Tools for the body (schema).Trends in cognitive sciences, 8 (2):79–86, 2004. 1

work page 2004
[6]

Tool-use induces morphological updating of the body schema.Current biology, 19(12):R478–R479, 2009

Lucilla Cardinali, Francesca Frassinetti, Claudio Brozzoli, Christian Urquizar, Alice C Roy, and Alessandro Farnè. Tool-use induces morphological updating of the body schema.Current biology, 19(12):R478–R479, 2009. 1

work page 2009
[7]

An fmri study of imitation: action representation and body schema.Neuropsychologia, 43(1):115–127, 2005

Thierry Chaminade, Andrew N Meltzoff, and Jean Decety. An fmri study of imitation: action representation and body schema.Neuropsychologia, 43(1):115–127, 2005. 1

work page 2005
[8]

Ale meta-analysis of action observation and imitation in the human brain.Neuroimage, 50(3):1148–1167, 2010

Svenja Caspers, Karl Zilles, Angela R Laird, and Simon B Eickhoff. Ale meta-analysis of action observation and imitation in the human brain.Neuroimage, 50(3):1148–1167, 2010

work page 2010
[9]

Imitation of hand and tool actions is effector- independent.Experimental brain research, 214(4):539–547, 2011

M Van Elk, HT Van Schie, and H Bekkering. Imitation of hand and tool actions is effector- independent.Experimental brain research, 214(4):539–547, 2011. 1

work page 2011
[10]

A survey of robot manipulation in contact.Robotics and Autonomous Systems, 156:104224, 2022

Markku Suomalainen, Yiannis Karayiannidis, and Ville Kyrki. A survey of robot manipulation in contact.Robotics and Autonomous Systems, 156:104224, 2022. 1

work page 2022
[11]

Cambridge University Press, 2017

Kevin M Lynch and Frank C Park.Modern robotics. Cambridge University Press, 2017. 1

work page 2017
[12]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. 1, 3

work page 2024
[13]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations. 2, 3 10

work page
[14]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations. 2

work page
[15]

Latent action learning requires supervision in the presence of distractors

Alexander Nikulin, Ilya Zisman, Denis Tarasov, Lyubaykin Nikita, Andrei Polubarov, Igor Kiselev, and Vladislav Kurenkov. Latent action learning requires supervision in the presence of distractors. InForty-second International Conference on Machine Learning. 3

work page
[16]

Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025. 3

work page arXiv 2025
[17]

Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230,

work page arXiv
[18]

What do latent action models actually learn? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 2, 3

work page
[19]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational conference on machine learning, pages 1180–1189. PMLR, 2015. 2

work page 2015
[21]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Conference on Machine Learning, pages 2048–2056, 2020. 2, 6

work page 2048
[22]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025. 3

work page arXiv 2025
[24]

Dino-wm: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InForty-second International Conference on Machine Learning. 3

work page
[25]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025. 3

work page arXiv 2025
[28]

Diwa: Diffusion policy adaptation with world models

Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. Diwa: Diffusion policy adaptation with world models. In9th Annual Conference on Robot Learning. 3

work page
[29]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Learning to act robustly with view-invariant latent actions.arXiv preprint arXiv:2601.02994, 2026

Youngjoon Jeong, Junha Chun, and Taesup Kim. Learning to act robustly with view-invariant latent actions.arXiv preprint arXiv:2601.02994, 2026. 3

work page arXiv 2026
[32]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Joint-aligned latent action: Towards scalable vla pretraining in the wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild. arXiv preprint arXiv:2602.21736, 2026. 3

work page arXiv 2026
[34]

On the identifiability of latent action policies.arXiv preprint arXiv:2510.01337, 2025

Sébastien Lachapelle. On the identifiability of latent action policies.arXiv preprint arXiv:2510.01337, 2025. 3

work page arXiv 2025
[35]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

La- tent action diffusion for cross-embodiment manipulation

Erik Bauer, Elvis Nava, and Robert K Katzschmann. Latent action diffusion for cross- embodiment manipulation.arXiv preprint arXiv:2506.14608, 2025. 3

work page arXiv 2025
[38]

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025. 3

work page internal anchor Pith review arXiv 2025
[39]

Unidex: Rethinking search inverted indexing with unified semantic modeling.arXiv preprint arXiv:2509.24632, 2025

Zan Li, Jiahui Chen, Yuan Chai, Xiaoze Jiang, Xiaohua Qi, Zhiheng Qin, Runbin Zhou, Shun Zuo, Guangchao Hao, Kefeng Wang, et al. Unidex: Rethinking search inverted indexing with unified semantic modeling.arXiv preprint arXiv:2509.24632, 2025

work page arXiv 2025
[40]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE,

work page
[41]

In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025. 3

work page arXiv 2025
[42]

Cross-embodiment robot manipulation skill transfer using latent space alignment.arXiv preprint arXiv:2406.01968,

Tianyu Wang, Dwait Bhatt, Xiaolong Wang, and Nikolay Atanasov. Cross-embodiment robot manipulation skill transfer using latent space alignment.arXiv preprint arXiv:2406.01968,

work page arXiv
[43]

Cross-entropy is all you need to invert the data generating process.arXiv preprint arXiv:2410.21869,

Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E V ogt, Randall Balestriero, Wieland Brendel, and David Klindt. Cross-entropy is all you need to invert the data generating process.arXiv preprint arXiv:2410.21869, 2024. 13, 14

work page arXiv 2024
[44]

Dispersion on a sphere.Proceedings of the royal society of London

Ronald Aylmer Fisher. Dispersion on a sphere.Proceedings of the royal society of London. Series A. Mathematical and physical sciences, 217(1130):295–305, 1953. 14

work page 1953
[45]

affine generator

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 19 12 A Proofs Assumption 1(Data-generating process). (i) The latent action space Z is a dz-dimensional topo- logical m...

work page 2024

[1] [1]

Clarendon press, 2006

Shaun Gallagher.How the body shapes the mind. Clarendon press, 2006. 1

work page 2006

[2] [2]

Oxford University Press, 2021

Yochai Ataria, Shogo Tanaka, and Shaun Gallagher.Body schema and body image: New directions. Oxford University Press, 2021

work page 2021

[3] [3]

Davide Sattin, Chiara Parma, Christian Lunetta, Aida Zulueta, Jacopo Lanzone, Luca Giani, Marta Vassallo, Mario Picozzi, and Eugenio Agostino Parati. An overview of the body schema and body image: theoretical models, methodological settings and pitfalls for rehabilitation of persons with neurological disorders.Brain Sciences, 13(10):1410, 2023

work page 2023

[4] [4]

Body schema in robotics: a review.IEEE Transactions on Autonomous Mental Development, 2(4):304–324, 2010

Matej Hoffmann, Hugo Marques, Alejandro Arieta, Hidenobu Sumioka, Max Lungarella, and Rolf Pfeifer. Body schema in robotics: a review.IEEE Transactions on Autonomous Mental Development, 2(4):304–324, 2010. 1

work page 2010

[5] [5]

Tools for the body (schema).Trends in cognitive sciences, 8 (2):79–86, 2004

Angelo Maravita and Atsushi Iriki. Tools for the body (schema).Trends in cognitive sciences, 8 (2):79–86, 2004. 1

work page 2004

[6] [6]

Tool-use induces morphological updating of the body schema.Current biology, 19(12):R478–R479, 2009

Lucilla Cardinali, Francesca Frassinetti, Claudio Brozzoli, Christian Urquizar, Alice C Roy, and Alessandro Farnè. Tool-use induces morphological updating of the body schema.Current biology, 19(12):R478–R479, 2009. 1

work page 2009

[7] [7]

An fmri study of imitation: action representation and body schema.Neuropsychologia, 43(1):115–127, 2005

Thierry Chaminade, Andrew N Meltzoff, and Jean Decety. An fmri study of imitation: action representation and body schema.Neuropsychologia, 43(1):115–127, 2005. 1

work page 2005

[8] [8]

Ale meta-analysis of action observation and imitation in the human brain.Neuroimage, 50(3):1148–1167, 2010

Svenja Caspers, Karl Zilles, Angela R Laird, and Simon B Eickhoff. Ale meta-analysis of action observation and imitation in the human brain.Neuroimage, 50(3):1148–1167, 2010

work page 2010

[9] [9]

Imitation of hand and tool actions is effector- independent.Experimental brain research, 214(4):539–547, 2011

M Van Elk, HT Van Schie, and H Bekkering. Imitation of hand and tool actions is effector- independent.Experimental brain research, 214(4):539–547, 2011. 1

work page 2011

[10] [10]

A survey of robot manipulation in contact.Robotics and Autonomous Systems, 156:104224, 2022

Markku Suomalainen, Yiannis Karayiannidis, and Ville Kyrki. A survey of robot manipulation in contact.Robotics and Autonomous Systems, 156:104224, 2022. 1

work page 2022

[11] [11]

Cambridge University Press, 2017

Kevin M Lynch and Frank C Park.Modern robotics. Cambridge University Press, 2017. 1

work page 2017

[12] [12]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. 1, 3

work page 2024

[13] [13]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations. 2, 3 10

work page

[14] [14]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InThe Thirteenth International Conference on Learning Representations. 2

work page

[15] [15]

Latent action learning requires supervision in the presence of distractors

Alexander Nikulin, Ilya Zisman, Denis Tarasov, Lyubaykin Nikita, Andrei Polubarov, Igor Kiselev, and Vladislav Kurenkov. Latent action learning requires supervision in the presence of distractors. InForty-second International Conference on Machine Learning. 3

work page

[16] [16]

Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

Anthony Liang, Pavel Czempin, Matthew Hong, Yutai Zhou, Erdem Biyik, and Stephen Tu. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025. 3

work page arXiv 2025

[17] [17]

Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230,

work page arXiv

[18] [18]

What do latent action models actually learn? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 2, 3

work page

[19] [19]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational conference on machine learning, pages 1180–1189. PMLR, 2015. 2

work page 2015

[21] [21]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Conference on Machine Learning, pages 2048–2056, 2020. 2, 6

work page 2048

[22] [22]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025. 3

work page arXiv 2025

[24] [24]

Dino-wm: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. InForty-second International Conference on Machine Learning. 3

work page

[25] [25]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation.arXiv preprint arXiv:2512.13644, 2025. 3

work page arXiv 2025

[28] [28]

Diwa: Diffusion policy adaptation with world models

Akshay L Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. Diwa: Diffusion policy adaptation with world models. In9th Annual Conference on Robot Learning. 3

work page

[29] [29]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Learning to act robustly with view-invariant latent actions.arXiv preprint arXiv:2601.02994, 2026

Youngjoon Jeong, Junha Chun, and Taesup Kim. Learning to act robustly with view-invariant latent actions.arXiv preprint arXiv:2601.02994, 2026. 3

work page arXiv 2026

[32] [32]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Joint-aligned latent action: Towards scalable vla pretraining in the wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild. arXiv preprint arXiv:2602.21736, 2026. 3

work page arXiv 2026

[34] [34]

On the identifiability of latent action policies.arXiv preprint arXiv:2510.01337, 2025

Sébastien Lachapelle. On the identifiability of latent action policies.arXiv preprint arXiv:2510.01337, 2025. 3

work page arXiv 2025

[35] [35]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

La- tent action diffusion for cross-embodiment manipulation

Erik Bauer, Elvis Nava, and Robert K Katzschmann. Latent action diffusion for cross- embodiment manipulation.arXiv preprint arXiv:2506.14608, 2025. 3

work page arXiv 2025

[38] [38]

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025. 3

work page internal anchor Pith review arXiv 2025

[39] [39]

Unidex: Rethinking search inverted indexing with unified semantic modeling.arXiv preprint arXiv:2509.24632, 2025

Zan Li, Jiahui Chen, Yuan Chai, Xiaoze Jiang, Xiaohua Qi, Zhiheng Qin, Runbin Zhou, Shun Zuo, Guangchao Hao, Kefeng Wang, et al. Unidex: Rethinking search inverted indexing with unified semantic modeling.arXiv preprint arXiv:2509.24632, 2025

work page arXiv 2025

[40] [40]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE,

work page

[41] [41]

In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025. 3

work page arXiv 2025

[42] [42]

Cross-embodiment robot manipulation skill transfer using latent space alignment.arXiv preprint arXiv:2406.01968,

Tianyu Wang, Dwait Bhatt, Xiaolong Wang, and Nikolay Atanasov. Cross-embodiment robot manipulation skill transfer using latent space alignment.arXiv preprint arXiv:2406.01968,

work page arXiv

[43] [43]

Cross-entropy is all you need to invert the data generating process.arXiv preprint arXiv:2410.21869,

Patrik Reizinger, Alice Bizeul, Attila Juhos, Julia E V ogt, Randall Balestriero, Wieland Brendel, and David Klindt. Cross-entropy is all you need to invert the data generating process.arXiv preprint arXiv:2410.21869, 2024. 13, 14

work page arXiv 2024

[44] [44]

Dispersion on a sphere.Proceedings of the royal society of London

Ronald Aylmer Fisher. Dispersion on a sphere.Proceedings of the royal society of London. Series A. Mathematical and physical sciences, 217(1130):295–305, 1953. 14

work page 1953

[45] [45]

affine generator

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 19 12 A Proofs Assumption 1(Data-generating process). (i) The latent action space Z is a dz-dimensional topo- logical m...

work page 2024