Recognition: 3 theorem links
Motus: A Unified Latent Action World Model
Pith reviewed 2026-05-12 18:39 UTC · model grok-4.3
The pith
A unified latent action world model combines understanding, generation, and control to enhance robotic task performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose Motus as a unified latent action world model that integrates understanding, world modeling, and control capabilities. It employs a Mixture-of-Transformer architecture with three experts and a flexible scheduler to handle multiple modes, extracts latent actions using optical flow, and trains via a three-phase pipeline on a six-layer data pyramid. This enables the model to serve as world models, vision-language-action models, and other variants while achieving better performance on robotic tasks than fragmented approaches.
What carries the argument
Mixture-of-Transformer experts for understanding, video generation, and action, paired with optical flow-derived latent actions and a three-phase training pipeline with data pyramid.
Load-bearing premise
The gains observed are due to the unified architecture and training rather than differences in model size, data quantity, or implementation details.
What would settle it
Running the same benchmarks with a version that uses separate models for each expert or mode but matches the total compute and data used would show whether unification is necessary for the reported benefits.
read the original abstract
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Motus, a unified latent action world model for embodied agents. It introduces a Mixture-of-Transformer (MoT) architecture integrating three experts (understanding, video generation, action), a UniDiffuser-style scheduler for switching between modes (world models, VLA, inverse dynamics, video generation, joint prediction), optical-flow-based latent actions, and a three-phase training pipeline with a six-layer data pyramid for large-scale action pretraining. The central empirical claim is that this unified approach yields superior performance over SOTA baselines: +15% over X-VLA and +45% over Pi0.5 in simulation, and +11–48% in real-world scenarios.
Significance. If the reported gains are shown to stem from the unified MoT + latent-action + multi-phase design rather than unmatched data scale or pretraining volume, the work would demonstrate a practical path toward consolidating fragmented embodied capabilities into a single model that can leverage heterogeneous motion data, with potential downstream benefits for robotic task learning.
major comments (3)
- [§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.
- [§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.
- [§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.
minor comments (2)
- [Abstract] Abstract: missing space before parenthesis in 'real-world scenarios(improved by +11~48%)'.
- [§3.2] Notation: 'UniDiffuser-style scheduler' is referenced repeatedly but never given an explicit equation or pseudocode; a short formal definition would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that stronger empirical validation is needed to support the claims of unification benefits. We will make revisions to address the concerns about experimental rigor, including adding error bars, ablations, and analysis of the routing mechanism. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§4 Experiments] §4 Experiments (and abstract): the headline performance claims (+15% over X-VLA, +45% over Pi0.5 in simulation; +11–48% real-world) are presented without matched-scale or matched-data controls against the cited baselines, without error bars, and without ablations that isolate the MoT experts, UniDiffuser scheduler, or optical-flow latent actions from capacity or data-volume effects; this directly undermines attribution of gains to the unification.
Authors: We acknowledge the validity of this concern. In the revised manuscript, we will report error bars based on at least three independent runs for all key metrics. We will also include ablation experiments that isolate the contributions of the MoT architecture, the UniDiffuser-style scheduler, and the optical-flow-based latent actions by comparing against variants without these components. For matched data controls, we will add a detailed comparison of the training datasets and scales used in our work versus the baselines, noting that our six-layer data pyramid enables leveraging a broader set of motion data. However, fully retraining the baselines on our exact data distribution is beyond our current computational resources, so we will explicitly discuss this as a limitation while providing the available controls. revision: partial
-
Referee: [§3.2 MoT Architecture] §3.2 MoT Architecture: the Mixture-of-Transformer expert routing is described as integrating the three modalities, yet the manuscript supplies no analysis of how routing weights are optimized or whether they introduce task-specific free parameters that could trade off performance across modes (understanding vs. generation vs. action), leaving the 'unified without trade-offs' claim untested.
Authors: We agree that empirical analysis of the routing is essential. We will augment Section 3.2 with details on the routing optimization process, including the loss terms that encourage balanced expert utilization. Additionally, we will provide new experiments showing the distribution of routing weights for different tasks and modes, as well as performance comparisons when using learned routing versus fixed or uniform routing. These results will demonstrate that the MoT does not incur significant trade-offs across understanding, generation, and action capabilities. revision: yes
-
Referee: [§3.3 Training Pipeline] §3.3 Training Pipeline and §3.4 Latent Actions: the three-phase recipe and optical-flow 'pixel-level delta action' extraction are central to the large-scale pretraining claim, but no ablation or sensitivity analysis is provided showing that removing the data pyramid or the optical-flow prior measurably harms downstream task performance; without these, the necessity of the full pipeline cannot be evaluated.
Authors: We recognize that ablations are required to validate the pipeline design. In the revised version, we will add sensitivity analyses and ablations in Section 4: specifically, results from training without the full data pyramid (using only subsets of the layers) and without the optical-flow prior (using raw action labels instead). These will be evaluated on the simulation and real-world benchmarks to quantify the performance degradation, thereby supporting the necessity of the proposed three-phase training and latent action extraction. revision: yes
Circularity Check
No circularity: empirical architecture validated by external benchmarks
full rationale
The paper proposes Motus as an engineering combination of MoT experts, UniDiffuser scheduler, optical-flow latent actions, and a three-phase training pipeline with data pyramid. All load-bearing claims are performance numbers obtained from held-out simulation and real-robot evaluations against external baselines (X-VLA, Pi0.5). No equations, uniqueness theorems, or first-principles derivations appear; nothing reduces by construction to a fitted parameter or self-citation. Self-citations, if present, are not invoked to justify the central result. The reported gains may or may not be attributable to unification versus scale, but that is a question of experimental controls, not circularity in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- expert routing weights in MoT
- UniDiffuser-style scheduler parameters
axioms (2)
- domain assumption Optical flow provides a sufficient pixel-level representation of latent actions for downstream control
- domain assumption Pretrained general models can be integrated via MoT without losing their individual capabilities
invented entities (2)
-
Mixture-of-Transformer (MoT)
no independent evidence
-
latent action from optical flow
no independent evidence
Forward citations
Cited by 35 Pith papers
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Learning Visual Feature-Based World Models via Residual Latent Action
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
-
JailWAM: Jailbreaking World Action Models in Robot Control
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
-
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.CoRR, abs/2409.16283, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[5]
H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human ma- nipulation enhanced bimanual robotic manipulation, 2025. 1
work page 2025
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foun- dation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero- shot robotic manipulation with pretrained image-editing dif- fusion models.CoRR, abs/2310.10639, 2023. 1
-
[8]
In 9th Annual Conference on Robot Learning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.\π0.5: a vision- language-action model with open-world generalization. In 9th Annual Conference on Robot Learning, 2025. 1, 3, 4, 6
work page 2025
-
[9]
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...
-
[10]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipula- tion platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for en- hanced motion generation in video models.arXiv preprint arXiv:2502.02492, 2025. 3
-
[13]
Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compres- sion autoencoder for efficient high-resolution diffusion mod- els. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5
work page 2025
-
[14]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 6, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos.arXiv preprint arXiv:2412.04445,
-
[16]
Action-free reasoning for policy generalization
Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. InICRA 2025 Workshop on Foundation Models and Neuro- Symbolic AI for Robotics, 2025. 3
work page 2025
-
[17]
Jeremy A Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, and Animesh Garg. Amplify: Ac- 9 tionless motion priors for robot learning from videos.arXiv preprint arXiv:2506.14198, 2025. 3
-
[18]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learn- ing universal policies via text-guided video generation.Ad- vances in neural information processing systems, 36:9156– 9172, 2023. 1, 3
work page 2023
-
[20]
Imitating latent policies from observation
Ashley Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Isbell. Imitating latent policies from observation. In International conference on machine learning, pages 1755–
-
[21]
Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation. arXiv preprint arXiv:2507.12898, 2025. 1, 3
-
[22]
Adaworld: Learning adaptable world models with latent actions
Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. InForty-second International Conference on Machine Learning, 2025. 3
work page 2025
-
[23]
beta-V AE: Learning basic visual con- cepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InInterna- tional Conference on Learning Representations, 2017. 3
work page 2017
-
[24]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapu- rapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 6, 5
work page internal anchor Pith review arXiv 2025
-
[25]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.CoRR, abs/2412.14803, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning. 1
-
[27]
OpenVLA: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...
work page 2024
-
[28]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.CoRR, abs/2503.00200, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[29]
Dual diffusion for unified image generation and understanding
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025. 3
work page 2025
-
[30]
Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models
Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. InICLR 2025 Workshop on World Models: Under- standing, Modelling and Scaling, 2025. 3
work page 2025
-
[31]
Rdt-1b: a diffusion foundation model for bimanual manipula- tion
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipula- tion. InThe Thirteenth International Conference on Learning Representations. 1, 4, 6, 5
-
[32]
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiang- miao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025. 1, 3
-
[33]
Cesar, Xi- angyang Ji, and Xu-Cheng Yin
Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xi- angyang Ji, and Xu-Cheng Yin. Dpflow: Adaptive opti- cal flow estimation with a dual-pyramid framework. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 17810–17820. Computer Vision Foundation / IEEE, 2025. 5
work page 2025
-
[34]
Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, and Vladislav Kurenkov. Latent action learning requires supervision in the presence of distractors.arXiv preprint arXiv:2502.00379,
-
[35]
Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, and Junjun He. Unimedvl: Unifying medical multimodal...
work page 2025
-
[36]
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alexander Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew E. Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Ant...
work page 2024
-
[37]
Derpanis, and Kostas Daniilidis
Oleh Rybkin, Karl Pertsch, Andrew Jaegle, Konstantinos G. Derpanis, and Kostas Daniilidis. Learning what you can do before doing anything. InInternational Conference on Learning Representations, 2019. 3
work page 2019
-
[38]
Learning to act without actions
Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations, 2024. 3
work page 2024
-
[39]
Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025
Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Auto- mated task-agnostic actions for bimanual manipulation, 2025. 1, 5
work page 2025
-
[40]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
arXiv preprint arXiv:2412.15109 (2024)
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.CoRR, abs/2412.15109, 2024. 1
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mation Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 4
work page 2024
-
[44]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Yiqi Wang, Mrinal Verghese, and Jeff Schneider. Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025. 3
-
[47]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025. 3
work page 2025
-
[48]
arXiv preprint arXiv:2412.13877 (2024) 14
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xi- aozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan 11 Xu, Guang Yang, et al. Robomind: Benchmark on multi- embodiment intelligence normative data for robot manipula- tion.arXiv preprint arXiv:2412.13877, 2024. 6, 5
-
[49]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[50]
Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kai- jing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from in- ternet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025. 3
-
[51]
Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy condition- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6960–6970, 2025. 3
work page 2025
-
[52]
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,
work page internal anchor Pith review arXiv
-
[53]
Learning interactive real-world simulators
Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1
work page 2024
-
[54]
Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, and Jun Zhu. Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025. 3
-
[55]
Latent action pretraining from videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jian- feng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 3
work page 2025
-
[56]
Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Ry- bkin, and Pieter Abbeel. Video2policy: Scaling up manip- ulation tasks in simulation through internet videos.CoRR, abs/2502.09886, 2025. 1
-
[57]
Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies
Chengbo Yuan, Rui Zhou, Mengzhen Liu, Yingdong Hu, Shengjie Wang, Li Yi, Shanghang Zhang, Chuan Wen, and Yang Gao. Motiontrans: Human VR data enable motion- level learning for robotic manipulation policies. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025. 3
work page 2025
-
[58]
What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025
Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025. 3
-
[59]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low- cost hardware. InRobotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. 3
work page 2023
-
[60]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 1, 3, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025. 3
-
[62]
Robodreamer: Learning compo- sitional world models for robot imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compo- sitional world models for robot imagination. InInterna- tional Conference on Machine Learning, pages 61885–61896. PMLR, 2024. 1, 3
work page 2024
-
[63]
Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.arXiv preprint arXiv:2501.14729, 2025. 3
-
[64]
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 1, 3, 4
work page internal anchor Pith review arXiv 2025
-
[65]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 12 Motus: A Unified Latent Action World Model Supplementary Material
work page 2023
-
[66]
Training and Inference of the Unified Model In this section, we analyze the training and inference proce- dures of the unified model, from both theoretical and experi- mental perspectives. 7.1. Theorectical Analysis During each training iteration, given o0 t:t+k and a0 t:t+k, Mo- tus samples different timesteps τo, τa and noise ϵo, ϵa for them respectivel...
-
[67]
Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab
More Experiments Results 8.1. Overall Comparison on RoboTwin 2.0 Simula- tion Data with More Baselines Tab. 14 shows the evaluation results on RoboTwin 2.0 Simu- lation, presenting the performance of Motus and other base- lines on all 50 tasks under both clean scenes and randomized scenes. 8.2. Other Benchmarks LIBERO-Long.LIBERO-Long is the long-horizon ...
-
[68]
Implementation Details 9.1. Model Architecture Tab. 11 provides the key hyperparameter settings for the Motus model architecture. Grind Coffee Beans With Grinder (AC-One) Touch Instructed Keyboard (AC-One) Brew Coffee using Coffee Maker (AC-One) Place Green Cube Into Plate (AC-One) Pour Water from Kettle to Flowers (AC-One) Get Water from Water Dispenser ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.