Recognition: unknown
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Pith reviewed 2026-05-09 20:43 UTC · model grok-4.3
The pith
Being-H0.7 trains robot policies to reason about future states by aligning latent representations from current observations with those derived from future frames, then discards the future branch at deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface. A deployable prior branch infers latent states from the current context alone, while a training-only posterior branch replaces the queries with embeddings computed from future observations. Joint alignment of the two branches in latent space causes the prior branch to internalize future-aware, action-useful structure. At test time the posterior branch is removed entirely and no visual rollout is performed, yielding a policy that combines the benefits of world models with the efficiency of direct VLA policies.
What carries the argument
Learnable latent queries placed between perception and action, trained by joint alignment of a current-context prior branch and a future-observation posterior branch.
If this is right
- Robot policies gain predictive information about contacts, dynamics, and task progress without incurring the runtime cost of pixel-space video generation.
- The same architecture remains fully deployable as a direct VLA because the posterior branch and any visual rollout are discarded after training.
- Sparse action supervision can be supplemented by latent-space future alignment instead of requiring dense future-frame prediction.
- The method scales to both simulation benchmarks and diverse real-world egocentric video tasks while preserving inference speed.
Where Pith is reading between the lines
- The same latent-query alignment could be inserted into other multimodal control pipelines where future context is available only at training time.
- If the alignment generalizes across robot embodiments, the learned representations might transfer more readily than pixel-based world models.
- Longer prediction horizons could be tested by extending the posterior branch to multiple future steps and measuring whether the prior branch continues to improve.
- The approach reduces dependence on high-fidelity video synthesis, which may lower data and compute requirements for training generalist policies.
Load-bearing premise
That matching the prior branch's latent outputs to the posterior branch's future-derived embeddings will reliably embed genuine future dynamics and action utility rather than superficial training-distribution statistics.
What would settle it
An ablation that removes the posterior branch or the alignment loss and measures whether performance on contact-rich or long-horizon tasks falls to the level of a plain VLA baseline would directly test whether the dual-branch design supplies the claimed future-aware representations.
Figures
read the original abstract
Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Being-H0.7, a latent world-action model for robot control that augments VLA policies with future-aware reasoning. It inserts learnable latent queries as a reasoning interface and trains them via a dual-branch setup: a deployable prior branch that processes only current observations and a training-only posterior branch that incorporates future observations. Joint alignment of the branches in latent space is claimed to produce action-useful representations of dynamics and task progress from current context alone, enabling inference without visual rollouts or pixel prediction. Experiments are reported to show state-of-the-art or comparable results on six simulation benchmarks plus diverse real-world tasks.
Significance. If the central mechanism holds, the work offers a practical middle ground between costly world-model rollouts and shortcut-prone direct VLAs, potentially improving efficiency and generalization in generalist robot policies. The absence of pixel-space prediction at inference is a clear deployability advantage over prior video-based approaches.
major comments (2)
- [Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.
- [Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.
minor comments (2)
- [Figure 1] Figure 1 (architecture diagram) would benefit from explicit labeling of the prior versus posterior paths and the alignment loss to make the inference-time deployment clearer.
- [Method section] Notation for the latent queries (e.g., how they are initialized and updated) should be defined consistently in the text and equations to avoid ambiguity when describing the joint training objective.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the dual-branch alignment and experimental presentation. We address each major comment below and have revised the manuscript to incorporate additional ablations and clarifications that strengthen the attribution of performance gains to the future-aware latent reasoning.
read point-by-point responses
-
Referee: [Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.
Authors: We appreciate the referee's point that the alignment must demonstrably induce future-aware structure rather than allow trivial copying of marginal statistics. The core mechanism relies on the posterior branch providing future-informed embeddings that the prior must match from current observations alone; this forces the latent queries to encode predictive, action-relevant dynamics because the alignment objective is computed in a shared latent space where superficial visual or action marginals alone cannot fully bridge the information gap. Nevertheless, we acknowledge that explicit auxiliary losses (such as contrastive future discrimination) could provide further safeguards. In the revised manuscript we have added a targeted ablation that removes the alignment loss entirely while retaining the latent queries, showing a clear performance drop across benchmarks. This result, together with the updated description in Section 3, supports that the gains arise from the future-aware alignment rather than marginal copying. revision: yes
-
Referee: [Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.
Authors: We agree that isolating the contribution of the dual-branch alignment is essential for rigorous validation. The revised manuscript now includes updated result tables with two new ablations: (1) training only the prior branch without any posterior or alignment (equivalent to a standard VLA with latent queries but no future information), and (2) full model with the alignment loss removed. These variants are reported alongside the original baselines on all six simulation benchmarks and the real-world tasks. The results show consistent degradation when the alignment is ablated, directly attributing the reported gains to the future-informed design rather than other implementation details. revision: yes
Circularity Check
No circularity: dual-branch alignment is an explicit training objective, not a definitional reduction
full rationale
The paper describes a concrete training procedure: learnable latent queries are aligned between a prior branch (current observations only) and a posterior branch (future observations) via a joint alignment loss. At inference the posterior is discarded. This is a standard auxiliary-supervision setup and does not reduce any claimed result to its own inputs by construction, nor does it rely on a fitted parameter renamed as a prediction or on a self-citation chain for its justification. The assertion that the alignment produces future-aware representations is presented as an empirical outcome verified on benchmarks rather than a tautology. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
Reference graph
Works this paper leans on
-
[1]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[2]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[3]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
Being-h0: vision-language-action pretraining from large-scale human videos,
Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025
-
[7]
Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization
Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h05: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026
-
[8]
Advancing open-source world models,
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...
-
[9]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024
2024
-
[10]
Diffusion for world modeling: Visual details matter in atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[11]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
-
[12]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review arXiv 2026
-
[13]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
work page internal anchor Pith review arXiv 2026
-
[14]
Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
work page internal anchor Pith review arXiv 2026
-
[15]
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026
work page internal anchor Pith review arXiv 2026
-
[16]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[18]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Robot learning of shifting objects for grasping in cluttered environments
Lars Berscheid, Pascal Meißner, and Torsten Kröger. Robot learning of shifting objects for grasping in cluttered environments. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 612–618. IEEE, 2019
2019
-
[20]
Robonet: Large-scale multi-robot learning,
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Sid- dharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019
-
[21]
Rh20t: A robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023
-
[22]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023
-
[23]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review arXiv 2022
-
[24]
Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[26]
Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, et al. Rethinking visual-language-action model scaling: Alignment, mixture, and regularization.arXiv preprint arXiv:2602.09722, 2026
-
[27]
PaliGemma 2: A Family of Versatile VLMs for Transfer
Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
Eagle 2: Building post-training data strategies from scratch for frontier vision-language models
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025
-
[29]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
From pixels to tokens: Byte-pair encoding on quantized visual modalities
Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu. From pixels to tokens: Byte-pair encoding on quantized visual modalities. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[31]
Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025
Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025
-
[32]
OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data
Luo Hao, Yue Zihao, Zhang Wanpeng, Feng Yicheng, Zheng Sipeng, Ye Deheng, and Lu Zongqing. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[33]
Videoorion: Tokenizing object dynamics in videos
Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, and Zongqing Lu. Videoorion: Tokenizing object dynamics in videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20401–20412, 2025. 16
2025
-
[34]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[36]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024
work page Pith review arXiv 2024
-
[38]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review arXiv 2024
-
[39]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025
-
[40]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025
work page Pith review arXiv 2025
-
[42]
Dexgraspvla: A vision-language- action framework towards general dexterous grasping,
Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025
-
[43]
Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[44]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review arXiv 2024
-
[45]
Onetwovla: A unified vision-language-action model with adaptive reasoning
Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025
- [46]
-
[47]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025
-
[48]
Mobile robot manipulation using pure object detection
Brent Griffin. Mobile robot manipulation using pure object detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 561–571, 2023
2023
-
[49]
Curl: Contrastive unsupervised representations for reinforcement learning
Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. InInternational conference on machine learning, pages 5639–5650. PMLR, 2020
2020
-
[50]
Using geometry to detect grasp poses in 3d point clouds
Andreas Ten Pas and Robert Platt. Using geometry to detect grasp poses in 3d point clouds. InRobotics Research: Volume 1, pages 307–324. Springer, 2017
2017
-
[51]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review arXiv 2025
-
[52]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 17
work page internal anchor Pith review arXiv 2024
-
[53]
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025
-
[54]
Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025
-
[55]
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025
-
[56]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025
work page internal anchor Pith review arXiv 2025
-
[57]
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025
work page internal anchor Pith review arXiv 2025
-
[58]
Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
-
[59]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025
work page internal anchor Pith review arXiv 2025
-
[60]
Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025
-
[61]
Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer
Hao Luo and Zongqing Lu. Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer. InInternational Conference on Learning Representations, 2025
2025
-
[62]
Aleksandar Vujinovic and Aleksandar Kovacevic. Act-jepa: Novel joint-embedding predictive architecture for efficient policy representation learning.arXiv preprint arXiv:2501.14622, 2025
-
[63]
FLARE: Robot learning with implicit world modeling
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loïc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InAnnual Conference on Robot Lear...
2025
-
[64]
Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026
-
[65]
DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InAnnual Conference on Neural Information Processing Systems, 2025
2025
-
[66]
Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, et al. Last_{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026
-
[67]
Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259, 2026
-
[68]
Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, and Zongqing Lu. Conservative offline robot policy learning via posterior-transition reweighting.arXiv preprint arXiv:2603.16542, 2026
-
[69]
Joint-aligned latent action: Towards scalable vla pretraining in the wild
Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[70]
Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation. arXiv preprint arXiv:2602.22010, 2026. 18
-
[71]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024
work page internal anchor Pith review arXiv 2024
-
[72]
Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation, 2025
Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025
-
[73]
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024
-
[74]
10kh-realomin-opendata, 2025
Gen Robot. 10kh-realomin-opendata, 2025
2025
-
[75]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019
2019
-
[76]
Univtg: Towards unified video-language temporal grounding
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023
2023
-
[77]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
2022
-
[78]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
2024
-
[79]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018
2018
-
[80]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,
Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025
-
[81]
R3m: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning (CoRL), 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.