pith. machine review for the scientific record. sign in

arxiv: 2605.00078 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV· cs.LG

Recognition: unknown

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG
keywords visual-language-action modelslatent world-action modelsrobot controlfuture predictionegocentric videoslatent queriesdual-branch alignmentdeployable policies
0
0 comments X

The pith

Being-H0.7 trains robot policies to reason about future states by aligning latent representations from current observations with those derived from future frames, then discards the future branch at deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard visual-language-action models for robots often learn shortcut mappings because action labels are sparse and do not force the model to understand how the world will change. This paper shows that a compact set of learnable latent queries can be trained to carry future-aware structure by matching a current-only prior branch to a future-informed posterior branch during training. The result is a policy that runs exactly like a direct VLA at inference time yet benefits from predictive information about dynamics and task progress. If the alignment succeeds, the approach delivers the predictive power of world models while avoiding the cost of generating or processing future video frames. Experiments across simulation suites and real robot tasks indicate the method reaches state-of-the-art or comparable success rates.

Core claim

Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface. A deployable prior branch infers latent states from the current context alone, while a training-only posterior branch replaces the queries with embeddings computed from future observations. Joint alignment of the two branches in latent space causes the prior branch to internalize future-aware, action-useful structure. At test time the posterior branch is removed entirely and no visual rollout is performed, yielding a policy that combines the benefits of world models with the efficiency of direct VLA policies.

What carries the argument

Learnable latent queries placed between perception and action, trained by joint alignment of a current-context prior branch and a future-observation posterior branch.

If this is right

  • Robot policies gain predictive information about contacts, dynamics, and task progress without incurring the runtime cost of pixel-space video generation.
  • The same architecture remains fully deployable as a direct VLA because the posterior branch and any visual rollout are discarded after training.
  • Sparse action supervision can be supplemented by latent-space future alignment instead of requiring dense future-frame prediction.
  • The method scales to both simulation benchmarks and diverse real-world egocentric video tasks while preserving inference speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-query alignment could be inserted into other multimodal control pipelines where future context is available only at training time.
  • If the alignment generalizes across robot embodiments, the learned representations might transfer more readily than pixel-based world models.
  • Longer prediction horizons could be tested by extending the posterior branch to multiple future steps and measuring whether the prior branch continues to improve.
  • The approach reduces dependence on high-fidelity video synthesis, which may lower data and compute requirements for training generalist policies.

Load-bearing premise

That matching the prior branch's latent outputs to the posterior branch's future-derived embeddings will reliably embed genuine future dynamics and action utility rather than superficial training-distribution statistics.

What would settle it

An ablation that removes the posterior branch or the alignment loss and measures whether performance on contact-rich or long-horizon tasks falls to the level of a plain VLA baseline would directly test whether the dual-branch design supplies the claimed future-aware representations.

Figures

Figures reproduced from arXiv: 2605.00078 by Chaoyi Xu, Haiweng Xu, Hao Luo, Sipeng Zheng, Wanpeng Zhang, Yicheng Feng, Yuhui Fu, Ziheng Xi, Zongqing Lu.

Figure 1
Figure 1. Figure 1: Being-H0.7 at a glance. We build a Latent World-Action Model that differs from VLAs and WAMs. A latent reasoning space is introduced via a set of latent queries in the prior branch, and is further endowed with world modeling by the joint alignment with a future-aware posterior branch. Pretrained on large-scale egocentric videos, Being-H0.7 achieves strong performance across diverse robot tasks. Abstract Vi… view at source ↗
Figure 2
Figure 2. Figure 2: Latent reasoning and latent world-action model. Left: Learnable latent queries are inserted to form a latent reasoning space that progressively organizes intermediate hidden states and guides action generation through propagation. Right: Through joint alignment between the dual-branch design, the model learns to reason with future information at inference time, turning into a latent world-action model. dur… view at source ↗
Figure 3
Figure 3. Figure 3: Being-H0.7 Architecture. We pack the prior and posterior branches into a single MoT sequence with shared context, where the two branches are optimized simultaneously. The posterior branch replaces latent queries with future embeddings, and the two branches are coupled by hidden-state alignment and lightweight regularization. A dual-branch attention mask is applied to isolate prior and posterior branches wh… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the real-world embodiments used in this evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual overview of the 12 real-robot evaluation tasks. The figure shows the task scenes used in our real-world evaluation across PND Adam-U, Unitree G1, and Franka FR3, covering the five ability-oriented suites. interaction. Motion Reasoning tasks emphasize trajectory anticipation, relative velocity, and contact timing. Long Horizon tasks stress subgoal memory and sequential consistency across multiple sta… view at source ↗
Figure 6
Figure 6. Figure 6: Suite-level real-robot success rates (%). Comparison of Being-H0.7, Being-H0.5, π0.5, and Fast-WAM on the five ability-oriented task suites. Each task is evaluated over 20 blind trials, and each suite score is averaged over all tasks carrying the corresponding suite tag. pose errors or delayed corrections usually lead to immediate failure. Among the baselines, Fast-WAM remains the strongest one in these re… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the Latent Reasoning. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Inference cost measured in the real-world deployment stack. We report it as a system-level view of [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Being-H0.7, a latent world-action model for robot control that augments VLA policies with future-aware reasoning. It inserts learnable latent queries as a reasoning interface and trains them via a dual-branch setup: a deployable prior branch that processes only current observations and a training-only posterior branch that incorporates future observations. Joint alignment of the branches in latent space is claimed to produce action-useful representations of dynamics and task progress from current context alone, enabling inference without visual rollouts or pixel prediction. Experiments are reported to show state-of-the-art or comparable results on six simulation benchmarks plus diverse real-world tasks.

Significance. If the central mechanism holds, the work offers a practical middle ground between costly world-model rollouts and shortcut-prone direct VLAs, potentially improving efficiency and generalization in generalist robot policies. The absence of pixel-space prediction at inference is a clear deployability advantage over prior video-based approaches.

major comments (2)
  1. [Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.
  2. [Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.
minor comments (2)
  1. [Figure 1] Figure 1 (architecture diagram) would benefit from explicit labeling of the prior versus posterior paths and the alignment loss to make the inference-time deployment clearer.
  2. [Method section] Notation for the latent queries (e.g., how they are initialized and updated) should be defined consistently in the text and equations to avoid ambiguity when describing the joint training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the dual-branch alignment and experimental presentation. We address each major comment below and have revised the manuscript to incorporate additional ablations and clarifications that strengthen the attribution of performance gains to the future-aware latent reasoning.

read point-by-point responses
  1. Referee: [Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.

    Authors: We appreciate the referee's point that the alignment must demonstrably induce future-aware structure rather than allow trivial copying of marginal statistics. The core mechanism relies on the posterior branch providing future-informed embeddings that the prior must match from current observations alone; this forces the latent queries to encode predictive, action-relevant dynamics because the alignment objective is computed in a shared latent space where superficial visual or action marginals alone cannot fully bridge the information gap. Nevertheless, we acknowledge that explicit auxiliary losses (such as contrastive future discrimination) could provide further safeguards. In the revised manuscript we have added a targeted ablation that removes the alignment loss entirely while retaining the latent queries, showing a clear performance drop across benchmarks. This result, together with the updated description in Section 3, supports that the gains arise from the future-aware alignment rather than marginal copying. revision: yes

  2. Referee: [Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.

    Authors: We agree that isolating the contribution of the dual-branch alignment is essential for rigorous validation. The revised manuscript now includes updated result tables with two new ablations: (1) training only the prior branch without any posterior or alignment (equivalent to a standard VLA with latent queries but no future information), and (2) full model with the alignment loss removed. These variants are reported alongside the original baselines on all six simulation benchmarks and the real-world tasks. The results show consistent degradation when the alignment is ablated, directly attributing the reported gains to the future-informed design rather than other implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity: dual-branch alignment is an explicit training objective, not a definitional reduction

full rationale

The paper describes a concrete training procedure: learnable latent queries are aligned between a prior branch (current observations only) and a posterior branch (future observations) via a joint alignment loss. At inference the posterior is discarded. This is a standard auxiliary-supervision setup and does not reduce any claimed result to its own inputs by construction, nor does it rely on a fitted parameter renamed as a prediction or on a self-citation chain for its justification. The assertion that the alignment produces future-aware representations is presented as an empirical outcome verified on benchmarks rather than a tautology. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that latent-space alignment between current-context and future-observation branches will induce useful predictive structure; no additional free parameters, axioms, or invented entities beyond standard neural-network training are introduced in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1200 out tokens · 19761 ms · 2026-05-09T20:43:33.103786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  2. Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

    cs.RO 2026-05 unverdicted novelty 6.0

    Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...

  3. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

Reference graph

Works this paper leans on

123 extracted references · 89 canonical work pages · cited by 3 Pith papers · 40 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  3. [3]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    Being-h0: vision-language-action pretraining from large-scale human videos,

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  7. [7]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h05: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  8. [8]

    Advancing open-source world models,

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  9. [9]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

  10. [10]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  11. [11]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  12. [12]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  13. [13]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  14. [14]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  15. [15]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  16. [16]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15

  17. [17]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  18. [18]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  19. [19]

    Robot learning of shifting objects for grasping in cluttered environments

    Lars Berscheid, Pascal Meißner, and Torsten Kröger. Robot learning of shifting objects for grasping in cluttered environments. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 612–618. IEEE, 2019

  20. [20]

    Robonet: Large-scale multi-robot learning,

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Sid- dharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019

  21. [21]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

  22. [22]

    On bringing robots home

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023

  23. [23]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  24. [24]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  25. [26]

    Rethinking visual-language-action model scaling: Alignment, mixture, and regularization.arXiv preprint arXiv:2602.09722, 2026

    Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, et al. Rethinking visual-language-action model scaling: Alignment, mixture, and regularization.arXiv preprint arXiv:2602.09722, 2026

  26. [27]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

  27. [28]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

  28. [29]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  29. [30]

    From pixels to tokens: Byte-pair encoding on quantized visual modalities

    Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu. From pixels to tokens: Byte-pair encoding on quantized visual modalities. InThe Thirteenth International Conference on Learning Representations, 2025

  30. [31]

    Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025

    Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025

  31. [32]

    OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data

    Luo Hao, Yue Zihao, Zhang Wanpeng, Feng Yicheng, Zheng Sipeng, Ye Deheng, and Lu Zongqing. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  32. [33]

    Videoorion: Tokenizing object dynamics in videos

    Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, and Zongqing Lu. Videoorion: Tokenizing object dynamics in videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20401–20412, 2025. 16

  33. [34]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  34. [35]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  35. [36]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  36. [37]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  37. [38]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  38. [39]

    Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  39. [40]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  40. [41]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

  41. [42]

    Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

    Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025

  42. [43]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  43. [44]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  44. [45]

    Onetwovla: A unified vision-language-action model with adaptive reasoning

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

  45. [46]

    Clark, S

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generaliza- tion.arXiv preprint arXiv:2502.03729, 2025

  46. [47]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

  47. [48]

    Mobile robot manipulation using pure object detection

    Brent Griffin. Mobile robot manipulation using pure object detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 561–571, 2023

  48. [49]

    Curl: Contrastive unsupervised representations for reinforcement learning

    Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. InInternational conference on machine learning, pages 5639–5650. PMLR, 2020

  49. [50]

    Using geometry to detect grasp poses in 3d point clouds

    Andreas Ten Pas and Robert Platt. Using geometry to detect grasp poses in 3d point clouds. InRobotics Research: Volume 1, pages 307–324. Springer, 2017

  50. [51]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  51. [52]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 17

  52. [53]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  53. [54]

    Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

  54. [55]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  55. [56]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  56. [57]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

  57. [58]

    Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  58. [59]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  59. [60]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

  60. [61]

    Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer

    Hao Luo and Zongqing Lu. Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer. InInternational Conference on Learning Representations, 2025

  61. [62]

    Act-jepa: Novel joint-embedding predictive architecture for efficient policy representation learning.arXiv preprint arXiv:2501.14622, 2025

    Aleksandar Vujinovic and Aleksandar Kovacevic. Act-jepa: Novel joint-embedding predictive architecture for efficient policy representation learning.arXiv preprint arXiv:2501.14622, 2025

  62. [63]

    FLARE: Robot learning with implicit world modeling

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loïc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InAnnual Conference on Robot Lear...

  63. [64]

    Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

  64. [65]

    DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InAnnual Conference on Neural Information Processing Systems, 2025

  65. [66]

    Last _{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

    Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, et al. Last_{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

  66. [67]

    Frappe: Infusing world modeling into generalist policies via multiple future representation alignment, 2026

    Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259, 2026

  67. [68]

    Conservative offline robot policy learning via posterior-transition reweighting.arXiv preprint arXiv:2603.16542, 2026

    Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, and Zongqing Lu. Conservative offline robot policy learning via posterior-transition reweighting.arXiv preprint arXiv:2603.16542, 2026

  68. [69]

    Joint-aligned latent action: Towards scalable vla pretraining in the wild

    Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  69. [70]

    World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

    Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation. arXiv preprint arXiv:2602.22010, 2026. 18

  70. [71]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  71. [72]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation, 2025

    Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  72. [73]

    Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers, 2024

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

  73. [74]

    10kh-realomin-opendata, 2025

    Gen Robot. 10kh-realomin-opendata, 2025

  74. [75]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

  75. [76]

    Univtg: Towards unified video-language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023

  76. [77]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  77. [78]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  78. [79]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

  79. [80]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  80. [81]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning (CoRL), 2022

Showing first 80 references.