arxiv: 2605.00078 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV· cs.LG

Recognition: unknown

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo , Wanpeng Zhang , Yicheng Feng , Sipeng Zheng , Haiweng Xu , Chaoyi Xu , Ziheng Xi , Yuhui Fu

show 1 more author

Zongqing Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:43 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords visual-language-action modelslatent world-action modelsrobot controlfuture predictionegocentric videoslatent queriesdual-branch alignmentdeployable policies

0 comments

The pith

Being-H0.7 trains robot policies to reason about future states by aligning latent representations from current observations with those derived from future frames, then discards the future branch at deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard visual-language-action models for robots often learn shortcut mappings because action labels are sparse and do not force the model to understand how the world will change. This paper shows that a compact set of learnable latent queries can be trained to carry future-aware structure by matching a current-only prior branch to a future-informed posterior branch during training. The result is a policy that runs exactly like a direct VLA at inference time yet benefits from predictive information about dynamics and task progress. If the alignment succeeds, the approach delivers the predictive power of world models while avoiding the cost of generating or processing future video frames. Experiments across simulation suites and real robot tasks indicate the method reaches state-of-the-art or comparable success rates.

Core claim

Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface. A deployable prior branch infers latent states from the current context alone, while a training-only posterior branch replaces the queries with embeddings computed from future observations. Joint alignment of the two branches in latent space causes the prior branch to internalize future-aware, action-useful structure. At test time the posterior branch is removed entirely and no visual rollout is performed, yielding a policy that combines the benefits of world models with the efficiency of direct VLA policies.

What carries the argument

Learnable latent queries placed between perception and action, trained by joint alignment of a current-context prior branch and a future-observation posterior branch.

If this is right

Robot policies gain predictive information about contacts, dynamics, and task progress without incurring the runtime cost of pixel-space video generation.
The same architecture remains fully deployable as a direct VLA because the posterior branch and any visual rollout are discarded after training.
Sparse action supervision can be supplemented by latent-space future alignment instead of requiring dense future-frame prediction.
The method scales to both simulation benchmarks and diverse real-world egocentric video tasks while preserving inference speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-query alignment could be inserted into other multimodal control pipelines where future context is available only at training time.
If the alignment generalizes across robot embodiments, the learned representations might transfer more readily than pixel-based world models.
Longer prediction horizons could be tested by extending the posterior branch to multiple future steps and measuring whether the prior branch continues to improve.
The approach reduces dependence on high-fidelity video synthesis, which may lower data and compute requirements for training generalist policies.

Load-bearing premise

That matching the prior branch's latent outputs to the posterior branch's future-derived embeddings will reliably embed genuine future dynamics and action utility rather than superficial training-distribution statistics.

What would settle it

An ablation that removes the posterior branch or the alignment loss and measures whether performance on contact-rich or long-horizon tasks falls to the level of a plain VLA baseline would directly test whether the dual-branch design supplies the claimed future-aware representations.

Figures

Figures reproduced from arXiv: 2605.00078 by Chaoyi Xu, Haiweng Xu, Hao Luo, Sipeng Zheng, Wanpeng Zhang, Yicheng Feng, Yuhui Fu, Ziheng Xi, Zongqing Lu.

**Figure 1.** Figure 1: Being-H0.7 at a glance. We build a Latent World-Action Model that differs from VLAs and WAMs. A latent reasoning space is introduced via a set of latent queries in the prior branch, and is further endowed with world modeling by the joint alignment with a future-aware posterior branch. Pretrained on large-scale egocentric videos, Being-H0.7 achieves strong performance across diverse robot tasks. Abstract Vi… view at source ↗

**Figure 2.** Figure 2: Latent reasoning and latent world-action model. Left: Learnable latent queries are inserted to form a latent reasoning space that progressively organizes intermediate hidden states and guides action generation through propagation. Right: Through joint alignment between the dual-branch design, the model learns to reason with future information at inference time, turning into a latent world-action model. dur… view at source ↗

**Figure 3.** Figure 3: Being-H0.7 Architecture. We pack the prior and posterior branches into a single MoT sequence with shared context, where the two branches are optimized simultaneously. The posterior branch replaces latent queries with future embeddings, and the two branches are coupled by hidden-state alignment and lightweight regularization. A dual-branch attention mask is applied to isolate prior and posterior branches wh… view at source ↗

**Figure 4.** Figure 4: Overview of the real-world embodiments used in this evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual overview of the 12 real-robot evaluation tasks. The figure shows the task scenes used in our real-world evaluation across PND Adam-U, Unitree G1, and Franka FR3, covering the five ability-oriented suites. interaction. Motion Reasoning tasks emphasize trajectory anticipation, relative velocity, and contact timing. Long Horizon tasks stress subgoal memory and sequential consistency across multiple sta… view at source ↗

**Figure 6.** Figure 6: Suite-level real-robot success rates (%). Comparison of Being-H0.7, Being-H0.5, π0.5, and Fast-WAM on the five ability-oriented task suites. Each task is evaluated over 20 blind trials, and each suite score is averaged over all tasks carrying the corresponding suite tag. pose errors or delayed corrections usually lead to immediate failure. Among the baselines, Fast-WAM remains the strongest one in these re… view at source ↗

**Figure 7.** Figure 7: Visualization of the Latent Reasoning. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Inference cost measured in the real-world deployment stack. We report it as a system-level view of [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Being-H0.7's dual-branch latent alignment is a clean way to add future awareness to VLAs without pixel rollouts, but the abstract gives no numbers or ablations so it's unclear if the alignment actually extracts dynamics instead of shortcuts.

read the letter

Being-H0.7 puts learnable latent queries between perception and action in a VLA-style policy, then trains a prior branch on current observations to match a posterior branch that sees future frames. At inference the posterior drops out, so you get a deployable policy that has been shaped by future information without ever generating video. The motivation is practical: direct VLAs are fast but shortcut-prone because action labels are sparse, while full world models add prediction at high cost in pixels and compute. This latent version tries to split the difference with a straightforward dual-branch alignment loss. If the alignment works as intended, it could improve robustness on manipulation and navigation tasks without the overhead of visual prediction. The architecture itself is simple enough that groups already running VLAs could test the idea without major re-engineering. The soft spot is the missing evidence. The abstract claims state-of-the-art or comparable results on six simulation benchmarks and real-world tasks, yet supplies no quantitative details, ablation tables, or error analysis. The stress-test point lands: without extra terms that force explicit dynamics prediction or future discrimination, the prior could simply match marginal statistics or visual regularities from the training distribution rather than learning contact, progress, or causal structure. If the full paper shows controls that rule this out and the gains survive those checks, the claim strengthens; otherwise the central benefit remains an assumption. This is for researchers already working on egocentric video policies or VLA fine-tuning who want a lighter way to inject predictive reasoning. A reader building sample-efficient robot controllers could borrow the latent-query interface and the training split. I would send it to peer review. The problem is real, the design is explicit, and referees can verify whether the alignment delivers measurable future-aware behavior once the experiments are on the table.

Referee Report

2 major / 2 minor

Summary. The paper introduces Being-H0.7, a latent world-action model for robot control that augments VLA policies with future-aware reasoning. It inserts learnable latent queries as a reasoning interface and trains them via a dual-branch setup: a deployable prior branch that processes only current observations and a training-only posterior branch that incorporates future observations. Joint alignment of the branches in latent space is claimed to produce action-useful representations of dynamics and task progress from current context alone, enabling inference without visual rollouts or pixel prediction. Experiments are reported to show state-of-the-art or comparable results on six simulation benchmarks plus diverse real-world tasks.

Significance. If the central mechanism holds, the work offers a practical middle ground between costly world-model rollouts and shortcut-prone direct VLAs, potentially improving efficiency and generalization in generalist robot policies. The absence of pixel-space prediction at inference is a clear deployability advantage over prior video-based approaches.

major comments (2)

[Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.
[Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.

minor comments (2)

[Figure 1] Figure 1 (architecture diagram) would benefit from explicit labeling of the prior versus posterior paths and the alignment loss to make the inference-time deployment clearer.
[Method section] Notation for the latent queries (e.g., how they are initialized and updated) should be defined consistently in the text and equations to avoid ambiguity when describing the joint training objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the dual-branch alignment and experimental presentation. We address each major comment below and have revised the manuscript to incorporate additional ablations and clarifications that strengthen the attribution of performance gains to the future-aware latent reasoning.

read point-by-point responses

Referee: [Training procedure / dual-branch alignment] The dual-branch alignment (described in the training procedure) is load-bearing for the claim that the prior learns future-aware structure rather than superficial statistics. The manuscript provides no auxiliary losses (e.g., explicit dynamics or action prediction from the latent queries, or contrastive future discrimination) that would block the prior from simply copying marginal visual or action statistics present in the posterior embeddings. Without such safeguards or targeted ablations, the reported performance gains cannot be confidently attributed to causal reasoning.

Authors: We appreciate the referee's point that the alignment must demonstrably induce future-aware structure rather than allow trivial copying of marginal statistics. The core mechanism relies on the posterior branch providing future-informed embeddings that the prior must match from current observations alone; this forces the latent queries to encode predictive, action-relevant dynamics because the alignment objective is computed in a shared latent space where superficial visual or action marginals alone cannot fully bridge the information gap. Nevertheless, we acknowledge that explicit auxiliary losses (such as contrastive future discrimination) could provide further safeguards. In the revised manuscript we have added a targeted ablation that removes the alignment loss entirely while retaining the latent queries, showing a clear performance drop across benchmarks. This result, together with the updated description in Section 3, supports that the gains arise from the future-aware alignment rather than marginal copying. revision: yes
Referee: [Experiments and results tables] Performance claims (six simulation benchmarks and real-world tasks) rest on quantitative results that are not accompanied by ablations isolating the contribution of the latent alignment versus a standard VLA baseline. Tables reporting success rates or returns should include a direct comparison with the posterior branch removed or with the alignment loss ablated; the current presentation leaves open whether gains arise from the future-informed design or from other implementation choices.

Authors: We agree that isolating the contribution of the dual-branch alignment is essential for rigorous validation. The revised manuscript now includes updated result tables with two new ablations: (1) training only the prior branch without any posterior or alignment (equivalent to a standard VLA with latent queries but no future information), and (2) full model with the alignment loss removed. These variants are reported alongside the original baselines on all six simulation benchmarks and the real-world tasks. The results show consistent degradation when the alignment is ablated, directly attributing the reported gains to the future-informed design rather than other implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity: dual-branch alignment is an explicit training objective, not a definitional reduction

full rationale

The paper describes a concrete training procedure: learnable latent queries are aligned between a prior branch (current observations only) and a posterior branch (future observations) via a joint alignment loss. At inference the posterior is discarded. This is a standard auxiliary-supervision setup and does not reduce any claimed result to its own inputs by construction, nor does it rely on a fitted parameter renamed as a prediction or on a self-citation chain for its justification. The assertion that the alignment produces future-aware representations is presented as an empirical outcome verified on benchmarks rather than a tautology. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that latent-space alignment between current-context and future-observation branches will induce useful predictive structure; no additional free parameters, axioms, or invented entities beyond standard neural-network training are introduced in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1200 out tokens · 19761 ms · 2026-05-09T20:43:33.103786+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
cs.RO 2026-05 unverdicted novelty 6.0

Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

Reference graph

Works this paper leans on

123 extracted references · 89 canonical work pages · cited by 3 Pith papers · 40 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review arXiv 2022
[2]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[6]

Being-h0: vision-language-action pretraining from large-scale human videos,

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025
[7]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h05: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

work page arXiv 2026
[8]

Advancing open-source world models,

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

work page arXiv 2026
[9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

2024
[10]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[11]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page arXiv 2025
[12]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review arXiv 2026
[13]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review arXiv 2026
[14]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review arXiv 2026
[15]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review arXiv 2026
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[18]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review arXiv 2025
[19]

Robot learning of shifting objects for grasping in cluttered environments

Lars Berscheid, Pascal Meißner, and Torsten Kröger. Robot learning of shifting objects for grasping in cluttered environments. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 612–618. IEEE, 2019

2019
[20]

Robonet: Large-scale multi-robot learning,

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Sid- dharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019

work page arXiv 1910
[21]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

work page arXiv 2023
[22]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[23]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review arXiv 2022
[24]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[26]

Rethinking visual-language-action model scaling: Alignment, mixture, and regularization.arXiv preprint arXiv:2602.09722, 2026

Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, et al. Rethinking visual-language-action model scaling: Alignment, mixture, and regularization.arXiv preprint arXiv:2602.09722, 2026

work page arXiv 2026
[27]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review arXiv 2024
[28]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

work page arXiv 2025
[29]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

From pixels to tokens: Byte-pair encoding on quantized visual modalities

Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu. From pixels to tokens: Byte-pair encoding on quantized visual modalities. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[31]

Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. Unified multimodal understanding via byte-pair visual encoding.arXiv preprint arXiv:2506.23639, 2025

work page arXiv 2025
[32]

OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data

Luo Hao, Yue Zihao, Zhang Wanpeng, Feng Yicheng, Zheng Sipeng, Ye Deheng, and Lu Zongqing. OpenMMEgo: Enhancing egocentric understanding for LMMs with open weights and data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[33]

Videoorion: Tokenizing object dynamics in videos

Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Hao Luo, Zihao Yue, and Zongqing Lu. Videoorion: Tokenizing object dynamics in videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20401–20412, 2025. 16

2025
[34]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[35]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[36]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page Pith review arXiv 2024
[38]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review arXiv 2024
[39]

Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025
[40]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

work page Pith review arXiv 2025
[42]

Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025

work page arXiv 2025
[43]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[44]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review arXiv 2024
[45]

Onetwovla: A unified vision-language-action model with adaptive reasoning

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

work page arXiv 2025
[46]

Clark, S

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generaliza- tion.arXiv preprint arXiv:2502.03729, 2025

work page arXiv 2025
[47]

Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

work page arXiv 2025
[48]

Mobile robot manipulation using pure object detection

Brent Griffin. Mobile robot manipulation using pure object detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 561–571, 2023

2023
[49]

Curl: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. InInternational conference on machine learning, pages 5639–5650. PMLR, 2020

2020
[50]

Using geometry to detect grasp poses in 3d point clouds

Andreas Ten Pas and Robert Platt. Using geometry to detect grasp poses in 3d point clouds. InRobotics Research: Volume 1, pages 307–324. Springer, 2017

2017
[51]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

work page internal anchor Pith review arXiv 2025
[52]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 17

work page internal anchor Pith review arXiv 2024
[53]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

work page arXiv 2025
[54]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

work page arXiv 2025
[55]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page arXiv 2025
[56]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review arXiv 2025
[57]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025

work page internal anchor Pith review arXiv 2025
[58]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

work page arXiv 2025
[59]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review arXiv 2025
[60]

Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

work page arXiv 2025
[61]

Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer

Hao Luo and Zongqing Lu. Learning video-conditioned policy on unlabelled data with joint embedding predictive transformer. InInternational Conference on Learning Representations, 2025

2025
[62]

Act-jepa: Novel joint-embedding predictive architecture for efficient policy representation learning.arXiv preprint arXiv:2501.14622, 2025

Aleksandar Vujinovic and Aleksandar Kovacevic. Act-jepa: Novel joint-embedding predictive architecture for efficient policy representation learning.arXiv preprint arXiv:2501.14622, 2025

work page arXiv 2025
[63]

FLARE: Robot learning with implicit world modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loïc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. InAnnual Conference on Robot Lear...

2025
[64]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

work page arXiv 2026
[65]

DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InAnnual Conference on Neural Information Processing Systems, 2025

2025
[66]

Last _{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, et al. Last_{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026
[67]

Frappe: Infusing world modeling into generalist policies via multiple future representation alignment, 2026

Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259, 2026

work page arXiv 2026
[68]

Conservative offline robot policy learning via posterior-transition reweighting.arXiv preprint arXiv:2603.16542, 2026

Wanpeng Zhang, Hao Luo, Sipeng Zheng, Yicheng Feng, Haiweng Xu, Ziheng Xi, Chaoyi Xu, Haoqi Yuan, and Zongqing Lu. Conservative offline robot policy learning via posterior-transition reweighting.arXiv preprint arXiv:2603.16542, 2026

work page arXiv 2026
[69]

Joint-aligned latent action: Towards scalable vla pretraining in the wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu. Joint-aligned latent action: Towards scalable vla pretraining in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[70]

World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation. arXiv preprint arXiv:2602.22010, 2026. 18

work page arXiv 2026
[71]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review arXiv 2024
[72]

Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation, 2025

Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

work page arXiv 2025
[73]

Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers, 2024

Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

work page arXiv 2024
[74]

10kh-realomin-opendata, 2025

Gen Robot. 10kh-realomin-opendata, 2025

2025
[75]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

2019
[76]

Univtg: Towards unified video-language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023

2023
[77]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[78]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

2024
[79]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018

2018
[80]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

work page arXiv 2025
[81]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning (CoRL), 2022

2022

Showing first 80 references.