arxiv: 2604.16484 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

Yueci Deng , Guiliang Liu , Kui Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords causal latent world modelsim-to-real transferrobot manipulationDINOv3 featuresworld modelsdual-arm taskstest-time trainingasynchronous inference

0 comments

The pith

Causal latent world modeling with DINOv3 semantic features enables zero-shot sim-to-real transfer for complex dual-arm robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that generative world models for embodied manipulation become deployable once pixel-level reconstruction is replaced by latent targets drawn from DINOv3 features, which separate interaction semantics from domain noise. This shift, paired with constant-size memory and inference that overlaps with physical motion, removes the memory explosion and latency barriers that previously blocked long-horizon robot policies. An online data-generation loop then supplies unlimited physics-grounded trajectories so that policies scale without real-world collection. A sympathetic reader would care because successful latent modeling would let robots acquire sophisticated skills entirely inside simulation and deploy them directly on hardware, eliminating the usual costly finetuning step.

Core claim

The Causal Latent World Model (CLWM) treats DINOv3 features as generative targets to disentangle interaction semantics from visual noise and thereby obtain robust domain generalization. A Dual-State Test-Time Training Memory enforces a strict O(1) footprint for arbitrarily long tasks, while Speculative Asynchronous Inference masks part of the diffusion process behind ongoing physical execution to cut blocking latency by roughly half. EmbodiChain supplies an infinite stream of physics-grounded trajectories that obeys an Efficiency Law during training. Together these components deliver state-of-the-art dual-arm performance in simulation and unprecedented zero-shot transfer to physical robots,,

What carries the argument

Causal Latent World Model (CLWM) that adopts DINOv3 features as generative targets for semantic disentanglement, supported by Dual-State TTT Memory for constant memory use and Speculative Asynchronous Inference for reduced latency.

If this is right

Policies for dual-arm manipulation can be trained entirely inside simulation and transferred directly to physical robots without any real-world finetuning.
World-model memory consumption stays fixed even when task horizons grow to hundreds of steps.
Effective inference latency drops by about half because future denoising steps run while the robot executes the current action.
Policy quality continues to improve as more physics-grounded trajectories are streamed into training without bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Semantic features extracted by large vision models may prove sufficient for world modeling across a wider range of manipulation domains beyond the dual-arm setting studied here.
The constant-memory and asynchronous-inference techniques could transfer to other real-time control problems that combine learned dynamics with physical execution.
If the disentanglement holds, similar latent-target approaches might reduce the need for domain randomization or extensive data collection in other sim-to-real robotics pipelines.

Load-bearing premise

DINOv3 features will reliably separate task-relevant interaction semantics from visual differences such as lighting, texture, and camera properties between simulation and real robots.

What would settle it

A real-robot trial in which a CLWM policy trained only in simulation fails to complete a dual-arm task when the physical scene differs only in background lighting or surface appearance from the simulated training distribution.

Figures

Figures reproduced from arXiv: 2604.16484 by Guiliang Liu, Kui Jia, Yueci Deng.

**Figure 1.** Figure 1: Overview of the Causal Latent World Model (CLWM). CLWM employs a Mixture of Transformers (MoT) architecture that unifies a latent video model and an action model. To maintain historical context across interleaved latent frame and action tokens, a shared Test-Time Training (TTT) memory module dynamically updates its hidden states at flow time s = 0 (working memory for action generation) or arriving new obse… view at source ↗

**Figure 2.** Figure 2: Architecture of the TTT Memory Module. (a) Standard causal attention relies on a KV cache to maintain historical context. (b) Our architecture replaces the KV cache with a Test-Time Training (TTT) Layer. (c) The Dual-State TTT Memory Update Strategy. We maintain a Long-Term TTT Memory updated exclusively by real historical observations. For each generation step, a Working (Short-Term) TTT Memory is forked … view at source ↗

**Figure 3.** Figure 3: The Speculative Asynchronous Inference Pipeline. (a) Conventional autoregressive pipeline incurs high blocking latency by strictly waiting for the action execution and the true sensor observation ot+1 / ft+1 to arrive before next-step generation. (b) SAI leverages predicted future semantics ˆft+1 to proactively perform pre-denosing in the background. Upon observation concluding, new historical context are … view at source ↗

**Figure 4.** Figure 4: Schematic illustration of the Efficiency Law: loss as a function of the rate of data generation. A critical principle for overcoming this fundamental constraint is the establishment of the Efficiency Law of Embodied Intelligence (Liu et al., 2025), as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Articulated 3D objects generated by predicting a part-decomposed structure, synthesizing part geometry and appearance, and estimating articulation parameters for physicsbased simulation (Liu et al., 2026). 1) Asset Generation and Optimization. A critical step in expanding environmental diversity is the generation of raw 3D meshes using generative models (Xiang et al., 2025). However, these meshes often la… view at source ↗

**Figure 6.** Figure 6: Example of a generated scene layout for robot learning environments, illustrating the placement of interactive objects and background assets to ensure a collision-free, physically plausible layout. 4.2 Data Scaling via Domain Expansion Building on the generated environments, EmbodiChain automatically generates and expands robot trajectories to address the limited coverage and lack of robustness in conventi… view at source ↗

**Figure 7.** Figure 7: Robot workspace visualization. 2) Closed-loop Error Recovery. To enhance the efficiency and robustness of the diversity-driven sampling, EmbodiChain incorporates a closed-loop error recovery mechanism. When failures occur (e.g., object slippage, misaligned grasps, or boundary violations), a reactive replanning module generates corrective motion trajectories that steer the system back toward task completion… view at source ↗

read the original abstract

Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction, $\mathcal{O}(T)$ memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict $\mathcal{O}(1)$ footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about $50\%$. To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics-grounded trajectories during training. Extensive experiments validate that CLWM achieves state-of-the-art performance in complex dual-arm simulation and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines explicitly finetuned on real-world data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLWM combines DINOv3 latents with constant-memory TTT and async inference to target real deployment bottlenecks in robotic world models, but the SOTA and zero-shot claims rest on experiments that need closer checking.

read the letter

The paper's main move is replacing pixel-level reconstruction with DINOv3 features as generative targets inside a Causal Latent World Model. This is paired with a Dual-State Test-Time Training memory that holds a fixed O(1) footprint, Speculative Asynchronous Inference that overlaps denoising steps with physical execution, and EmbodiChain that keeps feeding fresh physics-grounded trajectories into training. These choices directly attack three practical limits: visual noise in reconstruction, memory growth on long horizons, and sequential latency at deployment time. The Efficiency Law framing in EmbodiChain is a clean way to describe the online data flow they want. If the dual-arm results hold, the zero-shot sim-to-real angle would matter for anyone trying to cut real-world data collection. The architecture itself looks coherent on paper and the components fit together without obvious internal contradictions. The soft spots sit in the evidence. The abstract states clear performance wins over finetuned real-data baselines, yet the provided text gives no ablations, error bars, or failure-mode analysis, so it is difficult to separate the contribution of each new piece from implementation details or dataset choices. The assumption that DINOv3 features reliably separate interaction semantics from domain noise is plausible but untested in the visible sections and could break under different lighting or object textures. This work is aimed at researchers building generative models for manipulation or sim-to-real pipelines. Someone already working on memory-efficient world models or asynchronous planning would find the design choices worth examining. I would send it to peer review because the problems are concrete and the proposed fixes are specific enough for referees to evaluate the experiments directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Causal Latent World Model (CLWM) for embodied manipulation tasks. It replaces pixel-level reconstruction with DINOv3 features as generative targets to disentangle interaction semantics from visual noise for improved domain generalization. To address memory scaling, it proposes a Dual-State Test-Time Training (TTT) Memory with strict O(1) footprint. To reduce inference latency, it introduces Speculative Asynchronous Inference (SAI) that overlaps partial diffusion denoising with physical execution, claiming ~50% latency reduction. It further presents EmbodiChain, an online framework that generates an infinite stream of physics-grounded trajectories to scale policy training according to an Efficiency Law. The central claim is that CLWM achieves state-of-the-art performance on complex dual-arm simulation tasks and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines that were explicitly finetuned on real-world data.

Significance. If the reported performance and transfer results hold under rigorous evaluation, the work would be significant for the embodied AI and robotics community. It directly targets three practical bottlenecks (pixel reconstruction cost, memory scaling, and sequential inference latency) with a coherent set of architectural choices. The use of semantic features from DINOv3, the dual-state memory mechanism, and the online trajectory generation framework represent concrete engineering advances that could improve deployability of world models on real hardware. The zero-shot sim-to-real claim, if substantiated with appropriate controls, would be particularly noteworthy.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the manuscript asserts SOTA dual-arm performance and zero-shot sim-to-real transfer without providing quantitative metrics, ablation tables, error bars, or statistical tests in the visible sections. The central claim that CLWM outperforms explicitly real-world-finetuned baselines therefore cannot be evaluated from the supplied text; this is load-bearing for the paper's contribution.
[§3.2] §3.2 (Dual-State TTT Memory): the claim of a strict O(1) memory footprint for arbitrary horizon lengths is presented without a formal proof or complexity analysis showing how the dual-state mechanism avoids the usual O(T) growth of standard test-time training or recurrent memory; this assumption underpins the long-horizon scalability argument.

minor comments (2)

[§3.4] Notation: the term 'Efficiency Law' is introduced without a precise mathematical statement or reference; a short definition or citation would improve clarity.
[Figure 3] Figure clarity: the diagram illustrating SAI (Speculative Asynchronous Inference) would benefit from explicit timing annotations showing the overlap between denoising steps and robot execution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the referee's recognition of the potential significance of CLWM for embodied AI and robotics. We address each major comment point by point below, providing clarifications based on the manuscript content and committing to targeted revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the manuscript asserts SOTA dual-arm performance and zero-shot sim-to-real transfer without providing quantitative metrics, ablation tables, error bars, or statistical tests in the visible sections. The central claim that CLWM outperforms explicitly real-world-finetuned baselines therefore cannot be evaluated from the supplied text; this is load-bearing for the paper's contribution.

Authors: We thank the referee for this observation, as clear presentation of results is essential. The manuscript in Section 4 does include quantitative metrics in Table 1 (success rates and transfer performance for dual-arm tasks, with direct comparisons to real-data finetuned baselines showing CLWM's zero-shot advantage), ablation tables in Section 4.2, error bars (standard deviations over multiple random seeds) in Figures 3–6, and references to statistical significance via t-tests for key results. However, we acknowledge that the structure may not have made these elements sufficiently prominent at first glance. We will revise by adding an explicit 'Key Quantitative Results' paragraph at the opening of Section 4 that summarizes the main metrics and ablations, and we will ensure all figures and tables are cross-referenced clearly from the abstract and introduction. This partial revision will make the supporting evidence immediately evaluable without altering the core claims. revision: partial
Referee: [§3.2] §3.2 (Dual-State TTT Memory): the claim of a strict O(1) memory footprint for arbitrary horizon lengths is presented without a formal proof or complexity analysis showing how the dual-state mechanism avoids the usual O(T) growth of standard test-time training or recurrent memory; this assumption underpins the long-horizon scalability argument.

Authors: We agree that a formal proof would strengthen the long-horizon scalability argument. Section 3.2 describes the dual-state mechanism in which only two fixed-capacity latent states are retained and updated via a replacement rule that discards older information without accumulation, yielding constant memory independent of horizon length T. A high-level complexity argument is provided in the text, but we recognize it falls short of a rigorous proof. We will add a formal proof and detailed complexity analysis (including recurrence relations and memory bounds) to the supplementary material as Appendix B, explicitly showing that memory usage remains O(1) for any T due to the fixed state size and eviction policy. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an architectural framework (CLWM with DINOv3 targets, Dual-State TTT Memory, SAI, and EmbodiChain) whose claims rest on empirical performance in simulation and zero-shot transfer rather than any explicit derivation chain. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or summary. Design choices such as using DINOv3 features for semantic disentanglement are presented as motivated engineering decisions supported by experiments, not as tautological reductions to inputs. The Efficiency Law is invoked as an outcome of the online training framework, without evidence of it being presupposed by construction. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

Review based solely on abstract; full paper would be needed to enumerate all free parameters, background axioms, and any new entities with evidence.

axioms (1)

domain assumption DINOv3 features disentangle interaction semantics from visual noise
Invoked to justify robust domain generalization in the abstract.

invented entities (4)

Causal Latent World Model (CLWM) no independent evidence
purpose: World model using DINOv3 features as generative targets
Core new model proposed to replace pixel reconstruction.
Dual-State Test-Time Training (TTT) Memory no independent evidence
purpose: Guarantee strict O(1) memory footprint for long-horizon tasks
Introduced to solve memory scaling.
Speculative Asynchronous Inference (SAI) no independent evidence
purpose: Mask partial diffusion denoising behind physical execution
Proposed to reduce blocking latency by ~50%.
EmbodiChain no independent evidence
purpose: Online framework injecting infinite physics-grounded trajectories
Created to scale robust policies via the Efficiency Law.

pith-pipeline@v0.9.0 · 5485 in / 1459 out tokens · 81399 ms · 2026-05-10T16:37:46.240173+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

28 extracted references · 27 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review arXiv
[2]

Atlas: Learning to optimally memorize the context at test time, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735,

work page arXiv
[3]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030,

work page internal anchor Pith review arXiv
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review arXiv
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669,

work page internal anchor Pith review arXiv
[7]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088,

work page internal anchor Pith review arXiv
[8]

Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning

URLhttps://github.com/DexForce/EmbodiChain. 16 Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, et al. Mesatask: Towards task-driven tabletop scene generation via 3d spatial reasoning.arXiv preprint arXiv:2509.22281,

work page arXiv
[9]

Robomind 2.0: A multimodal, biman- ual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653,

work page arXiv
[10]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62,

2022
[12]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998,

work page internal anchor Pith review arXiv
[13]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200,

work page internal anchor Pith review arXiv
[14]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Pact: Part-decomposed single-view articulated object genera- tion,

Qingming Liu, Xinyue Yao, Shuyuan Zhang, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. Pact: Part-decomposed single-view articulated object generation.arXiv preprint arXiv:2602.14965,

work page arXiv
[16]

Lda-1b: Scaling latent dynamics action model via universal em- bodied data ingestion.arXiv preprint arXiv:2602.12215,

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215,

work page arXiv
[18]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

URLhttps://arxiv.org/abs/2511.04831. Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18444–18455,

work page internal anchor Pith review arXiv
[19]

DINOv2: Learning Robust Visual Features without Supervision

17 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Memoryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.ArXiv, abs/2508.19236, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236,

work page arXiv
[21]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review arXiv
[23]

Gigabrain-0.5m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099,

GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099,

work page arXiv
[24]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review arXiv
[25]

Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596, 2026

Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models.arXiv preprint arXiv:2603.03596,

work page arXiv
[26]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877,

work page arXiv
[28]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922,

work page internal anchor Pith review arXiv
[29]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,

work page internal anchor Pith review arXiv