pith. machine review for the scientific record. sign in

arxiv: 2604.26848 · v2 · submitted 2026-04-29 · 💻 cs.RO

Recognition: unknown

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:22 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationworld modelingdiffusion modelsspatial-temporal predictionaction generationbimanual tasksgeometry-aware attentionvision-language-action
0
0 comments X

The pith

STARRY couples future scene prediction directly to action outputs by jointly denoising spatial-temporal latents and controls in one diffusion process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action policies often separate predictive world models from action execution, leading to poor performance when robots must coordinate precise movements over time and space. STARRY addresses this by training a single diffusion model to denoise both future visual latents and actions together, while using Geometry-Aware Selective Attention Modulation to turn predicted depth and gripper geometry into weights that focus the action head on relevant image tokens. The method reports 93.82 percent average success across 50 bimanual tasks in clean simulation and lifts real-world success from 42.5 percent to 70.8 percent against a strong baseline. A reader should care because tighter integration of prediction and control could reduce the need for separate planners and make robots more reliable on tasks that currently require hand-crafted geometric reasoning.

Core claim

STARRY shows that a unified diffusion process can simultaneously generate future spatial-temporal latents and corresponding actions, with Geometry-Aware Selective Attention Modulation converting predicted depth and end-effector geometry into token-specific attention weights that align 2D observations with 3D control commands. This produces policies that reach 93.82 percent and 93.30 percent average success under clean and randomized conditions on RoboTwin 2.0 across fifty bimanual tasks and raise real-world average success from 42.5 percent to 70.8 percent relative to a prior vision-language-action policy.

What carries the argument

A single diffusion process that jointly denoises future spatial-temporal latents and actions, combined with Geometry-Aware Selective Attention Modulation that turns predicted depth and end-effector geometry into token-aligned attention weights for action generation.

If this is right

  • Bimanual tasks that require simultaneous spatial coordination become solvable without separate motion planners.
  • Policy performance remains high when visual conditions are randomized, suggesting robustness to real-world variation.
  • Simulation-to-real transfer improves because the same latent space serves both prediction and control.
  • Longer-horizon manipulation sequences can be executed with fewer compounding errors from decoupled prediction and action modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-denoising structure could be tested on single-arm or mobile manipulation benchmarks to check whether the gains generalize beyond bimanual settings.
  • If the geometry modulation proves critical, similar selective attention mechanisms might be added to other diffusion-based policies without retraining the entire world model.
  • Replacing the current visual encoder with a larger foundation model could further improve the quality of the predicted latents that feed into action generation.

Load-bearing premise

The assumption that jointly denoising future spatial-temporal latents and actions in one diffusion process plus geometry-aware attention modulation will reliably bridge 2D visual tokens to 3D metric control and produce the measured gains on spatially demanding tasks.

What would settle it

An ablation that removes the joint denoising step or the Geometry-Aware Selective Attention Modulation and measures whether average success on the fifty randomized bimanual tasks falls substantially below the reported 93.30 percent.

Figures

Figures reproduced from arXiv: 2604.26848 by Bin Yu, Chi Harold Liu, Cong Huang, Hao Wu, Kai Chen, Yukun Shi, Yurun Jin, Yuxuan Tian.

Figure 1
Figure 1. Figure 1: Overview of STARRY. The policy contains four modules: Understanding Expert, Spatial￾Temporal (ST) World Model, Geometry Expert, and Action Expert. The ST World Model predicts future spatial-temporal latent variables, while the Geometry Expert and GASAM generate geometry￾aware weights to selectively modulate the action branch. control [25, 5, 42]. While these methods improve temporal consistency and multi-m… view at source ↗
Figure 2
Figure 2. Figure 2: Construction of spatial-temporal inputs for the view at source ↗
Figure 3
Figure 3. Figure 3: Geometry-aware weight construction and modulation in GASAM. Rows (a)–(c) show view at source ↗
Figure 4
Figure 4. Figure 4: Real-world task settings. We evaluate three representative two-stage bimanual manipulation view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison in real-world experiments. STARRY and view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of future spatial-temporal latents on RoboTwin 2.0. Each row corresponds to view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of future spatial-temporal latents in real-world manipulation. The decoded view at source ↗
read the original abstract

Robotic manipulation requires reasoning about future spatial-temporal interactions and geometric constraints, yet existing Vision-Language-Action (VLA) policies often leave predictive representation weakly coupled with action execution, causing failures in tasks requiring precise spatial-temporal coordination. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction and action generation by jointly denoising future spatial-temporal latents and actions through a unified diffusion process. To bridge 2D visual tokens and 3D metric control, STARRY introduces Geometry-Aware Selective Attention Modulation (GASAM), which converts predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings across 50 bimanual tasks. Real-world experiments show that STARRY improves average success from 42.5% to 70.8% compared with $\pi_{0.5}$. These results demonstrate the effectiveness of action-centric spatial-temporal world modeling for spatially and temporally demanding robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes STARRY, a world-model-enhanced action-generation policy for robotic manipulation that aligns spatial-temporal prediction and action generation by jointly denoising future spatial-temporal latents and actions through a unified diffusion process. It introduces Geometry-Aware Selective Attention Modulation (GASAM) to bridge 2D visual tokens and 3D metric control by converting predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On the RoboTwin 2.0 benchmark, STARRY reports 93.82% / 93.30% average success under Clean and Randomized settings across 50 bimanual tasks, and real-world experiments show improvement from 42.5% to 70.8% average success compared with the π_{0.5} baseline.

Significance. If the empirical claims hold under detailed scrutiny, this work could meaningfully advance vision-language-action policies by strengthening the coupling between predictive world modeling and action execution, particularly for bimanual tasks with high spatial-temporal demands. The GASAM mechanism offers a concrete mechanism for incorporating geometric information into attention, which may prove useful beyond the specific diffusion setup.

major comments (2)
  1. [Abstract] Abstract: The performance numbers (93.82%/93.30% on RoboTwin 2.0 and real-world lift from 42.5% to 70.8%) are stated without any accompanying experimental details, baselines, number of trials, error bars, statistical tests, or ablation studies. This information is load-bearing for the central claim that the unified diffusion process and GASAM produce the reported gains.
  2. [Proposed Approach] Method description: The joint denoising of future spatial-temporal latents and actions via a unified diffusion process, together with the conversion of depth and end-effector geometry into token-aligned weights in GASAM, is described only at a high level. No equations, pseudocode, or implementation specifics are supplied to show how 2D tokens are mapped to 3D metric control or how the modulation is applied during denoising.
minor comments (1)
  1. [Abstract] The baseline notation π_{0.5} is used without definition or citation in the abstract; it should be clarified in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to strengthen the abstract with additional experimental context and to provide more detailed equations and implementation specifics for the proposed approach.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance numbers (93.82%/93.30% on RoboTwin 2.0 and real-world lift from 42.5% to 70.8%) are stated without any accompanying experimental details, baselines, number of trials, error bars, statistical tests, or ablation studies. This information is load-bearing for the central claim that the unified diffusion process and GASAM produce the reported gains.

    Authors: We agree that the abstract would be improved by including more context on the experimental setup. In the revised version, we will expand the abstract to note the evaluation across 50 bimanual tasks on RoboTwin 2.0 under Clean and Randomized settings, the comparison against the π_{0.5} baseline, and that full results including error bars, statistical significance, and ablations appear in the Experiments section. This addresses the concern while respecting abstract length constraints. revision: yes

  2. Referee: [Proposed Approach] Method description: The joint denoising of future spatial-temporal latents and actions via a unified diffusion process, together with the conversion of depth and end-effector geometry into token-aligned weights in GASAM, is described only at a high level. No equations, pseudocode, or implementation specifics are supplied to show how 2D tokens are mapped to 3D metric control or how the modulation is applied during denoising.

    Authors: We acknowledge the description is currently high-level. We will revise the Proposed Approach section to include the full mathematical formulation of the unified diffusion objective for joint denoising of spatial-temporal latents and actions, along with pseudocode for GASAM. This will explicitly detail the mapping from predicted depth and end-effector geometry to token-aligned attention weights and how modulation is applied at each denoising step, improving reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract describes STARRY as a world-model-enhanced policy using a unified diffusion process for joint denoising of spatial-temporal latents and actions, plus GASAM for bridging 2D tokens to 3D control, and reports empirical success rates (93.82%/93.30% on RoboTwin 2.0, real-world lift from 42.5% to 70.8%). No equations, fitted parameters presented as predictions, self-citations, or ansatzes are visible in the provided text. The performance claims are experimental outcomes rather than derivations that reduce to inputs by construction. The central design choices are presented as independent contributions leading to measured gains, with no load-bearing step that is self-definitional or renames a known result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract introduces GASAM and a unified diffusion process but provides no explicit free parameters, axioms, or other background assumptions. The performance claims rest on the unverified effectiveness of these components.

invented entities (1)
  • Geometry-Aware Selective Attention Modulation (GASAM) no independent evidence
    purpose: Converts predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation to bridge 2D visual tokens and 3D metric control.
    New module presented in the abstract to address the 2D-to-3D bridging problem.

pith-pipeline@v0.9.0 · 5507 in / 1271 out tokens · 93027 ms · 2026-05-07T11:22:40.151804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  2. [2]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. InProceedings of the Conference on Robot Learning (CoRL), 2022

  3. [3]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  4. [4]

    Revisiting feature prediction for learning visual repre- sentations from video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video. InTransactions on Machine Learning Research (TMLR), 2024

  5. [5]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  6. [6]

    Galliker, et al.π0.5: A vision- language-action model with open-world generalization

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, et al.π0.5: A vision- language-action model with open-world generalization. InProceedings of the 9th Conference on Robot Learning (CoRL), Proceedings of Machine Learning Research, 2025

  7. [7]

    π0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InProceedings of Robotics: Science and Systems (RSS), 2025

  8. [8]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. InProceedings of Robotics: Science and Systems (RSS), 2023

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Jiahao Cen, Zhenyu Yu, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  10. [10]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan 10 Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalabl...

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2024

  12. [12]

    arXiv preprint arXiv:2509.22642 (2025)

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, et al. Wow: Towards a world-omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

  13. [13]

    M. U. Din et al. Vision language action models in robotic manipulation.arXiv preprint arXiv:2507.10672, 2025

  14. [14]

    Embodiedmae: A unified 3d multi-modal representation for robot manipulation.arXiv preprint arXiv:2505.10105, 2025

    Zibin Dong, Fei Ni, Yifu Yuan, Yinchuan Li, and Jianye Hao. Embodiedmae: A unified 3d multi-modal representation for robot manipulation.arXiv preprint arXiv:2505.10105, 2025

  15. [15]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, et al. Dream- dojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  16. [16]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  17. [17]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  18. [18]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  19. [19]

    Internvla-a1: Unifying understanding, generation and action for robotic manipulation, 2026

    InternVLA-A1 Team. Internvla-a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

  20. [20]

    Vima: General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InProceedings of the International Conference on Machine Learning (ICML), 2023

  21. [21]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Yuke Zhu, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), 2024

  22. [22]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control. arXiv preprint arXiv:2601.21998, 2026

  23. [23]

    A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025

    Xinqing Li, Xin He, Le Zhang, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025

  24. [24]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  25. [25]

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  26. [26]

    Robomimic: A benchmark for robot learning from demonstration

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. Robomimic: A benchmark for robot learning from demonstration. InProceedings of the Conference on Robot Learning (CoRL), 2021. 11

  27. [27]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  29. [29]

    GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

    Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, and Li Jiang. Geopre- dict: Leveraging predictive kinematics and 3d gaussian geometry for precise vla manipulation. arXiv preprint arXiv:2512.16811, 2025

  30. [30]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  31. [31]

    arXiv preprint arXiv:2602.08971 (2026)

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  32. [32]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations (ICLR), 2021

  33. [33]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, et al. Bridgedata v2: A dataset for robot learning at scale. InProceedings of the Conference on Robot Learning (CoRL), 2023

  34. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  35. [35]

    arXiv preprint arXiv:2511.19861 (2025)

    Angen Ye, Boyuan Wang, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

  36. [36]

    HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

    Xumin Yu et al. Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

  37. [37]

    Up-vla: A unified understanding and prediction model for embodied agent

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. InProceedings of the International Conference on Machine Learning (ICML), 2025

  38. [38]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

    X. Zhang et al. Egodex: Learning dexterous manipulation from egocentric human demonstra- tions.arXiv preprint arXiv:2505.11709, 2025

  39. [39]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023

  40. [40]

    Learning 4d embodied world models

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Learning 4d embodied world models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  41. [41]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  42. [42]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems (RSS), 2025

  43. [43]

    Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

    F. Zhu et al. Wmpo: World model-based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515, 2025

  44. [44]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the Conference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023. 12 A Additional details and pseudocode A.1 Hyperparameters W...