pith. machine review for the scientific record. sign in

arxiv: 2604.09330 · v1 · submitted 2026-04-10 · 💻 cs.RO · cs.CV

Recognition: unknown

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords video-action generationembodied data synthesisflow matchingrobot policiessynthetic pretrainingcross-modal consistencydual-stream frameworktrajectory replay
0
0 comments X

The pith

A dual-stream model jointly generates videos and actions to create synthetic robot training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VAG as a way to produce paired video and action sequences from visual and language inputs for robot learning. It runs two generation processes together, keeping their steps aligned and moving key context from video to action through a pooling step. This setup aims to create synthetic demonstrations that can be replayed on robots and used to pretrain policies. Readers would care because real robot data collection is expensive, so scalable synthetic alternatives could expand what robots can learn.

Core claim

VAG is a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization.

What carries the argument

dual-stream flow-matching framework with synchronized denoising and adaptive 3D pooling for cross-modal context transfer

If this is right

  • Generates aligned video-action pairs with competitive prediction quality.
  • Supports executable trajectory replay from the generated actions.
  • Provides synthetic pretraining data that improves downstream policy generalization in simulated and real settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could lower the volume of human teleoperation needed to scale robot foundation models.
  • The joint generation method might extend to longer horizons or multi-task scenarios where separate video and action models currently struggle with consistency.
  • Combining VAG outputs with existing world models could create hybrid training pipelines that mix synthetic and real data more effectively.

Load-bearing premise

The generated video-action pairs match real robot demonstration distributions closely enough that policy improvements from pretraining transfer to actual robot tasks without large domain gaps.

What would settle it

Train a policy on VAG-generated data then test it on the same real-world robot tasks used for real-data baselines; if performance shows no gain or clear degradation relative to real-data training, the claim that the synthetic data is useful for embodied pretraining fails.

Figures

Figures reproduced from arXiv: 2604.09330 by Chaojun Ni, Guan Huang, Jiagang Zhu, Jiajun Lv, Kerui Li, Tianze Liu, Xiaofeng Wang, Xiaolei Lang, Xingxing Zuo, Yang Wang, Yukun Zhou, Yun Ye, Zheng Zhu.

Figure 1
Figure 1. Figure 1: Illustration and capabilities of VAG. (a) We train our dual-stream video-action generation model using teleoperated robot trajectories. (b) Given an initial frame and a language instruction, the model can synchronously generate aligned video–action data pairs. (c) The generated data can be used to train robot policies for improved generalization. (d) The actions generated by the model can also be applied f… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture comparison of embodied models. (a) Vision-Language-Action (VLA) models iteratively predict and execute actions, serving as a policy [6]. (b) World models (WMs) generate rich visual rollouts spanning diverse scenarios but lack aligned action trajectories for direct policy learning [1]. (c) World-Action (WA) models serve as a policy and enhance action prediction by incorporating video generation… view at source ↗
Figure 3
Figure 3. Figure 3: Inference and training pipeline of VAG. Both the video branch and the action branch are based on flow matching. the ground truth: L(θ) = ∥u (xt, t, c; θ) − vt∥ 2 , (3) where c denotes conditioning information associated with x (e.g., text embeddings, reference image frames, and other conditional inputs), θ represents the model parameters, and u(·; θ) is the predicted velocity function. 3.2. Dual-Stream Vid… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curve of VAG over time on the AgiBot dataset for 40,000 iterations. SVD Wan2.2 CP2 VAG [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative video generation results across different methods. Prompt:“Use the right hand to pour the water from the gray teapot into the cup.” CP2 stands for Cosmos-Predict2. Baselines. We first evaluate the video prediction perfor￾mance against the state-of-the-art image-to-video models SVD [7] and Wan2.2 [64], which have model sizes of 1.5B and 5B, respectively. Furthermore, to evaluate the accu￾racy of… view at source ↗
Figure 6
Figure 6. Figure 6: Action curve of all 16 dimensions (predicted vs GT) over time on a grasping task of the AgiBot dataset. Prompt: “Use the right hand to pick up the apple from the table”. ics. For video generation, we use the first frame and cor￾responding text prompt of each test sample. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training loss curve of VAG over time on the LIBERO dataset for 20,000 iterations [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Action curves of all 7 dimensions (predicted vs. ground truth) over time on the spatial task of LIBERO dataset. Prompt: “Pick up the black bowl between the plate and the ramekin and place it on the plate.” The blue curves denote ground￾truth actions, while the yellow ones denote predicted actions. loss curve shown in [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Real-world demonstrations of VLAs. In the face of changes in the position or color of the grasped object, π0.5-w￾VAG-pretrain outperforms π0.5, verifying VAG’s effectiveness. 2000 4000 6000 8000 10000 Step 10 −2 3 × 10 −3 4 × 10 −3 6 × 10 −3 2 × 10 −2 Loss Loss Comparison (log-scale) π0.5 π0.5-w-VAG-pretrain π0.5 π0.5-w-VAG-pretrain 0 10 20 30 40 50 60 Success Rate (%) 35% 55% Success Rate Comparison [PI… view at source ↗
Figure 11
Figure 11. Figure 11: Success rate of VLAs in the real world. Powered by the synthesized data from VAG, π0.5-w-VAG-pretrain showcases better generalization than π0.5 (a 20% increase in success rate). cution. In [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative examples of VAG-generated video and action trajectories for left-arm, right-arm, and bimanual manip￾ulation. Each example includes (1) the initial VAG input frame and generated video, (2) the textual prompt and action trajectory, and (3) the action trajectory executed on the Agilex robot. jointly generating videos and actions within a unified flow￾matching-based dual-stream framework with s… view at source ↗
read the original abstract

Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action trajectories conditioned on visual and language inputs. It synchronizes denoising across branches and employs an adaptive 3D pooling mechanism to transfer global video context to the action branch, aiming to improve cross-modal consistency. The authors claim that this produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and yields useful synthetic pretraining data that improves downstream policy generalization across simulated and real-world settings.

Significance. If the empirical results hold, VAG could provide a practical method for scalable embodied data synthesis, addressing the high cost of real teleoperation demonstrations and limitations of existing world models or two-stage pipelines. The joint generation approach targets a key gap in producing paired data suitable for policy learning.

major comments (2)
  1. The abstract asserts improvements in alignment and downstream policy gains, yet supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis; the central claims cannot be verified from the given text alone.
  2. The claim that VAG-generated pairs improve policy generalization (abstract) rests on the assumption that cross-modal consistency from synchronized denoising and adaptive 3D pooling produces data distributionally close enough to real demonstrations to avoid domain-gap degradation in closed-loop performance. This requires explicit validation via trajectory statistics, visual feature distributions, or policy success rates, as alignment during generation does not automatically imply transfer without systematic biases in dynamics or smoothness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying the empirical support for our claims. We address each major comment point by point below.

read point-by-point responses
  1. Referee: The abstract asserts improvements in alignment and downstream policy gains, yet supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis; the central claims cannot be verified from the given text alone.

    Authors: We agree that the abstract is a high-level summary and lacks specific numbers. The full manuscript contains quantitative results in Sections 4 and 5, including alignment metrics (cross-modal consistency and trajectory replay success), baseline comparisons against WA models and two-stage pipelines, ablations on synchronized denoising and adaptive 3D pooling, and error analyses. To improve verifiability, we have revised the abstract to include key quantitative highlights such as policy success rate improvements while respecting length limits. revision: yes

  2. Referee: The claim that VAG-generated pairs improve policy generalization (abstract) rests on the assumption that cross-modal consistency from synchronized denoising and adaptive 3D pooling produces data distributionally close enough to real demonstrations to avoid domain-gap degradation in closed-loop performance. This requires explicit validation via trajectory statistics, visual feature distributions, or policy success rates, as alignment during generation does not automatically imply transfer without systematic biases in dynamics or smoothness.

    Authors: We acknowledge the importance of validating transfer beyond generation-time alignment. The manuscript already provides this validation: policy success rates are reported for downstream tasks in both simulation and real-world settings (showing gains over baselines), trajectory statistics compare dynamics and smoothness (velocity, jerk, and acceleration distributions), and visual feature distributions are analyzed (e.g., via FID scores between generated and real data). Ablations confirm that synchronized denoising and adaptive 3D pooling reduce domain gaps without introducing systematic biases, as evidenced by closed-loop performance results. revision: no

Circularity Check

0 steps flagged

No significant circularity in VAG derivation chain

full rationale

The paper introduces a new dual-stream flow-matching architecture with synchronized denoising and adaptive 3D pooling as architectural innovations for joint video-action generation. No equations, fitted parameters, or self-citations are presented in the provided text that reduce predictions or consistency claims back to the inputs by construction. Performance claims (alignment, replayability, downstream policy gains) are framed as empirical outcomes of the proposed model rather than self-referential derivations. This is a standard non-circular proposal of a generative framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, background axioms, or newly postulated entities are described beyond the high-level framework components.

pith-pipeline@v0.9.0 · 5555 in / 1117 out tokens · 48031 ms · 2026-05-10T17:13:21.682354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

86 extracted references · 66 canonical work pages · cited by 1 Pith paper · 32 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2, 3, 4, 5

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

  3. [3]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35:24639–24654, 2022

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe- dro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35:24639–24654, 2022. 3, 6

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for trans- fer.arXiv preprint arXiv:2407.07726, 2024. 2

  5. [5]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chen- dong Xiang, Yinze Rong, et al. Motus: A unified latent ac- tion world model.arXiv preprint arXiv:2512.13030, 2025. 2, 3

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 2, 3

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2, 3, 6

  8. [8]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manip- ulation platform for scalable and intelligent embodied sys- tems.arXiv preprint arXiv:2503.06669, 2025. 5

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 2, 3

  10. [10]

    Gr-3 technical report,

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, and Yichu Yang. Gr-3 technical report,

  11. [11]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr- 2: A generative video-language-action model with web- scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 2, 3

  12. [12]

    Offset: Segmentation-based focus shift revision for composed image retrieval

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the ACM International Conference on Multimedia, page 6113–6122, 2025. 2

  13. [13]

    Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. In Proceedings of the ACM International Conference on Mul- timedia, page 6143–6152, 2025. 2

  14. [14]

    VisRL: Intention-driven visual perception via reinforced reasoning,

    Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025. 2

  15. [15]

    Sifthinker: Spatially-aware image focus for visual reasoning.arXiv preprint arXiv:2508.06259, 2025

    Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, and Ruqi Huang. Sifthinker: Spatially-aware image focus for visual reasoning.arXiv preprint arXiv:2508.06259, 2025. 2

  16. [16]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 2, 5

  17. [17]

    arXiv preprint arXiv:2509.22642 (2025)

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

  18. [18]

    Euclidean distance mapping.Computer Graphics and image processing, 14(3):227–248, 1980

    Per-Erik Danielsson. Euclidean distance mapping.Computer Graphics and image processing, 14(3):227–248, 1980. 7

  19. [19]

    Neural pruning for 3d scene reconstruction: Efficient nerf acceleration.arXiv preprint arXiv:2504.00950, 2025

    Tianqi Ding, Dawei Xiang, Pablo Rivas, and Liang Dong. Neural pruning for 3d scene reconstruction: Efficient nerf acceleration.arXiv preprint arXiv:2504.00950, 2025. 3

  20. [20]

    Pair: Complementarity-guided disentanglement for composed im- age retrieval

    Zhiheng Fu, Zixu Li, Zhiwei Chen, Chunxiao Wang, Xuemeng Song, Yupeng Hu, and Liqiang Nie. Pair: Complementarity-guided disentanglement for composed im- age retrieval. InProceedings of the IEEE International Con- ference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025. 2

  21. [21]

    arXiv preprint arXiv:2602.06949 , year=

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Ma- lik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos. arXiv preprint arXiv:2602.06949, 2026. 3

  22. [22]

    Ctrl-world: A controllable generative world model for robot manipulation, 2026

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025. 3

  23. [23]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Train- ing agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 3

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- 10 ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

  25. [25]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 4

  26. [26]

    Median: Adaptive intermediate-grained aggregation network for composed im- age retrieval

    Qinlei Huang, Zhiwei Chen, Zixu Li, Chunxiao Wang, Xue- meng Song, Yupeng Hu, and Liqiang Nie. Median: Adaptive intermediate-grained aggregation network for composed im- age retrieval. InProceedings of the IEEE International Con- ference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025. 2

  27. [27]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. pis- tar06: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025. 2

  28. [28]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Gal- liker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  29. [29]

    arXiv preprint arXiv:2505.12705 (2025)

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Jo- han Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking gen- eralization in robot learning through video world models. arXiv preprint arXiv:2505.12705v2, 2025. 2, 3

  30. [30]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jian- ning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual- system vla model.arXiv preprint arXiv:2509.00576, 2025. 2

  31. [31]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Fos- ter, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kol- lar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.0924...

  32. [32]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2

  33. [33]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming- Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 3

  34. [34]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 5

  35. [35]

    MimicDreamer: Aligning human and robot demonstrations for scalable vla training

    Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, et al. Mimicdreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199, 2025. 2

  36. [36]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 2

  37. [37]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 2, 3

  38. [38]

    Cogvla: Cognition-aligned vision-language- action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025

    Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 2

  39. [39]

    Onetwovla: A unified vision-language-action model with adaptive reasoning

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Jun- ming Zhao, and Yang Gao. Onetwovla: A unified vision- language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025. 2

  40. [40]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

  41. [41]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 5

  42. [42]

    Robotransfer: Controllable geometry- consistent video diffusion for manipulation policy transfer,

    Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, and Zhizhong Su. Robotransfer: Geometry-consistent video dif- fusion for robotic visual policy transfer.arXiv preprint arXiv:2505.23171, 2025. 2, 3

  43. [43]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 2

  44. [44]

    arXiv preprint arXiv:2412.09043 (2024)

    Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian re- construction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024. 3

  45. [45]

    Wonderfree: Enhancing novel view quality and cross-view consistency for 3d scene exploration

    Chaojun Ni, Jie Li, Haoyun Li, Hengyu Liu, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Boyuan Wang, Chenxin Li, Guan Huang, et al. Wonderfree: Enhancing novel view quality and cross-view consistency for 3d scene exploration. arXiv preprint arXiv:2506.20590, 2025. 2

  46. [46]

    arXiv preprint arXiv:2504.02261 (2025)

    Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Weijie Wang, Haoyun Li, Guosheng Zhao, Jie Li, Wenkang Qin, Guan Huang, and Wenjun Mei. Wonderturbo: Generating interactive 3d world in 0.72 seconds.arXiv preprint arXiv:2504.02261, 2025

  47. [47]

    Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

    Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, and Wenjun Mei. Recondreamer-rl: Enhancing rein- forcement learning via diffusion-based scene reconstruction. arXiv preprint arXiv:2508.08170, 2025. 2

  48. [48]

    Swiftvla: Unlocking spa- tiotemporal dynamics for lightweight vla models at minimal 11 overhead

    Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wen- zhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, et al. Swiftvla: Unlocking spa- tiotemporal dynamics for lightweight vla models at minimal 11 overhead. InProceedings of the IEEE conference on com- puter vision and pattern recognition, 2026. 2

  49. [49]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist hu- manoid robots.arXiv preprint arXiv:2503.14734, 2025. 2

  50. [50]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherl...

  51. [51]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

  52. [52]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 4

  53. [53]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  54. [54]

    Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

    Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ra- mamoorthi. Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion.arXiv preprint arXiv:2404.07199, 2024. 2

  55. [55]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844,

  56. [56]

    arXiv preprint arXiv:2412.03555 (2024) 1

    Andreas Steiner, Andr ´e Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Grit- senko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024. 2

  57. [57]

    AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

    Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Au- tomated task-agnostic actions for bimanual manipulation. arXiv preprint arXiv:2507.12768, 2025. 2, 3, 6

  58. [58]

    arXiv preprint arXiv:2510.19430 (2025)

    GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jia- gang Zhu, Lv Feng, et al. Gigabrain-0: A world model- powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025. 2

  59. [59]

    arXiv preprint arXiv:2511.19861 (2025)

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World mod- els as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. 2

  60. [60]

    Gigabrain-0.5m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099,

    GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026. 2

  61. [61]

    arXiv preprint arXiv:2512.10675 (2025)

    Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, et al. Evalu- ating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025. 3

  62. [62]

    Causal video models are data-efficient robot policy learners.Rhoda AI Blog, 2026

    Rhoda AI Team. Causal video models are data-efficient robot policy learners.Rhoda AI Blog, 2026. 2

  63. [63]

    Pdfactor: Learning tri- perspective view policy diffusion field for multi-task robotic manipulation

    Jingyi Tian, Le Wang, Sanping Zhou, Sen Wang, Jiayi Li, Haowen Sun, and Wei Tang. Pdfactor: Learning tri- perspective view policy diffusion field for multi-task robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15757–15767, 2025. 2

  64. [64]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 6

  65. [65]

    Humandreamer-x: Photorealistic single-image human avatars reconstruction via gaussian restoration.arXiv preprint arXiv:2504.03536, 2025

    Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Xiaopei Zhang, Guan Huang, Yijie Ren, Lihong Liu, et al. Humandreamer-x: Photoreal- istic single-image human avatars reconstruction via gaussian restoration.arXiv preprint arXiv:2504.03536, 2025. 2

  66. [66]

    Humandreamer: Generating controllable human-motion videos via decoupled generation

    Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al. Humandreamer: Generating controllable human-motion videos via decoupled generation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 12391–12401, 2025. 2

  67. [67]

    Sampo: Scale-wise autoregression with mo- tion prompt for generative world models.arXiv preprint arXiv:2509.15536, 2025

    Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, and Hua Gang. Sampo: Scale-wise autoregression with mo- tion prompt for generative world models.arXiv preprint arXiv:2509.15536, 2025. 3

  68. [68]

    Flowram: Grounding flow matching policy with region-aware mamba framework for robotic manipulation

    Sen Wang, Le Wang, Sanping Zhou, Jingyi Tian, Jiayi Li, Haowen Sun, and Wei Tang. Flowram: Grounding flow matching policy with region-aware mamba framework for robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12176–12186,

  69. [69]

    Drivegen3d: Boosting feed- forward driving scene generation with efficient video diffu- sion.arXiv preprint arXiv:2510.15264, 2025

    Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, et al. Drivegen3d: Boosting feed- forward driving scene generation with efficient video diffu- sion.arXiv preprint arXiv:2510.15264, 2025. 2

  70. [70]

    Video models are zero-shot learners and reasoners

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 3

  71. [71]

    Dual-stream diffusion for world-model augmented vision-language-action model, 2025

    John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model 12 augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025. 2, 3

  72. [72]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. 2, 3

  73. [73]

    arXiv preprint arXiv:2601.18692 (2026)

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026. 2

  74. [74]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 2

  75. [75]

    Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

    Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026. 3

  76. [76]

    Gigaworld-policy: An efficient action- centered world–action model, 2026

    Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Gu- osheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240,

  77. [77]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 2, 3

  78. [78]

    PlayWorld: Learning Robot World Models from Autonomous Play

    Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.arXiv preprint arXiv:2603.09030, 2026. 3

  79. [79]

    Dm0: An embodied-native vision- language-action model towards physical ai.arXiv preprint arXiv:2602.14974, 2026

    En Yu, Haoran Lv, Jianjian Sun, Kangheng Lin, Ruitao Zhang, Yukang Shi, Yuyang Chen, Ze Chen, Ziheng Zhang, Fan Jia, et al. Dm0: An embodied-native vision- language-action model towards physical ai.arXiv preprint arXiv:2602.14974, 2026. 2

  80. [80]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026. 2, 3

Showing first 80 references.