pith. sign in

arxiv: 2606.19531 · v1 · pith:4WTWMSAYnew · submitted 2026-06-17 · 💻 cs.CV · cs.RO

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords world action modelsimage editingrobot action predictionKV cachesflow matchingvideo generation alternativevisual world modeling
0
0 comments X

The pith

Image editing models can serve as world action models for robots by conditioning actions on denoising caches instead of generating videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that world action models do not need to generate full future videos to predict robot actions. Repurposing pretrained image editing models focuses only on the target frame transformation from the current image, which matches the needs of action prediction better than video. By using the key-value caches from the image editing denoising process to condition a flow-matching action expert, the model avoids decoding the target image and reduces computational demands. This approach outperforms both standard vision-language-action baselines and other world action models in simulator and real-world tests without extra pretraining, while using far less compute.

Core claim

ImageWAM repurposes image editing models for world action modeling by conditioning an action expert directly on the KV caches from image-editing denoising steps, providing a compact world-action context that captures task-relevant changes without full video prediction or image decoding.

What carries the argument

KV caches from the denoising process in a pretrained image editing model, serving as the world-action context for conditioning a flow-matching action expert.

If this is right

  • ImageWAM achieves higher performance than VLA baselines and competitive WAMs across simulator and real-world experiments without additional policy pretraining.
  • Computation is reduced to one-sixth the FLOPs and one-quarter the latency compared to video-based WAMs.
  • Attention in the editing caches concentrates on task-relevant change regions rather than irrelevant details.
  • The image editing prior grounds task instructions to localized visual changes more effectively than video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Single-frame image editing priors may be sufficient for many control tasks where only the next state matters, not the full trajectory.
  • This could lead to hybrid systems that combine image editing with other modalities for even more efficient robot policies.
  • Testing on longer-horizon tasks might reveal whether avoiding video prediction also reduces error accumulation over time.
  • Similar cache-based conditioning could be explored in other generative models for action prediction beyond robotics.

Load-bearing premise

The KV caches from image-editing denoising contain enough task-relevant world state information to let the action expert predict correctly without ever producing or using the actual edited image.

What would settle it

Training the action expert on KV caches from unrelated or random image edits and finding that performance matches the original ImageWAM would show the caches do not carry the necessary information.

read the original abstract

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ImageWAM, a framework that repurposes pretrained image editing models for robot action prediction. Instead of generating video, it conditions a flow-matching action expert on the KV caches produced during the image-editing denoising process, without decoding the target image. The authors claim that this approach outperforms standard VLA baselines and competitive WAMs across simulator and real-world experiments, reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs, and is supported by attention analysis showing focus on task-relevant change regions.

Significance. If the empirical results hold, this work would be significant for the robotics and computer vision community by showing that image editing can provide a more efficient and better-matched prior for world action modeling than video generation. The reported compute reductions could enable more practical deployment of such models. The multi-environment validation is a strength if the details are provided.

major comments (2)
  1. [Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.
  2. [Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.
minor comments (1)
  1. [Abstract] Abstract: The term 'flow-matching action expert' is introduced without definition or reference to prior work on flow matching in this context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript accordingly to strengthen the abstract and supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.

    Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised version we will update the abstract to report concrete metrics, including success rates on named simulator and real-world benchmarks, explicit baseline comparisons (e.g., RT-X, Octo, and competitive video-based WAMs), the precise FLOPs and latency reductions (with error bars or statistical details from our experiments), and the datasets used. This will make the performance claims directly verifiable. revision: yes

  2. Referee: [Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.

    Authors: The current manuscript supports the role of editing-derived KV caches primarily through attention visualizations showing focus on task-relevant regions. We acknowledge that a quantitative ablation would provide stronger evidence that these caches are the operative factor behind the gains relative to VLA baselines or non-editing conditionings. We will add such an ablation in the revised paper, comparing action-expert performance when conditioned on editing KV caches versus alternative sources (e.g., features from non-editing image models or standard VLA encoders) across the reported environments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baselines

full rationale

The paper advances ImageWAM as an empirical alternative to video-based WAMs by conditioning an action expert on KV caches from a frozen image-editing model. All load-bearing claims (outperformance vs. VLA baselines and competitive WAMs, FLOPs/latency reductions, attention focus on change regions) are presented as results of simulator and real-world experiments rather than derived from internal equations or self-referential definitions. No fitted parameters are renamed as predictions, no uniqueness theorems are invoked via self-citation, and the central justification (sufficiency of editing caches) is tested via attention analysis and performance metrics against external references. The derivation chain is therefore self-contained through direct experimental comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method is described as repurposing existing pretrained image-editing and flow-matching models.

pith-pipeline@v0.9.1-grok · 5818 in / 1044 out tokens · 23434 ms · 2026-06-26T20:59:57.323000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 36 linked inside Pith

  1. [1]

    Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

  2. [2]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  3. [3]

    Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  4. [4]

    Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

    Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

  5. [5]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  6. [6]

    Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

    Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

  7. [7]

    Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

    Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, and Jianyu Chen. Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

  8. [8]

    Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

    Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

  9. [9]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

  10. [10]

    Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

    Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

  11. [11]

    Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

    Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

  12. [12]

    Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

  13. [13]

    Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  14. [14]

    Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

    Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

  15. [15]

    Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

    Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, and Ping Tan. Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

  16. [16]

    Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

    Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, and Xin Jin. Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

  17. [17]

    Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

    Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, et al. Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

  18. [18]

    Tesseract: Learning 4d embodied world models

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. 2025. URLhttps://arxiv.org/abs/2504.20995

  19. [19]

    Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

    Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, and Xin Jin. Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

  20. [20]

    Nano banana pro.https://deepmind.google/technologies/gemini/, 2025

    Google DeepMind. Nano banana pro.https://deepmind.google/technologies/gemini/, 2025. Built on Gem- ini 3 Pro. Image generation and editing model

  21. [21]

    GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026

    OpenAI. GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026. Accessed: 2026- 03-19

  22. [22]

    Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  23. [23]

    Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  24. [24]

    Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

    Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

  25. [25]

    Nextstep-1: Toward autoregressive image generation with continuous tokens at scale

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale. arXiv preprint arXiv:2508.10711, 2025

  26. [26]

    Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

  27. [27]

    Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

    Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

  28. [28]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer

    Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

  29. [29]

    Magicbrush: A manually annotated dataset for instruction-guided image editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023

  30. [30]

    Guiding instruction-based image editing via multimodal large language models

    Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024

  31. [31]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  32. [32]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26125–26135, 2025

  33. [33]

    Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

    Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

  34. [34]

    Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

    Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, et al. Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

  35. [35]

    Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

    Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony De- schênes, François Pomerleau, and Philippe Giguere. Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

  36. [36]

    pi0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint, 2024

  37. [37]

    pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

  38. [38]

    Gr00t n1: An open foundation model for generalist humanoid robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint, 2025

  39. [39]

    Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

  40. [40]

    Reconvla: Reconstructive vision-language-action model as effective robot perceiver

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

  41. [41]

    Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

    HY Team, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, et al. Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

  42. [42]

    Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

    Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, and Yanwei Fu. Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

  43. [43]

    Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

    Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

  44. [44]

    Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

  45. [45]

    Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

  46. [46]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

  47. [47]

    Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

    Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, and Zongqing Lu. Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

  48. [48]

    Being-h0: Vision-language-action pretraining from large-scale human videos

    HaoLuo, YichengFeng, WanpengZhang, SipengZheng, YeWang, HaoqiYuan, JiazhengLiu, ChaoyiXu, QinJin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos. InInternational Conference on Machine Learning. PMLR, 2026

  49. [49]

    Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

    Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

  50. [50]

    Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

  51. [51]

    Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

  52. [52]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. In Proceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

  53. [53]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  54. [54]

    Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  55. [55]

    F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025. URLhttps://api.semanticscholar.org/CorpusID:281204333

  56. [56]

    Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

    Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, et al. Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

  57. [57]

    Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

    Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

  58. [58]

    Learning universal policies via text-guided video generation.NeurIPS, 2024

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.NeurIPS, 2024

  59. [59]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

  60. [60]

    Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

  61. [61]

    Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

    Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

  62. [62]

    Murphy, Chelsea Finn, and Yilun Du

    Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin P. Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. 2026. URL https://api.semanticscholar.org/CorpusID:287074218

  63. [63]

    Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.ArXiv, abs/2512.15840, 2025. URLhttps://api.semanticscholar.org/CorpusID: 283933826

  64. [64]

    Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

    Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

  65. [65]

    Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026

    Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, and Jian Tang. Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026. URLhttps://api.semanticscholar.org/CorpusID:285051517

  66. [66]

    Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026

    Zhongrui Zhang, Cheng-Chuan Yang, Qin Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, and Jianyu Chen. Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026. URLhttps: //api.semanticscholar.org/CorpusID:287202336

  67. [67]

    Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

    Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

  68. [68]

    Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

    Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pak- damansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, et al. Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

  69. [69]

    Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

    Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

  70. [70]

    Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

    Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

  71. [71]

    Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

  72. [72]

    Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

    Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

  73. [73]

    Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

    Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

  74. [74]

    Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

    Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

  75. [75]

    Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

    Bingchuan Wei, Bingqi Huang, Jingheng Ma, Sen Cui, et al. Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

  76. [76]

    Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

    Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li, Jiagang Zhu, Tianze Liu, Jiajun Lv, Xingxing Zuo, Yun Ye, et al. Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

  77. [77]

    Interactive world simulator for robot policy training and evaluation

    Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation. arXiv preprint arXiv:2603.08546, 2026

  78. [78]

    World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

    Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

  79. [79]

    World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

    Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, and Donglin Wang. World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

  80. [80]

    Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025

    Yue Liao, Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Hu Yue, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280545868

Showing first 80 references.