ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3
The pith
Image editing models can serve as world action models for robots by conditioning actions on denoising caches instead of generating videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ImageWAM repurposes image editing models for world action modeling by conditioning an action expert directly on the KV caches from image-editing denoising steps, providing a compact world-action context that captures task-relevant changes without full video prediction or image decoding.
What carries the argument
KV caches from the denoising process in a pretrained image editing model, serving as the world-action context for conditioning a flow-matching action expert.
If this is right
- ImageWAM achieves higher performance than VLA baselines and competitive WAMs across simulator and real-world experiments without additional policy pretraining.
- Computation is reduced to one-sixth the FLOPs and one-quarter the latency compared to video-based WAMs.
- Attention in the editing caches concentrates on task-relevant change regions rather than irrelevant details.
- The image editing prior grounds task instructions to localized visual changes more effectively than video generation.
Where Pith is reading between the lines
- Single-frame image editing priors may be sufficient for many control tasks where only the next state matters, not the full trajectory.
- This could lead to hybrid systems that combine image editing with other modalities for even more efficient robot policies.
- Testing on longer-horizon tasks might reveal whether avoiding video prediction also reduces error accumulation over time.
- Similar cache-based conditioning could be explored in other generative models for action prediction beyond robotics.
Load-bearing premise
The KV caches from image-editing denoising contain enough task-relevant world state information to let the action expert predict correctly without ever producing or using the actual edited image.
What would settle it
Training the action expert on KV caches from unrelated or random image edits and finding that performance matches the original ImageWAM would show the caches do not carry the necessary information.
read the original abstract
World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ImageWAM, a framework that repurposes pretrained image editing models for robot action prediction. Instead of generating video, it conditions a flow-matching action expert on the KV caches produced during the image-editing denoising process, without decoding the target image. The authors claim that this approach outperforms standard VLA baselines and competitive WAMs across simulator and real-world experiments, reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs, and is supported by attention analysis showing focus on task-relevant change regions.
Significance. If the empirical results hold, this work would be significant for the robotics and computer vision community by showing that image editing can provide a more efficient and better-matched prior for world action modeling than video generation. The reported compute reductions could enable more practical deployment of such models. The multi-environment validation is a strength if the details are provided.
major comments (2)
- [Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.
- [Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.
minor comments (1)
- [Abstract] Abstract: The term 'flow-matching action expert' is introduced without definition or reference to prior work on flow matching in this context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript accordingly to strengthen the abstract and supporting evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised version we will update the abstract to report concrete metrics, including success rates on named simulator and real-world benchmarks, explicit baseline comparisons (e.g., RT-X, Octo, and competitive video-based WAMs), the precise FLOPs and latency reductions (with error bars or statistical details from our experiments), and the datasets used. This will make the performance claims directly verifiable. revision: yes
-
Referee: [Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.
Authors: The current manuscript supports the role of editing-derived KV caches primarily through attention visualizations showing focus on task-relevant regions. We acknowledge that a quantitative ablation would provide stronger evidence that these caches are the operative factor behind the gains relative to VLA baselines or non-editing conditionings. We will add such an ablation in the revised paper, comparing action-expert performance when conditioned on editing KV caches versus alternative sources (e.g., features from non-editing image models or standard VLA encoders) across the reported environments. revision: yes
Circularity Check
No circularity: empirical claims rest on external baselines
full rationale
The paper advances ImageWAM as an empirical alternative to video-based WAMs by conditioning an action expert on KV caches from a frozen image-editing model. All load-bearing claims (outperformance vs. VLA baselines and competitive WAMs, FLOPs/latency reductions, attention focus on change regions) are presented as results of simulator and real-world experiments rather than derived from internal equations or self-referential definitions. No fitted parameters are renamed as predictions, no uniqueness theorems are invoked via self-citation, and the central justification (sufficiency of editing caches) is tested via attention analysis and performance metrics against external references. The derivation chain is therefore self-contained through direct experimental comparison.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024
2024
-
[2]
World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
Pith/arXiv arXiv 2026
-
[3]
Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Pith/arXiv arXiv 2026
-
[4]
Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025
arXiv 2025
-
[5]
Cosmos policy: Fine-tuning video models for visuomotor control and planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[6]
Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026
arXiv 2026
-
[7]
Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026
Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, and Jianyu Chen. Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026
Pith/arXiv arXiv 2026
-
[8]
Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026
Pith/arXiv arXiv 2026
-
[9]
Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025
2025
-
[10]
Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026
Pith/arXiv arXiv 2026
-
[11]
Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026
Pith/arXiv arXiv 2026
-
[12]
Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025
Pith/arXiv arXiv 2025
-
[13]
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026
Pith/arXiv arXiv 2026
-
[14]
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026
arXiv 2026
-
[15]
Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, and Ping Tan. Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026
Pith/arXiv arXiv 2026
-
[16]
Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, and Xin Jin. Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026
arXiv 2026
-
[17]
Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025
Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, et al. Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025
arXiv 2025
-
[18]
Tesseract: Learning 4d embodied world models
Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. 2025. URLhttps://arxiv.org/abs/2504.20995
arXiv 2025
-
[19]
Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024
Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, and Xin Jin. Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024
2024
-
[20]
Nano banana pro.https://deepmind.google/technologies/gemini/, 2025
Google DeepMind. Nano banana pro.https://deepmind.google/technologies/gemini/, 2025. Built on Gem- ini 3 Pro. Image generation and editing model
2025
-
[21]
GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026
OpenAI. GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026. Accessed: 2026- 03-19
2026
-
[22]
Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025
Pith/arXiv arXiv 2025
-
[23]
Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
Pith/arXiv arXiv 2025
-
[24]
Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026
Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026
2026
-
[25]
Nextstep-1: Toward autoregressive image generation with continuous tokens at scale
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale. arXiv preprint arXiv:2508.10711, 2025
arXiv 2025
-
[26]
Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
arXiv 2026
-
[27]
Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026
Pith/arXiv arXiv 2026
-
[28]
Z-image: An efficient image generation foundation model with single-stream diffusion transformer
Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025
Pith/arXiv arXiv 2025
-
[29]
Magicbrush: A manually annotated dataset for instruction-guided image editing
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023
2023
-
[30]
Guiding instruction-based image editing via multimodal large language models
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024
2024
-
[31]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
2024
-
[32]
Anyedit: Mastering unified high-quality image editing for any idea
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26125–26135, 2025
2025
-
[33]
Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026
Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026
Pith/arXiv arXiv 2026
-
[34]
Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026
Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, et al. Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026
Pith/arXiv arXiv 2026
-
[35]
Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony De- schênes, François Pomerleau, and Philippe Giguere. Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026
Pith/arXiv arXiv 2026
-
[36]
pi0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint, 2024
2024
-
[37]
pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025
2025
-
[38]
Gr00t n1: An open foundation model for generalist humanoid robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint, 2025
2025
-
[39]
Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025
2025
-
[40]
Reconvla: Reconstructive vision-language-action model as effective robot perceiver
Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026
2026
-
[41]
HY Team, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, et al. Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026
Pith/arXiv arXiv 2026
-
[42]
Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, and Yanwei Fu. Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026
Pith/arXiv arXiv 2026
-
[43]
Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025
arXiv 2025
-
[44]
Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025
2025
-
[45]
Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024
2024
-
[46]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025
2025
-
[47]
Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, and Zongqing Lu. Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025
arXiv 2025
-
[48]
Being-h0: Vision-language-action pretraining from large-scale human videos
HaoLuo, YichengFeng, WanpengZhang, SipengZheng, YeWang, HaoqiYuan, JiazhengLiu, ChaoyiXu, QinJin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos. InInternational Conference on Machine Learning. PMLR, 2026
2026
-
[49]
Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025
arXiv 2025
-
[50]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025
arXiv 2025
-
[51]
Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026
arXiv 2026
-
[52]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. In Proceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026
2026
-
[53]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[54]
Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
Pith/arXiv arXiv 2025
-
[55]
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025. URLhttps://api.semanticscholar.org/CorpusID:281204333
Pith/arXiv arXiv 2025
-
[56]
Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, et al. Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026
Pith/arXiv arXiv 2026
-
[57]
Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025
arXiv 2025
-
[58]
Learning universal policies via text-guided video generation.NeurIPS, 2024
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.NeurIPS, 2024
2024
-
[59]
Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023
2023
-
[60]
Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025
Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025
2025
-
[61]
Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024
Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024
2024
-
[62]
Murphy, Chelsea Finn, and Yilun Du
Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin P. Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. 2026. URL https://api.semanticscholar.org/CorpusID:287074218
2026
-
[63]
Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.ArXiv, abs/2512.15840, 2025. URLhttps://api.semanticscholar.org/CorpusID: 283933826
Pith/arXiv arXiv 2025
-
[64]
Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025
Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025
2025
-
[65]
Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026
Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, and Jian Tang. Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026. URLhttps://api.semanticscholar.org/CorpusID:285051517
arXiv 2026
-
[66]
Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026
Zhongrui Zhang, Cheng-Chuan Yang, Qin Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, and Jianyu Chen. Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026. URLhttps: //api.semanticscholar.org/CorpusID:287202336
2026
-
[67]
Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026
arXiv 2026
-
[68]
Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pak- damansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, et al. Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026
Pith/arXiv arXiv 2026
-
[69]
Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025
arXiv 2025
-
[70]
Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026
arXiv 2026
-
[71]
Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026
arXiv 2026
-
[72]
Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026
Pith/arXiv arXiv 2026
-
[73]
Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026
Pith/arXiv arXiv 2026
-
[74]
Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026
arXiv 2026
-
[75]
Bingchuan Wei, Bingqi Huang, Jingheng Ma, Sen Cui, et al. Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026
arXiv 2026
-
[76]
Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li, Jiagang Zhu, Tianze Liu, Jiajun Lv, Xingxing Zuo, Yun Ye, et al. Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026
Pith/arXiv arXiv 2026
-
[77]
Interactive world simulator for robot policy training and evaluation
Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation. arXiv preprint arXiv:2603.08546, 2026
arXiv 2026
-
[78]
Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026
Pith/arXiv arXiv 2026
-
[79]
Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, and Donglin Wang. World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026
Pith/arXiv arXiv 2026
-
[80]
Yue Liao, Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Hu Yue, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280545868
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.