ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Haitao Lin; He Zhang; Jingbo Zhang; Wenjun Zeng; Wenyao Zhang; Xiaokang Yang; Xin Jin; Yao Mu; Yuyang Zhang; Zekun Qi

arxiv: 2606.19531 · v1 · pith:4WTWMSAYnew · submitted 2026-06-17 · 💻 cs.CV · cs.RO

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Yuyang Zhang , Wenyao Zhang , Zekun Qi , He Zhang , Haitao Lin , Jingbo Zhang , Yao Mu , Xiaokang Yang

show 2 more authors

Wenjun Zeng Xin Jin

This is my paper

Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords world action modelsimage editingrobot action predictionKV cachesflow matchingvideo generation alternativevisual world modeling

0 comments

The pith

Image editing models can serve as world action models for robots by conditioning actions on denoising caches instead of generating videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that world action models do not need to generate full future videos to predict robot actions. Repurposing pretrained image editing models focuses only on the target frame transformation from the current image, which matches the needs of action prediction better than video. By using the key-value caches from the image editing denoising process to condition a flow-matching action expert, the model avoids decoding the target image and reduces computational demands. This approach outperforms both standard vision-language-action baselines and other world action models in simulator and real-world tests without extra pretraining, while using far less compute.

Core claim

ImageWAM repurposes image editing models for world action modeling by conditioning an action expert directly on the KV caches from image-editing denoising steps, providing a compact world-action context that captures task-relevant changes without full video prediction or image decoding.

What carries the argument

KV caches from the denoising process in a pretrained image editing model, serving as the world-action context for conditioning a flow-matching action expert.

If this is right

ImageWAM achieves higher performance than VLA baselines and competitive WAMs across simulator and real-world experiments without additional policy pretraining.
Computation is reduced to one-sixth the FLOPs and one-quarter the latency compared to video-based WAMs.
Attention in the editing caches concentrates on task-relevant change regions rather than irrelevant details.
The image editing prior grounds task instructions to localized visual changes more effectively than video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-frame image editing priors may be sufficient for many control tasks where only the next state matters, not the full trajectory.
This could lead to hybrid systems that combine image editing with other modalities for even more efficient robot policies.
Testing on longer-horizon tasks might reveal whether avoiding video prediction also reduces error accumulation over time.
Similar cache-based conditioning could be explored in other generative models for action prediction beyond robotics.

Load-bearing premise

The KV caches from image-editing denoising contain enough task-relevant world state information to let the action expert predict correctly without ever producing or using the actual edited image.

What would settle it

Training the action expert on KV caches from unrelated or random image edits and finding that performance matches the original ImageWAM would show the caches do not carry the necessary information.

read the original abstract

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ImageWAM claims image-editing KV caches can replace video generation for robot world models with big efficiency wins, but the abstract gives no numbers or ablations to check if the caches actually carry the needed state.

read the letter

The main thing here is a straightforward substitution: instead of generating future video frames for a world action model, the authors freeze a pretrained image editor, run its denoising steps on the current image plus instruction, and feed the resulting KV caches straight into a flow-matching action head. No target image is ever decoded at inference. That setup is presented as fixing three problems at once—token cost, irrelevant detail, and error accumulation—while cutting FLOPs to roughly 1/6 and latency to 1/4.

What stands out is the clean prior argument. Image editing is already trained to produce localized changes from text, so its internal representations should be more action-relevant than a general video predictor. The attention maps they show, focusing on change regions, line up with that story. If the full experiments confirm the gains across sim and real without extra pretraining, the efficiency angle would be useful for anyone trying to run these models on-robot.

The soft spot is exactly the one the stress-test flags. The whole claim rests on those caches containing enough task-specific visual differences. The abstract mentions an attention check but does not report the obvious controls: what happens with caches from a non-editing image model, or even random ones? Without those ablations or the actual performance tables, error bars, and dataset sizes, the outperformance numbers cannot be judged. The abstract is unusually light on quantitative detail for a methods paper.

This is for people building efficient embodied policies who already work with diffusion or flow models. A reader who wants to test cheaper world-action conditioning would get value from the idea even if the results need tightening. The thinking is coherent on its own terms and engages the right prior work, so it clears the bar for serious refereeing. I would send it out, but expect the reviewers to demand the missing ablations on the cache content.

Referee Report

2 major / 1 minor

Summary. The paper proposes ImageWAM, a framework that repurposes pretrained image editing models for robot action prediction. Instead of generating video, it conditions a flow-matching action expert on the KV caches produced during the image-editing denoising process, without decoding the target image. The authors claim that this approach outperforms standard VLA baselines and competitive WAMs across simulator and real-world experiments, reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs, and is supported by attention analysis showing focus on task-relevant change regions.

Significance. If the empirical results hold, this work would be significant for the robotics and computer vision community by showing that image editing can provide a more efficient and better-matched prior for world action modeling than video generation. The reported compute reductions could enable more practical deployment of such models. The multi-environment validation is a strength if the details are provided.

major comments (2)

[Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.
[Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.

minor comments (1)

[Abstract] Abstract: The term 'flow-matching action expert' is introduced without definition or reference to prior work on flow matching in this context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript accordingly to strengthen the abstract and supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims that ImageWAM 'outperforms standard VLA baselines and matching competitive WAMs' and 'reduces FLOPs to 1/6 and latency to 1/4' are made without any quantitative results, specific baselines, datasets, error bars, or statistical details. This is load-bearing because the abstract supplies no verifiable evidence for these assertions, preventing assessment of the claimed gains.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised version we will update the abstract to report concrete metrics, including success rates on named simulator and real-world benchmarks, explicit baseline comparisons (e.g., RT-X, Octo, and competitive video-based WAMs), the precise FLOPs and latency reductions (with error bars or statistical details from our experiments), and the datasets used. This will make the performance claims directly verifiable. revision: yes
Referee: [Abstract] Abstract: The key assumption that 'KV caches produced by image-editing denoising' contain sufficient task-relevant world state information to condition the action expert (without decoding the target image) is justified only by qualitative 'attention analysis.' No quantitative ablation is described that tests whether these caches are the operative source of improvement compared to VLA baselines or alternative conditioning (e.g., from non-editing models). This directly impacts the claim that image editing supplies a 'better-matched prior' for world-action modeling.

Authors: The current manuscript supports the role of editing-derived KV caches primarily through attention visualizations showing focus on task-relevant regions. We acknowledge that a quantitative ablation would provide stronger evidence that these caches are the operative factor behind the gains relative to VLA baselines or non-editing conditionings. We will add such an ablation in the revised paper, comparing action-expert performance when conditioned on editing KV caches versus alternative sources (e.g., features from non-editing image models or standard VLA encoders) across the reported environments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baselines

full rationale

The paper advances ImageWAM as an empirical alternative to video-based WAMs by conditioning an action expert on KV caches from a frozen image-editing model. All load-bearing claims (outperformance vs. VLA baselines and competitive WAMs, FLOPs/latency reductions, attention focus on change regions) are presented as results of simulator and real-world experiments rather than derived from internal equations or self-referential definitions. No fitted parameters are renamed as predictions, no uniqueness theorems are invoked via self-citation, and the central justification (sufficiency of editing caches) is tested via attention analysis and performance metrics against external references. The derivation chain is therefore self-contained through direct experimental comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method is described as repurposing existing pretrained image-editing and flow-matching models.

pith-pipeline@v0.9.1-grok · 5818 in / 1044 out tokens · 23434 ms · 2026-06-26T20:59:57.323000+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 36 linked inside Pith

[1]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

2024
[2]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[3]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[4]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

arXiv 2025
[5]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[6]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

arXiv 2026
[7]

Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, and Jianyu Chen. Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

Pith/arXiv arXiv 2026
[8]

Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

Pith/arXiv arXiv 2026
[9]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

2025
[10]

Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026
[11]

Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Pith/arXiv arXiv 2026
[12]

Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[13]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[14]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026
[15]

Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, and Ping Tan. Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

Pith/arXiv arXiv 2026
[16]

Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, and Xin Jin. Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

arXiv 2026
[17]

Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, et al. Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

arXiv 2025
[18]

Tesseract: Learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. 2025. URLhttps://arxiv.org/abs/2504.20995

arXiv 2025
[19]

Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, and Xin Jin. Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

2024
[20]

Nano banana pro.https://deepmind.google/technologies/gemini/, 2025

Google DeepMind. Nano banana pro.https://deepmind.google/technologies/gemini/, 2025. Built on Gem- ini 3 Pro. Image generation and editing model

2025
[21]

GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026

OpenAI. GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026. Accessed: 2026- 03-19

2026
[22]

Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

Pith/arXiv arXiv 2025
[23]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[24]

Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

2026
[25]

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale. arXiv preprint arXiv:2508.10711, 2025

arXiv 2025
[26]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

arXiv 2026
[27]

Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Pith/arXiv arXiv 2026
[28]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer

Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025
[29]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023

2023
[30]

Guiding instruction-based image editing via multimodal large language models

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024

2024
[31]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

2024
[32]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26125–26135, 2025

2025
[33]

Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

Pith/arXiv arXiv 2026
[34]

Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, et al. Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

Pith/arXiv arXiv 2026
[35]

Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony De- schênes, François Pomerleau, and Philippe Giguere. Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

Pith/arXiv arXiv 2026
[36]

pi0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint, 2024

2024
[37]

pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

2025
[38]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint, 2025

2025
[39]

Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

2025
[40]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026
[41]

Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

HY Team, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, et al. Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

Pith/arXiv arXiv 2026
[42]

Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, and Yanwei Fu. Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

Pith/arXiv arXiv 2026
[43]

Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

arXiv 2025
[44]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

2025
[45]

Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

2024
[46]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

2025
[47]

Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, and Zongqing Lu. Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

arXiv 2025
[48]

Being-h0: Vision-language-action pretraining from large-scale human videos

HaoLuo, YichengFeng, WanpengZhang, SipengZheng, YeWang, HaoqiYuan, JiazhengLiu, ChaoyiXu, QinJin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos. InInternational Conference on Machine Learning. PMLR, 2026

2026
[49]

Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

arXiv 2025
[50]

Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

arXiv 2025
[51]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

arXiv 2026
[52]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. In Proceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

2026
[53]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[54]

Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Pith/arXiv arXiv 2025
[55]

F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025. URLhttps://api.semanticscholar.org/CorpusID:281204333

Pith/arXiv arXiv 2025
[56]

Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, et al. Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

Pith/arXiv arXiv 2026
[57]

Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

arXiv 2025
[58]

Learning universal policies via text-guided video generation.NeurIPS, 2024

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.NeurIPS, 2024

2024
[59]

Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

2023
[60]

Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

2025
[61]

Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

2024
[62]

Murphy, Chelsea Finn, and Yilun Du

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin P. Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. 2026. URL https://api.semanticscholar.org/CorpusID:287074218

2026
[63]

Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.ArXiv, abs/2512.15840, 2025. URLhttps://api.semanticscholar.org/CorpusID: 283933826

Pith/arXiv arXiv 2025
[64]

Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

2025
[65]

Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026

Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, and Jian Tang. Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026. URLhttps://api.semanticscholar.org/CorpusID:285051517

arXiv 2026
[66]

Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026

Zhongrui Zhang, Cheng-Chuan Yang, Qin Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, and Jianyu Chen. Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026. URLhttps: //api.semanticscholar.org/CorpusID:287202336

2026
[67]

Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

arXiv 2026
[68]

Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pak- damansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, et al. Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

Pith/arXiv arXiv 2026
[69]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025
[70]

Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

arXiv 2026
[71]

Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

arXiv 2026
[72]

Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

Pith/arXiv arXiv 2026
[73]

Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

Pith/arXiv arXiv 2026
[74]

Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

arXiv 2026
[75]

Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

Bingchuan Wei, Bingqi Huang, Jingheng Ma, Sen Cui, et al. Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

arXiv 2026
[76]

Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li, Jiagang Zhu, Tianze Liu, Jiajun Lv, Xingxing Zuo, Yun Ye, et al. Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

Pith/arXiv arXiv 2026
[77]

Interactive world simulator for robot policy training and evaluation

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation. arXiv preprint arXiv:2603.08546, 2026

arXiv 2026
[78]

World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

Pith/arXiv arXiv 2026
[79]

World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, and Donglin Wang. World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

Pith/arXiv arXiv 2026
[80]

Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025

Yue Liao, Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Hu Yue, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280545868

Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint, 2024

2024

[2] [2]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[3] [3]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[4] [4]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2025

arXiv 2025

[5] [5]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[6] [6]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision- language-action generation.arXiv preprint arXiv:2602.09849, 2026

arXiv 2026

[7] [7]

Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, and Jianyu Chen. Uam: A dual-stream perspective on forgetting in vla training.arXiv preprint arXiv:2605.15735, 2026

Pith/arXiv arXiv 2026

[8] [8]

Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

Pith/arXiv arXiv 2026

[9] [9]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint, 2025

2025

[10] [10]

Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026

[11] [11]

Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Wenyao Zhang, Bozhou Zhang, Zekun Qi, Wenjun Zeng, Xin Jin, and Li Zhang. Disentangled robot learning via separate forward and inverse dynamics pretraining.arXiv preprint arXiv:2604.16391, 2026

Pith/arXiv arXiv 2026

[12] [12]

Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[13] [13]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[14] [14]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

arXiv 2026

[15] [15]

Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, and Ping Tan. Maskwam: Unifying mask prompting and prediction for world-action models.arXiv preprint arXiv:2606.13515, 2026

Pith/arXiv arXiv 2026

[16] [16]

Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, and Xin Jin. Reworld: Multi-dimensional reward modeling for embodied world models.arXiv preprint arXiv:2601.12428, 2026

arXiv 2026

[17] [17]

Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, et al. Orv: 4d occupancy-centric robot video generation.arXiv preprint arXiv:2506.03079, 2025

arXiv 2025

[18] [18]

Tesseract: Learning 4d embodied world models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models. 2025. URLhttps://arxiv.org/abs/2504.20995

arXiv 2025

[19] [19]

Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, and Xin Jin. Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Processing Systems, 37:98478–98504, 2024

2024

[20] [20]

Nano banana pro.https://deepmind.google/technologies/gemini/, 2025

Google DeepMind. Nano banana pro.https://deepmind.google/technologies/gemini/, 2025. Built on Gem- ini 3 Pro. Image generation and editing model

2025

[21] [21]

GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026

OpenAI. GPT-Image-1.5.https://openai.com/index/new-chatgpt-images-is-here/, 2026. Accessed: 2026- 03-19

2026

[22] [22]

Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

Pith/arXiv arXiv 2025

[23] [23]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[24] [24]

Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

2026

[25] [25]

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale. arXiv preprint arXiv:2508.10711, 2025

arXiv 2025

[26] [26]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

arXiv 2026

[27] [27]

Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Pith/arXiv arXiv 2026

[28] [28]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer

Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025

[29] [29]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023

2023

[30] [30]

Guiding instruction-based image editing via multimodal large language models

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024

2024

[31] [31]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

2024

[32] [32]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26125–26135, 2025

2025

[33] [33]

Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T Barron, et al. Image generators are generalist vision learners.arXiv preprint arXiv:2604.20329, 2026

Pith/arXiv arXiv 2026

[34] [34]

Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, et al. Diffusion model as a generalist segmentation learner.arXiv preprint arXiv:2604.24575, 2026

Pith/arXiv arXiv 2026

[35] [35]

Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony De- schênes, François Pomerleau, and Philippe Giguere. Leveraging image generators to address training data scarcity: The gen4regen dataset for forest regeneration mapping.arXiv preprint arXiv:2605.05627, 2026

Pith/arXiv arXiv 2026

[36] [36]

pi0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint, 2024

2024

[37] [37]

pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint, 2025

2025

[38] [38]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint, 2025

2025

[39] [39]

Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint, 2025

2025

[40] [40]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026

[41] [41]

Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

HY Team, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, et al. Hy-embodied-0.5: Embodied foundation models for real-world agents.arXiv preprint arXiv:2604.07430, 2026

Pith/arXiv arXiv 2026

[42] [42]

Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, and Yanwei Fu. Universal pose pretraining for generalizable vision-language-action policies.arXiv preprint arXiv:2602.19710, 2026

Pith/arXiv arXiv 2026

[43] [43]

Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

arXiv 2025

[44] [44]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint, 2025

2025

[45] [45]

Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.ICLR, 2024

2024

[46] [46]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint, 2025

2025

[47] [47]

Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Sipeng Zheng, Qin Jin, and Zongqing Lu. Dig-flow: Discrepancy-guided flow matching for robust vla models.arXiv preprint arXiv:2512.01715, 2025

arXiv 2025

[48] [48]

Being-h0: Vision-language-action pretraining from large-scale human videos

HaoLuo, YichengFeng, WanpengZhang, SipengZheng, YeWang, HaoqiYuan, JiazhengLiu, ChaoyiXu, QinJin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos. InInternational Conference on Machine Learning. PMLR, 2026

2026

[49] [49]

Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025

arXiv 2025

[50] [50]

Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

arXiv 2025

[51] [51]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

arXiv 2026

[52] [52]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. In Proceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026

2026

[53] [53]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[54] [54]

Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Pith/arXiv arXiv 2025

[55] [55]

F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.ArXiv, abs/2509.06951, 2025. URLhttps://api.semanticscholar.org/CorpusID:281204333

Pith/arXiv arXiv 2025

[56] [56]

Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, et al. Qwen-vla: Unifying vision-language-action modeling across tasks, environments, and robot embodiments.arXiv preprint arXiv:2605.30280, 2026

Pith/arXiv arXiv 2026

[57] [57]

Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

arXiv 2025

[58] [58]

Learning universal policies via text-guided video generation.NeurIPS, 2024

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.NeurIPS, 2024

2024

[59] [59]

Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint, 2023

2023

[60] [60]

Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Generalist bimanual manipulation via foundation video diffusion models.arXiv preprint, 2025

2025

[61] [61]

Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation.NeurIPS, 2024

2024

[62] [62]

Murphy, Chelsea Finn, and Yilun Du

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin P. Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry. 2026. URL https://api.semanticscholar.org/CorpusID:287074218

2026

[63] [63]

Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.ArXiv, abs/2512.15840, 2025. URLhttps://api.semanticscholar.org/CorpusID: 283933826

Pith/arXiv arXiv 2025

[64] [64]

Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint, 2025

2025

[65] [65]

Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026

Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, and Jian Tang. Tc-idm: Grounding video generation for executable zero-shot robot motion.ArXiv, abs/2601.18323, 2026. URLhttps://api.semanticscholar.org/CorpusID:285051517

arXiv 2026

[66] [66]

Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026

Zhongrui Zhang, Cheng-Chuan Yang, Qin Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, and Jianyu Chen. Veo-act: How far can frontier video models advance generalizable robot manipulation? 2026. URLhttps: //api.semanticscholar.org/CorpusID:287202336

2026

[67] [67]

Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. Vampo: Policy optimization for improving visual dynamics in video action models.arXiv preprint arXiv:2603.19370, 2026

arXiv 2026

[68] [68]

Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pak- damansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, et al. Do world action models generalize better than vlas? a robustness study.arXiv preprint arXiv:2603.22078, 2026

Pith/arXiv arXiv 2026

[69] [69]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025

[70] [70]

Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4d: Kinematic 4d world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

arXiv 2026

[71] [71]

Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

arXiv 2026

[72] [72]

Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization.arXiv preprint arXiv:2603.12639, 2026

Pith/arXiv arXiv 2026

[73] [73]

Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. Unit: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026

Pith/arXiv arXiv 2026

[74] [74]

Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabilizing multi- step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

arXiv 2026

[75] [75]

Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

Bingchuan Wei, Bingqi Huang, Jingheng Ma, Sen Cui, et al. Fate: Closed-loop feasibility-aware task generation with active repair for physically grounded robotic curricula.arXiv preprint arXiv:2603.01505, 2026

arXiv 2026

[76] [76]

Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li, Jiagang Zhu, Tianze Liu, Jiajun Lv, Xingxing Zuo, Yun Ye, et al. Vag: Dual-stream video-action generation for embodied data synthesis.arXiv preprint arXiv:2604.09330, 2026

Pith/arXiv arXiv 2026

[77] [77]

Interactive world simulator for robot policy training and evaluation

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation. arXiv preprint arXiv:2603.08546, 2026

arXiv 2026

[78] [78]

World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, and Yilun Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026

Pith/arXiv arXiv 2026

[79] [79]

World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, and Donglin Wang. World-value-actionmodel: Implicitplanningforvision-language-actionsystems.arXiv preprint arXiv:2604.14732, 2026

Pith/arXiv arXiv 2026

[80] [80]

Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025

Yue Liao, Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Hu Yue, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation.ArXiv, abs/2508.05635, 2025. URLhttps: //api.semanticscholar.org/CorpusID:280545868

Pith/arXiv arXiv 2025