{"total":37,"items":[{"citing_arxiv_id":"2605.23856","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"model to generate actions with expressive multimodal distributions. DPPO [ 56] extends this idea107 by modeling a two-layer MDP structure, which enables ﬁne-tuning of diffusion-based policies in108 RL settings. Another line of work integrates diffusion models with value-based methods (e.g., Q-109 learning), to generate multimodal action distributions guided by learned value functions, such as110 Diffusion-QL [ 57], IDQL [ 58], CPQL [ 59], CEP [ 60], and DWM [ 61].111 3 Latent Identiﬁcation in POMDP112 In this section, we seek to formally model the structure of the decision-making system by answering:113 (1). Where the latent factors reside and how they inﬂuence the observable variables such as states,114 actions, and rewards? and (2). Whether they can be identiﬁed from demonstration data alone?"},{"citing_arxiv_id":"2605.12090","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ages the pre-trained semantic latent spaces of Large Language Models (LLMs) or Vision-Language Models (VLMs) to map perceptual inputs directly to the action space. Formally , the VLA objective is defined by the conditional probability of actions given the multimodal context: 3 Roadmap to W AM Background VLAs World Model Action-Conditioned iVideoGPT [23], FlowDreamer [24], EnerVerse [25], PlaNet [26], TransDreamer [27], V-JEP A [28]. . . Langugae-Conditoned MoCoGAN [29], U-Net [30], Latte [ 31], Wan [32], Sora 2 [ 33]. . . Embodied World Model SWIM [34], DreamDojo [ 35], RoboDreamer [36], RoboScape [37]. . . WM for VLA Imitation Learning Ctrl-World [38], RoboScape [37], DREMA [ 39] Reinforcement Learning Dreamer to Control [ 40] DreamerV2 [ 41], Dreamer 4 [ 42], RISE [ 43]"},{"citing_arxiv_id":"2605.07514","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T09:44:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"candidates N. The overhead of computing selection weights and consistency scores is negligible, taking approximately0.7ms forN= 8. 7 5.1 Results on RoboCasa Table 1:Results on RoboCasa.\" ∗\" denotes our reim- plementation; all other results are taken from [22].TTS indicates test-time scaling. Method TTS Average SR (%) UV A [26]✗50.0 DP-VLA [16]✗57.3 UWM [45]✗60.8 π0 [6]✗62.5 GR00T-N1.5 [5]✗64.1 Video-Policy [27]✗66.0 FLARE [44]✗66.4 Cosmos-Policy∗ [22]✗66.6 + Value-Prediction∗ ✓67.4 + Consistency-Consensus (ours)✓67.3 + Consistency-Exploring (ours)✓68.0 We follow the evaluation protocol of Cosmos-Policy [ 22] on RoboCasa [ 29], which contains 24 kitchen manipulation tasks performed by a single Franka Emika"},{"citing_arxiv_id":"2605.06481","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LIBERO-PRO: Towards robust and fair evaluation of Vision-Language-Action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025. [100] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. arXiv:1812.07035. [101] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. [102] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al."},{"citing_arxiv_id":"2605.06222","ref_index":32,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When to Trust Imagination: Adaptive Action Execution for World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in the real world, it improves success rate by 35%. 2 Related work World action models.World Action Models (W AMs) extend standard VLA policies by explicitly modeling how future observations evolve under actions through joint video-action generation [33, 13, 1, 30, 12, 31]. This formulation allows W AMs to capture multiple control-relevant distributions within a unified framework, including forward dynamics p(o′ |o, a) , inverse dynamics p(a|o, o ′), the marginal action distribution p(a|o) , and the marginal image distribution p(o′ |o) corresponding to video generation [ 33, 1, 13]. Compared with VLAs that primarily model the action modality,"},{"citing_arxiv_id":"2605.06192","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields","primary_cat":"cs.CV","submitted_at":"2026-05-07T13:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv:2605.06192v1 [cs.CV] 7 May 2026 Figure 1: Comparison between direct low-dimensional action conditioning and the proposedStruc- tured Kinematic-to-Visual Action Fields(KV AFs). policy learning, and offer scalable proxies for evaluating VLA policies without repeatedly executing every policy on physical robots. Meanwhile, recent world-action models [31, 28, 7] have begun to jointly tune video generation and action modeling within unified generative frameworks, showing that future videos can provide dense world representations for action generation, policy learning, planning, and value estimation. However, these works predominantly treat video generation merely as an auxiliary representation to optimize"},{"citing_arxiv_id":"2605.05126","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-06T16:55:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 2, 3, 6 [93] Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kai- wen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025. 2 [94] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world mod- els: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 3 [95] Yijie Zhu, Yibo Lyu, Zitong Yu, Rui Shao, Kaiyang Zhou, and Liqiang Nie. Emosym: A symbiotic framework for uni-"},{"citing_arxiv_id":"2605.02881","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MolmoAct2: Action Reasoning Models for Real-world Deployment","primary_cat":"cs.RO","submitted_at":"2026-05-04T17:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00080","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Model for Robot Learning: A Comprehensive Survey","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00078","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"6 [5] 3B 93.9 - - 36.0 47.6 4.60 4.24 - π0.5 [40] 3B 96.9 77.4 - 41.4 - 4.06 4.13 82.7/76.8 starVLA [113] 4B 96.5 77.0 - - 48.8 - - 88.2/88.3 MINT-4B [114] 4B 98.7 80.1 84.1 - - 4.57 - - ABot-M0 [115] 4B 98.6 80.5 - - 58.3 - - 86.1/85.1 LingBot-VLA [116]4B - - - - - - - 86.5/85.3 Being-H0.5 [7] 2B 98.9 78.5 83.1 53.5 - 4.63 4.48 - # W orld Model UWM [57] - 79.0 - - 48.2 - - - - UVA [56] - - - - 50.0 - - - - VPP [52] 1.5B - - - - - - 4.33 - DreamVLA [65] - 92.6 - - - - - 4.44 - JEPA-VLA [117] - 96.4 25.6 - - - - - 73.5/- VLA-JEPA [64] - 96.1 79.5 - - - - - - LingBot-VA [14] 5B 98.5 - - - - - - 92.9/91.6 Cosmos-Policy [13]2B 98.5 - - 67.1 - - - - Fast-WAM [15] 6B 97.6 - - - - - - 91.9/91.8 Being-H0."},{"citing_arxiv_id":"2604.26694","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising","primary_cat":"cs.RO","submitted_at":"2026-04-29T14:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"inverse dynamics models or extracting intermediate representations to convert world models into planners [22, 32-35]. Others augment VLAs with auxiliary future prediction objectives to inject dynamics awareness [36-41]. While both directions yield improvements, they remain loosely coupled rather than truly unified. Recently, a line of work has sought to build end-to-end unified video-action models from video foundation models. UWM [19] and Motus [20] formulate the problem as a Unified World Model, enabling flexible conditioning and multi-task generation. VideoVLA [21] and Cosmos Policy [11] directly append action tokens into video sequences for joint prediction. Other works [ 10, 27, 28] employ a Mixture of Transformer architecture with independent parameters and denoising timesteps"},{"citing_arxiv_id":"2604.21914","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-23T17:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ics models [26], [27], [28] or pose tracking models [29], [8]. VISTA [30], on the other hand, leverages generative models to synthesize images from novel viewpoints for data augmentation, thereby enhancing the policy's robustness to variations in camera pose. Another line of research proposes unified models that simultaneously predict both future frames and robot actions [31], [32], [19], handling both scene understanding and action generation within a single frame- work. Some approaches leverage generative models as world simulators to support model-based reinforcement learning algorithms [33], [34] or to enable closed-loop verification of policies [35]. Rooted in end-to-end robotic manipulation frameworks, our approach aims to produce latent representa-"},{"citing_arxiv_id":"2604.16592","ref_index":232,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"✗ ✓ ✗ ✓ ✓ ✗ ✗Dynamic 3DGS World Model predicting future Gaussian scenes under action for manipu- lation RoboScape [148] 2025 Robot.✗ ✓ ✗ ✓ ✓ ✗ ✗Physics-informed World Model jointly learning video, depth, and keypoint dynamics EnerVerse-AC [68] 2025 Robot.✓ ✓ ✗ ✗ ✓ ✗ ✗Chunk-wise autoregressive video diffusion with sparse memory and 4DGS for action-conditioned prediction UWM [232] 2025 Robot.✓ ✗ ✗ ✗ ✓ ✗ ✗Couples video and action dif- fusion in one transformer; pretrained on video-only and video+action data GR-1 [185] 2024 Robot.✓ ✓ ✓ ✗ ✓ ✗ ✗GPT transformer pretrained on 800K Ego4D clips jointly pre- dicting actions and future frames GR-2 [28] 2024 Robot.✓ ✓ ✓ ✗ ✓ ✗ ✗Scaled video-language-action model (719M) achieving 97.7%"},{"citing_arxiv_id":"2604.13645","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-04-15T09:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11751","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11135","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps","primary_cat":"cs.RO","submitted_at":"2026-04-13T07:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16484","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-13T03:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09527","ref_index":126,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Envisioning the Future, One Step at a Time","primary_cat":"cs.CV","submitted_at":"2026-04-10T17:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"able zero-shot planning. InForty-second International Con- ference on Machine Learning. 2, 3 [125] Mo Zhou, Jianwei Wang, Xuanmeng Zhang, Dylan Camp- bell, Kai Wang, Long Yuan, Wenjie Zhang, and Xuemin Lin. Probdiffflow: an efficient learning-free framework for probabilistic single-image optical flow estimation.Frontiers of Computer Science, 20(8):2008342, 2026. 3 [126] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world mod- els: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 2 Envisioning the Future, One Step at a Time Supplementary Material A. Additional Implementation Details We provide more context on implementation details of our"},{"citing_arxiv_id":"2604.03181","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-03T16:57:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"manipulation via multi-modal diffusion.arXiv preprint arXiv:2512.16023, 2025. [33] Yixiang Chen, Peiyan Li, Yan Huang, Jiabing Yang, Kehan Chen, and Liang Wang. Ec-flow: Enabling versatile robotic manipulation from action-unlabeled videos via embodiment-centric flow. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11958-11968, October 2025. 13 [34] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. [35] Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process."},{"citing_arxiv_id":"2603.16666","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","primary_cat":"cs.CV","submitted_at":"2026-03-17T15:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025. URL https://arxiv.org/abs/2512.13030. [6] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets, 2025. URLhttps://arxiv.org/abs/2504.02792. [7] Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025. URL https://arxiv.org/abs/2507.12898. [8] Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation,"},{"citing_arxiv_id":"2603.15759","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation","primary_cat":"cs.RO","submitted_at":"2026-03-16T18:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20231","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-02-23T18:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15922","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models are Zero-shot Policies","primary_cat":"cs.RO","submitted_at":"2026-02-17T15:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16163","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","primary_cat":"cs.AI","submitted_at":"2026-01-22T18:09:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.07060","ref_index":153,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3, 6, 7 [152] Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, and Jianlan Luo. Act2goal: From world model to general goal-conditioned policy.arXiv preprint arXiv:2512.23541, 2025. 3 [153] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world mod- els: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 2 [154] Jie Zhu, Xiao Guo, Yiyang Su, Anil Jain, and Xiaom- ing Liu. Fusionagent: A multimodal agent with dynamic model selection for human recognition."},{"citing_arxiv_id":"2512.21714","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AstraNav-World: World Model for Foresight Control and Consistency","primary_cat":"cs.CV","submitted_at":"2025-12-25T15:31:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15692","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs","primary_cat":"cs.RO","submitted_at":"2025-12-17T18:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13030","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Motus: A Unified Latent Action World Model","primary_cat":"cs.CV","submitted_at":"2025-12-15T06:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• WM:p(o t+1:t+k |o t,a t+1:t+k). • IDM:p(a t+1:t+k |o t:t+k). • VGM:p(o t+1:t+k |o t, ℓ). • Video-Action Joint Prediction Model: p(ot+1:t+k,a t+1:t+k |o t, ℓ). Two fundamental challenges (detailed in Sec. 3) hinder the integration of these capabilities. First,unifying such multimodal generative capabilitieswithin one framework is nontrivial. While unified world models (UWMs) [64] offer a theoretical prototype, they are typically trained from scratch or with limited priors, lacking either robust vision-language understanding from vision-language models (VLMs) or rich physical interaction knowledge from VGMs. Second, em- bodied intelligence demands the ability tolearn from large- scale heterogeneous data-including internet videos, ego-"},{"citing_arxiv_id":"2511.17792","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?","primary_cat":"cs.CV","submitted_at":"2025-11-21T21:36:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04812","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal Diffusion Forcing for Forceful Manipulation","primary_cat":"cs.RO","submitted_at":"2025-11-06T21:08:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10125","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ctrl-World: A Controllable Generative World Model for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2025-10-11T09:13:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06951","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions","primary_cat":"cs.RO","submitted_at":"2025-09-08T17:58:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04447","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","primary_cat":"cs.CV","submitted_at":"2025-07-06T16:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. arXiv preprint arXiv:2503.22020, 2025. 2, 3, 8 [59] Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiangmiao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation. arXiv preprint arXiv:2504.17784, 2025. [60] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792, 2025. [61] Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning."},{"citing_arxiv_id":"2507.01099","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Geometry-aware 4D Video Generation for Robot Manipulation","primary_cat":"cs.CV","submitted_at":"2025-07-01T18:01:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A geometry-aware 4D video generation model trained with cross-view pointmap alignment to produce spatio-temporally consistent future videos from novel viewpoints for robot manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.12705","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","primary_cat":"cs.RO","submitted_at":"2025-05-19T04:55:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}