{"total":14,"items":[{"citing_arxiv_id":"2606.20781","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: A Survey","primary_cat":"cs.RO","submitted_at":"2026-06-18T17:05:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"PhysGen [144] adapts a continuous autoregressive video backbone, and DriveWAM [141] interleaves driving 8 video latents with ego-action tokens on a shared DiT. This group changes the coupling pattern, but the action token is still grounded by a pixel-decodable future. Diffusion and hybrid variants make the same joint target less sequential. PAD [52] jointly denoises future images and actions in one DiT. UD-VLA [19], CoVAR [183], and VideoVLA [140] vary the bridge through discrete diffusion, cross-stream attention, or video-generator co-training. Motus [10] is placed here by its joint-prediction path, where sparse future video and actions are denoised together. Its scheduler can switch to action-only or inverse-dynamics use cases, but those shortcuts do not change the category assignment."},{"citing_arxiv_id":"2606.19531","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?","primary_cat":"cs.CV","submitted_at":"2026-06-17T19:25:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16533","ref_index":158,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kairos: A Native World Model Stack for Physical AI","primary_cat":"cs.AI","submitted_at":"2026-06-15T10:37:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09811","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing","primary_cat":"cs.RO","submitted_at":"2026-06-08T17:55:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AHA-WAM is a dual-DiT asynchronous world-action model with horizon-adaptive offset training and OVCR routing that reports 92.8% success on RoboTwin and 78.3% on real tasks at 24.17 Hz without robot pretraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09258","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection","primary_cat":"cs.RO","submitted_at":"2026-06-08T09:30:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07107","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-05T10:01:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12369","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. [13] Alexandre Chapin, Emmanuel Dellandréa, and Liming Chen. Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026. [14] Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language- action model via joint discrete denoising diffusion pro- cess.arXiv preprint arXiv:2511.01718, 2025. [15] Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan,"},{"citing_arxiv_id":"2605.12090","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], LingBotV A [18], AIM [ 102] DexWorldModel [103], FastW AM [104], MotuBrain [105] AdaWorldPolicy [106], DiT4DiT [107], Motus [19], Act2Goal [108], PhysGen [22], GigaWorld-Policy [109], UD-VLA [110], X-W AM [111] Training data Robot-centric Teleoperation QT-Opt [112], MIME [ 113], RoboNet [114], Robo T urk-Real [115], BridgeData [116], MT-Opt [117] BC-Z [118], RT-1 [119], Language-Table [120], BridgeData v2 [ 121], Jaco Play [ 122] Cable Routing Dataset [ 123], RH20T [124], OXE [125], DROID [126], RH20T-P [127], RoboMIND [128] ARIO [129], RoboData [130], DexCap [131], FuSe [132], AgiBot World [133], REASSEMBLE [ 134] OmniAction [135], UnifoLM-WBT [136] UMI-style Human Demonstration"},{"citing_arxiv_id":"2605.00438","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation","primary_cat":"cs.AI","submitted_at":"2026-05-01T06:15:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2212.06817, 2022. [4] Jun Cen, Chaohui Y u, Hangjie Y uan, Y uming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025. [5] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. [6] Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Uniﬁed diffusion vla: Vision-language-action model via joint discrete denoising diffusion process. arXiv preprint arXiv:2511.01718, 2025. [7] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchﬁel, Russ Tedrake, and Shuran Song."},{"citing_arxiv_id":"2604.25050","ref_index":44,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors","primary_cat":"cs.RO","submitted_at":"2026-04-27T23:04:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Discrete diffusion policies act as natural asynchronous executors for robotics by treating action generation as iterative unmasking, yielding higher success rates and lower computation than flow-matching real-time chunking in dynamic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[42] Y . Wen, H. Li, K. Gu, Y . Zhao, T. Wang, and X. Sun. Llada-vla: Vision language diffusion action models.arXiv preprint arXiv:2509.06932, 2025. [43] J. Wen, M. Zhu, J. Liu, Z. Liu, Y . Yang, L. Zhang, S. Zhang, Y . Zhu, and Y . Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025. [44] J. Chen, W. Song, P. Ding, Z. Zhou, H. Zhao, F. Tang, D. Wang, and H. Li. Unified diffu- sion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025. [45] J. Ye, S. Gong, J. Gao, J. Fan, S. Wu, W. Bi, H. Bai, L. Shang, and L. Kong. Dream-vl & dream-vla: Open vision-language and vision-language-action models with diffusion language"},{"citing_arxiv_id":"2603.26320","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching","primary_cat":"cs.RO","submitted_at":"2026-03-27T11:38:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.25661","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance","primary_cat":"cs.RO","submitted_at":"2026-03-26T17:14:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervised finetuning plus a simple regularization term.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14148","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-11-18T05:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}