{"total":14,"items":[{"citing_arxiv_id":"2606.15032","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position","primary_cat":"cs.LG","submitted_at":"2026-06-13T00:21:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper proposes an L0-L7 evidential ladder for evaluating world models in embodied decision-making, prioritizing interventional action fidelity and policy optimization utility over visual plausibility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02274","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning","primary_cat":"cs.RO","submitted_at":"2026-06-01T14:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dexterity-BEV creates 3D vertex-based inputs and BEV-aligned outputs to reduce spatial-temporal misalignments in end-to-end robot policies trained on diverse datasets and embodiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01164","ref_index":195,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends","primary_cat":"cs.CV","submitted_at":"2026-05-31T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31286","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-29T13:20:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30877","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Wall-OSS-0.5 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-29T06:04:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26282","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-25T19:06:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22446","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts","primary_cat":"cs.CV","submitted_at":"2026-05-21T13:13:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09423","ref_index":75,"ref_count":16,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":4,"top_context_role":"background","top_context_polarity":"background","context_text":"Grutopia: Dream general robots in a city at scale, 2024. URL https: //arxiv.org/abs/2407.10943. [74] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. POET: Open-Ended Co- evolution of Environments and Their Optimized Solutions. InProceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 142-151. ACM, 2019. doi: 10.1145/3321707.3321799. [75] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, and Kenneth O Stanley. Enhanced poet: Open-ended reinforcement learning through unbounded invention of learning challenges and their solutions. InProceedings of the 37th International Conference on Machine Learning, pages 9940-9951. PMLR, 2020. URL http://proceedings.mlr. press/v119/wang20l/wang20l."},{"citing_arxiv_id":"2604.27711","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control","primary_cat":"cs.RO","submitted_at":"2026-04-30T10:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13942","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection","primary_cat":"cs.RO","submitted_at":"2026-04-15T14:53:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To mitigate visual distractors in cluttered environments, the high-level planner identifies task-irrelevant regions using Bk. The low-level executor subsequently converts these constraints into a geometry-preserving filtered observation. At the onset of a sub-tasktstart k , distractor bounding boxes are processed by a zero-shot segmentation modelSto generate a pixel-level mask: Qtstart k (u, v) =I h ∃bi ∈ B k such that(u, v)∈ S(I tstart k , bi) i .(15) Following initialization, the mask is propagated through time using a lightweight temporal update module: Qt =K(I t, Qt−1), t > t start k .(16) The resulting filtered image is computed as ˆIt = Ψ(It) =I t ⊙(1−Q t),(17) where ⊙ denotes element-wise multiplication. This operation effectively suppresses distractor regions while preserving the task-relevant geometry of the scene."},{"citing_arxiv_id":"2604.11302","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS","primary_cat":"cs.RO","submitted_at":"2026-04-13T11:01:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16484","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-13T03:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09330","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-10T13:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gigabrain-0: A world model- powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025. 2 [59] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World mod- els as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. 2 [60] GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026. 2 [61] Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek"},{"citing_arxiv_id":"2604.08168","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-04-09T12:28:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[43] Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 2, 3 [44] GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025. 7 [45] GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning. arXiv preprint arXiv:2602.12099, 2026. 1 [46] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey"}],"limit":50,"offset":0}