{"total":15,"items":[{"citing_arxiv_id":"2605.21372","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training","primary_cat":"cs.CV","submitted_at":"2026-05-20T16:36:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17912","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform","primary_cat":"cs.RO","submitted_at":"2026-05-18T06:18:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03269","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"//research.nvidia.com/labs/gear/gr00t-n1_6/, December 2025b. [37] Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. [38] Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024. [39] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. [40] Google. Gemini API | Google AI for Developers.https://ai.google.dev/api, 2026. [Accessed 03-05-2026]. [41] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen,"},{"citing_arxiv_id":"2605.02130","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 8 [58] GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv e-prints, pages arXiv- 2510, 2025. 8 [59] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World mod- els as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. 8 [60] Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are-"},{"citing_arxiv_id":"2605.00080","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Model for Robot Learning: A Comprehensive Survey","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26848","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-29T16:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising real-world success from 42.5% to 70.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26694","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising","primary_cat":"cs.RO","submitted_at":"2026-04-29T14:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024. [16] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [17] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. [18] Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al."},{"citing_arxiv_id":"2604.19092","ref_index":46,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-21T05:09:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(LVP) [9] investigates video-conditioned planning, leveraging predicted visual rollouts as intermediate representations for downstream control. WoW [12] em- phasizes physically grounded intuition through large-scale embodied interaction data. EnerVerse [21] proposes an embodied future-space formulation for manip- ulation reasoning, while GigaWorld-0 [46] frames world models as scalable data engines for embodied AI. As world models evolve from perceptual generators into embodied simulators and planning engines, evaluation must extend beyond visual fidelity to verify physical consistency and control feasibility. 2.2 Learning Robotic Actions from Video Learning robotic control from video has attracted increasing attention as large-"},{"citing_arxiv_id":"2604.15805","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation","primary_cat":"cs.RO","submitted_at":"2026-04-17T08:06:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025. [47] Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.NeurIPS, 2021. [48] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to empower embodied ai, 2025."},{"citing_arxiv_id":"2604.11751","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"not expect neural networks to produce trajectories that deviate too much from those demonstrated in D, especially when training is performed from scratch, and D is not sufficiently large. For the same reason, the language instruction ℓ tends to serve as a one-hot label [34], inducing poor novel instruction following ability; Moreover, the model may exploit visual shortcuts, selecting actions based on spurious correlations [54], such as associating the actions with the scene layout. Both issues indicate a lack of genuine vision-language understanding by the model, preventing extrapolation. VLAs are proposed to address this by initializing from pretrained foundation models. They are thus expected to possess the capability:semantic generalization. This aims at making a policy trained"},{"citing_arxiv_id":"2604.11302","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS","primary_cat":"cs.RO","submitted_at":"2026-04-13T11:01:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09330","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-10T13:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Anypos: Au- tomated task-agnostic actions for bimanual manipulation. arXiv preprint arXiv:2507.12768, 2025. 2, 3, 6 [58] GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jia- gang Zhu, Lv Feng, et al. Gigabrain-0: A world model- powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025. 2 [59] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World mod- els as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. 2 [60] GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv"},{"citing_arxiv_id":"2604.04707","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenWorldLib: A Unified Codebase and Definition of Advanced World Models","primary_cat":"cs.CV","submitted_at":"2026-04-06T14:19:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Song, Delin Qu, et al. Are we ready for rl in text-to-3d generation? a progressive investigation.arXiv preprint arXiv:2512.10949, 2025. [113] GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025. [114] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025. [115] HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al."},{"citing_arxiv_id":"2603.28489","ref_index":201,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"WoV oGen [180],Cosmos-Drive- Dreams [181] Drive-WM [182], Vista [183], MiLA [184], ADriver-I [185], [186], Drivedreamer [19], MagicDrive-V2 [40], DriveArena [187], MAD [188] Epona [189], GenAD [190], DriveLaW [191], DrivingGPT [192], VaV AM [193] Embodied AI Vidar [194], DreamGen [195], GenMimic [196], RBench [197], GigaWorld-0 [198], RIGVid [199], LuciBot [200], Gen2Act [201], Dreamitate [202] World-Env [203], EV AC [204], Ctrl-World [205], VideoAgent [206], VIPER [207], WorldEval [208], Genie Envisioner [209], World-Gymnast [210], DreamDojo [211] GR-1 [212], VILP [213], UV A [214], RoboEnvision [215], GEVRM [216], EnerVerse [217], LingBot-V A [218], Cosmos Policy [219],Fast-W AM [220],LeWorld- Model [221],DreamZero [222]"},{"citing_arxiv_id":"2603.12639","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization","primary_cat":"cs.CV","submitted_at":"2026-03-13T04:16:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}