{"total":11,"items":[{"citing_arxiv_id":"2605.13403","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[82] Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022. [83] Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demon- strations for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 7(2): 4126-4133, 2022. [84] Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023. [85] Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Jörn V ogel. Shared control templates for assistive robotics. In2020 IEEE international conference on robotics and automation (ICRA), pages 1956-1962."},{"citing_arxiv_id":"2605.12090","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"VLAs World Model Action-Conditioned iVideoGPT [23], FlowDreamer [24], EnerVerse [25], PlaNet [26], TransDreamer [27], V-JEP A [28]. . . Langugae-Conditoned MoCoGAN [29], U-Net [30], Latte [ 31], Wan [32], Sora 2 [ 33]. . . Embodied World Model SWIM [34], DreamDojo [ 35], RoboDreamer [36], RoboScape [37]. . . WM for VLA Imitation Learning Ctrl-World [38], RoboScape [37], DREMA [ 39] Reinforcement Learning Dreamer to Control [ 40] DreamerV2 [ 41], Dreamer 4 [ 42], RISE [ 43] DreamerV3 [44], DayDreamer [45], World-Env [46], RoboScape-R [47] WMPO [48], WoVR [49], VLA-RFT [50], RWML [51], MoDem-V2 [52] World-Gymnast [53], RWM-U [54], World4RL [55], VIPER [ 56] PhysWorld [57], Diffusion Reward [58], GenReward [59]"},{"citing_arxiv_id":"2605.01694","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent State Design for World Models under Sufficiency Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-03T03:19:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2023-2026 I-JEPA [2], V-JEPA 2 [3], V-JEPA 2.1 [44], LeWorldModel [40] Reward / value-shaped Reward and policy-relevant supervision 2019-2021 TPC [46], value-aligned latent planning [28] Value-equivalent Bellman-relevant statistics only 2020-2023 MuZero [52], EfficientZero [66], TD-MPC2 [26] Causal / counterfactual Intervention-sensitive structural variables 2026 Causal-JEPA [45], CausalV AE-WM [14] Table 1 maps design targets along this spectrum. Physical-reasoning probes [39, 67] reinforce the axis by separating visual fidelity from physical and causal correctness. 2.2 Relationships among sufficiency constraints The six roles are descriptive. The sufficiency constraints behind them have formal relationships. Three propositions"},{"citing_arxiv_id":"2604.07517","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-04-08T18:52:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04974","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11755","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints","primary_cat":"cs.CV","submitted_at":"2026-03-12T10:02:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new occlusion-aware control module generates high-fidelity egocentric videos from sparse 3D hand joints, supported by a million-clip dataset and cross-embodiment benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15493","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GR-3 Technical Report","primary_cat":"cs.RO","submitted_at":"2025-07-21T10:54:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00990","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations","primary_cat":"cs.RO","submitted_at":"2025-07-01T17:39:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895-10904, 2023. 3 [80] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Em- mons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879- 893. PMLR, 2018. 3 [81] Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. arXiv preprint arXiv:2308.10901, 2023. 3 [82] Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models learn physical principles from watching videos? arXiv preprint arXiv:2501.09038, 2025. 1 [83] Shinichiro Nakaoka, Atsushi Nakazawa, Fumio Kanehiro,"},{"citing_arxiv_id":"2506.14135","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-06-17T02:55:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.04983","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning","primary_cat":"cs.RO","submitted_at":"2024-11-07T18:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.13139","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2023-12-20T16:00:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}