{"total":71,"items":[{"citing_arxiv_id":"2606.28128","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation","primary_cat":"cs.CV","submitted_at":"2026-06-26T14:30:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27677","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DIM-WAM: World-Action Modeling with Diverse Historical Event Memory","primary_cat":"cs.RO","submitted_at":"2026-06-26T03:17:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30011","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00113","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2026-05-27T05:32:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27947","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SANTS: A State-Adaptive Scheduler for World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-27T04:40:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00110","ref_index":95,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27817","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Turning Video Models into Generalist Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-05-27T01:21:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25829","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T13:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21862","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control","primary_cat":"cs.RO","submitted_at":"2026-05-21T01:19:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18556","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Key-Gram: Extensible World Knowledge for Embodied Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-18T15:37:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18287","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableVLA: Towards Robust Vision-Language-Action Models without Extra Data","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15735","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":94,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM Autoregression GR1 [86], grmg [ 87], GR2 [88], Co TVLA [89], WorldVLA [90], rynnvla2 [91] VLA-JEP A [92], F1-VLA [93] Diffusion-based P AD [21], VideoVLA [94], UWM [20], DreamZero [ 17], CosmosPolicy [16], FLARE [95], UV A [96] FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], LingBotV A [18], AIM [ 102] DexWorldModel [103], FastW AM [104], MotuBrain [105] AdaWorldPolicy [106], DiT4DiT [107], Motus [19], Act2Goal [108], PhysGen [22], GigaWorld-Policy [109], UD-VLA [110], X-W AM [111] Training data"},{"citing_arxiv_id":"2605.10942","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. [9] Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. [10] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [11] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei"},{"citing_arxiv_id":"2605.10925","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:56:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"[9] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. [10] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [11] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping"},{"citing_arxiv_id":"2605.10819","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09948","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A typical VLA system encodes visual observations and language instruc- tions into latent representations using a VLM backbone, followed by a policy head that maps these representations to control actions [20]. Recent efforts have focused on scaling data, model capac- ity, and task diversity, leading to improved generalization across tasks and environments [ 21, 22]. In most exist- ing approaches, action prediction is primarily based on the final-layer representations of the backbone, implicitly assuming that deeper representations are more suitable for decision making. While effective in practice, this de- sign largely treats the output of the backbone as a fixed representation for downstream control. Layer-wise Representation Readout."},{"citing_arxiv_id":"2605.07931","ref_index":11,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:04:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. [10] Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. [11] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [12] Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li,"},{"citing_arxiv_id":"2605.07794","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T14:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. [22] Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025. [23] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [24] Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu,"},{"citing_arxiv_id":"2605.03269","ref_index":21,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2025a. [20] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025b. [21] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. [22] Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu"},{"citing_arxiv_id":"2605.02130","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[9] Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In ICRA, 2025. 8 [10] Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025. 2 [11] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 2 [12] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia."},{"citing_arxiv_id":"2605.00078","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 19 [89] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. [90] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [91] Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv"},{"citing_arxiv_id":"2604.26565","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation","primary_cat":"cs.CV","submitted_at":"2026-04-29T11:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language models on DenseStep2M yields substantial performance gains across both generative (dense captioning and procedural grounding) and discriminative (cross-modal retrieval) video under- standing tasks. 2 Related Work Large-Scale Video-Text Datasets.Multimodal video-text datasets serve as fundamental resources for advancing video understanding research. Early datasets such as MSVD [17], MSR-VTT [71], and DiDeMo [4] provide high-quality, human-curated annotations but are limited in scale, often containing only a few thousand clips. To support the training of foundational models, the field has shifted toward web-scale corpora. Datasets like WebVid-2M [11], Youtube- 8M [1], and Instagram65M [29] leverage massive amounts of inter- net data, though they often rely on sparse user-provided tags or"},{"citing_arxiv_id":"2604.25859","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-28T16:58:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24681","ref_index":36,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-27T16:42:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. [34] Qingwen Bu, Yanting Yang, et al. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025. [35] Hongtao Wu, Ya Jing, et al. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024. [36] Chi-Lam Cheang, Guangzeng Chen, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [37] Homanga Bharadhwaj, Debidatta Dwibedi, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning, 2025."},{"citing_arxiv_id":"2604.22615","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GazeVLA: Learning Human Intention for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-24T14:46:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Objects Fig. 2:We curate a large-scale egocentric dataset from diverse sources, containing both hand and gaze annotations with masks indicating validity. The dataset features a unified coordinate system and covers diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in a total of over 150M frames. models [3,13,30,52,66,83] via large-scale visual pretraining. Compared to these approaches, our work advocates moving beyond learning what human do toward understanding why they do it. By explicitly modeling human intent rather than merely imitating execution-level behaviors, we aim to enable deeper and more generalizable knowledge transfer from human to robots, which is particularly"},{"citing_arxiv_id":"2604.21241","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"OpenVLA [2] suggest that scaling multimodal backbones can translate into broader task coverage in robotics. At the same time, the field has been actively experimenting with different design choices-from diffusion/flow-based action heads that improve continuous control fidelity (e.g., Octo [3], pi0 [4], RDT [5]), to richer multimodal structures and training signals (e.g., GR-1/GR-2 [6], [7], RoboDreamer [8], and RL-augmented variants [9], [10]). These parallel threads reflect an ongoing evolution of VLA paradigms rather than a settled blueprint [11]. Alongside architectural progress, the robotics community continues to accumulate data from increasingly diverse plat- forms and setups. Differences in embodiments, controllers, camera configurations, and annotation conventions make it"},{"citing_arxiv_id":"2604.20246","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cortex 2.0: Grounding World Models in Real-World Industrial Deployment","primary_cat":"cs.RO","submitted_at":"2026-04-22T06:49:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and unpacking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"At internet scale, UniSim [8] and Cosmos [9] demonstrated that world models pretrained on large- scale video acquire broad physical priors transferable to robotic settings. Several concurrent works have explored using such models at inference time: IRASim [28] and GPC [29] showed that scoring candidate rollouts before execution improves task success over reactive policies, while GR-2 [30] and V-JEPA 2 [31] validated that joint pretraining on internet video and robot data supports strong physical reasoning with limited robot-specific supervision. Li et al. [32] further demonstrated this direction on deployment data. Cortex 2.0 builds on these findings by grounding world model training in continuously collected operational data and scoring imagined rollouts via PRO before any action is"},{"citing_arxiv_id":"2604.17887","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Closed-loop visuomotor control with generative expectation for robotic manipulation. Advances in Neural Information Processing Systems, 2024. 1 [11] Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025. 2 [12] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 3 [13] Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li,"},{"citing_arxiv_id":"2604.17862","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"M100: An Orchestrated Dataflow Architecture Powering General AI Computing","primary_cat":"cs.LG","submitted_at":"2026-04-20T06:19:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16592","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"✓ ✓ ✗ ✗ ✓ ✗ ✗Chunk-wise autoregressive video diffusion with sparse memory and 4DGS for action-conditioned prediction UWM [232] 2025 Robot.✓ ✗ ✗ ✗ ✓ ✗ ✗Couples video and action dif- fusion in one transformer; pretrained on video-only and video+action data GR-1 [185] 2024 Robot.✓ ✓ ✓ ✗ ✓ ✗ ✗GPT transformer pretrained on 800K Ego4D clips jointly pre- dicting actions and future frames GR-2 [28] 2024 Robot.✓ ✓ ✓ ✗ ✓ ✗ ✗Scaled video-language-action model (719M) achieving 97.7% success across 100+ real tasks UniPi [44] 2023 Robot.✗ ✗ ✓ ✗ ✓ ✗ ✗Text-conditioned video diffusion as policy; extracts actions via in- verse dynamics SuSIE [19] 2024 Robot.✗ ✓ ✓ ✗ ✓ ✗ ✗Image-editing diffusion synthe- sizing subgoal images for goal- conditioned manipulation policy"},{"citing_arxiv_id":"2604.15483","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 3 [27] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video genera- tive pre-training for visual robot manipulation.Interna- tional Conference on Learning Representations (ICLR), 2024. [28] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr- 2: A generative video-language-action model with web- scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 3 [29] Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum 'e III, Andrey Kolobov,"},{"citing_arxiv_id":"2604.13654","ref_index":178,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap","primary_cat":"cs.RO","submitted_at":"2026-04-15T09:20:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"states and evaluate potential actions before execution, while unified architectures like RoboBrain [ 177] bridge abstract rea- soning with concrete manipulation through multi-level world modeling. This integration is further enhanced by techniques such as visual chain-of-thought reasoning (CoT-VLA) [ 95] and the incorporation of web-scale knowledge (GR-2) [ 178]. By combining the predictive power of world models with the actionable outputs of VLAs, these integrated systems represent a significant step toward truly generalist embodied agents capable of robust, long-horizon navigation in complex, dynamic environments. 3.3.4 Architectural Integration: Hierarchical and Hybrid Agen- tic Systems Effective agentic UA Vs increasingly adopt hierarchical or hy-"},{"citing_arxiv_id":"2604.11386","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation","primary_cat":"cs.RO","submitted_at":"2026-04-13T12:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LAPA [55], and OpenVLA-OFT [19], further demonstrate efficient finetuning across robots and sensing modalities. Collectively, these results point to a data- driven bottleneck: robust cross-task and cross-embodiment generalization hinges on large, diverse, and high-fidelity datasets that faithfully capture real-world appearance, sensing, and physics. 2.4 World Simulator for Robotic Manipulation Scalable robot learning [2,7,9,29,63] depends on abundant, realistic data, yet collecting real-world trajectories via human demonstrations is slow and labor- intensive, limiting broad access. Generative video models [1,50] offer a cost- effective way to synthesize policy training data. UniPi [14] and AVDC [21] cast robot planning as text-to-video generation (AVDC further estimates inverse"},{"citing_arxiv_id":"2604.10170","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-11T11:36:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09330","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-10T13:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tive policy learning due to the absence of paired and long- horizon action trajectories. To this end, current methods compensate by relying on externally provided trajectories as additional supervision signals during training. World-Action Model as Policy.To enhance action prediction, a complementary research direction incor- porates future video generation as an auxiliary sig- nal. GR1 [72], GR2 [11], WorldVLA [9], UV A [37], DUST [71], DreamZero [77], Motus [5], Cosmos- Policy [33], GigaWorld-Policy [76] and Fast-W AM [80] jointly predict next-step or multi-frame observations along- side actions. However, these methods primarily focus on improving action prediction via predictive visual signals rather than scalable video-action pair synthesis."},{"citing_arxiv_id":"2604.08544","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds","primary_cat":"cs.RO","submitted_at":"2026-04-09T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This mechanism ensures that the vertex does not overshoot the safety margin during a single iteration, effectively preventing tunneling and numerical instability. B.6. Position Updation The sequence concludes with the synthesis of the final vertex positionenew 𝑖 for the current iteration. By applying the clipped displacement to the iteration's starting pointe(𝑛) 𝑖 , we obtain: enew 𝑖 =e (𝑛) 𝑖 +𝑠·Δe 𝑖.(10) This integration ensures that the updated mesh state is not only physically optimal according to the AVBD energy gradients but also strictly compliant with geometric safety constraints, leading to a robust and collision-free simulation. B.7. Simulation Infrastructure To support high-fidelity data collection for deformable manipulation, we develop a simulation in-"},{"citing_arxiv_id":"2604.08168","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-04-09T12:28:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3 [6] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23206-23217, 2023. 3 [7] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 3 [8] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia."},{"citing_arxiv_id":"2604.04974","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03181","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-03T16:57:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[8] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. [9] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. [10] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [11] Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li,"},{"citing_arxiv_id":"2603.16666","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","primary_cat":"cs.CV","submitted_at":"2026-03-17T15:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, and Linxi Fan. Dreamgen: Unlocking generalization in robot learning through video world models, 2025. URLhttps://arxiv.org/abs/2505.12705. [26] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language- action models, 2025. URLhttps://arxiv.org/abs/2503.22020. [27] Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang,"},{"citing_arxiv_id":"2603.15759","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation","primary_cat":"cs.RO","submitted_at":"2026-03-16T18:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20231","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-02-23T18:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00110","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T14:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15922","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models are Zero-shot Policies","primary_cat":"cs.RO","submitted_at":"2026-02-17T15:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning, pages 287-318. PMLR, 2023. 4 [16] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 4 [17] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 5 [18] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia."},{"citing_arxiv_id":"2601.07060","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2, 6 [11] Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal- conditioned reinforcement learning with imagined subgoals. InInternational Conference on Machine Learning (ICML), pages 1430-1440, 2021. 3 [12] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 2 [13] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia."},{"citing_arxiv_id":"2512.21714","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AstraNav-World: World Model for Foresight Control and Consistency","primary_cat":"cs.CV","submitted_at":"2025-12-25T15:31:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.09928","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-12-10T18:59:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}