{"total":12,"items":[{"citing_arxiv_id":"2605.19940","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains","primary_cat":"cs.AI","submitted_at":"2026-05-19T15:00:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14211","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ASH: Agents that Self-Hone via Embodied Learning","primary_cat":"cs.AI","submitted_at":"2026-05-14T00:10:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11459","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Robotic manipulation in real-world settings frequently involves environments whose state changes during policy execution, ranging from regular motions such as objects on a conveyor belt to unex- pected events such as external perturbations [1-3]. Handling such dynamic conditions has therefore become a central requirement for general-purpose manipulation policies [4, 5]. Among recent ap- proaches, Vision-Language-Action (VLA) models map visual observations and language instructions directly to low-level control, and have emerged as a promising candidate for this setting [6-8]. However, most current VLAs adopt action chunking, where the model predicts a fixed-length sequence of future actions from a single visual frame at each inference call and the robot executes them open-"},{"citing_arxiv_id":"2604.13959","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI","primary_cat":"cs.AI","submitted_at":"2026-04-15T15:10:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10517","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning","primary_cat":"cs.AI","submitted_at":"2026-04-12T08:14:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07034","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis","primary_cat":"cs.RO","submitted_at":"2026-04-08T12:49:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., \"Palm-e: An embodied multimodal language model,\"arXiv preprint arXiv:2303.03378, 2023. [3] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al., \"A generalist agent,\"arXiv preprint arXiv:2205.06175, 2022. [4] Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, Z. Zhao,et al., \"Toward general-purpose robots via foundation models: A survey and meta-analysis,\"arXiv preprint arXiv:2312.08782, 2023. [5] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman,et al., \"Foundation models in"},{"citing_arxiv_id":"2603.12510","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-03-12T22:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06949","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos","primary_cat":"cs.RO","submitted_at":"2026-02-06T18:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"InProc. of the International Conf. on Learning Representations (ICLR), 2022. 15 [38] Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis.arXiv preprint arXiv:2312.08782, 2023. 2 [39] Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting Video Diffusion Models to Interactive World Models.arXiv preprint arXiv:2505.14357, 2025. 5, 6 [40] Xun Huang. Towards Video World Models, 2025. URLhttps://www.xunhuang.me/blogs/world_ model.html. 8, 16 [41] Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman."},{"citing_arxiv_id":"2509.13414","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MapAnything: Universal Feed-Forward Metric 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2025-09-16T18:00:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InShape from shading, pages 123-171. MIT Press, 1989. 1 [18] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans. Pattern Anal. Mach. Intell., 46 (12):10579-10596, 2024. 8 [19] Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward general-purpose robots via founda-"},{"citing_arxiv_id":"2505.07813","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies","primary_cat":"cs.RO","submitted_at":"2025-05-12T17:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexWild co-trains dexterous robot policies on in-the-wild human hand interactions recorded with a low-cost system and limited robot data, achieving 68.5% success in unseen environments and 5.8x better cross-embodiment generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.01652","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2024-09-03T06:45:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yang, C. R. Garrett, T. Lozano-P 'erez, L. Kaelbling, and D. Fox. Sequence-based plan feasibility prediction for efficient task and motion planning.arXiv preprint arXiv:2211.01576, 2022. [89] G. S. Camps, R. Dyro, M. Pavone, and M. Schwager. Learning deep sdf maps online for robot navigation and exploration. arXiv preprint arXiv:2207.10782, 2022. [90] Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, Z. Zhao, et al. Toward general-purpose robots via foundation models: A survey and meta-analysis. arXiv preprint arXiv:2312.08782, 2023. [91] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, et al. Foundation models in robotics: Applications, challenges, and the future."},{"citing_arxiv_id":"2405.14093","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To the best of our knowledge, this survey is the first to review the recent progress of VLA models, a rapidly emerg- ing research area. Previous surveys have investigated other facets of embodied AI. Firoozi et al. [12] comprehensively summarized foundation models in robotics up to 2023, while Wang et al. [13] focused on LLMs in robotics. Hu et al. [14] examined more recent vision, language, and robotic foundation models for general-purpose robots. Kawaharazuka et al. [15] concentrated on real-world robot applications. In contrast, our work emphasizes VLA models, thereby complementing and extending the existing literature on embodied AI. B. Contributions To the best of our knowledge, this article is the first"}],"limit":50,"offset":0}