{"total":10,"items":[{"citing_arxiv_id":"2605.20246","ref_index":15,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents","primary_cat":"cs.LG","submitted_at":"2026-05-18T04:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GROW decomposes trajectories into state-action samples to enable GRPO for multi-turn VLM agents and reports state-of-the-art results on more than 800 Minecraft tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":127,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Motus [19], Act2Goal [108], PhysGen [22], GigaWorld-Policy [109], UD-VLA [110], X-W AM [111] Training data Robot-centric Teleoperation QT-Opt [112], MIME [ 113], RoboNet [114], Robo T urk-Real [115], BridgeData [116], MT-Opt [117] BC-Z [118], RT-1 [119], Language-Table [120], BridgeData v2 [ 121], Jaco Play [ 122] Cable Routing Dataset [ 123], RH20T [124], OXE [125], DROID [126], RH20T-P [127], RoboMIND [128] ARIO [129], RoboData [130], DexCap [131], FuSe [132], AgiBot World [133], REASSEMBLE [ 134] OmniAction [135], UnifoLM-WBT [136] UMI-style Human Demonstration UMI [137], FastUMI [138], FastUMI-100K [139], RealOmin [140], Hoi! [ 141], RDT2 [142] ActiveUMI [143], exUMI [ 144], Tactile-Conditioned Diffusion Policy [145], DexUMI [ 146]"},{"citing_arxiv_id":"2511.17855","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2025-11-22T00:45:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QuickLAP fuses LLM-extracted language observations with physical feedback in a closed-form Bayesian update to cut reward learning error by over 70% in a driving simulator and improve user preference in a 15-person study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14093","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"human instructions conveyed through language, enabling the completion of long-horizon rearrangement tasks. The efficacy of such language-based guidance is primarily attributed to the utilization of a meticulously collected dataset containing di- verse language instructions, which surpasses previous datasets by an order of magnitude in scale. Hiveformer [89] places significant emphasis on leveraging multiview scene observations and maintaining the full observa- tion history for a language-conditioned policy. This approach represents an advancement over previous systems, such as CLIPort and BC-Z, that only use the current observation. Notably, Hiveformer stands out as one of the early adopters IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8"},{"citing_arxiv_id":"2312.13139","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2023-12-20T16:00:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.17596","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2023-10-26T17:17:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Learning multi-arm manipulation through collaborative teleoperation,\" arXiv preprint arXiv:2012.06738, 2020. [26] J. Wong, A. Tung, A. Kurenkov, A. Mandlekar, L. Fei-Fei, S. Savarese, and R. Mart'ın-Mart'ın, \"Error-aware imitation learning from teleoperation data for mobile manipulation,\" inConfer- ence on Robot Learning . PMLR, 2022, pp. 1367-1378. [27] C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence, \"Interactive language: Talking to robots in real time,\"arXiv preprint arXiv:2210.06407, 2022. [28] D. A. Pomerleau, \"Alvinn: An autonomous land vehicle in a neural network,\" in Advances in neural information processing systems, 1989, pp. 305-313. [29] A. J."},{"citing_arxiv_id":"2310.06114","ref_index":199,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Interactive Real-World Simulators","primary_cat":"cs.AI","submitted_at":"2023-10-09T19:42:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.08708","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Consciousness in Artificial Intelligence: Insights from the Science of Consciousness","primary_cat":"cs.AI","submitted_at":"2023-08-17T00:10:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"No current AI systems exhibit the indicator properties derived from established scientific theories of consciousness, yet there appear to be no fundamental technical obstacles to implementing those properties in future systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.05973","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","primary_cat":"cs.RO","submitted_at":"2023-07-12T07:40:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA), pages 6322-6329. IEEE, 2022. [52] C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648. [53] C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022. [54] L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manip- ulation concepts from instructions and human demonstrations. The International Journal of"},{"citing_arxiv_id":"2303.03378","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaLM-E: An Embodied Multimodal Language Model","primary_cat":"cs.LG","submitted_at":"2023-03-06T18:58:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive transfer from joint training on language and robotics data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}