{"total":26,"items":[{"citing_arxiv_id":"2606.30613","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sequential Planning via Anchored Robotic Keypoints","primary_cat":"cs.RO","submitted_at":"2026-06-29T17:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13097","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents","primary_cat":"cs.PL","submitted_at":"2026-06-11T09:25:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FCGraft synthesizes code policies for embodied agents by grafting KV caches from a library of validated functions, claiming 18.31% higher success rate and 2.3x faster synthesis than prompt-level caching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07570","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can LLMs extract scientific consensus? A case study in high-temperature superconductivity","primary_cat":"cs.DL","submitted_at":"2026-05-26T03:52:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs recover coherent, interpretable structures from HTS literature including family-dependent mechanisms and temporal belief evolution via a constructed knowledge graph.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07568","ref_index":274,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Study of Behavioral Cloning for Scientific Data Annotation","primary_cat":"cs.HC","submitted_at":"2026-05-26T02:19:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26256","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions","primary_cat":"cs.AI","submitted_at":"2026-05-25T18:27:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25646","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"G-DRAGON: Geospatial Reasoning and Dynamic Planning for Retrieval-Augmented Outdoor Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-25T09:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"G-DRAGON framework maps language commands to OSM coordinates via lightweight LLM for global planning and uses frontier exploration for local targets, outperforming baselines in simulation and completing real UGV person-search missions up to 500m.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19587","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects","primary_cat":"cs.AI","submitted_at":"2026-05-19T09:31:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SceneCode compiles natural language prompts into executable code programs that generate editable, articulated indoor scenes for physics simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18109","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17077","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14504","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution","primary_cat":"cs.AI","submitted_at":"2026-05-14T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11951","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T11:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In such systems, per- ception agents powered by Vision-Language Models (VLMs), including CLIP [40], Grounding DINO [35], and SAM3 [6], enable open-vocabulary recognition and segmentation for flex- ible object manipulation in dynamic environments. Reasoning and planning agents based on Large Language Models (LLMs) or code-generation frameworks such as ProgPrompt [43] and Code-as-Policy [32] further allow robots to generate task plans and actions from high-level language instructions. Beyond using a single foundation model, recent work increasingly focuses on coordinating multiple foundation models as dis- tinct agents [8, 14, 15, 20, 27, 33, 41, 48, 55], enabling more structured perception, planning, and execution."},{"citing_arxiv_id":"2604.18463","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Using large language models for embodied planning introduces systematic safety risks","primary_cat":"cs.AI","submitted_at":"2026-04-20T16:18:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"SayCan introduced affordance grounding, combining LLM propos- als with learned value functions to ensure only physically feasible actions are selected [2]. Code as Policies showed that code-writing LLMs can generate interpretable robot policy programs from nat- ural language [3], while ProgPrompt employed programmatic prompt structures with precondition checking to constrain outputs to valid actions [37]. More recent work has explored multimodal embod- ied language models that incorporate sensor data directly [38], vision-language-action models that output robot actions as text tokens [39], and closed-loop reasoning systems that incorporate environ- ment feedback [40]. Additional approaches include hierarchical policies bridging high-level language"},{"citing_arxiv_id":"2604.17807","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement","primary_cat":"cs.CV","submitted_at":"2026-04-20T04:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Re²MoGen generates open-vocabulary motions via MCTS-enhanced LLM keyframe planning, pose-prior optimization with dynamic temporal matching fine-tuning, and physics-aware RL post-training, claiming SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08224","ref_index":131,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-09T13:19:41+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07395","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring","primary_cat":"cs.RO","submitted_at":"2026-04-08T08:01:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user escalation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.08388","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation","primary_cat":"cs.AI","submitted_at":"2026-03-09T13:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HECG combines multi-dimensional metrics for strategy choice, ten-type error classification with recoverability details, and causal-context graphs to improve LLM agent reliability in complex tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20867","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SoK: Agentic Skills -- Beyond Tool Use in LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-02-24T13:11:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1, 3 [98] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InCon- ference on robot learning, pages 894-906. PMLR, 2022. [99] Karen Simonyan and Andrew Zisserman. Two-stream convo- lutional networks for action recognition in videos.Advances in neural information processing systems, 27, 2014. 3 [100] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022. [101] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason,"},{"citing_arxiv_id":"2512.10605","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator","primary_cat":"cs.RO","submitted_at":"2025-12-11T12:58:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LEO-RobotAgent is a general-purpose framework that enables LLMs to independently plan, use tools, and collaborate with humans while operating multiple robot types for unpredictable tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.05973","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","primary_cat":"cs.RO","submitted_at":"2023-07-12T07:40:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Finn. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019. [58] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler. Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874, 2022. [59] I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022. [60] C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022. [61] S."},{"citing_arxiv_id":"2305.17144","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory","primary_cat":"cs.AI","submitted_at":"2023-05-25T17:59:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16291","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","primary_cat":"cs.AI","submitted_at":"2023-05-25T17:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"research [75-78] has witnessed a significant increase in the utilization of LLMs for planning purposes. Recent efforts can be roughly classified into two groups. 1) Large language models for robot learning: Many prior works apply LLMs to generate subgoals for robot planning [27, 27, 25, 79, 80]. Inner Monologue [26] incorporates environment feedback for robot planning with LLMs. Code as Policies [16] and ProgPrompt [ 22] directly leverage LLMs to generate executable robot policies. VIMA [ 19] and PaLM-E [ 59] fine-tune pre-trained LLMs to support multimodal prompts. 2) Large language models for text agents: ReAct [29] leverages chain-of-thought prompting [46] and generates both reasoning traces and task-specific actions with LLMs. Reflexion [30] is built upon ReAct [29] with self-reflection to enhance reasoning."},{"citing_arxiv_id":"2305.14992","ref_index":147,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reasoning with Language Model is Planning with World Model","primary_cat":"cs.CL","submitted_at":"2023-05-24T10:28:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.11477","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM+P: Empowering Large Language Models with Optimal Planning Proficiency","primary_cat":"cs.AI","submitted_at":"2023-04-22T20:34:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"veloped in recent years, such as Bert [27], CodeX [28], Opt [29], GPT-3 [10], ChatGPT [30], GPT-4 [2], Llama [31], Llama2 [32], and PaLM [33]. As LLMs are pretrained with a tremendous amount of offline text data, they can emerge with surprising zero-shot generalization ability, which can be leveraged for robot planning tasks [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45]. Several recent methods had successes in extracting task knowledge from LLMs to decompose commands or instructions for robots in natural language. For instance, the work of Huang et al. showed that LLMs can be used for task planning in household domains by iteratively augmenting prompts [38]. SayCan is another approach that enabled robot planning with"},{"citing_arxiv_id":"2303.03378","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PaLM-E: An Embodied Multimodal Language Model","primary_cat":"cs.LG","submitted_at":"2023-03-06T18:58:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive transfer from joint training on language and robotics data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2302.01560","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents","primary_cat":"cs.AI","submitted_at":"2023-02-03T06:06:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}