{"total":14,"items":[{"citing_arxiv_id":"2605.25059","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-24T13:10:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VEOcc is a voxel-based online semantic occupancy prediction method using recursive assimilation and three update modules (TLA, RCM, CSU) that reports new SOTA results on Occ-ScanNet and EmbodiedOcc-ScanNet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19420","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-19T06:12:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18197","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots","primary_cat":"cs.RO","submitted_at":"2026-05-18T10:37:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RGB-only active 3D scene graph generation unifies perception and planning to achieve depth-baseline parity and more than double object detection in active indoor exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18184","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation","primary_cat":"cs.RO","submitted_at":"2026-05-18T10:26:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fixed external cameras as Common Prior Maps boost initial object recall in 3D scene graph generation by up to 79% and improve active exploration efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18109","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14504","ref_index":31,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution","primary_cat":"cs.AI","submitted_at":"2026-05-14T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21924","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-Horizon Manipulation via Trace-Conditioned VLA Planning","primary_cat":"cs.RO","submitted_at":"2026-04-23T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07034","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis","primary_cat":"cs.RO","submitted_at":"2026-04-08T12:49:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rely on task-specific fine-tuning or large memory structures. Structured Representations for Multimodal Reasoning. Recent work suggests that structured scene abstractions can inject useful inductive bias into language-guided reasoning without requiring full retraining. Examples include 3D scene graphs for task grounding [41], bird's-eye-view (BEV) interfaces for multimodal reasoning [42], and diagrammatic abstractions for improved visual understanding [43]. KITE is most closely aligned with this direction. Compared with RE- FLECT [9], which reasons over summarized robot memories and multisensory logs, KITE focuses on the representation in- terface itself: it converts long execution videos into a compact, keyframe-indexed, layout-grounded evidence bundle that can"},{"citing_arxiv_id":"2602.08392","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"recognition, pages 8494-8502, 2018. [91] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. 3 [92] Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Ground- ing large language models using 3d scene graphs for scal- able robot task planning.arXiv preprint arXiv:2307.06135, 2023. [93] Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anub- hav Sarkar, SM Towhidul Islam Tonmoy, Aman Chadha, Amit Sheth, and Amitava Das."},{"citing_arxiv_id":"2511.12676","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections","primary_cat":"cs.CV","submitted_at":"2025-11-16T16:30:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13778","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","primary_cat":"cs.RO","submitted_at":"2025-10-15T17:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.13107","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models","primary_cat":"cs.RO","submitted_at":"2024-09-19T22:24:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.07864","ref_index":261,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2023-09-14T17:12:03+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The remarkable nature of the human brain is largely attributed to its high degree of plasticity and adaptability. It can continuously adjust its structure and function in response to external stimuli and internal needs, thereby adapting to different environments and tasks. These years, plenty of research indicates that pre-trained models on large-scale corpora can learn universal language representations [36; 261; 262]. Leveraging the power of pre-trained models, with only a small amount of data for fine-tuning, LLMs can demonstrate excellent performance in downstream tasks [263]. There is no need to train new models from scratch, which saves a lot of computation resources. However, through this task-specific fine-tuning, the models lack versatility and struggle to be generalized to other tasks."},{"citing_arxiv_id":"2304.11477","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM+P: Empowering Large Language Models with Optimal Planning Proficiency","primary_cat":"cs.AI","submitted_at":"2023-04-22T20:34:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"produce when presented with such a task is often incorrect in the sense that following the output plan will not actually solve the task. Therefore, in this work, we focus on resolving this issue by leveraging the properties of classical planners. Similarly, some recent work also investigates approaches for combining classical planning with LLMs [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57]. They either use prompting or fine-tuning to make LLMs capable of solving PDDL planning problems. Improvements to long- horizon planning capabilities have also been made by iter- atively querying LLMs, as demonstrated in Minecraft [58]. In contrast, we do not solely rely on LLM as the problem solver, but are more into taking the advantage of both the"}],"limit":50,"offset":0}