LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.
hub
SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
VEOcc is a voxel-based online semantic occupancy prediction method using recursive assimilation and three update modules (TLA, RCM, CSU) that reports new SOTA results on Occ-ScanNet and EmbodiedOcc-ScanNet.
RGB-only active 3D scene graph generation unifies perception and planning to achieve depth-baseline parity and more than double object detection in active indoor exploration.
Fixed external cameras as Common Prior Maps boost initial object recall in 3D scene graph generation by up to 79% and improve active exploration efficiency.
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.
TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.
Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
citing papers explorer
-
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding
VEOcc is a voxel-based online semantic occupancy prediction method using recursive assimilation and three update modules (TLA, RCM, CSU) that reports new SOTA results on Occ-ScanNet and EmbodiedOcc-ScanNet.
-
RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots
RGB-only active 3D scene graph generation unifies perception and planning to achieve depth-baseline parity and more than double object detection in active indoor exploration.
-
Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
Fixed external cameras as Common Prior Maps boost initial object recall in 3D scene graph generation by up to 79% and improve active exploration efficiency.
-
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation
A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.
-
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.
-
Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models
Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.