hub Canonical reference

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei · 2023 · cs.RO · arXiv 2307.05973

Canonical reference. 80% of citing Pith papers cite this work as background.

65 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 65 citing papers arXiv PDF

abstract

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 baseline 2 method 1

citation-polarity summary

background 16 baseline 2 unclear 1 use method 1

representative citing papers

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Improving Robotic Generalist Policies via Flow Reversal Steering

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

UAVFF3D introduces a geometry-aware real-synthetic benchmark and evaluation protocol for feed-forward UAV 3D reconstruction that supports domain adaptation and reduces errors in camera pose and scene geometry.

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

cs.RO · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.

PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

cs.RO · 2026-04-26 · unverdicted · novelty 7.0

PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

cs.CV · 2026-04-17 · conditional · novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

cs.RO · 2026-02-09 · unverdicted · novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

Large Video Planner Enables Generalizable Robot Control

cs.RO · 2025-12-17 · conditional · novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

cs.RO · 2023-10-13 · unverdicted · novelty 7.0

A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Automating the Design of Embodied AgentArchitectures

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

Closed-Loop Trace Distillation distills one-line natural-language prompts from labeled training traces to improve VLM accuracy on predicting minimal-success action chains in Exploratory Manipulation Trace QA by 0.38-0.47 across simulator and real-robot tasks.

Efficient Skill Grounding via Code Refactoring with Small Language Models

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

RECENT decouples skill semantics from embodiment-specific bindings via code refactoring to let small language models achieve skill grounding performance matching large language model baselines.

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

TLVS mitigates hallucinations in LVLMs via token-level extraction and visual-sensitivity-adaptive steering applied only at critical decoding steps.

Continuous Reasoning for Vision-Language-Action

cs.RO · 2026-05-29 · unverdicted · novelty 6.0

Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.

Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity

cs.RO · 2026-05-28 · unverdicted · novelty 6.0

Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

cs.CV · 2026-05-24 · unverdicted · novelty 6.0

VEOcc is a voxel-based online semantic occupancy prediction method using recursive assimilation and three update modules (TLA, RCM, CSU) that reports new SOTA results on Occ-ScanNet and EmbodiedOcc-ScanNet.

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.

citing papers explorer

Showing 3 of 3 citing papers after filters.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 3 · internal anchor
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models cs.CV · 2024-02-19 · unverdicted · none · ref 25 · internal anchor
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 37 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer