{"work":{"id":"5a5edf95-2538-4e2b-8dfa-da39cec89f22","openalex_id":null,"doi":null,"arxiv_id":"2307.05973","raw_key":null,"title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","authors":null,"authors_text":"Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei","year":2023,"venue":"cs.RO","abstract":"Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io","external_url":"https://arxiv.org/abs/2307.05973","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T03:56:37.163611+00:00","pith_arxiv_id":"2307.05973","created_at":"2026-05-09T06:30:42.721919+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","render_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models"},"hub":{"state":{"work_id":"5a5edf95-2538-4e2b-8dfa-da39cec89f22","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":50,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2023-10-04T07:56:42+00:00","last_pith_cited_at":"2026-05-22T17:08:37+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-12T04:19:08.574454+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":17},{"context_role":"baseline","n":2},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":16},{"context_polarity":"baseline","n":2},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}