JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
hub Canonical reference
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
hub tools
citation-role summary
citation-polarity summary
representative citing papers
UAVFF3D introduces a geometry-aware real-synthetic benchmark and evaluation protocol for feed-forward UAV 3D reconstruction that supports domain adaptation and reduces errors in camera pose and scene geometry.
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.
PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.
VEOcc is a voxel-based online semantic occupancy prediction method using recursive assimilation and three update modules (TLA, RCM, CSU) that reports new SOTA results on Occ-ScanNet and EmbodiedOcc-ScanNet.
Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Robots autonomously convert LLM-guided experiences into a reusable local method library, reducing average execution time from 7.7772s to 6.7779s and LLM calls per task from 1.0 to 0.2 in repeated-task experiments.
citing papers explorer
No citing papers match the current filters.