Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

· 2026 · cs.CV · arXiv 2604.09781

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than recent Vision-Language-Action models (VLAs). Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

representative citing papers

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

SimuScene feeds physics simulation diagnostics back into shape and layout estimation to correct geometric errors and output simulation-ready compositional scenes from single images.

ZeroDex: Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

ZeroDex grounds VLM outputs into 3D keypoints via multi-view triangulation and ray voting to enable zero-shot long-horizon dexterous manipulation with closed-loop replanning.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image cs.CV · 2026-06-02 · unverdicted · none · ref 6 · internal anchor
SimuScene feeds physics simulation diagnostics back into shape and layout estimation to correct geometric errors and output simulation-ready compositional scenes from single images.

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

fields

years

verdicts

representative citing papers

citing papers explorer