arxiv: 2604.09781 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Sangwon Baik , Gunhee Kim , Mingi Choi , Hanbyul Joo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords 6D object posetext-guided rearrangementclosed-loop VLM agentsvision-language modelsrobot manipulationinference-time techniquesmulti-view reasoningpose prediction

0 comments

The pith

Closed-loop VLM agents predict text-guided goal 6D object poses by iterative reasoning and updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can overcome their 3D limitations for text-guided 6D object pose rearrangement by running in a closed loop. The process has the VLM observe the current RGB-D or mesh scene, check if it matches the text instruction, propose an incremental pose update for the target object, apply the change, and re-render to continue. Supporting techniques include multi-view reasoning with view selection, object-centered coordinate visualization, and single-axis rotation prediction. A sympathetic reader would care because this turns existing VLMs into effective agents for precise 3D manipulation using only inference-time steps, beating prior methods without any fine-tuning and improving downstream robot tasks.

Core claim

Given a 3D scene and a text instruction specifying a desired state change, the VLM repeatedly observes the scene, evaluates whether it is faithful to the instruction, proposes a pose update for the target object, applies the update, and renders the updated scene, achieving superior accuracy in predicting the text-guided goal 6D pose across both closed-source and open-source models.

What carries the argument

Closed-loop VLM agent that cycles through observe-evaluate-propose-update-render steps, augmented by multi-view reasoning with supporting view selection, object-centered coordinate system visualization, and single-axis rotation prediction.

Load-bearing premise

The VLM can reliably and consistently evaluate whether the current scene matches the text instruction and propose accurate incremental pose updates that converge to the goal without getting stuck or hallucinating incorrect changes.

What would settle it

A benchmark of text instructions and starting scenes where the agent is run until termination and checked for failure to reach the instructed final 6D pose or for repeated incorrect proposals.

Figures

Figures reproduced from arXiv: 2604.09781 by Gunhee Kim, Hanbyul Joo, Mingi Choi, Sangwon Baik.

**Figure 1.** Figure 1: Given a 3D scene and a text instruction specifying the desired goal state, our method uses a VLM to iteratively refine the target object’s 6D pose in a closed loop until it reaches the goal. Abstract. Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target o… view at source ↗

**Figure 2.** Figure 2: Our Method Overview. (a) We first ask the evaluator to assess how well the rendered multi-view images satisfy the input text instruction. If the current scene is judged inconsistent with the instruction, we provide the proposer with the supporting view, augmented with coordinate system visualization, that best explains the evaluator’s judgment. Based on this input, the proposer predicts an incremental 6D … view at source ↗

**Figure 3.** Figure 3: Qualitative Comparison in the 6-DoF Rearrangement Task of the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison on Robot Manipulation Using Open6DOR [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Ablation Study of Inference-time Techniques. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison in the Robot Manipulation Task of the SIM [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of the Maximum Number of Iterations on Open6DOR V2. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Additional Qualitative Results on the 6-DoF Rearrangement Task. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Additional Qualitative Results on the Robot Manipulation Task. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: System Prompt for Target and Related Object Selection. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: System Prompt for Faithfulness Check. This prompt guides the VLM to assess whether the multi-view images are faithful to the text instruction and to select the view that best supports its judgment [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: System Prompt for 6D Pose Rearrangement Prediction. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This turns VLMs into closed-loop agents for text-guided 6D pose updates via three inference tricks, with practical gains claimed on both model types and a robot demo.

read the letter

The main thing to know is that the paper shows how to get VLMs to iteratively adjust a target object's 6D pose in a scene until it matches a text instruction, by looping through observation, match evaluation, single-axis update proposal, and re-rendering. They add multi-view selection, object-centered coordinate overlays, and single-axis-only predictions to make the VLM outputs more usable without any fine-tuning or extra modules. This closed-loop setup is the core new piece compared to prior one-shot VLM pose predictors. It performs well by reporting consistent improvements across closed-source and open VLMs, an ablation that each of the three techniques contributes, and a downstream robot manipulation experiment where the pose outputs lead to higher success rates than baselines. The no-training requirement makes it immediately usable with strong existing models. The softer spot is the load-bearing assumption that the VLM will keep giving accurate scene judgments and incremental updates that actually converge rather than oscillating or stalling. The stress-test note is right to flag this, especially for open models whose 3D reasoning can be inconsistent; if the experiments only show final success rates without detailing iteration counts, divergence cases, or failure modes on hard instructions, the robustness claim stays under-supported. The quantitative comparisons to priors would also need clear error analysis to show the gains are not just on easy cases. This is for robotics and vision researchers who want inference-only ways to apply VLMs to spatial rearrangement or language-guided editing. A reader working on practical VLM deployment in 3D tasks would get concrete techniques and a working robot example to build on. It deserves a serious referee because the method is straightforward to reproduce and the claims are directly testable with the right metrics. I would send it to review, expecting the referees to focus on convergence reliability and the exact experimental controls.

Referee Report

3 major / 3 minor

Summary. The paper proposes a closed-loop VLM agent for text-guided 6D object pose rearrangement. Given an RGB-D image or compositional 3D mesh scene and a text instruction describing a desired state change for a target object, the VLM iteratively observes the current scene, evaluates whether it matches the instruction, proposes an incremental single-axis pose update (rotation or translation) in the object-centered frame, applies the update, and re-renders the scene until convergence. Three inference-time techniques are introduced as essential: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. The central claim is that this approach, without any fine-tuning or new modules, surpasses prior methods in predicting the text-guided goal 6D pose, works consistently across closed-source and open-source VLMs, and yields higher success rates in downstream robot manipulation when paired with motion planning. An ablation study is presented to show the necessity of each technique.

Significance. If the quantitative results and consistency claims hold, the work would be significant for demonstrating that targeted inference-time interventions and closed-loop iteration can unlock substantially better 3D spatial reasoning in existing VLMs for pose rearrangement and manipulation tasks. This offers a training-free alternative to specialized 3D models or fine-tuning, with potential impact on zero-shot robot planning. The agentic formulation and the three specific techniques provide a concrete, reproducible recipe that could be adopted broadly in vision-language robotics.

major comments (3)

[§4 and ablation study] §4 (Experiments) and ablation study: The central claim that the method 'surpasses prior methods' and 'works consistently across both closed-source and open-source VLMs' rests on the VLM reliably judging scene-instruction match and proposing converging single-axis updates. However, the ablation only demonstrates necessity of the three techniques; it does not report quantitative metrics on closed-loop reliability such as per-iteration success rate, average iterations to convergence, rate of hallucinated mismatches, or divergence into local minima. This omission leaves the weakest assumption (VLM consistency) insufficiently tested, especially for weaker open-source models.
[Main results table / §4] Main results (presumably Table 1 or equivalent): The abstract asserts 'dramatic performance gains' and superiority over priors, yet the manuscript provides no concrete numbers, baseline comparisons, or error analysis in the visible description. Without these, it is impossible to evaluate whether the gains are robust or merely marginal, which is load-bearing for the headline claim.
[Method / closed-loop procedure] Method section (closed-loop description): The process assumes that VLM-proposed updates can always be applied without producing invalid configurations (e.g., collisions or out-of-bounds poses). No mechanism for validity checking or recovery is described; if such cases occur, the loop may fail to converge, undermining the reported robot manipulation improvements.

minor comments (3)

[Abstract] Abstract: The phrase 'dramatic performance gains' is qualitative; replace or supplement with a brief quantitative statement (e.g., 'X% improvement in pose error') that points to the main table.
[Figures and §3] Notation and figures: Ensure that the object-centered coordinate visualization and single-axis prediction are illustrated with clear before/after renderings and coordinate axes in the figures; current description is high-level.
[Related work] Related work: The positioning against prior VLM-agent and 6D pose papers could be expanded to clarify novelty of the three inference-time techniques versus existing prompting strategies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These have helped us strengthen the evaluation of the closed-loop reliability and clarify key aspects of the results and method. We address each major comment below and have made corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§4 and ablation study] §4 (Experiments) and ablation study: The central claim that the method 'surpasses prior methods' and 'works consistently across both closed-source and open-source VLMs' rests on the VLM reliably judging scene-instruction match and proposing converging single-axis updates. However, the ablation only demonstrates necessity of the three techniques; it does not report quantitative metrics on closed-loop reliability such as per-iteration success rate, average iterations to convergence, rate of hallucinated mismatches, or divergence into local minima. This omission leaves the weakest assumption (VLM consistency) insufficiently tested, especially for weaker open-source models.

Authors: We agree that additional quantitative metrics on closed-loop behavior are necessary to fully support the consistency claims. In the revised manuscript, we have expanded §4 with a new analysis subsection that reports average iterations to convergence, per-iteration match judgment success rates, rates of hallucinated mismatches, and divergence cases, broken down by VLM type. These additions directly test the reliability of the iterative process beyond the existing ablation on the three techniques. revision: yes
Referee: [Main results table / §4] Main results (presumably Table 1 or equivalent): The abstract asserts 'dramatic performance gains' and superiority over priors, yet the manuscript provides no concrete numbers, baseline comparisons, or error analysis in the visible description. Without these, it is impossible to evaluate whether the gains are robust or merely marginal, which is load-bearing for the headline claim.

Authors: The full manuscript presents the main quantitative results, including success rates and baseline comparisons, in Table 1 of §4. We have added an error analysis subsection in the revised version to discuss the robustness of these gains and have updated the abstract to reference the key quantitative improvements for greater clarity. revision: partial
Referee: [Method / closed-loop procedure] Method section (closed-loop description): The process assumes that VLM-proposed updates can always be applied without producing invalid configurations (e.g., collisions or out-of-bounds poses). No mechanism for validity checking or recovery is described; if such cases occur, the loop may fail to converge, undermining the reported robot manipulation improvements.

Authors: We acknowledge that the original method section did not explicitly describe handling of invalid configurations. In the revised manuscript, we have added a validity verification step that checks proposed updates against scene boundaries and potential collisions using the available depth information, along with a recovery mechanism to revert to the prior valid state and attempt an alternative update if needed. revision: yes

Circularity Check

0 steps flagged

Empirical closed-loop procedure with no derivation chain or self-referential reductions

full rationale

The paper describes an iterative empirical procedure: observe scene, evaluate against text instruction, propose single-axis pose update, apply and re-render. No equations, first-principles derivations, or parameter-fitting steps are claimed. The three inference-time techniques (multi-view selection, coordinate visualization, single-axis prediction) are presented as practical additions whose value is shown via ablation experiments rather than derived from prior results by construction. No self-citations are invoked as uniqueness theorems or load-bearing premises; performance gains are reported as direct empirical outcomes across VLMs. The method is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that off-the-shelf VLMs have sufficient 3D reasoning ability to serve as reliable agents in the described loop; no free parameters or new entities are introduced.

axioms (1)

domain assumption VLMs can evaluate scene faithfulness to text instructions and propose useful pose updates without fine-tuning
The entire closed-loop method depends on this capability being present in current models.

pith-pipeline@v0.9.0 · 5579 in / 1287 out tokens · 41614 ms · 2026-05-10T16:49:48.536375+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 15 canonical work pages · 13 internal anchors

[1]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734 (2025)

work page internal anchor Pith review arXiv 2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0 : A vision-language-action flow model for general robot control. arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817 (2022)

work page internal anchor Pith review arXiv 2022
[4]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

In: CVPR (2024)

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR (2024)

2024
[6]

In: NeurIPS (2024)

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: NeurIPS (2024)

2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., et al.: Gemini 2.5: Pushing the Fron- tier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

In: IROS (2024)

Ding, Y., Geng, H., Xu, C., Fang, X., Zhang, J., Wei, S., Dai, Q., Zhang, Z., Wang, H.: Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In: IROS (2024)

2024
[9]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al.: Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv:2505.20279 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

IJRR (2023)

Fang, H.S., Gou, M., Wang, C., Lu, C.: Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. IJRR (2023)

2023
[11]

In: CVPR (2020)

Fang, H.S., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: A large-scale benchmark for general object grasping. In: CVPR (2020)

2020
[12]

In: ACL (2024)

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., Yu, D.: Webvoyager: Building an end-to-end web agent with large multimodal models. In: ACL (2024)

2024
[13]

In: CVPR (2024)

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., Tang, J.: Cogagent: A visual language model for gui agents. In: CVPR (2024)

2024
[14]

arXiv:2511.21688 (2025)

Hu, W., Lin, J., Long, Y., Ran, Y., Jiang, L., Wang, Y., Zhu, C., Xu, R., Wang, T., Pang, J.: G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. arXiv:2511.21688 (2025)

work page arXiv 2025
[15]

In: CoRL (2023)

Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Composable 3d value maps for robotic manipulation with language models. In: CoRL (2023)

2023
[16]

In: CVPR (2025) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents 17

Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., et al.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In: CVPR (2025) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents 17

2025
[17]

In: ICRA (2024)

Kapelyukh, I., Ren, Y., Alzugaray, I., Johns, E.: Dream2real: Zero-shot 3d object rearrangement with vision-language models. In: ICRA (2024)

2024
[18]

In: CoRL (2024)

Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision- language-action model. In: CoRL (2024)

2024
[19]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

work page internal anchor Pith review arXiv 2023
[20]

In: CoRL (2024)

Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

2024
[21]

In: CVPR (2024)

Lin, J., Liu, L., Lu, D., Jia, K.: Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In: CVPR (2024)

2024
[22]

In: CVPR (2025)

Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: CVPR (2025)

2025
[23]

In: NeurIPS (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. In: NeurIPS (2023)

2023
[24]

In: CoRL (2025)

Liu, L., Lin, J., Liu, Z., Jia, K.: Picopose: Progressive pixel-to-pixel correspondence learning for novel object pose estimation. In: CoRL (2025)

2025
[25]

In: NeurIPS (2024)

Ma, C., Lu, K., Cheng, T.Y., Trigoni, N., Markham, A.: Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors. In: NeurIPS (2024)

2024
[26]

AVA: Attentive VLM Agent for Mastering StarCraft II

Ma, W., Fu, Y., Zhang, Z., Ghanem, B., Li, G.: Ava: Attentive vlm agent for mastering starcraft ii. arXiv:2503.05383 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

In: CVPR (2024)

Nguyen, V.N., Groueix, T., Salzmann, M., Lepetit, V.: Gigapose: Fast and robust novel object pose estimation via one correspondence. In: CVPR (2024)

2024
[28]

OpenAI: Chatgpt: Optimizing language models for dialogue (2023),https:// openai.com/blog/chatgpt

2023
[29]

In: ECCV (2024)

Örnek, E.P., Labbé, Y., Tekin, B., Ma, L., Keskin, C., Forster, C., Hodan, T.: Foundpose: Unseen object pose estimation with foundation features. In: ECCV (2024)

2024
[30]

In: ICRA (2024)

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Poo- ley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In: ICRA (2024)

2024
[31]

In: NeurIPS (2025)

Qi, Z., Zhang, W., Ding, Y., Dong, R., Yu, X., Li, J., Xu, L., Li, B., He, X., Fan, G., Zhang, J., He, J., Gu, J., Jin, X., Ma, K., Zhang, Z., Wang, H., Yi, L.: Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation. In: NeurIPS (2025)

2025
[32]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv:2501.12326 (2025)

work page internal anchor Pith review arXiv 2025
[33]

In: RSS (2025)

Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial representations for visual-language- action model. In: RSS (2025)

2025
[34]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv:2408.00714 (2024) 18 S. Baik et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

Sucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

2012
[36]

In: RSS (2024)

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

2024
[37]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al.: Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv:2509.02544 (2025)

work page internal anchor Pith review arXiv 2025
[38]

In: CVPR (2024)

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: CVPR (2024)

2024
[39]

In: NeurIPS (2025)

Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. In: NeurIPS (2025)

2025
[40]

In: CVPR (2024)

Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Yuan, L.: Florence-2: Advancing a unified representation for a variety of vision tasks. In: CVPR (2024)

2024
[41]

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Yin, S., Ge, J., Wang, Z.Z., Li, X., Black, M.J., Darrell, T., Kanazawa, A., Feng, H.: Vision-as-inverse-graphics agent via interleaved multimodal reasoning. arXiv:2601.11109 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

In: ICML (2024)

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. In: ICML (2024)

2024
[43]

Boyuan Zheng, Michael Y

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. arXiv:2401.01614 (2024)

work page arXiv 2024
[44]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

The knife cuts the apple at the exact center of its top surface

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., ...

2023
[46]

Identify what each label corresponds to using the masks and labels
[47]

Pour the tea into a teacup using a teapot

From the text description, determine: - The single object that should be moved or manipulated (target object) - Any other object(s) that are directly involved in or relevant to the intended interaction with the target object (related objects) - Exactly one target object must be selected - Multiple related objects may be selected when necessary - Example: ...
[48]

Do all objects named in the text appear in the scene?
[49]

Are the spatial relationships between objects described in the text description clear and plausible in the image?
[50]

Are there any physically invalid collisions between objects?
[51]

If information about rotation is specified in the text description, analyze in detail whether the rotation of the target object shown in the image matches the specified information
[52]

You shouldn't judge faithfulness based on just one image

Use all view images jointly to decide Faithfulness. You shouldn't judge faithfulness based on just one image. Keep in mind that the input images are multi-view images. Then pick one view image number that best supports your final judgment (faithful or unfaithful)
[53]

It can help you infer information that is difficult to verify from the current images, such as detecting collisions between objects based on their original sizes

Actively refer to the reasoning from previous conversations. It can help you infer information that is difficult to verify from the current images, such as detecting collisions between objects based on their original sizes. # Axis Definitions - Coordinate system is shown explicitly in the image (around the target object): - x-axis: Red arrow in the input ...
[58]

Yes" or

top corresponds to the +y-axis direction, bottom corresponds to the −y-axis direction. ## Output Format Return a JSON object with these fields: - **Faithfulness**: "Yes" or "No" (capitalize exactly as shown), indicating if the multi-view images match the text description. - **Image number**: Integer (1-{N}) of the view image best showing faithfulness. If ...
[59]

List all distinct objects in the image
[60]

- The manner or goal of its movement

Analyze the text description to determine: - The specific object to move. - The manner or goal of its movement
[61]

Only proceed if exactly one object is identified as the movement target; otherwise, return an error JSON (see Output format)
[62]

Pour the tea into a teacup using a teapot

Example: For "Pour the tea into a teacup using a teapot.", identify the teapot and teacup. Then, select the teapot as the moving object, deduce its movement (tilt to pour), determine pose change (position and orientation), and estimate the target pose
[63]

x", "y", or

Predict and output the pose change for the selected object: - 3D translation as [Tx, Ty, Tz] (float array, minimum 3 decimal places). - Dominant rotation axis: "x", "y", or "z" (lowercase string). - Rotation angle in degrees (float, at least 2 decimal places). - Rotation is applied counterclockwise around the axis (right-hand rule)
[64]

Scale all translation values accordingly

The length of each axis arrow shown in the image is 1. Scale all translation values accordingly
[65]

When inferring translation, you must explain how many times each axis' length must be moved to achieve the goal

Explain logically and in detail the reasons for your decisions. When inferring translation, you must explain how many times each axis' length must be moved to achieve the goal. When inferring rotation, you must explain why an axis is dominant and explain the specific reasoning behind how many degrees of rotation should be made about that axis. See 'Exampl...
[66]

Actively refer to the reasoning from previous conversations. It can help you infer information that is difficult to verify from the current single-view image alone, such as detecting collisions between objects based on their original sizes, or understanding how rotations are applied by using prior reasoning about rotation. # Axis Definitions - Coordinate ...
[67]

the apple is placed behind the bottle,

behind corresponds to the −z-axis direction. For example, if the prompt says “the apple is placed behind the bottle,” then the apple should be located more in the −z direction than the bottle
[68]

the apple is placed in front of the bottle,

front corresponds to the +z-axis direction. For example, if the prompt says “the apple is placed in front of the bottle,” then the apple should be located more in the +z direction than the bottle
[69]

the cap of the water bottle is facing left,

left corresponds to the −x-axis direction. For example, if the prompt says “the cap of the water bottle is facing left,” then the cap should face the −x direction
[70]

the cap of the water bottle is facing right,

right corresponds to the +x-axis direction. For example, if the prompt says “the cap of the water bottle is facing right,” then the cap should face the +x direction
[71]

x", "y", or

top corresponds to the +y-axis direction, bottom corresponds to the −y-axis direction. # Rotation Sign Convention Use the sign of the rotation angle according to the right-hand rule, but determine it with the following concrete convention: - A positive rotation about +x means the object rotates from +y toward +z. - A positive rotation about +y means the o...