pith. machine review for the scientific record. sign in

arxiv: 2604.09781 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords 6D object posetext-guided rearrangementclosed-loop VLM agentsvision-language modelsrobot manipulationinference-time techniquesmulti-view reasoningpose prediction
0
0 comments X

The pith

Closed-loop VLM agents predict text-guided goal 6D object poses by iterative reasoning and updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can overcome their 3D limitations for text-guided 6D object pose rearrangement by running in a closed loop. The process has the VLM observe the current RGB-D or mesh scene, check if it matches the text instruction, propose an incremental pose update for the target object, apply the change, and re-render to continue. Supporting techniques include multi-view reasoning with view selection, object-centered coordinate visualization, and single-axis rotation prediction. A sympathetic reader would care because this turns existing VLMs into effective agents for precise 3D manipulation using only inference-time steps, beating prior methods without any fine-tuning and improving downstream robot tasks.

Core claim

Given a 3D scene and a text instruction specifying a desired state change, the VLM repeatedly observes the scene, evaluates whether it is faithful to the instruction, proposes a pose update for the target object, applies the update, and renders the updated scene, achieving superior accuracy in predicting the text-guided goal 6D pose across both closed-source and open-source models.

What carries the argument

Closed-loop VLM agent that cycles through observe-evaluate-propose-update-render steps, augmented by multi-view reasoning with supporting view selection, object-centered coordinate system visualization, and single-axis rotation prediction.

Load-bearing premise

The VLM can reliably and consistently evaluate whether the current scene matches the text instruction and propose accurate incremental pose updates that converge to the goal without getting stuck or hallucinating incorrect changes.

What would settle it

A benchmark of text instructions and starting scenes where the agent is run until termination and checked for failure to reach the instructed final 6D pose or for repeated incorrect proposals.

Figures

Figures reproduced from arXiv: 2604.09781 by Gunhee Kim, Hanbyul Joo, Mingi Choi, Sangwon Baik.

Figure 1
Figure 1. Figure 1: Given a 3D scene and a text instruction specifying the desired goal state, our method uses a VLM to iteratively refine the target object’s 6D pose in a closed loop until it reaches the goal. Abstract. Vision-Language Models (VLMs) exhibit strong visual rea￾soning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target o… view at source ↗
Figure 2
Figure 2. Figure 2: Our Method Overview. (a) We first ask the evaluator to assess how well the rendered multi-view images satisfy the input text instruction. If the current scene is judged inconsistent with the instruction, we provide the proposer with the supporting view, augmented with coordinate system visualization, that best explains the evalu￾ator’s judgment. Based on this input, the proposer predicts an incremental 6D … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison in the 6-DoF Rearrangement Task of the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison on Robot Manipulation Using Open6DOR [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Ablation Study of Inference-time Techniques. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison in the Robot Manipulation Task of the SIM [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the Maximum Number of Iterations on Open6DOR V2. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional Qualitative Results on the 6-DoF Rearrangement Task. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional Qualitative Results on the Robot Manipulation Task. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System Prompt for Target and Related Object Selection. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System Prompt for Faithfulness Check. This prompt guides the VLM to assess whether the multi-view images are faithful to the text instruction and to select the view that best supports its judgment [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System Prompt for 6D Pose Rearrangement Prediction. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes a closed-loop VLM agent for text-guided 6D object pose rearrangement. Given an RGB-D image or compositional 3D mesh scene and a text instruction describing a desired state change for a target object, the VLM iteratively observes the current scene, evaluates whether it matches the instruction, proposes an incremental single-axis pose update (rotation or translation) in the object-centered frame, applies the update, and re-renders the scene until convergence. Three inference-time techniques are introduced as essential: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. The central claim is that this approach, without any fine-tuning or new modules, surpasses prior methods in predicting the text-guided goal 6D pose, works consistently across closed-source and open-source VLMs, and yields higher success rates in downstream robot manipulation when paired with motion planning. An ablation study is presented to show the necessity of each technique.

Significance. If the quantitative results and consistency claims hold, the work would be significant for demonstrating that targeted inference-time interventions and closed-loop iteration can unlock substantially better 3D spatial reasoning in existing VLMs for pose rearrangement and manipulation tasks. This offers a training-free alternative to specialized 3D models or fine-tuning, with potential impact on zero-shot robot planning. The agentic formulation and the three specific techniques provide a concrete, reproducible recipe that could be adopted broadly in vision-language robotics.

major comments (3)
  1. [§4 and ablation study] §4 (Experiments) and ablation study: The central claim that the method 'surpasses prior methods' and 'works consistently across both closed-source and open-source VLMs' rests on the VLM reliably judging scene-instruction match and proposing converging single-axis updates. However, the ablation only demonstrates necessity of the three techniques; it does not report quantitative metrics on closed-loop reliability such as per-iteration success rate, average iterations to convergence, rate of hallucinated mismatches, or divergence into local minima. This omission leaves the weakest assumption (VLM consistency) insufficiently tested, especially for weaker open-source models.
  2. [Main results table / §4] Main results (presumably Table 1 or equivalent): The abstract asserts 'dramatic performance gains' and superiority over priors, yet the manuscript provides no concrete numbers, baseline comparisons, or error analysis in the visible description. Without these, it is impossible to evaluate whether the gains are robust or merely marginal, which is load-bearing for the headline claim.
  3. [Method / closed-loop procedure] Method section (closed-loop description): The process assumes that VLM-proposed updates can always be applied without producing invalid configurations (e.g., collisions or out-of-bounds poses). No mechanism for validity checking or recovery is described; if such cases occur, the loop may fail to converge, undermining the reported robot manipulation improvements.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'dramatic performance gains' is qualitative; replace or supplement with a brief quantitative statement (e.g., 'X% improvement in pose error') that points to the main table.
  2. [Figures and §3] Notation and figures: Ensure that the object-centered coordinate visualization and single-axis prediction are illustrated with clear before/after renderings and coordinate axes in the figures; current description is high-level.
  3. [Related work] Related work: The positioning against prior VLM-agent and 6D pose papers could be expanded to clarify novelty of the three inference-time techniques versus existing prompting strategies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These have helped us strengthen the evaluation of the closed-loop reliability and clarify key aspects of the results and method. We address each major comment below and have made corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [§4 and ablation study] §4 (Experiments) and ablation study: The central claim that the method 'surpasses prior methods' and 'works consistently across both closed-source and open-source VLMs' rests on the VLM reliably judging scene-instruction match and proposing converging single-axis updates. However, the ablation only demonstrates necessity of the three techniques; it does not report quantitative metrics on closed-loop reliability such as per-iteration success rate, average iterations to convergence, rate of hallucinated mismatches, or divergence into local minima. This omission leaves the weakest assumption (VLM consistency) insufficiently tested, especially for weaker open-source models.

    Authors: We agree that additional quantitative metrics on closed-loop behavior are necessary to fully support the consistency claims. In the revised manuscript, we have expanded §4 with a new analysis subsection that reports average iterations to convergence, per-iteration match judgment success rates, rates of hallucinated mismatches, and divergence cases, broken down by VLM type. These additions directly test the reliability of the iterative process beyond the existing ablation on the three techniques. revision: yes

  2. Referee: [Main results table / §4] Main results (presumably Table 1 or equivalent): The abstract asserts 'dramatic performance gains' and superiority over priors, yet the manuscript provides no concrete numbers, baseline comparisons, or error analysis in the visible description. Without these, it is impossible to evaluate whether the gains are robust or merely marginal, which is load-bearing for the headline claim.

    Authors: The full manuscript presents the main quantitative results, including success rates and baseline comparisons, in Table 1 of §4. We have added an error analysis subsection in the revised version to discuss the robustness of these gains and have updated the abstract to reference the key quantitative improvements for greater clarity. revision: partial

  3. Referee: [Method / closed-loop procedure] Method section (closed-loop description): The process assumes that VLM-proposed updates can always be applied without producing invalid configurations (e.g., collisions or out-of-bounds poses). No mechanism for validity checking or recovery is described; if such cases occur, the loop may fail to converge, undermining the reported robot manipulation improvements.

    Authors: We acknowledge that the original method section did not explicitly describe handling of invalid configurations. In the revised manuscript, we have added a validity verification step that checks proposed updates against scene boundaries and potential collisions using the available depth information, along with a recovery mechanism to revert to the prior valid state and attempt an alternative update if needed. revision: yes

Circularity Check

0 steps flagged

Empirical closed-loop procedure with no derivation chain or self-referential reductions

full rationale

The paper describes an iterative empirical procedure: observe scene, evaluate against text instruction, propose single-axis pose update, apply and re-render. No equations, first-principles derivations, or parameter-fitting steps are claimed. The three inference-time techniques (multi-view selection, coordinate visualization, single-axis prediction) are presented as practical additions whose value is shown via ablation experiments rather than derived from prior results by construction. No self-citations are invoked as uniqueness theorems or load-bearing premises; performance gains are reported as direct empirical outcomes across VLMs. The method is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that off-the-shelf VLMs have sufficient 3D reasoning ability to serve as reliable agents in the described loop; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption VLMs can evaluate scene faithfulness to text instructions and propose useful pose updates without fine-tuning
    The entire closed-loop method depends on this capability being present in current models.

pith-pipeline@v0.9.0 · 5579 in / 1287 out tokens · 41614 ms · 2026-05-10T16:49:48.536375+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 15 canonical work pages · 13 internal anchors

  1. [1]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734 (2025)

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0 : A vision-language-action flow model for general robot control. arXiv:2410.24164 (2024)

  3. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817 (2022)

  4. [4]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  5. [5]

    In: CVPR (2024)

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR (2024)

  6. [6]

    In: NeurIPS (2024)

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: NeurIPS (2024)

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., et al.: Gemini 2.5: Pushing the Fron- tier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 (2025)

  8. [8]

    In: IROS (2024)

    Ding, Y., Geng, H., Xu, C., Fang, X., Zhang, J., Wei, S., Dai, Q., Zhang, Z., Wang, H.: Open6dor: Benchmarking open-instruction 6-dof object rearrangement and a vlm-based approach. In: IROS (2024)

  9. [9]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al.: Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv:2505.20279 (2025)

  10. [10]

    IJRR (2023)

    Fang, H.S., Gou, M., Wang, C., Lu, C.: Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. IJRR (2023)

  11. [11]

    In: CVPR (2020)

    Fang, H.S., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: A large-scale benchmark for general object grasping. In: CVPR (2020)

  12. [12]

    In: ACL (2024)

    He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., Yu, D.: Webvoyager: Building an end-to-end web agent with large multimodal models. In: ACL (2024)

  13. [13]

    In: CVPR (2024)

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., Tang, J.: Cogagent: A visual language model for gui agents. In: CVPR (2024)

  14. [14]

    arXiv:2511.21688 (2025)

    Hu, W., Lin, J., Long, Y., Ran, Y., Jiang, L., Wang, Y., Zhu, C., Xu, R., Wang, T., Pang, J.: G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. arXiv:2511.21688 (2025)

  15. [15]

    In: CoRL (2023)

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Composable 3d value maps for robotic manipulation with language models. In: CoRL (2023)

  16. [16]

    In: CVPR (2025) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents 17

    Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., et al.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In: CVPR (2025) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents 17

  17. [17]

    In: ICRA (2024)

    Kapelyukh, I., Ren, Y., Alzugaray, I., Johns, E.: Dream2real: Zero-shot 3d object rearrangement with vision-language models. In: ICRA (2024)

  18. [18]

    In: CoRL (2024)

    Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision- language-action model. In: CoRL (2024)

  19. [19]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

  20. [20]

    In: CoRL (2024)

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

  21. [21]

    In: CVPR (2024)

    Lin, J., Liu, L., Lu, D., Jia, K.: Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In: CVPR (2024)

  22. [22]

    In: CVPR (2025)

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: CVPR (2025)

  23. [23]

    In: NeurIPS (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. In: NeurIPS (2023)

  24. [24]

    In: CoRL (2025)

    Liu, L., Lin, J., Liu, Z., Jia, K.: Picopose: Progressive pixel-to-pixel correspondence learning for novel object pose estimation. In: CoRL (2025)

  25. [25]

    In: NeurIPS (2024)

    Ma, C., Lu, K., Cheng, T.Y., Trigoni, N., Markham, A.: Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors. In: NeurIPS (2024)

  26. [26]

    AVA: Attentive VLM Agent for Mastering StarCraft II

    Ma, W., Fu, Y., Zhang, Z., Ghanem, B., Li, G.: Ava: Attentive vlm agent for mastering starcraft ii. arXiv:2503.05383 (2025)

  27. [27]

    In: CVPR (2024)

    Nguyen, V.N., Groueix, T., Salzmann, M., Lepetit, V.: Gigapose: Fast and robust novel object pose estimation via one correspondence. In: CVPR (2024)

  28. [28]

    OpenAI: Chatgpt: Optimizing language models for dialogue (2023),https:// openai.com/blog/chatgpt

  29. [29]

    In: ECCV (2024)

    Örnek, E.P., Labbé, Y., Tekin, B., Ma, L., Keskin, C., Forster, C., Hodan, T.: Foundpose: Unseen object pose estimation with foundation features. In: ECCV (2024)

  30. [30]

    In: ICRA (2024)

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Poo- ley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In: ICRA (2024)

  31. [31]

    In: NeurIPS (2025)

    Qi, Z., Zhang, W., Ding, Y., Dong, R., Yu, X., Li, J., Xu, L., Li, B., He, X., Fan, G., Zhang, J., He, J., Gu, J., Jin, X., Ma, K., Zhang, Z., Wang, H., Yi, L.: Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation. In: NeurIPS (2025)

  32. [32]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv:2501.12326 (2025)

  33. [33]

    In: RSS (2025)

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial representations for visual-language- action model. In: RSS (2025)

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv:2408.00714 (2024) 18 S. Baik et al

  35. [35]

    IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

    Sucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

  36. [36]

    In: RSS (2024)

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

  37. [37]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al.: Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv:2509.02544 (2025)

  38. [38]

    In: CVPR (2024)

    Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: CVPR (2024)

  39. [39]

    In: NeurIPS (2025)

    Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. In: NeurIPS (2025)

  40. [40]

    In: CVPR (2024)

    Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Yuan, L.: Florence-2: Advancing a unified representation for a variety of vision tasks. In: CVPR (2024)

  41. [41]

    Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

    Yin, S., Ge, J., Wang, Z.Z., Li, X., Black, M.J., Darrell, T., Kanazawa, A., Feng, H.: Vision-as-inverse-graphics agent via interleaved multimodal reasoning. arXiv:2601.11109 (2026)

  42. [42]

    In: ICML (2024)

    Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. In: ICML (2024)

  43. [43]

    Boyuan Zheng, Michael Y

    Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. arXiv:2401.01614 (2024)

  44. [44]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv:2504.10479 (2025)

  45. [45]

    The knife cuts the apple at the exact center of its top surface

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., ...

  46. [46]

    Identify what each label corresponds to using the masks and labels

  47. [47]

    Pour the tea into a teacup using a teapot

    From the text description, determine: - The single object that should be moved or manipulated (target object) - Any other object(s) that are directly involved in or relevant to the intended interaction with the target object (related objects) - Exactly one target object must be selected - Multiple related objects may be selected when necessary - Example: ...

  48. [48]

    Do all objects named in the text appear in the scene?

  49. [49]

    Are the spatial relationships between objects described in the text description clear and plausible in the image?

  50. [50]

    Are there any physically invalid collisions between objects?

  51. [51]

    If information about rotation is specified in the text description, analyze in detail whether the rotation of the target object shown in the image matches the specified information

  52. [52]

    You shouldn't judge faithfulness based on just one image

    Use all view images jointly to decide Faithfulness. You shouldn't judge faithfulness based on just one image. Keep in mind that the input images are multi-view images. Then pick one view image number that best supports your final judgment (faithful or unfaithful)

  53. [53]

    It can help you infer information that is difficult to verify from the current images, such as detecting collisions between objects based on their original sizes

    Actively refer to the reasoning from previous conversations. It can help you infer information that is difficult to verify from the current images, such as detecting collisions between objects based on their original sizes. # Axis Definitions - Coordinate system is shown explicitly in the image (around the target object): - x-axis: Red arrow in the input ...

  54. [58]

    Yes" or

    top corresponds to the +y-axis direction, bottom corresponds to the −y-axis direction. ## Output Format Return a JSON object with these fields: - **Faithfulness**: "Yes" or "No" (capitalize exactly as shown), indicating if the multi-view images match the text description. - **Image number**: Integer (1-{N}) of the view image best showing faithfulness. If ...

  55. [59]

    List all distinct objects in the image

  56. [60]

    - The manner or goal of its movement

    Analyze the text description to determine: - The specific object to move. - The manner or goal of its movement

  57. [61]

    Only proceed if exactly one object is identified as the movement target; otherwise, return an error JSON (see Output format)

  58. [62]

    Pour the tea into a teacup using a teapot

    Example: For "Pour the tea into a teacup using a teapot.", identify the teapot and teacup. Then, select the teapot as the moving object, deduce its movement (tilt to pour), determine pose change (position and orientation), and estimate the target pose

  59. [63]

    x", "y", or

    Predict and output the pose change for the selected object: - 3D translation as [Tx, Ty, Tz] (float array, minimum 3 decimal places). - Dominant rotation axis: "x", "y", or "z" (lowercase string). - Rotation angle in degrees (float, at least 2 decimal places). - Rotation is applied counterclockwise around the axis (right-hand rule)

  60. [64]

    Scale all translation values accordingly

    The length of each axis arrow shown in the image is 1. Scale all translation values accordingly

  61. [65]

    When inferring translation, you must explain how many times each axis' length must be moved to achieve the goal

    Explain logically and in detail the reasons for your decisions. When inferring translation, you must explain how many times each axis' length must be moved to achieve the goal. When inferring rotation, you must explain why an axis is dominant and explain the specific reasoning behind how many degrees of rotation should be made about that axis. See 'Exampl...

  62. [66]

    Actively refer to the reasoning from previous conversations. It can help you infer information that is difficult to verify from the current single-view image alone, such as detecting collisions between objects based on their original sizes, or understanding how rotations are applied by using prior reasoning about rotation. # Axis Definitions - Coordinate ...

  63. [67]

    the apple is placed behind the bottle,

    behind corresponds to the −z-axis direction. For example, if the prompt says “the apple is placed behind the bottle,” then the apple should be located more in the −z direction than the bottle

  64. [68]

    the apple is placed in front of the bottle,

    front corresponds to the +z-axis direction. For example, if the prompt says “the apple is placed in front of the bottle,” then the apple should be located more in the +z direction than the bottle

  65. [69]

    the cap of the water bottle is facing left,

    left corresponds to the −x-axis direction. For example, if the prompt says “the cap of the water bottle is facing left,” then the cap should face the −x direction

  66. [70]

    the cap of the water bottle is facing right,

    right corresponds to the +x-axis direction. For example, if the prompt says “the cap of the water bottle is facing right,” then the cap should face the +x direction

  67. [71]

    x", "y", or

    top corresponds to the +y-axis direction, bottom corresponds to the −y-axis direction. # Rotation Sign Convention Use the sign of the rotation angle according to the right-hand rule, but determine it with the following concrete convention: - A positive rotation about +x means the object rotates from +y toward +z. - A positive rotation about +y means the o...