pith. machine review for the scientific record. sign in

arxiv: 2507.05791 · v5 · pith:OCCWOOYXnew · submitted 2025-07-08 · 💻 cs.AI

GTA1: GUI Test-time Scaling Agent

Pith reviewed 2026-05-17 13:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentstest-time scalingreinforcement learningaction proposalvisual groundingagent benchmarks
0
0 comments X

The pith

GTA1 uses test-time scaling to select optimal action proposals and reinforcement learning to enhance visual grounding for GUI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces GTA1 to tackle two key issues in GUI agents: choosing the right sequence of actions from many possibilities and precisely interacting with visual elements on screens. It samples multiple action proposals at each step and employs a judge model to pick the best one, using extra computation to improve decision making. It also trains a model with reinforcement learning to better align actions with interface elements by rewarding correct clicks. A sympathetic reader would care because this could make autonomous agents more capable at completing tasks on computers and other platforms without constant human oversight.

Core claim

GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks by conducting test-time scaling to select the most appropriate action proposal and proposing a model that improves grounding through reinforcement learning.

What carries the argument

Test-time scaling that samples multiple candidate action proposals and selects the best via a judge model, along with reinforcement learning to reward successful clicks for better grounding.

If this is right

  • More reliable task completion on complex graphical interfaces.
  • Improved accuracy in selecting action plans under large choice spaces.
  • Trade-off allowing higher performance at the cost of additional computation during inference.
  • Applicability to various platforms like Linux and others.
  • State-of-the-art results on standard benchmarks for GUI agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar test-time scaling could be applied to other types of AI agents beyond GUI tasks.
  • Combining this with larger models might further amplify the benefits of the selection process.
  • Future work could explore better judge models to reduce any potential selection bias.
  • The RL approach for grounding might extend to other visual interaction tasks.

Load-bearing premise

The judge model can accurately and without bias choose the best action proposal from the sampled candidates.

What would settle it

Demonstrating a consistent failure mode where the judge selects inferior actions leading to task failure on specific interfaces would challenge the core selection mechanism.

read the original abstract

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces GTA1, a GUI agent for autonomous task completion on interfaces such as Linux. It addresses expansive action spaces via test-time scaling: at each step, multiple action proposals are concurrently sampled and the best one is selected by a judge model. Grounding of the selected actions to visual elements is improved via reinforcement learning that rewards successful clicks. The manuscript reports state-of-the-art results on both grounding and agent task execution benchmarks and releases code and models.

Significance. If the reported gains hold under rigorous controls, the work would show that test-time compute can be traded for better planning quality in GUI agents and that RL provides an objective alignment benefit for grounding. The open release of code and models strengthens reproducibility and would allow the community to test the scaling hypothesis directly.

major comments (3)
  1. Abstract and experimental sections: the SOTA claim on agent task execution is presented without any reported baselines, statistical significance tests, error bars, or explicit data splits, so the central empirical result cannot be evaluated from the provided text.
  2. Method description of test-time scaling: no ablation or architectural detail is given showing that the judge model is trained or prompted independently of the proposer; without this separation the selection step risks reducing to self-consistency or majority voting, which would not substantiate the claimed new scaling benefit.
  3. Grounding via RL section: the claim that RL inherently aligns with successful clicks is stated without a concrete reward formulation, training objective, or comparison to supervised grounding baselines, leaving the contribution of the RL component unclear.
minor comments (1)
  1. Notation for action proposals and judge scoring should be formalized with equations rather than prose descriptions to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications based on the manuscript content and indicating planned revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: Abstract and experimental sections: the SOTA claim on agent task execution is presented without any reported baselines, statistical significance tests, error bars, or explicit data splits, so the central empirical result cannot be evaluated from the provided text.

    Authors: Section 4 of the manuscript reports results on standard GUI agent benchmarks with direct numerical comparisons to prior methods serving as baselines. To address the concern about evaluation rigor, we will add error bars from multiple runs, statistical significance tests, and explicit statements of the data splits in the revised experimental section and abstract where appropriate. revision: yes

  2. Referee: Method description of test-time scaling: no ablation or architectural detail is given showing that the judge model is trained or prompted independently of the proposer; without this separation the selection step risks reducing to self-consistency or majority voting, which would not substantiate the claimed new scaling benefit.

    Authors: The judge model is architecturally and prompt-wise independent from the proposer model, using a distinct evaluation-focused prompt. We will expand the method section with these details and add an ablation comparing judge-based selection to self-consistency and majority voting to demonstrate the distinct benefit of the scaling approach. revision: yes

  3. Referee: Grounding via RL section: the claim that RL inherently aligns with successful clicks is stated without a concrete reward formulation, training objective, or comparison to supervised grounding baselines, leaving the contribution of the RL component unclear.

    Authors: We will revise the RL grounding section to include the explicit reward formulation (positive reward for successful element clicks), the policy optimization objective, and direct comparisons against supervised grounding baselines to clarify the RL contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper presents GTA1 as an empirical GUI agent that performs test-time scaling by sampling multiple action proposals and selecting via a judge model, plus RL-based grounding improvement. No equations, derivations, first-principles results, or mathematical chains appear in the abstract or described content. Performance claims are supported by benchmark outcomes rather than any self-referential fitting, parameter renaming, or self-citation that reduces the central result to its own inputs by construction. The method is a standard test-time compute technique without evident self-definitional or load-bearing circular steps, making the contribution self-contained as an experimental system.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes that sampling multiple proposals and judging them improves decision quality and that RL rewards align with successful visual grounding; these are domain assumptions without independent verification in the provided text.

axioms (2)
  • domain assumption A judge model can accurately rank action proposals sampled from the policy
    Invoked when describing test-time scaling selection
  • domain assumption Reinforcement learning objective aligns with precise visual grounding
    Stated as key insight for the grounding model

pith-pipeline@v0.9.0 · 5562 in / 1150 out tokens · 53687 ms · 2026-05-17T13:49:50.823266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  2. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  3. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  4. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 unverdicted novelty 7.0

    GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.

  5. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

    cs.AI 2026-05 accept novelty 7.0

    GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

  6. Benchmarking and Improving GUI Agents in High-Dynamic Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...

  7. Benchmarking and Improving GUI Agents in High-Dynamic Environments

    cs.CV 2026-04 conditional novelty 7.0

    DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...

  8. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  9. Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

    cs.SE 2026-03 unverdicted novelty 7.0

    VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

  10. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  11. AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.

  12. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  13. IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.

  14. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  15. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 5.0

    An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  16. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  17. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  18. HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

  19. See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

    cs.CV 2026-04 unverdicted novelty 5.0

    Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 16 Pith papers · 12 internal anchors

  1. [1]

    Aria-ui: Visual grounding for gui instructions,

    Y . Yang, Y . Wang, D. Li, Z. Luo, B. Chen, C. Huang, and J. Li, “Aria-ui: Visual grounding for gui instructions,”arXiv preprint arXiv:2412.16256, 2024

  2. [2]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    B. Gou, R. Wang, B. Zheng, Y . Xie, C. Chang, Y . Shu, H. Sun, and Y . Su, “Navigating the digital world as humans do: Universal visual grounding for gui agents,”arXiv preprint arXiv:2410.05243, 2024

  3. [3]

    Screenspot-pro: Gui grounding for professional high-resolution computer use,

    K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high-resolution computer use,”arXiv preprint arXiv:2504.07981, 2025

  4. [4]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,”arXiv preprint arXiv:2410.23218, 2024

  5. [5]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    K. Cheng, Q. Sun, Y . Chu, F. Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,”arXiv preprint arXiv:2401.10935, 2024

  6. [6]

    Scaling computer-use grounding via user interface decomposition and synthesis,

    T. Xie, J. Deng, X. Li, J. Yang, H. Wu, J. Chen, W. Hu, X. Wang, Y . Xu, Z. Wang,et al., “Scaling computer-use grounding via user interface decomposition and synthesis,”arXiv preprint arXiv:2505.13227, 2025

  7. [7]

    Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164, 2024

    S. Agashe, J. Han, S. Gan, J. Yang, A. Li, and X. E. Wang, “Agent s: An open agentic framework that uses computers like a human,”arXiv preprint arXiv:2410.08164, 2024

  8. [8]

    Agent s2: A compositional generalist-specialist framework for computer use agents,

    S. Agashe, K. Wong, V . Tu, J. Yang, A. Li, and X. E. Wang, “Agent s2: A compositional generalist-specialist framework for computer use agents,”arXiv preprint arXiv:2504.00906, 2025

  9. [9]

    Aguvis: Unified pure vision agents for autonomous gui interaction,

    Y . Xu, Z. Wang, J. Wang, D. Lu, T. Xie, A. Saha, D. Sahoo, T. Yu, and C. Xiong, “Aguvis: Unified pure vision agents for autonomous gui interaction,”arXiv preprint arXiv:2412.04454, 2024

  10. [10]

    Openai o3 and o4-mini system card,

    OpenAI, “Openai o3 and o4-mini system card,” technical report, OpenAI, 2025. System Card

  11. [11]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,”Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024. 10 Salesforce AI Research2025-10-07

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  14. [14]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    R. Luo, L. Wang, W. He, and X. Xia, “Gui-r1: A generalist r1-style vision-language action model for gui agents,”arXiv preprint arXiv:2504.10458, 2025

  15. [15]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li, “Ui-r1: Enhancing action prediction of gui agents by reinforcement learning,”arXiv preprint arXiv:2503.21620, 2025

  16. [16]

    Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,

    Y . Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu, “Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,”arXiv preprint arXiv:2504.14239, 2025

  17. [17]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,

    Y . Zhou, S. Dai, S. Wang, K. Zhou, Q. Jia,et al., “Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,”arXiv preprint arXiv:2505.15810, 2025

  18. [18]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  19. [19]

    Claude 3.7 sonnet and claude code,

    Anthropic, “Claude 3.7 sonnet and claude code,” technical report, Anthropic, 2025. System Card

  20. [20]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang,et al., “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025

  21. [21]

    Omniparser for pure vision based gui agent,

    Y . Lu, J. Yang, Y . Shen, and A. Awadallah, “Omniparser for pure vision based gui agent,”arXiv preprint arXiv:2408.00203, 2024

  22. [22]

    Computer-using agent: Introducing a universal interface for ai to interact with the digital world,

    OpenAI, “Computer-using agent: Introducing a universal interface for ai to interact with the digital world,” 2025

  23. [23]

    Seed1.5-VL Technical Report

    D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang,et al., “Seed1. 5-vl technical report,” arXiv preprint arXiv:2505.07062, 2025

  24. [24]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025

  25. [25]

    Showui: One vision-language-action model for gui visual agent,

    K. Q. Lin, L. Li, D. Gao, Z. Yang, S. Wu, Z. Bai, S. W. Lei, L. Wang, and M. Z. Shou, “Showui: One vision-language-action model for gui visual agent,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 19498–19508, 2025

  26. [26]

    Cogagent: A visual language model for gui agents,

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding,et al., “Cogagent: A visual language model for gui agents,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14281– 14290, 2024

  27. [27]

    Is chatgpt a general-purpose natural language processing task solver?,

    C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is chatgpt a general-purpose natural language processing task solver?,”arXiv preprint arXiv:2302.06476, 2023

  28. [28]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,

    X. Yuan, J. Zhang, K. Li, Z. Cai, L. Yao, J. Chen, E. Wang, Q. Hou, J. Chen, P.-T. Jiang,et al., “Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,”arXiv preprint arXiv:2505.12370, 2025

  29. [29]

    Opencua: Open foundations for computer-use agents,

    X. Wang, B. Wang, D. Lu, J. Yang, T. Xie, J. Wang, J. Deng, X. Guo, Y . Xu, C. H. Wu, Z. Shen, Z. Li, R. Li, X. Li, J. Chen, B. Zheng, P. Li, F. Lei, R. Cao, Y . Fu, D. Shin, M. Shin, J. Hu, Y . Wang, J. Chen, Y . Ye, D. Zhang, D. Du, H. Hu, H. Chen, Z. Zhou, H. Yao, Z. Chen, Q. Gu, Y . Wang, H. Wang, D. Yang, V . Zhong, F. Sung, Y . Charles, Z. Yang, and...

  30. [30]

    Ui-vision: A desktop-centric gui benchmark for visual perception and interaction,

    S. Nayak, X. Jian, K. Q. Lin, J. A. Rodriguez, M. Kalsi, R. Awal, N. Chapados, M. T. Ã˝Uzsu, A. Agrawal, D. Vazquez, C. Pal, P. Taslakian, S. Gella, and S. Rajeswar, “Ui-vision: A desktop-centric gui benchmark for visual perception and interaction,” 2025

  31. [31]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web,

    R. Kapoor, Y . P. Butala, M. Russak, J. Y . Koh, K. Kamble, W. Alshikh, and R. Salakhutdinov, “Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web,” 2024

  32. [32]

    Widget captioning: Generating natural language description for mobile user interface elements,

    Y . Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan, “Widget captioning: Generating natural language description for mobile user interface elements,” 2020

  33. [33]

    Windows agent arena: Evaluating multi-modal os agents at scale,

    R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y . Li, Y . Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui, “Windows agent arena: Evaluating multi-modal os agents at scale,” 2024

  34. [34]

    Gemini 2.5: Our most intelligent ai model,

    Deepmind, “Gemini 2.5: Our most intelligent ai model,” technical report, Deepmind, 2025. 11 Salesforce AI Research2025-10-07

  35. [35]

    Introducing gemini 2.0: our new ai model for the agentic era,

    Deepmind, “Introducing gemini 2.0: our new ai model for the agentic era,” technical report, Deepmind, 2025

  36. [36]

    Coact-1: Computer-using agents with coding as actions,

    L. Song, Y . Dai, V . Prabhu, J. Zhang, T. Shi, L. Li, J. Li, S. Savarese, Z. Chen, J. Zhao, R. Xu, and C. Xiong, “Coact-1: Computer-using agents with coding as actions,” 2025

  37. [37]

    Kimi-vl technical report,

    K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, C. Wang, D. Zhang, D. Du, D. Wang, E. Yuan, E. Lu, F. Li, F. Sung, G. Wei, G. Lai, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Wu, H. Yao, H. Lu, H. Wang, H. Gao, H. Zheng, J. Li, J. Su, J. Wang, J. Deng, J. Qiu, J. Xie, J. Wang, J. Liu, J. Yan, K. Ouyang, L. Chen, L. Sui...

  38. [38]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala,et al., “Androidworld: A dynamic benchmarking environment for autonomous agents,”arXiv preprint arXiv:2405.14573, 2024

  39. [39]

    On the effects of data scale on ui control agents,

    W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva, “On the effects of data scale on ui control agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 92130–92154, 2024

  40. [40]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” 2023. 12 Salesforce AI Research2025-10-07 Table of Contents in Appendix A Example 14 B Model Training Details 14 C Evaluation Details 14 C.1 ScreenSpot-V2 . . . . . . . . . ....

  41. [41]

    Remember to generate the corresponding instruction to the code before a # in a comment and only return ONE line of code

  42. [42]

    Yes, I trust the authors\

    ‘agent.click‘ can have multiple clicks. For example, agent.click(’Click \"Yes, I trust the authors\" button’, 2, "left") is double click

  43. [43]

    Return ‘‘‘agent.fail()‘‘‘ in the code block when you think the task can not be done

    Return ‘‘‘agent.done()‘‘‘ in the code block when you think the task is done (Be careful when evaluating whether the task has been successfully completed). Return ‘‘‘agent.fail()‘‘‘ in the code block when you think the task can not be done

  44. [44]

    Whenever possible, your grounded action should use hot-keys with the agent.hotkey() action instead of clicking or dragging

  45. [45]

    When you finish modifying a file, always save it before proceeding using ‘‘‘agent.hotkey([’ctrl’, ’s’])‘‘‘ or equivalent

    Save modified files before returning ‘‘‘agent.done()‘‘‘. When you finish modifying a file, always save it before proceeding using ‘‘‘agent.hotkey([’ctrl’, ’s’])‘‘‘ or equivalent. Tasks may involve multiple files. Save each after finishing modification

  46. [46]

    Authentication required

    If you meet "Authentication required" prompt, you can continue to click "Cancel" to close it. My computer’s password is ’{CLIENT_PASSWORD}’, feel free to use it when you need sudo rights. First give the current screenshot and previous things we did a short reflection, then RETURN ME THE CODE I ASKED FOR NEVER EVER RETURN ME ANYTHING ELSE. 21 Salesforce AI...