GTA1: GUI Test-time Scaling Agent
Pith reviewed 2026-05-17 13:49 UTC · model grok-4.3
The pith
GTA1 uses test-time scaling to select optimal action proposals and reinforcement learning to enhance visual grounding for GUI agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks by conducting test-time scaling to select the most appropriate action proposal and proposing a model that improves grounding through reinforcement learning.
What carries the argument
Test-time scaling that samples multiple candidate action proposals and selects the best via a judge model, along with reinforcement learning to reward successful clicks for better grounding.
If this is right
- More reliable task completion on complex graphical interfaces.
- Improved accuracy in selecting action plans under large choice spaces.
- Trade-off allowing higher performance at the cost of additional computation during inference.
- Applicability to various platforms like Linux and others.
- State-of-the-art results on standard benchmarks for GUI agents.
Where Pith is reading between the lines
- Similar test-time scaling could be applied to other types of AI agents beyond GUI tasks.
- Combining this with larger models might further amplify the benefits of the selection process.
- Future work could explore better judge models to reduce any potential selection bias.
- The RL approach for grounding might extend to other visual interaction tasks.
Load-bearing premise
The judge model can accurately and without bias choose the best action proposal from the sampled candidates.
What would settle it
Demonstrating a consistent failure mode where the judge selects inferior actions leading to task failure on specific interfaces would challenge the core selection mechanism.
read the original abstract
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GTA1, a GUI agent for autonomous task completion on interfaces such as Linux. It addresses expansive action spaces via test-time scaling: at each step, multiple action proposals are concurrently sampled and the best one is selected by a judge model. Grounding of the selected actions to visual elements is improved via reinforcement learning that rewards successful clicks. The manuscript reports state-of-the-art results on both grounding and agent task execution benchmarks and releases code and models.
Significance. If the reported gains hold under rigorous controls, the work would show that test-time compute can be traded for better planning quality in GUI agents and that RL provides an objective alignment benefit for grounding. The open release of code and models strengthens reproducibility and would allow the community to test the scaling hypothesis directly.
major comments (3)
- Abstract and experimental sections: the SOTA claim on agent task execution is presented without any reported baselines, statistical significance tests, error bars, or explicit data splits, so the central empirical result cannot be evaluated from the provided text.
- Method description of test-time scaling: no ablation or architectural detail is given showing that the judge model is trained or prompted independently of the proposer; without this separation the selection step risks reducing to self-consistency or majority voting, which would not substantiate the claimed new scaling benefit.
- Grounding via RL section: the claim that RL inherently aligns with successful clicks is stated without a concrete reward formulation, training objective, or comparison to supervised grounding baselines, leaving the contribution of the RL component unclear.
minor comments (1)
- Notation for action proposals and judge scoring should be formalized with equations rather than prose descriptions to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications based on the manuscript content and indicating planned revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: Abstract and experimental sections: the SOTA claim on agent task execution is presented without any reported baselines, statistical significance tests, error bars, or explicit data splits, so the central empirical result cannot be evaluated from the provided text.
Authors: Section 4 of the manuscript reports results on standard GUI agent benchmarks with direct numerical comparisons to prior methods serving as baselines. To address the concern about evaluation rigor, we will add error bars from multiple runs, statistical significance tests, and explicit statements of the data splits in the revised experimental section and abstract where appropriate. revision: yes
-
Referee: Method description of test-time scaling: no ablation or architectural detail is given showing that the judge model is trained or prompted independently of the proposer; without this separation the selection step risks reducing to self-consistency or majority voting, which would not substantiate the claimed new scaling benefit.
Authors: The judge model is architecturally and prompt-wise independent from the proposer model, using a distinct evaluation-focused prompt. We will expand the method section with these details and add an ablation comparing judge-based selection to self-consistency and majority voting to demonstrate the distinct benefit of the scaling approach. revision: yes
-
Referee: Grounding via RL section: the claim that RL inherently aligns with successful clicks is stated without a concrete reward formulation, training objective, or comparison to supervised grounding baselines, leaving the contribution of the RL component unclear.
Authors: We will revise the RL grounding section to include the explicit reward formulation (positive reward for successful element clicks), the policy optimization objective, and direct comparisons against supervised grounding baselines to clarify the RL contribution. revision: yes
Circularity Check
No significant circularity; claims rest on empirical benchmarks
full rationale
The paper presents GTA1 as an empirical GUI agent that performs test-time scaling by sampling multiple action proposals and selecting via a judge model, plus RL-based grounding improvement. No equations, derivations, first-principles results, or mathematical chains appear in the abstract or described content. Performance claims are supported by benchmark outcomes rather than any self-referential fitting, parameter renaming, or self-citation that reduces the central result to its own inputs by construction. The method is a standard test-time compute technique without evident self-definitional or load-bearing circular steps, making the contribution self-contained as an experimental system.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A judge model can accurately rank action proposals sampled from the policy
- domain assumption Reinforcement learning objective aligns with precise visual grounding
Forward citations
Cited by 20 Pith papers
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
Reference graph
Works this paper leans on
-
[1]
Aria-ui: Visual grounding for gui instructions,
Y . Yang, Y . Wang, D. Li, Z. Luo, B. Chen, C. Huang, and J. Li, “Aria-ui: Visual grounding for gui instructions,”arXiv preprint arXiv:2412.16256, 2024
-
[2]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
B. Gou, R. Wang, B. Zheng, Y . Xie, C. Chang, Y . Shu, H. Sun, and Y . Su, “Navigating the digital world as humans do: Universal visual grounding for gui agents,”arXiv preprint arXiv:2410.05243, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Screenspot-pro: Gui grounding for professional high-resolution computer use
K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high-resolution computer use,”arXiv preprint arXiv:2504.07981, 2025
-
[4]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,”arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
K. Cheng, Q. Sun, Y . Chu, F. Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,”arXiv preprint arXiv:2401.10935, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Scaling computer-use grounding via user interface decomposition and synthesis,
T. Xie, J. Deng, X. Li, J. Yang, H. Wu, J. Chen, W. Hu, X. Wang, Y . Xu, Z. Wang,et al., “Scaling computer-use grounding via user interface decomposition and synthesis,”arXiv preprint arXiv:2505.13227, 2025
-
[7]
S. Agashe, J. Han, S. Gan, J. Yang, A. Li, and X. E. Wang, “Agent s: An open agentic framework that uses computers like a human,”arXiv preprint arXiv:2410.08164, 2024
-
[8]
Agent s2: A compositional generalist-specialist framework for computer use agents,
S. Agashe, K. Wong, V . Tu, J. Yang, A. Li, and X. E. Wang, “Agent s2: A compositional generalist-specialist framework for computer use agents,”arXiv preprint arXiv:2504.00906, 2025
-
[9]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Y . Xu, Z. Wang, J. Wang, D. Lu, T. Xie, A. Saha, D. Sahoo, T. Yu, and C. Xiong, “Aguvis: Unified pure vision agents for autonomous gui interaction,”arXiv preprint arXiv:2412.04454, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Openai o3 and o4-mini system card,
OpenAI, “Openai o3 and o4-mini system card,” technical report, OpenAI, 2025. System Card
work page 2025
-
[11]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,”Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024. 10 Salesforce AI Research2025-10-07
work page 2024
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu,et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
R. Luo, L. Wang, W. He, and X. Xia, “Gui-r1: A generalist r1-style vision-language action model for gui agents,”arXiv preprint arXiv:2504.10458, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li, “Ui-r1: Enhancing action prediction of gui agents by reinforcement learning,”arXiv preprint arXiv:2503.21620, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,
Y . Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu, “Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,”arXiv preprint arXiv:2504.14239, 2025
-
[17]
Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,
Y . Zhou, S. Dai, S. Wang, K. Zhou, Q. Jia,et al., “Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,”arXiv preprint arXiv:2505.15810, 2025
-
[18]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Claude 3.7 sonnet and claude code,
Anthropic, “Claude 3.7 sonnet and claude code,” technical report, Anthropic, 2025. System Card
work page 2025
-
[20]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang,et al., “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Omniparser for pure vision based gui agent, 2024
Y . Lu, J. Yang, Y . Shen, and A. Awadallah, “Omniparser for pure vision based gui agent,”arXiv preprint arXiv:2408.00203, 2024
-
[22]
Computer-using agent: Introducing a universal interface for ai to interact with the digital world,
OpenAI, “Computer-using agent: Introducing a universal interface for ai to interact with the digital world,” 2025
work page 2025
-
[23]
D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang,et al., “Seed1. 5-vl technical report,” arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Showui: One vision-language-action model for gui visual agent,
K. Q. Lin, L. Li, D. Gao, Z. Yang, S. Wu, Z. Bai, S. W. Lei, L. Wang, and M. Z. Shou, “Showui: One vision-language-action model for gui visual agent,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 19498–19508, 2025
work page 2025
-
[26]
Cogagent: A visual language model for gui agents,
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding,et al., “Cogagent: A visual language model for gui agents,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14281– 14290, 2024
work page 2024
-
[27]
Is chatgpt a general-purpose natural language processing task solver?,
C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is chatgpt a general-purpose natural language processing task solver?,”arXiv preprint arXiv:2302.06476, 2023
-
[28]
Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,
X. Yuan, J. Zhang, K. Li, Z. Cai, L. Yao, J. Chen, E. Wang, Q. Hou, J. Chen, P.-T. Jiang,et al., “Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,”arXiv preprint arXiv:2505.12370, 2025
-
[29]
Opencua: Open foundations for computer-use agents,
X. Wang, B. Wang, D. Lu, J. Yang, T. Xie, J. Wang, J. Deng, X. Guo, Y . Xu, C. H. Wu, Z. Shen, Z. Li, R. Li, X. Li, J. Chen, B. Zheng, P. Li, F. Lei, R. Cao, Y . Fu, D. Shin, M. Shin, J. Hu, Y . Wang, J. Chen, Y . Ye, D. Zhang, D. Du, H. Hu, H. Chen, Z. Zhou, H. Yao, Z. Chen, Q. Gu, Y . Wang, H. Wang, D. Yang, V . Zhong, F. Sung, Y . Charles, Z. Yang, and...
work page 2025
-
[30]
Ui-vision: A desktop-centric gui benchmark for visual perception and interaction,
S. Nayak, X. Jian, K. Q. Lin, J. A. Rodriguez, M. Kalsi, R. Awal, N. Chapados, M. T. Ã˝Uzsu, A. Agrawal, D. Vazquez, C. Pal, P. Taslakian, S. Gella, and S. Rajeswar, “Ui-vision: A desktop-centric gui benchmark for visual perception and interaction,” 2025
work page 2025
-
[31]
R. Kapoor, Y . P. Butala, M. Russak, J. Y . Koh, K. Kamble, W. Alshikh, and R. Salakhutdinov, “Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web,” 2024
work page 2024
-
[32]
Widget captioning: Generating natural language description for mobile user interface elements,
Y . Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan, “Widget captioning: Generating natural language description for mobile user interface elements,” 2020
work page 2020
-
[33]
Windows agent arena: Evaluating multi-modal os agents at scale,
R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y . Li, Y . Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui, “Windows agent arena: Evaluating multi-modal os agents at scale,” 2024
work page 2024
-
[34]
Gemini 2.5: Our most intelligent ai model,
Deepmind, “Gemini 2.5: Our most intelligent ai model,” technical report, Deepmind, 2025. 11 Salesforce AI Research2025-10-07
work page 2025
-
[35]
Introducing gemini 2.0: our new ai model for the agentic era,
Deepmind, “Introducing gemini 2.0: our new ai model for the agentic era,” technical report, Deepmind, 2025
work page 2025
-
[36]
Coact-1: Computer-using agents with coding as actions,
L. Song, Y . Dai, V . Prabhu, J. Zhang, T. Shi, L. Li, J. Li, S. Savarese, Z. Chen, J. Zhao, R. Xu, and C. Xiong, “Coact-1: Computer-using agents with coding as actions,” 2025
work page 2025
-
[37]
K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, C. Wang, D. Zhang, D. Du, D. Wang, E. Yuan, E. Lu, F. Li, F. Sung, G. Wei, G. Lai, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Wu, H. Yao, H. Lu, H. Wang, H. Gao, H. Zheng, J. Li, J. Su, J. Wang, J. Deng, J. Qiu, J. Xie, J. Wang, J. Liu, J. Yan, K. Ouyang, L. Chen, L. Sui...
work page 2025
-
[38]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala,et al., “Androidworld: A dynamic benchmarking environment for autonomous agents,”arXiv preprint arXiv:2405.14573, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
On the effects of data scale on ui control agents,
W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva, “On the effects of data scale on ui control agents,”Advances in Neural Information Processing Systems, vol. 37, pp. 92130–92154, 2024
work page 2024
-
[40]
Efficient memory management for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” 2023. 12 Salesforce AI Research2025-10-07 Table of Contents in Appendix A Example 14 B Model Training Details 14 C Evaluation Details 14 C.1 ScreenSpot-V2 . . . . . . . . . ....
work page 2023
-
[41]
Remember to generate the corresponding instruction to the code before a # in a comment and only return ONE line of code
-
[42]
‘agent.click‘ can have multiple clicks. For example, agent.click(’Click \"Yes, I trust the authors\" button’, 2, "left") is double click
-
[43]
Return ‘‘‘agent.fail()‘‘‘ in the code block when you think the task can not be done
Return ‘‘‘agent.done()‘‘‘ in the code block when you think the task is done (Be careful when evaluating whether the task has been successfully completed). Return ‘‘‘agent.fail()‘‘‘ in the code block when you think the task can not be done
-
[44]
Whenever possible, your grounded action should use hot-keys with the agent.hotkey() action instead of clicking or dragging
-
[45]
Save modified files before returning ‘‘‘agent.done()‘‘‘. When you finish modifying a file, always save it before proceeding using ‘‘‘agent.hotkey([’ctrl’, ’s’])‘‘‘ or equivalent. Tasks may involve multiple files. Save each after finishing modification
-
[46]
If you meet "Authentication required" prompt, you can continue to click "Cancel" to close it. My computer’s password is ’{CLIENT_PASSWORD}’, feel free to use it when you need sudo rights. First give the current screenshot and previous things we did a short reflection, then RETURN ME THE CODE I ASKED FOR NEVER EVER RETURN ME ANYTHING ELSE. 21 Salesforce AI...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.