Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

Ludwig Schmidt; Serena Yeung-Levy; Xiaohan Wang; Xueqiao Sun; Yuhui Zhang

arxiv: 2606.31270 · v1 · pith:ZDDH5CEBnew · submitted 2026-06-30 · 💻 cs.CV · cs.AI· cs.CL· cs.CY· cs.LG

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

Xueqiao Sun , Xiaohan Wang , Ludwig Schmidt , Serena Yeung-Levy , Yuhui Zhang This is my paper

Pith reviewed 2026-07-01 06:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.CYcs.LG

keywords computer-use agentsself-improvementfailure-driven learninginference-time patchingOSWorld benchmarkmultimodal LLMsagent trajectoriescode patches

0 comments

The pith

Agents improve at inference time by learning from their own failed trajectories through LLM-generated code patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that computer-use agents can be improved by analyzing their failures rather than only successes. An LLM diagnoses problems in failed trajectories and generates code patches to fix them. These patches are lightly checked by humans and then applied during inference. This leads to better performance without any model retraining.

Core claim

By using an LLM to diagnose failure modes from agent trajectories, propose solutions, and generate code patches that are lightly verified by humans, the success rate of the OpenCUA-72B model on the OSWorld benchmark is improved from 42.3% to 48.9%, a gain of 6.6 percentage points, without additional training cost and with only modest inference overhead.

What carries the argument

The failure-driven self-improvement loop that turns failed trajectories into inference-time code patches via LLM diagnosis.

If this is right

Failed trajectories can be converted into effective agent upgrades at inference time.
Performance gains occur without collecting new successful data or retraining.
The approach requires only modest additional computation during inference.
Light human verification ensures patch quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be combined with success-based self-improvement for faster gains.
It may apply to other multimodal agent tasks if failure diagnosis generalizes well.

Load-bearing premise

An LLM can reliably diagnose why the agent failed and create code patches that fix the issues in a general way.

What would settle it

Applying the generated patches to the agent and measuring success rate on OSWorld, which would need to show no gain or a loss to disprove the claim.

Figures

Figures reproduced from arXiv: 2606.31270 by Ludwig Schmidt, Serena Yeung-Levy, Xiaohan Wang, Xueqiao Sun, Yuhui Zhang.

**Figure 1.** Figure 1: Illustration of self-improving loops. While prior work focuses on fine-tuning the agent with collected successful trajectories, we explore the failure-case loop, which makes use of the large number of failure cases through LLM analysis and selfimprovement. recent work relies heavily on synthetic trajectories generated in verifiable environments: an agent is placed in an environment, executes a trajectory… view at source ↗

**Figure 2.** Figure 2: Overview of our failure-case loop, a self-evolving framework. In each round, failed trajectories are collected through agent rollout; a large language model (LLM) then acts as a meta-controller that diagnoses failure modes, proposes inference-time solutions, and applies code patches. The recipe for OSWorld is generated over multiple rounds of this failure-case loop. 3.1 Framework Overview We introduce a se… view at source ↗

**Figure 3.** Figure 3: Case studies of the visual search and terminal execution strategies. In both cases, our framework guides the agent to correctly improve its behavior. Failure Experience Collection. Within the given verifiable environment, the agent performs rollouts to generate a diverse set of execution trajectories. Each trajectory is evaluated by the environment’s built-in reward function, which provides quantitative fe… view at source ↗

**Figure 4.** Figure 4: Case studies of the knowledge support strategy. In both cases, our framework guides the agent to correctly improve its behavior. accomplishing many workflows with greater accuracy and efficiency. A more detailed explanation of this strategy is provided in Appendix A.2. Outcome. In the case shown in Figure 3b, the agent is asked to locate a specific file and copy its path to the clipboard. Terminal executi… view at source ↗

**Figure 5.** Figure 5: Repetition-warning strategy. The agent initially gets stuck at a screen state; after three rounds of unproductive attempts, recovery mode is triggered, prompting the agent to adopt an alternative strategy and succeed. is asked to resolve a “conda: command not found” error. Although the agent initially does not know how to do this, by retrieving a result from GPT-5-mini it learns the correct commands to ins… view at source ↗

**Figure 6.** Figure 6: Failure-mode distribution. The agent’s initial failures mainly consist of grounding errors, redundant action loops, and a lack of recovery; our self-evolving framework helps the agent overcome these deficiencies. The remaining failure modes span task comprehension, reasoning consistency, and visual understanding, reflecting higher-level cognitive aspects of agent ability. (Failure-mode proportions are app… view at source ↗

**Figure 1.** Figure 1: Prompt for verification process. be opened via the keyboard shortcut ["ctrl", "alt", "t"]; (2) before issuing any terminal command, the model must use a search action to verify the correct syntax; and (3) all command-line operations (cd, ls, ln, ffmpeg, etc.) require an explicit “Enter” action to run. This guidance ensures robust and efficient terminal use. A.3 Knowledge Support - Search Engine For tasks w… view at source ↗

**Figure 2.** Figure 2: Prompt for Search Engine. A.4 Knowledge Support - Software Manual In the LibreOffice setting, we additionally provide the agent with a compact software-manual reference tailored to common document-editing operations. Rather than supplying the full official manual, which is lengthy and often irrelevant to the tested tasks, we curate a focused subset of the keyboard shortcuts most frequently needed for GUI-b… view at source ↗

read the original abstract

Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these agents is collecting large-scale, high-quality trajectories. The standard approach generates synthetic data through a self-improving loop: an agent is placed in a verifiable environment and iteratively fine-tuned on its successful trajectories. Despite its effectiveness, this paradigm exploits only successful trajectories and discards the failed ones, even though failures carry rich information about a model's weaknesses. In this work, we explore a complementary failure-driven self-improvement loop, a data-centric paradigm that turns failed trajectories into agent improvements. Specifically, we employ an LLM to diagnose failure modes, propose inference-time solutions, and generate code patches -- lightly verified by humans -- that upgrade the agent. We validate this approach with the state-of-the-art OpenCUA-72B model on the OSWorld benchmark, improving the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points, without any additional training cost and with only modest inference overhead. Our results demonstrate that failure-driven self-improvement is a viable complement to success-based pipelines, enabling more efficient agent improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 6.6pp OSWorld gain from LLM-diagnosed failure patches is a practical idea but rests on unquantified human verification with no ablations shown.

read the letter

The main thing to know is that the authors turn failed agent trajectories into inference-time code patches. An LLM diagnoses the failure, proposes a fix, and generates a patch that gets light human verification; they apply this to OpenCUA-72B on OSWorld and report a lift from 42.3% to 48.9% success with no retraining and modest extra inference cost.

What is new is the explicit failure-driven loop as a complement to the usual success-only self-improvement pipelines. Most prior work discards failures; here they are turned into patches that upgrade the agent directly.

The paper does a clean job framing the problem and showing a concrete, low-cost gain on a known benchmark. The idea is straightforward and the result is reported without overclaiming training improvements.

The soft spot is exactly the one the stress-test flags. The abstract says patches are “lightly verified by humans” but supplies no counts of patches generated versus accepted, no description of the verification protocol, and no ablation that isolates the patches from extra LLM calls or prompt changes. If the human edits are doing real work or if the patches overfit to the observed failures, the claim of scalable LLM-driven self-improvement does not fully hold. The central empirical result therefore has limited evidential support until those details appear.

This is for researchers building computer-use agents who want to extract more performance from existing models without new training runs. A reader focused on practical agent improvement methods would find it worth reading.

It deserves a serious referee because the benchmark result is concrete and the framing is distinct from cited success-based work, even though the methods section will need expansion.

Referee Report

2 major / 1 minor

Summary. The paper proposes a failure-driven self-improvement paradigm for computer-use agents that uses an LLM to diagnose failure modes in trajectories, propose inference-time solutions, and generate code patches (lightly verified by humans) to upgrade the agent without additional training. It validates the approach on the OpenCUA-72B model using the OSWorld benchmark, reporting an improvement in success rate from 42.3% to 48.9%.

Significance. If the central result holds, the work demonstrates a practical complement to success-only self-improvement loops by extracting value from discarded failure trajectories at inference time, with the reported 6.6pp gain on a known benchmark and absence of training cost as concrete strengths.

major comments (2)

[Abstract] Abstract: the reported 6.6pp gain is presented without any quantification of patches generated versus accepted, the human verification protocol (e.g., what constitutes 'light' verification), or an ablation isolating the patches from other inference-time factors such as extra LLM calls or prompt changes; this directly undermines attribution of the improvement to the failure-driven method.
[Results] Results section (validation on OSWorld): no statistical significance, variance estimates, or controls are described for the 42.3% to 48.9% comparison, leaving open whether the gain exceeds baseline variability or depends on the specific set of inspected failures.

minor comments (1)

[Abstract] The abstract could more explicitly state the number of trajectories or tasks involved in the failure analysis to contextualize the scale of the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental attribution and statistical robustness. We address each major comment below and will revise the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 6.6pp gain is presented without any quantification of patches generated versus accepted, the human verification protocol (e.g., what constitutes 'light' verification), or an ablation isolating the patches from other inference-time factors such as extra LLM calls or prompt changes; this directly undermines attribution of the improvement to the failure-driven method.

Authors: We agree that the abstract omits these details, which are necessary for clear attribution. In the revision we will expand the abstract to report the number of patches generated versus accepted, provide a precise description of the light human verification protocol (including criteria for acceptance), and explicitly reference the ablation study in the results section that isolates the contribution of the generated code patches from additional LLM calls and prompt modifications. revision: yes
Referee: [Results] Results section (validation on OSWorld): no statistical significance, variance estimates, or controls are described for the 42.3% to 48.9% comparison, leaving open whether the gain exceeds baseline variability or depends on the specific set of inspected failures.

Authors: We acknowledge that the current results section lacks statistical analysis. We will add variance estimates (e.g., standard deviation across multiple evaluation seeds where feasible) and controls to assess whether the observed gain exceeds baseline variability. We will also clarify that the inspected failures are drawn from the full OSWorld test set rather than a curated subset. Due to the high computational cost of repeated full-benchmark runs, the added analysis will be partial but sufficient to address the concern. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark result only

full rationale

The paper reports an empirical improvement (42.3% to 48.9% on OSWorld) via LLM-generated patches from failed trajectories, with light human verification. No equations, derivations, fitted parameters presented as predictions, or self-citation chains for uniqueness theorems appear in the abstract or described method. The central claim is a direct benchmark measurement, independent of any self-referential construction or renaming of known results. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes LLM diagnostic capability.

pith-pipeline@v0.9.1-grok · 5775 in / 1150 out tokens · 40197 ms · 2026-07-01T06:14:43.645092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 30 canonical work pages · 13 internal anchors

[1]

arXiv preprint arXiv:2410.08164 (2024)

Agashe, S., Han, J., Gan, S., Yang, J., Li, A., Wang, X.E.: Agent s: An open agentic framework that uses computers like a human. arXiv preprint arXiv:2410.08164 (2024)

work page arXiv 2024
[2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agashe, S., Wong, K., Tu, V., Yang, J., Li, A., Wang, X.E.: Agent s2: A com- positional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Technical report, Anthropic (2025), https://www.anthropic.com/news/claude-3-7-sonnet, system Card

Anthropic: Claude 3.7 sonnet and claude code. Technical report, Anthropic (2025), https://www.anthropic.com/news/claude-3-7-sonnet, system Card

2025
[4]

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web (2023),https://arxiv.org/ abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Grounding Computer Use Agents on Human Demonstrations

Feizi, A.: Grounding computer use agents on human demonstrations. arXiv 2511.07332(Nov 2025).https://doi.org/10.48550/arXiv.2511.07332,https: //arxiv.org/abs/2511.07332, v1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.07332 2025
[6]

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., Yu, D.: Webvoyager: Building an end-to-end web agent with large multimodal models (2024),https: //arxiv.org/abs/2401.13919

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv preprint arXiv:2505.13909 (2025).https://doi.org/10.48550/arXiv.2505.13909, arXiv:2505.13909v1

He, Y.: Efficient agent training for computer use. arXiv preprint arXiv:2505.13909 (2025).https://doi.org/10.48550/arXiv.2505.13909, arXiv:2505.13909v1

work page doi:10.48550/arxiv.2505.13909 2025
[8]

Zaletel, and Joel E

Hu, X.: Os agents: A survey on mllm-based agents for general computing devices use. arXiv preprint arXiv:2508.04482 (2025),https://doi.org/10.48550/arXiv. 2508.04482, accepted by ACL 2025 (Oral)

work page internal anchor Pith review doi:10.48550/arxiv 2025
[9]

In: Che, W., Nabende, J., Shutova, E., Pilehvar,M.T.(eds.)Proceedingsofthe63rdAnnualMeetingoftheAssociationfor Computational Linguistics (Volume 1: Long Papers)

Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y.E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., Wu, F.: OS agents: A survey on MLLM-based agents for computer, phone and browser use. In: ...

2025
[10]

arXiv (Aug 2025),https://arxiv.org/abs/2508.04037, arXiv:2508.04037 [cs.AI]

Huo, Y.: Sea: Self-evolution agent with step-wise reward for computer use. arXiv (Aug 2025),https://arxiv.org/abs/2508.04037, arXiv:2508.04037 [cs.AI]

work page arXiv 2025
[11]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Juan, X.: A survey of self-evolving agents: On path to artificial super intelligence. arXiv2507.21046(2025).https://doi.org/10.48550/arXiv.2507.21046, v3

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21046 2025
[12]

Kapoor, R., Butala, Y.P., Russak, M., Koh, J.Y., Kamble, K., Alshikh, W., Salakhutdinov, R.: Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web (2024),https://arxiv.org/ abs/2402.17553

work page arXiv 2024
[13]

Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use (2025),https://arxiv.org/abs/2504.07981

work page arXiv 2025
[14]

org/abs/2406.03679

Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., Riva, O.: On the effects of data scale on ui control agents (2024),https://arxiv. org/abs/2406.03679

work page arXiv 2024
[15]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025) Inference-Time Self-Improvement for Computer-Use Agents 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Technical report, OpenAI (2025), https://cdn.openai.com/pdf/2221c875- 02dc- 4789- 800b- e7758f3722c1/o3- and-o4-mini-system-card.pdf, system Card

OpenAI: Openai o3 and o4-mini system card. Technical report, OpenAI (2025), https://cdn.openai.com/pdf/2221c875- 02dc- 4789- 800b- e7758f3722c1/o3- and-o4-mini-system-card.pdf, system Card

2025
[18]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

Song, L.: Coact-1: Computer-using agents with coding as actions. arXiv 2508.03923(2025).https://doi.org/10.48550/arXiv.2508.03923, v2

work page doi:10.48550/arxiv.2508.03923 2025
[20]

arXiv2508(2025),https : / / arxiv

Sun, Z.: Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv2508(2025),https : / / arxiv . org / abs / 2508 . 04700, arXiv:2508.04700 [cs.AI]

work page arXiv 2025
[21]

Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., Li, Y.: Screen2words: Automatic mobile ui summarization with multimodal learning (2021),https: //arxiv.org/abs/2108.03353

work page arXiv 2021
[22]

Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C.H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Sun...

work page arXiv 2025
[23]

In: The 34th Annual ACM Symposium on User Interface Software and Technology

Wu, J., Zhang, X., Nichols, J., Bigham, J.P.: Screen parsing: Towards reverse engineering of ui models from screenshots. In: The 34th Annual ACM Symposium on User Interface Software and Technology. p. 470–483. UIST ’21, ACM (Oct 2021).https://doi.org/10.1145/3472749.3474763,http://dx.doi.org/10. 1145/3472749.3474763

work page doi:10.1145/3472749.3474763 2021
[24]

Wu, Z.: See, think, act: Teaching multimodal agents to effectively interact with gui by identifying toggles (09 2025).https://doi.org/10.48550/arXiv.2509.13615, https://arxiv.org/abs/2509.13615

work page doi:10.48550/arxiv.2509.13615 2025
[25]

arXiv preprint arXiv:2505.13227 (2025)

Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227 (2025)

work page arXiv 2025
[26]

Advances in Neural Information Processing Systems37, 52040–52094 (2024)

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., et al.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems37, 52040–52094 (2024)

2024
[27]

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments (2024),https://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., Xiong, C.: Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Aria-ui: Visual grounding for gui instruc- tions.arXiv preprint arXiv:2412.16256, 2024

Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-ui: Visual grounding for gui instructions. arXiv preprint arXiv:2412.16256 (2024) 18 Sun et al

work page arXiv 2024
[30]

A Survey on Agentic Multimodal Large Language Models,

Yao,H.:Asurveyonagenticmultimodallargelanguagemodels.arXiv2510.10991 (Oct 2025).https://doi.org/10.48550/arXiv.2510.10991,https://arxiv. org/abs/2510.10991

work page doi:10.48550/arxiv.2510.10991 2025
[31]

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., Liao, J., Zheng, Q., Huang, F., Zhou, J., Yan, M.: Mobile-agent-v3: Fundamental agents for gui automation (2025),https://arxiv.org/abs/2508. 15144

2025
[32]

arXiv2510.19949(2025).https : / / doi

Yuan, K.: Surfer 2: The next generation of cross-platform computer use agents. arXiv2510.19949(2025).https : / / doi . org / 10 . 48550 / arXiv . 2510 . 19949, https://arxiv.org/abs/2510.19949, v2

work page arXiv 2025
[34]

Large Language Model-Brained GUI Agents: A Survey

Zhang, C.: Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279 (May 2025).https://doi.org/10.48550/arXiv.2411.18279, https://arxiv.org/abs/2411.18279

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.18279 2025
[35]

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v(ision) is a generalist web agent, if grounded (2024),https://arxiv.org/abs/2401.01614

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

arXiv (May 2025).https://doi.org/10.48550/arXiv.2505

Zhou, A.: Ui-genie: A self-improving approach for iteratively boosting mllm-based mobile gui agents. arXiv (May 2025).https://doi.org/10.48550/arXiv.2505. 21496,https://arxiv.org/abs/2505.21496, version 1

work page doi:10.48550/arxiv.2505 2025
[37]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: Webarena: A realistic web environment for building autonomous agents (2024),https://arxiv.org/abs/2307.13854 Supplementary Materials for Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents Xueqiao Sun1,2, Xia...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

**Analyze the Goal:** First, state what the user wants to accomplish in this ﬁnal step
[39]

What is its function?

**Analyze the Visuals:** Identify the UI element located inside the red circle. What is its function?
[40]

**Consider History:** Review your previous actions to understand the task progression and ensure this action aligns with the overall goal
[41]

ctrl", "alt

**Synthesize and Decide:** Based on the goal, visual evidence, and action history, was this the correct element to click? If not, what element *should* have been clicked, and where is it? ## Action Space con ﬁrm_coordinates() # Conﬁrm that the coordinates are correct adjust_coordinates(coordinate=[x, y]) # Provide corrected coordinates if needed ## Note -...

[1] [1]

arXiv preprint arXiv:2410.08164 (2024)

Agashe, S., Han, J., Gan, S., Yang, J., Li, A., Wang, X.E.: Agent s: An open agentic framework that uses computers like a human. arXiv preprint arXiv:2410.08164 (2024)

work page arXiv 2024

[2] [2]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agashe, S., Wong, K., Tu, V., Yang, J., Li, A., Wang, X.E.: Agent s2: A com- positional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Technical report, Anthropic (2025), https://www.anthropic.com/news/claude-3-7-sonnet, system Card

Anthropic: Claude 3.7 sonnet and claude code. Technical report, Anthropic (2025), https://www.anthropic.com/news/claude-3-7-sonnet, system Card

2025

[4] [4]

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web (2023),https://arxiv.org/ abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Grounding Computer Use Agents on Human Demonstrations

Feizi, A.: Grounding computer use agents on human demonstrations. arXiv 2511.07332(Nov 2025).https://doi.org/10.48550/arXiv.2511.07332,https: //arxiv.org/abs/2511.07332, v1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.07332 2025

[6] [6]

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., Yu, D.: Webvoyager: Building an end-to-end web agent with large multimodal models (2024),https: //arxiv.org/abs/2401.13919

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

arXiv preprint arXiv:2505.13909 (2025).https://doi.org/10.48550/arXiv.2505.13909, arXiv:2505.13909v1

He, Y.: Efficient agent training for computer use. arXiv preprint arXiv:2505.13909 (2025).https://doi.org/10.48550/arXiv.2505.13909, arXiv:2505.13909v1

work page doi:10.48550/arxiv.2505.13909 2025

[8] [8]

Zaletel, and Joel E

Hu, X.: Os agents: A survey on mllm-based agents for general computing devices use. arXiv preprint arXiv:2508.04482 (2025),https://doi.org/10.48550/arXiv. 2508.04482, accepted by ACL 2025 (Oral)

work page internal anchor Pith review doi:10.48550/arxiv 2025

[9] [9]

In: Che, W., Nabende, J., Shutova, E., Pilehvar,M.T.(eds.)Proceedingsofthe63rdAnnualMeetingoftheAssociationfor Computational Linguistics (Volume 1: Long Papers)

Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y.E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., Wu, F.: OS agents: A survey on MLLM-based agents for computer, phone and browser use. In: ...

2025

[10] [10]

arXiv (Aug 2025),https://arxiv.org/abs/2508.04037, arXiv:2508.04037 [cs.AI]

Huo, Y.: Sea: Self-evolution agent with step-wise reward for computer use. arXiv (Aug 2025),https://arxiv.org/abs/2508.04037, arXiv:2508.04037 [cs.AI]

work page arXiv 2025

[11] [11]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Juan, X.: A survey of self-evolving agents: On path to artificial super intelligence. arXiv2507.21046(2025).https://doi.org/10.48550/arXiv.2507.21046, v3

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21046 2025

[12] [12]

Kapoor, R., Butala, Y.P., Russak, M., Koh, J.Y., Kamble, K., Alshikh, W., Salakhutdinov, R.: Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web (2024),https://arxiv.org/ abs/2402.17553

work page arXiv 2024

[13] [13]

Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use (2025),https://arxiv.org/abs/2504.07981

work page arXiv 2025

[14] [14]

org/abs/2406.03679

Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., Riva, O.: On the effects of data scale on ui control agents (2024),https://arxiv. org/abs/2406.03679

work page arXiv 2024

[15] [15]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025) Inference-Time Self-Improvement for Computer-Use Agents 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Technical report, OpenAI (2025), https://cdn.openai.com/pdf/2221c875- 02dc- 4789- 800b- e7758f3722c1/o3- and-o4-mini-system-card.pdf, system Card

OpenAI: Openai o3 and o4-mini system card. Technical report, OpenAI (2025), https://cdn.openai.com/pdf/2221c875- 02dc- 4789- 800b- e7758f3722c1/o3- and-o4-mini-system-card.pdf, system Card

2025

[17] [18]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

Song, L.: Coact-1: Computer-using agents with coding as actions. arXiv 2508.03923(2025).https://doi.org/10.48550/arXiv.2508.03923, v2

work page doi:10.48550/arxiv.2508.03923 2025

[19] [20]

arXiv2508(2025),https : / / arxiv

Sun, Z.: Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv2508(2025),https : / / arxiv . org / abs / 2508 . 04700, arXiv:2508.04700 [cs.AI]

work page arXiv 2025

[20] [21]

Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., Li, Y.: Screen2words: Automatic mobile ui summarization with multimodal learning (2021),https: //arxiv.org/abs/2108.03353

work page arXiv 2021

[21] [22]

Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C.H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Sun...

work page arXiv 2025

[22] [23]

In: The 34th Annual ACM Symposium on User Interface Software and Technology

Wu, J., Zhang, X., Nichols, J., Bigham, J.P.: Screen parsing: Towards reverse engineering of ui models from screenshots. In: The 34th Annual ACM Symposium on User Interface Software and Technology. p. 470–483. UIST ’21, ACM (Oct 2021).https://doi.org/10.1145/3472749.3474763,http://dx.doi.org/10. 1145/3472749.3474763

work page doi:10.1145/3472749.3474763 2021

[23] [24]

Wu, Z.: See, think, act: Teaching multimodal agents to effectively interact with gui by identifying toggles (09 2025).https://doi.org/10.48550/arXiv.2509.13615, https://arxiv.org/abs/2509.13615

work page doi:10.48550/arxiv.2509.13615 2025

[24] [25]

arXiv preprint arXiv:2505.13227 (2025)

Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227 (2025)

work page arXiv 2025

[25] [26]

Advances in Neural Information Processing Systems37, 52040–52094 (2024)

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., et al.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems37, 52040–52094 (2024)

2024

[26] [27]

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments (2024),https://arxiv.org/abs/2404.07972

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [28]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., Xiong, C.: Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Aria-ui: Visual grounding for gui instruc- tions.arXiv preprint arXiv:2412.16256, 2024

Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-ui: Visual grounding for gui instructions. arXiv preprint arXiv:2412.16256 (2024) 18 Sun et al

work page arXiv 2024

[29] [30]

A Survey on Agentic Multimodal Large Language Models,

Yao,H.:Asurveyonagenticmultimodallargelanguagemodels.arXiv2510.10991 (Oct 2025).https://doi.org/10.48550/arXiv.2510.10991,https://arxiv. org/abs/2510.10991

work page doi:10.48550/arxiv.2510.10991 2025

[30] [31]

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., Liao, J., Zheng, Q., Huang, F., Zhou, J., Yan, M.: Mobile-agent-v3: Fundamental agents for gui automation (2025),https://arxiv.org/abs/2508. 15144

2025

[31] [32]

arXiv2510.19949(2025).https : / / doi

Yuan, K.: Surfer 2: The next generation of cross-platform computer use agents. arXiv2510.19949(2025).https : / / doi . org / 10 . 48550 / arXiv . 2510 . 19949, https://arxiv.org/abs/2510.19949, v2

work page arXiv 2025

[32] [34]

Large Language Model-Brained GUI Agents: A Survey

Zhang, C.: Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279 (May 2025).https://doi.org/10.48550/arXiv.2411.18279, https://arxiv.org/abs/2411.18279

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.18279 2025

[33] [35]

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v(ision) is a generalist web agent, if grounded (2024),https://arxiv.org/abs/2401.01614

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [36]

arXiv (May 2025).https://doi.org/10.48550/arXiv.2505

Zhou, A.: Ui-genie: A self-improving approach for iteratively boosting mllm-based mobile gui agents. arXiv (May 2025).https://doi.org/10.48550/arXiv.2505. 21496,https://arxiv.org/abs/2505.21496, version 1

work page doi:10.48550/arxiv.2505 2025

[35] [37]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: Webarena: A realistic web environment for building autonomous agents (2024),https://arxiv.org/abs/2307.13854 Supplementary Materials for Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents Xueqiao Sun1,2, Xia...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [38]

**Analyze the Goal:** First, state what the user wants to accomplish in this ﬁnal step

[37] [39]

What is its function?

**Analyze the Visuals:** Identify the UI element located inside the red circle. What is its function?

[38] [40]

**Consider History:** Review your previous actions to understand the task progression and ensure this action aligns with the overall goal

[39] [41]

ctrl", "alt

**Synthesize and Decide:** Based on the goal, visual evidence, and action history, was this the correct element to click? If not, what element *should* have been clicked, and where is it? ## Action Space con ﬁrm_coordinates() # Conﬁrm that the coordinates are correct adjust_coordinates(coordinate=[x, y]) # Provide corrected coordinates if needed ## Note -...