arxiv: 2604.24348 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

Zheng Wu , Yi Hua , Zhaoyuan Huang , Chenhao Xue , Yijie Lu , Pengzhou Cheng , Zongru Wu , Lingzhong Dong

show 3 more authors

Gongshen Liu Xinghao Jiang Zhuosheng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords OS agentsmultimodal LLMssafety evaluationrobustness analysisGUI agentsefficiency metricsagent benchmarktrajectory evaluation

0 comments

The pith

OS-SPEAR introduces four specialized subsets to benchmark OS agents on safety, performance, efficiency, and robustness, exposing key trade-offs in existing systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OS-SPEAR as a toolkit to evaluate multimodal LLM-based OS agents that navigate graphical interfaces. It fills gaps in prior benchmarks by defining separate subsets for safety hazards from environments and humans, performance through value-based trajectory sampling, efficiency via latency and token measures, and robustness under visual and textual disturbances. Evaluation of 22 agents with an automated reporting tool shows a common efficiency-safety or efficiency-robustness trade-off, stronger results from specialized agents than general models, and differing vulnerabilities by modality. This framework supplies standardized multidimensional rankings to support more reliable agent development.

Core claim

OS-SPEAR comprises an S-subset for diverse hazards, a P-subset for trajectory evaluation via value estimation and sampling, an E-subset for temporal and token-based efficiency, and an R-subset applying cross-modal disturbances, plus an automated analysis tool. When run on 22 OS agents, the toolkit demonstrates a prevalent trade-off between efficiency and safety or robustness, performance advantages for specialized agents over general-purpose models, and varying robustness issues across input modalities.

What carries the argument

The OS-SPEAR toolkit with its four subsets (Safety for hazards, Performance via stratified trajectory sampling, Efficiency via latency and token consumption, Robustness via cross-modal disturbances) and automated diagnostic report generator.

If this is right

Specialized OS agents outperform general-purpose models on performance metrics.
Current agents exhibit a trade-off where gains in efficiency reduce safety or robustness.
Robustness vulnerabilities differ between visual and textual input modalities.
The toolkit supplies a standardized multidimensional ranking for comparing and improving agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the subsets capture representative scenarios, agent developers could use the toolkit to balance efficiency against safety rather than optimizing one at the expense of others.
Adding dynamic or multi-step disturbances to the robustness subset could expose further weaknesses not visible in the current cross-modal tests.
Feeding the automated reports back into agent training might reduce the documented trade-offs in future designs.

Load-bearing premise

The four proposed subsets sufficiently represent real-world hazards, trajectories, and disturbances for OS agents without significant coverage gaps or labeling noise.

What would settle it

A new OS agent that scores high across all four subsets while avoiding the efficiency-safety or efficiency-robustness trade-off observed in the 22 evaluated agents would indicate the patterns may not hold generally.

Figures

Figures reproduced from arXiv: 2604.24348 by Chenhao Xue, Gongshen Liu, Lingzhong Dong, Pengzhou Cheng, Xinghao Jiang, Yi Hua, Yijie Lu, Zhaoyuan Huang, Zheng Wu, Zhuosheng Zhang, Zongru Wu.

**Figure 1.** Figure 1: Overview of OS-SPEAR. OS-SPEAR provides a comprehensive evaluation of OS agents across four dimensions: safety, performance, efficiency, and view at source ↗

**Figure 3.** Figure 3: Construction of the P-subset. The construction consists of two stages: view at source ↗

**Figure 4.** Figure 4: The workflow pipeline of the analysis tool. Four expert agents within the analysis tool independently process the corresponding subsets of the OS view at source ↗

**Figure 5.** Figure 5: Evaluation results of different OS agents in the E-subset’s token view at source ↗

read the original abstract

The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OS-SPEAR, a toolkit and benchmark for evaluating OS agents across four dimensions: Safety (S-subset with environment- and human-induced hazards), Performance (P-subset via trajectory value estimation and stratified sampling), Efficiency (E-subset with temporal latency and token metrics), and Robustness (R-subset with cross-modal disturbances). It evaluates 22 agents, reports a trade-off between efficiency and safety/robustness, superiority of specialized agents over general-purpose models, and modality-specific robustness vulnerabilities, while providing an automated diagnostic tool and open dataset/code.

Significance. If the subset curation and evaluations prove sound, OS-SPEAR would offer a valuable standardized, multidimensional framework that addresses gaps in existing OS agent benchmarks (narrow scenarios, noisy labels, limited metrics). The open-source release and automated reports could facilitate reproducible research and development of more reliable agents; the empirical insights on trade-offs and agent types would be useful if externally validated.

major comments (3)

[Abstract / P-subset description] The P-subset construction (abstract and methods) relies on 'trajectory value estimation and stratified sampling' without detailing the estimation algorithm, feature set used for value scoring, sampling strata, or any validation (e.g., inter-annotator agreement or external grounding). This is load-bearing for the performance superiority and efficiency-safety trade-off claims, as biased sampling could artifactually produce the reported rankings and correlations.
[Abstract / subset definitions] The S-subset and R-subset taxonomies for hazards and cross-modal disturbances are introduced without evidence of exhaustiveness, coverage analysis, or inter-rater validation; gaps here directly weaken the claims of 'varying robustness vulnerabilities across different modalities' and 'critical insights' into the landscape.
[Evaluation results] No statistical significance tests, confidence intervals, or error bars are reported for the trade-off observations, agent rankings, or modality comparisons across the 22 agents, making it impossible to assess whether the 'prevalent trade-off' and 'superiority' findings exceed sampling noise.

minor comments (2)

[Abstract] The abstract states current benchmarks suffer from 'noisy trajectory labeling' yet provides no comparison of OS-SPEAR's own labeling process against that baseline.
[Figures/Tables] Figure and table captions should explicitly state the number of trajectories per subset and the exact metrics used for the E-subset (latency vs. tokens) to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve transparency and rigor where the original submission was lacking.

read point-by-point responses

Referee: [Abstract / P-subset description] The P-subset construction (abstract and methods) relies on 'trajectory value estimation and stratified sampling' without detailing the estimation algorithm, feature set used for value scoring, sampling strata, or any validation (e.g., inter-annotator agreement or external grounding). This is load-bearing for the performance superiority and efficiency-safety trade-off claims, as biased sampling could artifactually produce the reported rankings and correlations.

Authors: We acknowledge that the manuscript provides only a high-level description of the P-subset curation. The full methods section expands on the process but does not include the requested specifics on the value estimation algorithm, feature set, sampling strata, or validation metrics. In the revision, we will add a detailed subsection describing the trajectory value estimation procedure, the features used for scoring, the stratification criteria, and any internal validation steps performed during curation. This will allow readers to evaluate potential sampling biases directly. revision: yes
Referee: [Abstract / subset definitions] The S-subset and R-subset taxonomies for hazards and cross-modal disturbances are introduced without evidence of exhaustiveness, coverage analysis, or inter-rater validation; gaps here directly weaken the claims of 'varying robustness vulnerabilities across different modalities' and 'critical insights' into the landscape.

Authors: The taxonomies were derived from a synthesis of prior work on OS agent safety hazards and multimodal robustness challenges. However, the original submission does not include formal coverage analysis or inter-rater agreement statistics. We will revise the relevant sections to document the taxonomy construction process, including the literature sources consulted and qualitative steps taken to promote comprehensiveness. We will also explicitly note the absence of quantitative coverage metrics as a limitation and discuss how future extensions could address this. revision: partial
Referee: [Evaluation results] No statistical significance tests, confidence intervals, or error bars are reported for the trade-off observations, agent rankings, or modality comparisons across the 22 agents, making it impossible to assess whether the 'prevalent trade-off' and 'superiority' findings exceed sampling noise.

Authors: We agree that the lack of statistical tests and uncertainty measures weakens the strength of the empirical claims. In the revised manuscript, we will add appropriate statistical analyses, including significance tests for the efficiency-safety/robustness trade-offs and modality comparisons, as well as confidence intervals or error bars on key metrics and rankings. These additions will be placed in the results section and will use standard methods suitable for the evaluation setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation subsets

full rationale

The paper introduces OS-SPEAR as a toolkit with four subsets (S, P, E, R) for evaluating 22 OS agents and reports empirical observations such as efficiency-safety trade-offs. The P-subset curation via 'trajectory value estimation and stratified sampling' is described only at a high level with no equations, fitted parameters, or self-referential definitions that would make subsequent performance rankings or trade-off claims reduce to the curation inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed as novel derivations. All reported insights are direct measurements on the proposed subsets rather than tautological predictions. This is a standard empirical benchmark proposal with self-contained evaluation content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central contribution rests on the assumption that the newly defined subsets capture representative hazards and metrics; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption OS agents interact with environments primarily through GUI actions that can be logged as trajectories.
Invoked in the description of P-subset curation and robustness disturbances.

invented entities (4)

S-subset (safety hazards) no independent evidence
purpose: To test environment- and human-induced risks in OS agent behavior.
Newly proposed test collection; no independent evidence provided beyond the paper's definition.
P-subset (performance via value estimation) no independent evidence
purpose: To curate representative trajectories using stratified sampling.
Newly proposed test collection; no independent evidence provided beyond the paper's definition.
E-subset (efficiency metrics) no independent evidence
purpose: To quantify temporal latency and token consumption.
Newly proposed test collection; no independent evidence provided beyond the paper's definition.
R-subset (cross-modal robustness) no independent evidence
purpose: To apply disturbances to visual and textual inputs.
Newly proposed test collection; no independent evidence provided beyond the paper's definition.

pith-pipeline@v0.9.0 · 5648 in / 1351 out tokens · 25664 ms · 2026-05-08T03:45:59.990965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 56 canonical work pages · 9 internal anchors

[1]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

2022
[2]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onlin...

2022
[3]

Plangenllms: A modern survey of llm planning capabilities,

H. Wei, Z. Zhang, S. He, T. Xia, S. Pan, and F. Liu, “Plangenllms: A modern survey of llm planning capabilities,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 19 497–19 521

2025
[4]

When is tree search useful for llm planning? it depends on the discriminator,

Z. Chen, M. White, R. Mooney, A. Payani, Y . Su, and H. Sun, “When is tree search useful for llm planning? it depends on the discriminator,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 659– 13 678

2024
[5]

Perceptiongpt: Effectively fusing visual perception into llm,

R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang, “Perceptiongpt: Effectively fusing visual perception into llm,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27 124–27 133

2024
[6]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, edit- ing,

H. Fei, S. Wu, H. Zhang, T.-S. Chua, and S. Yan, “Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, edit- ing,”Advances in neural information processing systems, vol. 37, pp. 57 207–57 239, 2024

2024
[7]

Determinants of llm-assisted decision-making.arXiv preprint arXiv:2402.17385, 2024

E. Eigner and T. H ¨andler, “Determinants of llm-assisted decision- making,”arXiv preprint arXiv:2402.17385, 2024

work page arXiv 2024
[8]

Llm-based multi-agent decision- making: Challenges and future directions,

C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent decision- making: Challenges and future directions,”IEEE Robotics and Automa- tion Letters, 2025

2025
[9]

Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities,

O. Gurcan, “Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities,”arXiv preprint arXiv:2405.06700, 2024

work page arXiv 2024
[10]

Richardson, Austin C

J. R. Anthis, R. Liu, S. M. Richardson, A. C. Kozlowski, B. Koch, J. Evans, E. Brynjolfsson, and M. Bernstein, “Llm social simulations are a promising research method,”arXiv preprint arXiv:2504.02234, 2025

work page arXiv 2025
[11]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Z. Fan, L. Wei, J. Tang, W. Chen, W. Siyuan, Z. Wei, and F. Huang, “Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 10 183–10 213

2025
[12]

et al.: MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow (Jul 2025)

Z. Wang, J. Wu, L. Cai, C. H. Low, X. Yang, Q. Li, and Y . Jin, “Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow,”arXiv preprint arXiv:2503.18968, 2025

work page arXiv 2025
[13]

Rmsagen: Integrating multiple sequence align- ment for function rna design,

J. Jiang, Y . Chen, Q. Zhang, J. Li, X. Shi, C. Zhou, Z. Lin, J. Wang, D. He, L. Honget al., “Rmsagen: Integrating multiple sequence align- ment for function rna design,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 1, 2026, pp. 489–497

2026
[14]

Protpainter: Draw or drag protein via topology-guided diffusion,

Z. Lu, S. Cheng, T. Jiang, Y . Zhang, and M. Zhang, “Protpainter: Draw or drag protein via topology-guided diffusion,” inThe Thirteenth International Conference on Learning Representations
[15]

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Z. Wu, H. Huang, X. Lou, X. Qu, P. Cheng, Z. Wu, W. Liu, W. Zhang, J. Wang, Z. Wanget al., “Verios: Query-driven proactive human-agent-gui interaction for trustworthy os agents,”arXiv preprint arXiv:2509.07553, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Proactive agent: Shifting llm agents from re- active responses to active assistance,

Y . Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y . Wu, H. Wang, X. Cong, Z. Zhang, Y . Linet al., “Proactive agent: Shifting llm agents from re- active responses to active assistance,”arXiv preprint arXiv:2410.12361, 2024

work page arXiv 2024
[17]

Gui agents: A survey,

D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xiaet al., “Gui agents: A survey,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 22 522– 22 538

2025
[18]

Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

S. Wang, W. Liu, J. Chen, Y . Zhou, W. Gan, X. Zeng, Y . Che, S. Yu, X. Hao, K. Shaoet al., “Gui agents with foundation models: A comprehensive survey,”arXiv preprint arXiv:2411.04890, 2024

work page arXiv 2024
[19]

Os-atlas: Foundation action model for generalist gui agents,

Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Lianget al., “Os-atlas: Foundation action model for generalist gui agents,” inThe Thirteenth International Conference on Learning Representations
[20]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huanget al., “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review arXiv 2025
[21]

Breaking the data barrier–building gui agents through task generalization,

J. Zhang, Z. Ding, C. Ma, Z. Chen, Q. Sun, Z. Lan, and J. He, “Breaking the data barrier–building gui agents through task generalization,” in Second Conference on Language Modeling
[22]

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

C. Gao, Z. Gu, Y . Liu, X. Qiu, S. Shen, Y . Wen, T. Xia, Z. Xu, Z. Zeng, B. Zhouet al., “Ui-venus-1.5 technical report,”arXiv preprint arXiv:2602.09082, 2026

work page arXiv 2026
[23]

You only look at screens: Multimodal chain- of-action agents,

Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain- of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3132–3149

2024
[24]

CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,

X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9097–9110

2024
[25]

GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

F. Tang, Z. Gu, Z. Lu, X. Liu, S. Shen, C. Meng, W. Wang, W. Zhang, Y . Shen, W. Luet al., “Gui-g2: Gaussian reward modeling for gui grounding,”arXiv preprint arXiv:2507.15846, 2025

work page arXiv 2025
[26]

Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li, “Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning,”arXiv preprint arXiv:2503.21620, 2025

work page arXiv 2025
[27]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia, “Gui-r1: A generalist r1-style vision-language action model for gui agents,”arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review arXiv 2025
[28]

arXiv preprint arXiv:2504.14239 , year=

Y . Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu, “Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,”arXiv preprint arXiv:2504.14239, 2025

work page arXiv 2025
[29]

Appagent: Multimodal agents as smartphone users,

C. Zhang, Z. Yang, J. Liu, Y . Li, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–20

2025
[31]

Mobileuse: A hierarchical reflection-driven gui agent for au- tonomous mobile operation,

N. Li, X. Qu, J. Zhou, J. Wang, M. Wen, K. Du, X. Lou, Q. Peng, and W. Zhang, “Mobileuse: A hierarchical reflection-driven gui agent for au- tonomous mobile operation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025b. URL https://openreview. net/forum
[32]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang, “Agent-safetybench: Evaluating the safety of llm agents,”arXiv preprint arXiv:2412.14470, 2024

work page internal anchor Pith review arXiv 2024
[33]

Wasp: Benchmarking web agent security against prompt injection attacks,

I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, and K. Chaud- huri, “Wasp: Benchmarking web agent security against prompt injection attacks,”arXiv preprint arXiv:2504.18575, 2025

work page arXiv 2025
[34]

Commercial llm agents are already vulnerable to simple yet dangerous attacks,

A. Li, Y . Zhou, V . C. Raghuram, T. Goldstein, and M. Goldblum, “Commercial llm agents are already vulnerable to simple yet dangerous attacks,”arXiv preprint arXiv:2502.08586, 2025

work page arXiv 2025
[35]

Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,

J. Yang, Z. Song, J. Chen, M. Song, S. Zhou, X. Ouyang, C. Chen, C. Wanget al., “Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,”arXiv preprint arXiv:2506.14477, 2025

work page arXiv 2025
[36]

OSWorld-gold: Benchmarking the efficiency of computer-use agents,

R. Abhyankar, Q. Qi, and Y . Zhang, “OSWorld-gold: Benchmarking the efficiency of computer-use agents,” inICML 2025 Workshop on Computer Use Agents, 2025. [Online]. Available: https://openreview. net/forum?id=sV3n6mYy7J

2025
[37]

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

P. Zhao, G. Liu, Y . Liang, W. He, Z. Lu, Y . Huang, Y . Guo, K. Zhang, H. Wang, L. Liu, and Y . Liu, “Mas-bench: A unified benchmark for shortcut-augmented hybrid mobile gui agents,”arXiv preprint arXiv:2509.06477, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Osworld-human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,

R. Abhyankar, Q. Qi, and Y . Zhang, “Osworld-human: Benchmarking the efficiency of computer-use agents,”arXiv preprint arXiv:2506.16042, 2025

work page arXiv 2025
[39]

Agent-scankit: Unraveling memory and reasoning of multimodal agents via sensitivity perturbations,

P. Cheng, L. Dong, Z. Wu, Z. Wu, X. Tang, C. Qin, Z. Zhang, and G. Liu, “Agent-scankit: Unraveling memory and reasoning of multimodal agents via sensitivity perturbations,”arXiv preprint arXiv:2510.00496, 2025

work page arXiv 2025
[40]

D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies,

S. Chen, T. Zhao, Y . Bin, F. Ma, W. Shao, and Z. Wang, “D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies,”arXiv preprint arXiv:2511.16590, 2025

work page arXiv 2025
[41]

Mla- trust: Benchmarking trustworthiness of multimodal llm agents in gui environments,

X. Yang, J. Chen, J. Luo, Z. Fang, Y . Dong, H. Su, and J. Zhu, “Mla- trust: Benchmarking trustworthiness of multimodal llm agents in gui environments,”arXiv preprint arXiv:2506.01616, 2025

work page arXiv 2025
[42]

D-gara: A dynamic benchmarking framework for gui agent robustness in real- world anomalies,

S. Chen, T. Zhao, Y . Bin, F. Ma, W. Shao, and Z. Wang, “D-gara: A dynamic benchmarking framework for gui agent robustness in real- world anomalies,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 21, 2026, pp. 17 419–17 426. 14

2026
[43]

Androidcontrol-curated: Revealing the true potential of gui agents through benchmark purification,

H. F. Leung, X. Xi, and F. Zuo, “Androidcontrol-curated: Revealing the true potential of gui agents through benchmark purification,”arXiv preprint arXiv:2510.18488, 2025

work page arXiv 2025
[44]

Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

X. Wang, Z. Wu, J. Xie, Z. Ding, B. Yang, Z. Li, Z. Liu, Q. Li, X. Dong, Z. Chenet al., “Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents,”arXiv preprint arXiv:2507.19478, 2025

work page arXiv 2025
[45]

Ui-venus technical report: Building high-performance ui agents with rft

Z. Gu, Z. Zeng, Z. Xu, X. Zhou, S. Shen, Y . Liu, B. Zhou, C. Meng, T. Xia, W. Chenet al., “Ui-venus technical report: Building high- performance ui agents with rft,”arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[46]

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu

Y . Liu, Z. Liu, S. Zhu, P. Li, C. Xie, J. Wang, X. Hu, X. Han, J. Yuan, X. Wanget al., “Infigui-g1: Advancing gui grounding with adaptive ex- ploration policy optimization,”arXiv preprint arXiv:2508.05731, 2025

work page arXiv 2025
[47]

Test-time reinforcement learning for gui grounding via region consistency,

Y . Du, Y . Yan, F. Tang, Z. Lu, C. Zong, W. Lu, S. Jiang, and Y . Shen, “Test-time reinforcement learning for gui grounding via region consistency,”arXiv preprint arXiv:2508.05615, 2025

work page arXiv 2025
[48]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

X. Yuan, J. Zhang, K. Li, Z. Cai, L. Yao, J. Chen, E. Wang, Q. Hou, J. Chen, P.-T. Jianget al., “Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,”arXiv preprint arXiv:2505.12370, 2025

work page arXiv 2025
[49]

Seeclick: Harnessing gui grounding for advanced visual gui agents,

K. Cheng, Q. Sun, Y . Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9313– 9332

2024
[50]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Y . Zhou, S. Dai, S. Wang, K. Zhou, Q. Jia, and J. Xu, “Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,” arXiv preprint arXiv:2505.15810, 2025

work page arXiv 2025
[51]

Mobilerl: Online agentic reinforcement learning for mobile gui agents

Y . Xu, X. Liu, X. Liu, J. Fu, H. Zhang, B. Jing, S. Zhang, Y . Wang, W. Zhao, and Y . Dong, “Mobilerl: Online agentic reinforcement learning for mobile gui agents,”arXiv preprint arXiv:2509.18119, 2025

work page arXiv 2025
[52]

Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040,

H. Lai, X. Liu, Y . Zhao, H. Xu, H. Zhang, B. Jing, Y . Ren, S. Yao, Y . Dong, and J. Tang, “Computerrl: Scaling end-to-end on- line reinforcement learning for computer use agents,”arXiv preprint arXiv:2508.14040, 2025

work page arXiv 2025
[53]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Y . Shi, W. Yu, Z. Li, Y . Wang, H. Zhang, N. Liu, H. Mi, and D. Yu, “Mobilegui-rl: Advancing mobile gui agent through reinforcement learn- ing in online environment,”arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[54]

Mobileuse: A hierarchical reflection-driven GUI agent for autonomous mobile operation,

N. Li, X. Qu, J. Zhou, J. Wang, M. Wen, K. Du, X. Lou, Q. Peng, J. Wang, and W. Zhang, “Mobileuse: A hierarchical reflection-driven GUI agent for autonomous mobile operation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[55]

Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

J. Ye, X. Zhang, H. Xu, H. Liu, J. Wang, Z. Zhu, Z. Zheng, F. Gao, J. Cao, Z. Lu, J. Liao, Q. Zheng, F. Huang, J. Zhou, and M. Yan, “Mobile-agent-v3: Fundamental agents for gui automation,” arXiv preprint arXiv:2508.15144, 2025

work page arXiv 2025
[56]

Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

S. Agashe, K. Wong, V . Tu, J. Yang, A. Li, and X. E. Wang, “Agent s2: A compositional generalist-specialist framework for computer use agents,”arXiv preprint arXiv:2504.00906, 2025

work page arXiv 2025
[57]

Webarena: A realistic web environment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=oKn9c6ytLx

2024
[58]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905

2024
[59]

Mind2web: Towards a generalist agent for the web,

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 091–28 114, 2023

2023
[60]

Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025

B. Gou, Z. Huang, Y . Ning, Y . Gu, M. Lin, W. Qi, A. Kopanev, B. Yu, B. J. Guti ´errez, Y . Shuet al., “Mind2web 2: Evaluating agentic search with agent-as-a-judge,”arXiv preprint arXiv:2506.21506, 2025

work page arXiv 2025
[61]

Webshop: Towards scalable real-world web interaction with grounded language agents,

S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,”Ad- vances in Neural Information Processing Systems, vol. 35, pp. 20 744– 20 757, 2022

2022
[62]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. ...

2024
[63]

Androidworld: A dynamic benchmarking environment for autonomous agents,

C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P. Lillicrap, and O. Riva, “Androidworld: A dynamic benchmarking environment for autonomous agents,” in The Thirteenth International Conference on Learning Representations,
[64]

Available: https://openreview.net/forum?id=il5yUQsrjC

[Online]. Available: https://openreview.net/forum?id=il5yUQsrjC
[65]

Mobile-bench: An evaluation benchmark for llm-based mobile agents,

S. Deng, W. Xu, H. Sun, W. Liu, T. Tan, L. Liujianfeng, A. Li, J. Luan, B. Wang, R. Yanet al., “Mobile-bench: An evaluation benchmark for llm-based mobile agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8813–8831

2024
[66]

Mobile-bench-v2: A more realistic and comprehensive benchmark for vlm-based mobile agents.arXiv preprint arXiv:2505.11891, 2025

W. Xu, Z. Jiang, Y . Liu, P. Gao, W. Liu, J. Luan, Y . Li, Y . Liu, B. Wang, and B. An, “Mobile-bench-v2: A more realistic and com- prehensive benchmark for vlm-based mobile agents,”arXiv preprint arXiv:2505.11891, 2025

work page arXiv 2025
[67]

Androidlab: Training and systematic benchmark- ing of android autonomous agents,

Y . Xu, X. Liu, X. Sun, S. Cheng, H. Yu, H. Lai, S. Zhang, D. Zhang, J. Tang, and Y . Dong, “Androidlab: Training and systematic benchmark- ing of android autonomous agents,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 2144–2166

2025
[68]

Screenspot-pro: Gui grounding for professional high- resolution computer use,

K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high- resolution computer use,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 8778–8786

2025
[69]

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,

G. Liu, P. Zhao, Y . Liang, Q. Luo, S. Tang, Y . Chai, W. Lin, H. Xiao, W. Wang, S. Chenet al., “Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,”arXiv preprint arXiv:2602.06075, 2026

work page arXiv 2026
[70]

Learnact: Few-shot mobile gui agent with a unified demonstration benchmark,

G. Liu, P. Zhao, L. Liu, Z. Chen, Y . Chai, S. Ren, H. Wang, S. He, and W. Meng, “Learnact: Few-shot mobile gui agent with a unified demonstration benchmark,”arXiv preprint arXiv:2504.13805, 2025

work page arXiv 2025
[71]

Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

Z. Wu, H. Huang, Y . Yang, Y . Song, X. Lou, W. Liu, W. Zhang, J. Wang, and Z. Zhang, “Quick on the uptake: Eliciting implicit intents from human demonstrations for personalized mobile-use agents,”arXiv preprint arXiv:2508.08645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents,

Y . Li, Y . Liu, H. Lu, Z. Xia, H. Wang, K. Han, C. Yang, J. Wu, J. Xu, R. Shiet al., “Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents,”arXiv preprint arXiv:2603.15039, 2026

work page arXiv 2026
[73]

TRACE: Trajectory-aware comprehensive evaluation for deep research agents

Y . Chen, J. Jiang, J. Liu, Y . Zhang, X. Guo, and I. King, “Trace: Trajectory-aware comprehensive evaluation for deep research agents,” arXiv preprint arXiv:2602.21230, 2026

work page arXiv 2026
[74]

Probench: Benchmarking gui agents with accurate process informa- tion,

L. Yang, Z. Wang, X. Tang, S. Zhou, D. Chen, W. Jiang, and Y . Li, “Probench: Benchmarking gui agents with accurate process informa- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 32, 2026, pp. 27 547–27 555

2026
[75]

Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild,

J. Sun, M. Li, Y . Zhang, J. Niu, Y . Wu, R. Jin, S. Lei, P. Tan, Z. Zhang, R. Wanget al., “Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild,”arXiv preprint arXiv:2602.11750, 2026

work page arXiv 2026
[76]

Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment,

Q. Wu, Z. Yang, H. Li, P. Gao, W. Liu, and J. Luan, “Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment,”arXiv preprint arXiv:2601.20335, 2026

work page arXiv 2026
[77]

arXiv preprint arXiv:2603.04949 , year=

M. F. Ishmam and K. Marino, “Timewarp: Evaluating web agents by revisiting the past,”arXiv preprint arXiv:2603.04949, 2026

work page arXiv 2026
[78]

Os-harm: A benchmark for measuring safety of computer use agents

T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, and M. Andriushchenko, “Os-harm: A benchmark for measuring safety of computer use agents,”arXiv preprint arXiv:2506.14866, 2025

work page arXiv 2025
[79]

Caution for the environment: Multimodal llm agents are susceptible to environmental distractions,

X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal llm agents are susceptible to environmental distractions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 22 324–22 339

2025
[80]

Hijacking jarvis: Benchmarking mobile gui agents against unprivileged third parties,

G. Liu, J. Ye, J. Liu, Y . Li, W. Liu, P. Gao, J. Luan, and Y . Liu, “Hijacking jarvis: Benchmarking mobile gui agents against unprivileged third parties,” inProceedings of the 2nd International Workshop on Edge and Mobile Foundation Models, 2025, pp. 12–18

2025
[81]

Eva: Red-teaming gui agents via evolving indirect prompt injection,

Y . Lu, T. Ju, M. Zhao, X. Ma, Y . Guo, and Z. Zhang, “Eva: Red-teaming gui agents via evolving indirect prompt injection,”arXiv preprint arXiv:2505.14289, 2025

work page arXiv 2025

Showing first 80 references.