Recognition: unknown
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
Pith reviewed 2026-05-08 03:45 UTC · model grok-4.3
The pith
OS-SPEAR introduces four specialized subsets to benchmark OS agents on safety, performance, efficiency, and robustness, exposing key trade-offs in existing systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OS-SPEAR comprises an S-subset for diverse hazards, a P-subset for trajectory evaluation via value estimation and sampling, an E-subset for temporal and token-based efficiency, and an R-subset applying cross-modal disturbances, plus an automated analysis tool. When run on 22 OS agents, the toolkit demonstrates a prevalent trade-off between efficiency and safety or robustness, performance advantages for specialized agents over general-purpose models, and varying robustness issues across input modalities.
What carries the argument
The OS-SPEAR toolkit with its four subsets (Safety for hazards, Performance via stratified trajectory sampling, Efficiency via latency and token consumption, Robustness via cross-modal disturbances) and automated diagnostic report generator.
If this is right
- Specialized OS agents outperform general-purpose models on performance metrics.
- Current agents exhibit a trade-off where gains in efficiency reduce safety or robustness.
- Robustness vulnerabilities differ between visual and textual input modalities.
- The toolkit supplies a standardized multidimensional ranking for comparing and improving agents.
Where Pith is reading between the lines
- If the subsets capture representative scenarios, agent developers could use the toolkit to balance efficiency against safety rather than optimizing one at the expense of others.
- Adding dynamic or multi-step disturbances to the robustness subset could expose further weaknesses not visible in the current cross-modal tests.
- Feeding the automated reports back into agent training might reduce the documented trade-offs in future designs.
Load-bearing premise
The four proposed subsets sufficiently represent real-world hazards, trajectories, and disturbances for OS agents without significant coverage gaps or labeling noise.
What would settle it
A new OS agent that scores high across all four subsets while avoiding the efficiency-safety or efficiency-robustness trade-off observed in the 22 evaluated agents would indicate the patterns may not hold generally.
Figures
read the original abstract
The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OS-SPEAR, a toolkit and benchmark for evaluating OS agents across four dimensions: Safety (S-subset with environment- and human-induced hazards), Performance (P-subset via trajectory value estimation and stratified sampling), Efficiency (E-subset with temporal latency and token metrics), and Robustness (R-subset with cross-modal disturbances). It evaluates 22 agents, reports a trade-off between efficiency and safety/robustness, superiority of specialized agents over general-purpose models, and modality-specific robustness vulnerabilities, while providing an automated diagnostic tool and open dataset/code.
Significance. If the subset curation and evaluations prove sound, OS-SPEAR would offer a valuable standardized, multidimensional framework that addresses gaps in existing OS agent benchmarks (narrow scenarios, noisy labels, limited metrics). The open-source release and automated reports could facilitate reproducible research and development of more reliable agents; the empirical insights on trade-offs and agent types would be useful if externally validated.
major comments (3)
- [Abstract / P-subset description] The P-subset construction (abstract and methods) relies on 'trajectory value estimation and stratified sampling' without detailing the estimation algorithm, feature set used for value scoring, sampling strata, or any validation (e.g., inter-annotator agreement or external grounding). This is load-bearing for the performance superiority and efficiency-safety trade-off claims, as biased sampling could artifactually produce the reported rankings and correlations.
- [Abstract / subset definitions] The S-subset and R-subset taxonomies for hazards and cross-modal disturbances are introduced without evidence of exhaustiveness, coverage analysis, or inter-rater validation; gaps here directly weaken the claims of 'varying robustness vulnerabilities across different modalities' and 'critical insights' into the landscape.
- [Evaluation results] No statistical significance tests, confidence intervals, or error bars are reported for the trade-off observations, agent rankings, or modality comparisons across the 22 agents, making it impossible to assess whether the 'prevalent trade-off' and 'superiority' findings exceed sampling noise.
minor comments (2)
- [Abstract] The abstract states current benchmarks suffer from 'noisy trajectory labeling' yet provides no comparison of OS-SPEAR's own labeling process against that baseline.
- [Figures/Tables] Figure and table captions should explicitly state the number of trajectories per subset and the exact metrics used for the E-subset (latency vs. tokens) to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to improve transparency and rigor where the original submission was lacking.
read point-by-point responses
-
Referee: [Abstract / P-subset description] The P-subset construction (abstract and methods) relies on 'trajectory value estimation and stratified sampling' without detailing the estimation algorithm, feature set used for value scoring, sampling strata, or any validation (e.g., inter-annotator agreement or external grounding). This is load-bearing for the performance superiority and efficiency-safety trade-off claims, as biased sampling could artifactually produce the reported rankings and correlations.
Authors: We acknowledge that the manuscript provides only a high-level description of the P-subset curation. The full methods section expands on the process but does not include the requested specifics on the value estimation algorithm, feature set, sampling strata, or validation metrics. In the revision, we will add a detailed subsection describing the trajectory value estimation procedure, the features used for scoring, the stratification criteria, and any internal validation steps performed during curation. This will allow readers to evaluate potential sampling biases directly. revision: yes
-
Referee: [Abstract / subset definitions] The S-subset and R-subset taxonomies for hazards and cross-modal disturbances are introduced without evidence of exhaustiveness, coverage analysis, or inter-rater validation; gaps here directly weaken the claims of 'varying robustness vulnerabilities across different modalities' and 'critical insights' into the landscape.
Authors: The taxonomies were derived from a synthesis of prior work on OS agent safety hazards and multimodal robustness challenges. However, the original submission does not include formal coverage analysis or inter-rater agreement statistics. We will revise the relevant sections to document the taxonomy construction process, including the literature sources consulted and qualitative steps taken to promote comprehensiveness. We will also explicitly note the absence of quantitative coverage metrics as a limitation and discuss how future extensions could address this. revision: partial
-
Referee: [Evaluation results] No statistical significance tests, confidence intervals, or error bars are reported for the trade-off observations, agent rankings, or modality comparisons across the 22 agents, making it impossible to assess whether the 'prevalent trade-off' and 'superiority' findings exceed sampling noise.
Authors: We agree that the lack of statistical tests and uncertainty measures weakens the strength of the empirical claims. In the revised manuscript, we will add appropriate statistical analyses, including significance tests for the efficiency-safety/robustness trade-offs and modality comparisons, as well as confidence intervals or error bars on key metrics and rankings. These additions will be placed in the results section and will use standard methods suitable for the evaluation setup. revision: yes
Circularity Check
No circularity: empirical benchmark with independent evaluation subsets
full rationale
The paper introduces OS-SPEAR as a toolkit with four subsets (S, P, E, R) for evaluating 22 OS agents and reports empirical observations such as efficiency-safety trade-offs. The P-subset curation via 'trajectory value estimation and stratified sampling' is described only at a high level with no equations, fitted parameters, or self-referential definitions that would make subsequent performance rankings or trade-off claims reduce to the curation inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed as novel derivations. All reported insights are direct measurements on the proposed subsets rather than tautological predictions. This is a standard empirical benchmark proposal with self-contained evaluation content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OS agents interact with environments primarily through GUI actions that can be logged as trajectories.
invented entities (4)
-
S-subset (safety hazards)
no independent evidence
-
P-subset (performance via value estimation)
no independent evidence
-
E-subset (efficiency metrics)
no independent evidence
-
R-subset (cross-modal robustness)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022
2022
-
[2]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onlin...
2022
-
[3]
Plangenllms: A modern survey of llm planning capabilities,
H. Wei, Z. Zhang, S. He, T. Xia, S. Pan, and F. Liu, “Plangenllms: A modern survey of llm planning capabilities,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 19 497–19 521
2025
-
[4]
When is tree search useful for llm planning? it depends on the discriminator,
Z. Chen, M. White, R. Mooney, A. Payani, Y . Su, and H. Sun, “When is tree search useful for llm planning? it depends on the discriminator,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 659– 13 678
2024
-
[5]
Perceptiongpt: Effectively fusing visual perception into llm,
R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang, “Perceptiongpt: Effectively fusing visual perception into llm,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27 124–27 133
2024
-
[6]
Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, edit- ing,
H. Fei, S. Wu, H. Zhang, T.-S. Chua, and S. Yan, “Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, edit- ing,”Advances in neural information processing systems, vol. 37, pp. 57 207–57 239, 2024
2024
-
[7]
Determinants of llm-assisted decision-making.arXiv preprint arXiv:2402.17385, 2024
E. Eigner and T. H ¨andler, “Determinants of llm-assisted decision- making,”arXiv preprint arXiv:2402.17385, 2024
-
[8]
Llm-based multi-agent decision- making: Challenges and future directions,
C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent decision- making: Challenges and future directions,”IEEE Robotics and Automa- tion Letters, 2025
2025
-
[9]
Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities,
O. Gurcan, “Llm-augmented agent-based modelling for social simula- tions: Challenges and opportunities,”arXiv preprint arXiv:2405.06700, 2024
-
[10]
J. R. Anthis, R. Liu, S. M. Richardson, A. C. Kozlowski, B. Koch, J. Evans, E. Brynjolfsson, and M. Bernstein, “Llm social simulations are a promising research method,”arXiv preprint arXiv:2504.02234, 2025
-
[11]
Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,
Z. Fan, L. Wei, J. Tang, W. Chen, W. Siyuan, Z. Wei, and F. Huang, “Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 10 183–10 213
2025
-
[12]
Z. Wang, J. Wu, L. Cai, C. H. Low, X. Yang, Q. Li, and Y . Jin, “Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow,”arXiv preprint arXiv:2503.18968, 2025
-
[13]
Rmsagen: Integrating multiple sequence align- ment for function rna design,
J. Jiang, Y . Chen, Q. Zhang, J. Li, X. Shi, C. Zhou, Z. Lin, J. Wang, D. He, L. Honget al., “Rmsagen: Integrating multiple sequence align- ment for function rna design,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 1, 2026, pp. 489–497
2026
-
[14]
Protpainter: Draw or drag protein via topology-guided diffusion,
Z. Lu, S. Cheng, T. Jiang, Y . Zhang, and M. Zhang, “Protpainter: Draw or drag protein via topology-guided diffusion,” inThe Thirteenth International Conference on Learning Representations
-
[15]
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Z. Wu, H. Huang, X. Lou, X. Qu, P. Cheng, Z. Wu, W. Liu, W. Zhang, J. Wang, Z. Wanget al., “Verios: Query-driven proactive human-agent-gui interaction for trustworthy os agents,”arXiv preprint arXiv:2509.07553, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Proactive agent: Shifting llm agents from re- active responses to active assistance,
Y . Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y . Wu, H. Wang, X. Cong, Z. Zhang, Y . Linet al., “Proactive agent: Shifting llm agents from re- active responses to active assistance,”arXiv preprint arXiv:2410.12361, 2024
-
[17]
Gui agents: A survey,
D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xiaet al., “Gui agents: A survey,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 22 522– 22 538
2025
-
[18]
Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024
S. Wang, W. Liu, J. Chen, Y . Zhou, W. Gan, X. Zeng, Y . Che, S. Yu, X. Hao, K. Shaoet al., “Gui agents with foundation models: A comprehensive survey,”arXiv preprint arXiv:2411.04890, 2024
-
[19]
Os-atlas: Foundation action model for generalist gui agents,
Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Lianget al., “Os-atlas: Foundation action model for generalist gui agents,” inThe Thirteenth International Conference on Learning Representations
-
[20]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huanget al., “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Breaking the data barrier–building gui agents through task generalization,
J. Zhang, Z. Ding, C. Ma, Z. Chen, Q. Sun, Z. Lan, and J. He, “Breaking the data barrier–building gui agents through task generalization,” in Second Conference on Language Modeling
-
[22]
Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026
C. Gao, Z. Gu, Y . Liu, X. Qiu, S. Shen, Y . Wen, T. Xia, Z. Xu, Z. Zeng, B. Zhouet al., “Ui-venus-1.5 technical report,”arXiv preprint arXiv:2602.09082, 2026
-
[23]
You only look at screens: Multimodal chain- of-action agents,
Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain- of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 3132–3149
2024
-
[24]
CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,
X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9097–9110
2024
-
[25]
GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025
F. Tang, Z. Gu, Z. Lu, X. Liu, S. Shen, C. Meng, W. Wang, W. Zhang, Y . Shen, W. Luet al., “Gui-g2: Gaussian reward modeling for gui grounding,”arXiv preprint arXiv:2507.15846, 2025
-
[26]
Ui-r1: Enhancing action prediction of gui agents by reinforcement learning
Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li, “Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning,”arXiv preprint arXiv:2503.21620, 2025
-
[27]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia, “Gui-r1: A generalist r1-style vision-language action model for gui agents,”arXiv preprint arXiv:2504.10458, 2025
work page internal anchor Pith review arXiv 2025
-
[28]
arXiv preprint arXiv:2504.14239 , year=
Y . Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu, “Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,”arXiv preprint arXiv:2504.14239, 2025
-
[29]
Appagent: Multimodal agents as smartphone users,
C. Zhang, Z. Yang, J. Liu, Y . Li, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–20
2025
-
[31]
Mobileuse: A hierarchical reflection-driven gui agent for au- tonomous mobile operation,
N. Li, X. Qu, J. Zhou, J. Wang, M. Wen, K. Du, X. Lou, Q. Peng, and W. Zhang, “Mobileuse: A hierarchical reflection-driven gui agent for au- tonomous mobile operation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025b. URL https://openreview. net/forum
-
[32]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang, “Agent-safetybench: Evaluating the safety of llm agents,”arXiv preprint arXiv:2412.14470, 2024
work page internal anchor Pith review arXiv 2024
-
[33]
Wasp: Benchmarking web agent security against prompt injection attacks,
I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, and K. Chaud- huri, “Wasp: Benchmarking web agent security against prompt injection attacks,”arXiv preprint arXiv:2504.18575, 2025
-
[34]
Commercial llm agents are already vulnerable to simple yet dangerous attacks,
A. Li, Y . Zhou, V . C. Raghuram, T. Goldstein, and M. Goldblum, “Commercial llm agents are already vulnerable to simple yet dangerous attacks,”arXiv preprint arXiv:2502.08586, 2025
-
[35]
Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,
J. Yang, Z. Song, J. Chen, M. Song, S. Zhou, X. Ouyang, C. Chen, C. Wanget al., “Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,”arXiv preprint arXiv:2506.14477, 2025
-
[36]
OSWorld-gold: Benchmarking the efficiency of computer-use agents,
R. Abhyankar, Q. Qi, and Y . Zhang, “OSWorld-gold: Benchmarking the efficiency of computer-use agents,” inICML 2025 Workshop on Computer Use Agents, 2025. [Online]. Available: https://openreview. net/forum?id=sV3n6mYy7J
2025
-
[37]
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
P. Zhao, G. Liu, Y . Liang, W. He, Z. Lu, Y . Huang, Y . Guo, K. Zhang, H. Wang, L. Liu, and Y . Liu, “Mas-bench: A unified benchmark for shortcut-augmented hybrid mobile gui agents,”arXiv preprint arXiv:2509.06477, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Osworld-human: Benchmarking the efficiency of computer-use agents.arXiv preprint arXiv:2506.16042,
R. Abhyankar, Q. Qi, and Y . Zhang, “Osworld-human: Benchmarking the efficiency of computer-use agents,”arXiv preprint arXiv:2506.16042, 2025
-
[39]
Agent-scankit: Unraveling memory and reasoning of multimodal agents via sensitivity perturbations,
P. Cheng, L. Dong, Z. Wu, Z. Wu, X. Tang, C. Qin, Z. Zhang, and G. Liu, “Agent-scankit: Unraveling memory and reasoning of multimodal agents via sensitivity perturbations,”arXiv preprint arXiv:2510.00496, 2025
-
[40]
D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies,
S. Chen, T. Zhao, Y . Bin, F. Ma, W. Shao, and Z. Wang, “D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies,”arXiv preprint arXiv:2511.16590, 2025
-
[41]
Mla- trust: Benchmarking trustworthiness of multimodal llm agents in gui environments,
X. Yang, J. Chen, J. Luo, Z. Fang, Y . Dong, H. Su, and J. Zhu, “Mla- trust: Benchmarking trustworthiness of multimodal llm agents in gui environments,”arXiv preprint arXiv:2506.01616, 2025
-
[42]
D-gara: A dynamic benchmarking framework for gui agent robustness in real- world anomalies,
S. Chen, T. Zhao, Y . Bin, F. Ma, W. Shao, and Z. Wang, “D-gara: A dynamic benchmarking framework for gui agent robustness in real- world anomalies,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 21, 2026, pp. 17 419–17 426. 14
2026
-
[43]
Androidcontrol-curated: Revealing the true potential of gui agents through benchmark purification,
H. F. Leung, X. Xi, and F. Zuo, “Androidcontrol-curated: Revealing the true potential of gui agents through benchmark purification,”arXiv preprint arXiv:2510.18488, 2025
-
[44]
X. Wang, Z. Wu, J. Xie, Z. Ding, B. Yang, Z. Li, Z. Liu, Q. Li, X. Dong, Z. Chenet al., “Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents,”arXiv preprint arXiv:2507.19478, 2025
-
[45]
Ui-venus technical report: Building high-performance ui agents with rft
Z. Gu, Z. Zeng, Z. Xu, X. Zhou, S. Shen, Y . Liu, B. Zhou, C. Meng, T. Xia, W. Chenet al., “Ui-venus technical report: Building high- performance ui agents with rft,”arXiv preprint arXiv:2508.10833, 2025
-
[46]
Y . Liu, Z. Liu, S. Zhu, P. Li, C. Xie, J. Wang, X. Hu, X. Han, J. Yuan, X. Wanget al., “Infigui-g1: Advancing gui grounding with adaptive ex- ploration policy optimization,”arXiv preprint arXiv:2508.05731, 2025
-
[47]
Test-time reinforcement learning for gui grounding via region consistency,
Y . Du, Y . Yan, F. Tang, Z. Lu, C. Zong, W. Lu, S. Jiang, and Y . Shen, “Test-time reinforcement learning for gui grounding via region consistency,”arXiv preprint arXiv:2508.05615, 2025
-
[48]
X. Yuan, J. Zhang, K. Li, Z. Cai, L. Yao, J. Chen, E. Wang, Q. Hou, J. Chen, P.-T. Jianget al., “Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning,”arXiv preprint arXiv:2505.12370, 2025
-
[49]
Seeclick: Harnessing gui grounding for advanced visual gui agents,
K. Cheng, Q. Sun, Y . Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9313– 9332
2024
-
[50]
Y . Zhou, S. Dai, S. Wang, K. Zhou, Q. Jia, and J. Xu, “Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,” arXiv preprint arXiv:2505.15810, 2025
-
[51]
Mobilerl: Online agentic reinforcement learning for mobile gui agents
Y . Xu, X. Liu, X. Liu, J. Fu, H. Zhang, B. Jing, S. Zhang, Y . Wang, W. Zhao, and Y . Dong, “Mobilerl: Online agentic reinforcement learning for mobile gui agents,”arXiv preprint arXiv:2509.18119, 2025
-
[52]
H. Lai, X. Liu, Y . Zhao, H. Xu, H. Zhang, B. Jing, Y . Ren, S. Yao, Y . Dong, and J. Tang, “Computerrl: Scaling end-to-end on- line reinforcement learning for computer use agents,”arXiv preprint arXiv:2508.14040, 2025
-
[53]
Y . Shi, W. Yu, Z. Li, Y . Wang, H. Zhang, N. Liu, H. Mi, and D. Yu, “Mobilegui-rl: Advancing mobile gui agent through reinforcement learn- ing in online environment,”arXiv preprint arXiv:2507.05720, 2025
-
[54]
Mobileuse: A hierarchical reflection-driven GUI agent for autonomous mobile operation,
N. Li, X. Qu, J. Zhou, J. Wang, M. Wen, K. Du, X. Lou, Q. Peng, J. Wang, and W. Zhang, “Mobileuse: A hierarchical reflection-driven GUI agent for autonomous mobile operation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[55]
Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025
J. Ye, X. Zhang, H. Xu, H. Liu, J. Wang, Z. Zhu, Z. Zheng, F. Gao, J. Cao, Z. Lu, J. Liao, Q. Zheng, F. Huang, J. Zhou, and M. Yan, “Mobile-agent-v3: Fundamental agents for gui automation,” arXiv preprint arXiv:2508.15144, 2025
-
[56]
S. Agashe, K. Wong, V . Tu, J. Yang, A. Li, and X. E. Wang, “Agent s2: A compositional generalist-specialist framework for computer use agents,”arXiv preprint arXiv:2504.00906, 2025
-
[57]
Webarena: A realistic web environment for building autonomous agents,
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=oKn9c6ytLx
2024
-
[58]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,
J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905
2024
-
[59]
Mind2web: Towards a generalist agent for the web,
X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 091–28 114, 2023
2023
-
[60]
Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025
B. Gou, Z. Huang, Y . Ning, Y . Gu, M. Lin, W. Qi, A. Kopanev, B. Yu, B. J. Guti ´errez, Y . Shuet al., “Mind2web 2: Evaluating agentic search with agent-as-a-judge,”arXiv preprint arXiv:2506.21506, 2025
-
[61]
Webshop: Towards scalable real-world web interaction with grounded language agents,
S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web interaction with grounded language agents,”Ad- vances in Neural Information Processing Systems, vol. 35, pp. 20 744– 20 757, 2022
2022
-
[62]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. ...
2024
-
[63]
Androidworld: A dynamic benchmarking environment for autonomous agents,
C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. E. Bishop, W. Li, F. Campbell-Ajala, D. K. Toyama, R. J. Berry, D. Tyamagundlu, T. P. Lillicrap, and O. Riva, “Androidworld: A dynamic benchmarking environment for autonomous agents,” in The Thirteenth International Conference on Learning Representations,
-
[64]
Available: https://openreview.net/forum?id=il5yUQsrjC
[Online]. Available: https://openreview.net/forum?id=il5yUQsrjC
-
[65]
Mobile-bench: An evaluation benchmark for llm-based mobile agents,
S. Deng, W. Xu, H. Sun, W. Liu, T. Tan, L. Liujianfeng, A. Li, J. Luan, B. Wang, R. Yanet al., “Mobile-bench: An evaluation benchmark for llm-based mobile agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8813–8831
2024
-
[66]
W. Xu, Z. Jiang, Y . Liu, P. Gao, W. Liu, J. Luan, Y . Li, Y . Liu, B. Wang, and B. An, “Mobile-bench-v2: A more realistic and com- prehensive benchmark for vlm-based mobile agents,”arXiv preprint arXiv:2505.11891, 2025
-
[67]
Androidlab: Training and systematic benchmark- ing of android autonomous agents,
Y . Xu, X. Liu, X. Sun, S. Cheng, H. Yu, H. Lai, S. Zhang, D. Zhang, J. Tang, and Y . Dong, “Androidlab: Training and systematic benchmark- ing of android autonomous agents,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 2144–2166
2025
-
[68]
Screenspot-pro: Gui grounding for professional high- resolution computer use,
K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high- resolution computer use,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 8778–8786
2025
-
[69]
Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,
G. Liu, P. Zhao, Y . Liang, Q. Luo, S. Tang, Y . Chai, W. Lin, H. Xiao, W. Wang, S. Chenet al., “Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,”arXiv preprint arXiv:2602.06075, 2026
-
[70]
Learnact: Few-shot mobile gui agent with a unified demonstration benchmark,
G. Liu, P. Zhao, L. Liu, Z. Chen, Y . Chai, S. Ren, H. Wang, S. He, and W. Meng, “Learnact: Few-shot mobile gui agent with a unified demonstration benchmark,”arXiv preprint arXiv:2504.13805, 2025
-
[71]
Z. Wu, H. Huang, Y . Yang, Y . Song, X. Lou, W. Liu, W. Zhang, J. Wang, and Z. Zhang, “Quick on the uptake: Eliciting implicit intents from human demonstrations for personalized mobile-use agents,”arXiv preprint arXiv:2508.08645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents,
Y . Li, Y . Liu, H. Lu, Z. Xia, H. Wang, K. Han, C. Yang, J. Wu, J. Xu, R. Shiet al., “Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents,”arXiv preprint arXiv:2603.15039, 2026
-
[73]
TRACE: Trajectory-aware comprehensive evaluation for deep research agents
Y . Chen, J. Jiang, J. Liu, Y . Zhang, X. Guo, and I. King, “Trace: Trajectory-aware comprehensive evaluation for deep research agents,” arXiv preprint arXiv:2602.21230, 2026
-
[74]
Probench: Benchmarking gui agents with accurate process informa- tion,
L. Yang, Z. Wang, X. Tang, S. Zhou, D. Chen, W. Jiang, and Y . Li, “Probench: Benchmarking gui agents with accurate process informa- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 32, 2026, pp. 27 547–27 555
2026
-
[75]
Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild,
J. Sun, M. Li, Y . Zhang, J. Niu, Y . Wu, R. Jin, S. Lei, P. Tan, Z. Zhang, R. Wanget al., “Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild,”arXiv preprint arXiv:2602.11750, 2026
-
[76]
Q. Wu, Z. Yang, H. Li, P. Gao, W. Liu, and J. Luan, “Mobilebench-ol: A comprehensive chinese benchmark for evaluating mobile gui agents in real-world environment,”arXiv preprint arXiv:2601.20335, 2026
-
[77]
arXiv preprint arXiv:2603.04949 , year=
M. F. Ishmam and K. Marino, “Timewarp: Evaluating web agents by revisiting the past,”arXiv preprint arXiv:2603.04949, 2026
-
[78]
Os-harm: A benchmark for measuring safety of computer use agents
T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, and M. Andriushchenko, “Os-harm: A benchmark for measuring safety of computer use agents,”arXiv preprint arXiv:2506.14866, 2025
-
[79]
Caution for the environment: Multimodal llm agents are susceptible to environmental distractions,
X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal llm agents are susceptible to environmental distractions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 22 324–22 339
2025
-
[80]
Hijacking jarvis: Benchmarking mobile gui agents against unprivileged third parties,
G. Liu, J. Ye, J. Liu, Y . Li, W. Liu, P. Gao, J. Luan, and Y . Liu, “Hijacking jarvis: Benchmarking mobile gui agents against unprivileged third parties,” inProceedings of the 2nd International Workshop on Edge and Mobile Foundation Models, 2025, pp. 12–18
2025
-
[81]
Eva: Red-teaming gui agents via evolving indirect prompt injection,
Y . Lu, T. Ju, M. Zhao, X. Ma, Y . Guo, and Z. Zhang, “Eva: Red-teaming gui agents via evolving indirect prompt injection,”arXiv preprint arXiv:2505.14289, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.