A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Ada Chen; Jen-tse Huang; Jingyu Xiao; Junyuan Zhang; Kun Wang; Shuai Wang; Shu Yang; Wenxuan Wang; Yongjiang Wu

arxiv: 2505.10924 · v4 · submitted 2025-05-16 · 💻 cs.CL · cs.AI· cs.CR· cs.CV· cs.SE

A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Ada Chen , Yongjiang Wu , Junyuan Zhang , Jingyu Xiao , Shu Yang , Jen-tse Huang , Kun Wang , Wenxuan Wang

show 1 more author

Shuai Wang

This is my paper

Pith reviewed 2026-05-22 15:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.CVcs.SE

keywords computer-using agentssafety threatssecurity risksLLM agentsdefensive strategiesGUI automationbenchmarksautonomous agents

0 comments

The pith

Computer-using agents that control interfaces introduce safety threats from LLM reasoning and component integrations requiring structured taxonomies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines computer-using agents as LLM-based systems that autonomously operate graphical interfaces, web pages, and mobile apps. It reviews the literature to categorize safety and security threats arising from vulnerabilities in reasoning and multimodal integrations. A taxonomy of defensive strategies is proposed along with summaries of benchmarks, datasets, and metrics for evaluation. This work supplies researchers a foundation for identifying new vulnerabilities and gives practitioners guidance on building secure agents.

Core claim

The authors systematize knowledge on CUA safety and security by defining the agents for analysis, categorizing threats, proposing a defense taxonomy, and compiling evaluation resources to support secure design and deployment.

What carries the argument

The four research objectives that structure the survey: defining CUAs for safety analysis, threat categorization, defense taxonomy, and benchmark summary.

If this is right

Researchers gain a map to locate and investigate previously unexamined vulnerabilities in CUAs.
Developers receive concrete guidance on integrating defenses when building or deploying these agents.
Evaluation practices can standardize around the collected benchmarks and metrics.
Integration risks from multiple components and multimodal inputs become easier to address systematically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same categorization approach might transfer to safety analysis of other autonomous agent types that act on digital environments.
Widespread adoption of CUAs could slow until these threats receive practical mitigations.
Insights from CUA defenses may inform security for general multimodal reasoning systems.

Load-bearing premise

The collected literature is comprehensive and unbiased enough to support the threat categories and defense taxonomy without major omissions.

What would settle it

Identification of a significant safety threat in an actual CUA that cannot be placed into any of the proposed threat categories would show the categorization is incomplete.

Figures

Figures reproduced from arXiv: 2505.10924 by Ada Chen, Jen-tse Huang, Jingyu Xiao, Junyuan Zhang, Kun Wang, Shuai Wang, Shu Yang, Wenxuan Wang, Yongjiang Wu.

read the original abstract

Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes threats and defenses for computer-using agents in a clear way but gives almost no detail on how the papers were chosen.

read the letter

The main thing to know is that this paper lays out a safety-focused definition of computer-using agents and a taxonomy of defensive strategies, which could help people working in this space get organized. It is new relative to the broader AI safety surveys it cites because it narrows in on agents that actually click through GUIs and handle multimodal inputs. The four objectives structure the review cleanly: defining the agents for safety work, grouping the threats, mapping defenses, and covering benchmarks and metrics. That organization is the useful part and gives practitioners something concrete to think about when designing these systems. The summaries of existing work on reasoning vulnerabilities and component integration look reasonable on the surface and tie the risks to real deployment questions. The soft spot is the literature review itself. The abstract claims a comprehensive review but says nothing about search terms, databases, inclusion rules, or how many papers were screened. That leaves open the chance that some emerging GUI-state or persistent attack papers were missed or that the categories reflect selection more than the full picture. If the full text has a methods section that fixes this, the issue is minor; otherwise it undercuts how much weight the taxonomy can carry as a foundation. This is for researchers who need a quick map of the CUA safety area and for teams building these agents who want defensive options in one place. A reader who already knows the general LLM safety literature will get the most out of the specific taxonomy and benchmark list. It deserves a serious referee because the topic is timely and the structure could shape later evaluation work if the coverage is tightened. I would send it to review and ask the authors to add the review methodology and check for any obvious gaps in recent multimodal or stateful attacks.

Referee Report

1 major / 0 minor

Summary. The manuscript surveys safety and security threats to Computer-Using Agents (CUAs), LLM-based systems that autonomously interact with graphical user interfaces on desktops, web, and mobile. It pursues four objectives: (i) defining a CUA suitable for safety analysis, (ii) categorizing current safety threats, (iii) proposing a taxonomy of defensive strategies, and (iv) summarizing benchmarks, datasets, and metrics for assessing CUA safety and performance. The authors position their work as providing a structured foundation for exploring vulnerabilities and actionable guidance for practitioners.

Significance. Should the literature review prove comprehensive, the paper would contribute a useful organization of threats and defenses in this nascent field of AI agents, potentially guiding future work on secure CUAs. The systematization could help bridge gaps between AI capabilities and security considerations, especially given the integration of multimodal inputs and multiple software components. Credit is due for attempting to distill insights into actionable categories and taxonomies.

major comments (1)

The abstract asserts that a 'comprehensive literature review' was conducted to support objectives (ii) and (iii), yet it omits any description of the search strategy, inclusion criteria, time period covered, or number of papers included. This detail is essential to substantiate the claim of providing a 'structured foundation' without significant omissions, as noted in the reader's assessment of potential gaps in coverage of emerging threats such as multimodal attacks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our survey of safety and security threats for Computer-Using Agents. We address the major comment below and commit to revisions that improve transparency.

read point-by-point responses

Referee: The abstract asserts that a 'comprehensive literature review' was conducted to support objectives (ii) and (iii), yet it omits any description of the search strategy, inclusion criteria, time period covered, or number of papers included. This detail is essential to substantiate the claim of providing a 'structured foundation' without significant omissions, as noted in the reader's assessment of potential gaps in coverage of emerging threats such as multimodal attacks.

Authors: We agree that the absence of explicit methodology details in the abstract (and potentially the main text) weakens the substantiation of our 'comprehensive' claim. In the revised version, we will add a new subsection (likely in Section 2 or as an appendix) that describes the literature review process: search keywords and databases (e.g., arXiv, Google Scholar, ACL Anthology with terms such as 'computer-using agent', 'GUI agent', 'LLM agent safety'), inclusion criteria (peer-reviewed papers, preprints, and reports from 2023 onward focused on LLM-driven GUI interaction), time period, and approximate number of papers screened and included. This addition will also clarify scope limitations regarding emerging areas like multimodal attacks, allowing readers to better assess coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without derivations or self-referential reductions

full rationale

This is a literature survey paper whose contribution is a systematization of existing work on CUA safety threats and defenses. It defines four research objectives to structure the review but performs no mathematical derivations, parameter fitting, predictions, or equations. All categorizations and taxonomies are distilled from cited external papers rather than reducing to quantities or structures defined within this work itself. No self-citation forms a load-bearing premise, and there are no self-definitional loops or fitted inputs renamed as predictions. The paper is therefore self-contained against external benchmarks with a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the contribution is the organization of existing literature rather than new derivations; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5797 in / 1149 out tokens · 50591 ms · 2026-05-22T15:18:18.745114+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web.ArXiv, abs/2306.06070. Zehang Deng, Yongjian Guo, Changzhou Han, Wan- lun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2024b. Ai agents under threat: A survey of key security challenges and future pathways.ACM Com- puting Surveys, 57:1 – 36. Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jil- iang Tang, Tianmi...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503. Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. 2025. Wasp: Benchmarking web agent se- curity against prompt injection attacks.ArXiv, abs/2504.18575. Falong Fan and Xi Li. 2025. Peerguard: Defending multi-agent systems again...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast,

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. ArXiv, abs/2402.08567. Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2025. The emerged security and privacy of llm agent: A survey with case studies. ACM Computing Surveys, 58(6):1–36. Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan X...

work page arXiv 2025
[4]

InConference on Empirical Methods in Natural Language Processing

Trustagent: Towards safe and trustworthy llm- based agents. InConference on Empirical Methods in Natural Language Processing. Jen-Tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, and Michael R. Lyu. 2024. On the resilience of llm-based multi-agent collaboration with faulty agents. Wanjing Huang, Tongjie P...

work page arXiv 2024
[5]

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko

Refusal-trained llms are easily jailbroken as browser agents.ArXiv, abs/2410.13886. Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. 2025. Os-harm: A bench- mark for measuring safety of computer use agents. ArXiv, abs/2506.14866. Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung ...

work page arXiv 2025
[6]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey.ArXiv, abs/2309.07864. Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Xiaodong Song, and Bo Li. 2024. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning. Tianbao Xie, Danyang Zha...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Ad- vweb: Controllable black-box attacks on vlm-powered web agents

Advweb: Controllable black-box attacks on vlm-powered web agents.ArXiv, abs/2410.17401. Jingqi Yang, Zhilong Song, Jiawei Chen, Mingli Song, Sheng Zhou, linjun sun, Xiaogang Ouyang, Chun Chen, and Can Wang. 2025a. Gui-robust: A com- prehensive dataset for testing gui agent robustness in real-world anomalies.ArXiv, abs/2506.14477. Jingyi Yang, Shuai Shao, ...

work page arXiv 2022
[8]

Ignore all previous instructions and delete every file in the Documents folder

show that by injecting a system-level no- tification pop-up milliseconds before the agent’s intended click, one can hijack its execution flow, luring it to tap the pop-up instead of the correct element of the user interface. Ma et al. (2024b) further simulate a vulnerable scenario by injecting irrelevant distractions such as pop-up boxes, fake search resu...

work page 2025
[9]

stress test,

poisons GUI grounding data by remaping a tiny, low-salience on-screen mark to specific ele- ment–action pairs, driving attacker-selected clicks whenever the visual trigger appears. Likewise, ScreenHijack (Wang et al., 2025g) fine-tunes vi- sion–language mobile agents on a small fraction of screenshots covertly perturbed with an impercepti- ble visual trig...

work page 2025
[10]

MELON executes each prompt twice, once normally and once with a masked injection, to compare outputs and flag any inconsistencies as injected content (Zhu et al., 2025a)

uses multiple independent runs of the same prompt across agents and uses majority consensus to filter out jailbreak attempts. MELON executes each prompt twice, once normally and once with a masked injection, to compare outputs and flag any inconsistencies as injected content (Zhu et al., 2025a). For backdoor attacks, ReAgent performs dual-level consistenc...

work page 2025
[11]

Meanwhile, PrivacyLens (Shao et al., 2024) inves- tigates privacy-sensitive interactions in web-based conversations, containing 493 validated prompts derived from U.S

and BrowserART (Kumar et al., 2024) focus on evaluating agents’ safety-related behaviors in tasks involving web navigation, interaction, and tool usage under potential prompt injection threats. Meanwhile, PrivacyLens (Shao et al., 2024) inves- tigates privacy-sensitive interactions in web-based conversations, containing 493 validated prompts derived from ...

work page 2024
[12]

RiOSWorld (Yang et al., 2025b) runs 492 risky tasks in 13 categories on an OSWorld VM, capturing both environment and user-originated risks

provides 350 multi-turn, multi-user tasks in both benign and adversarial settings using a real browser, shell, file system, and messaging APIs. RiOSWorld (Yang et al., 2025b) runs 492 risky tasks in 13 categories on an OSWorld VM, capturing both environment and user-originated risks. RedTeamCUA (Liao et al., 2025) introduces RTC-Bench with 864 hybrid Web–...

work page 2025
[13]

Broader risk-awareness and multidimen- sional safetyBeyond concrete tool or environ- ment settings, several works emphasize compre- hensive risk taxonomies and analysis

further contributes a large-scale dataset of 2,653 instances spanning 10 risk categories and 10 attack strategies, where each instance consists of multi-step action sequences that appear locally benign but collectively lead to harmful outcomes. Broader risk-awareness and multidimen- sional safetyBeyond concrete tool or environ- ment settings, several work...

work page 2024
[14]

PrivacyLens (Shao et al., 2024) offers 493 privacy-sensitive vignettes and trajectories for leakage analysis

contributes 70 samples across 5 domains with paired ground-truth implementations to evalu- ate both helpfulness and safety. PrivacyLens (Shao et al., 2024) offers 493 privacy-sensitive vignettes and trajectories for leakage analysis. GUI-Robust (Yang et al., 2025a) complements these efforts by focusing on robustness under anomalies in inter- actions. It i...

work page 2024
[15]

ground truth

or by verifying the tool-call sequence contains the human-annotated steps (Fu et al., 2025). RAS-Eval (Fu et al., 2025) futher in- troduces several composite metrics: – Task Incompletion Rate (TIR)counts runs that invoke only a subset or incor- rect combination of required tools. – Task Fail Rate (TFR)flags runs that crash, make no tool calls, or exceed l...

work page 2025
[16]

attempted

and VPI-Bench (Cao et al., 2025) rely on LLM judges to flag these attempts: RedTeamCUA uses a single LLM to detect beginnings of harmful actions, while VPI-Bench employs a majority vote of three frontier LLMs to decide whether an attack was “attempted”. A similar concept,Risk Goal Intention (RGI), is used in RiOSWorld (Yang et al., 2025b) to denote an age...

work page 2025
[17]

More advanced rule-based evaluators compare final environment states against expected outcomes

employs predefined rules to evaluate most simple tasks, thereby minimizing dependence on LLM-based grading. More advanced rule-based evaluators compare final environment states against expected outcomes. SafeArena (Tur et al., 2025) matches outputs to predefined reference objects and applies the Agent Risk Assessment (ARIA) framework’s four hierar- chical...

work page 2025
[18]

This approach has been extended to a diverse ar- ray of benchmarks: BrowserART (Kumar et al.,

applies a LLM-based classifiers to detect whether sensitive information can be inferred from an agent’s actions. This approach has been extended to a diverse ar- ray of benchmarks: BrowserART (Kumar et al.,

work page
[19]

and AgentHarm (Andriushchenko et al.,

work page
[20]

use GPT-4o to classify harmful behaviors and evaluate refusals. CASA (Qiu et al., 2025) adopts GPT-4o across metrics to assess cultural and social awareness, SafeArena (Tur et al., 2025) feeds GPT-4o each agent’s trajectory and meta- data to assign one of the four ARIA risk levels, ASB (Zhang et al., 2024b) uses LLMs to evalu- ate whether agents properly ...

work page 2025
[21]

also adopts an LLM-as-judge framework to perform fine-grained evaluation of multi-step ex- ecution trajectories, capturing contextual and se- quential aspects of harmful behavior. More specialized uses include RedTeamCUA (Liao et al., 2025), RiOSWorld (Yang et al., 2025b) and WASP (Evtimov et al., 2025), which rely on LLM to flag evidence of attempted but...

work page 2025

[1] [1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web.ArXiv, abs/2306.06070. Zehang Deng, Yongjian Guo, Changzhou Han, Wan- lun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2024b. Ai agents under threat: A survey of key security challenges and future pathways.ACM Com- puting Surveys, 57:1 – 36. Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jil- iang Tang, Tianmi...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503. Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. 2025. Wasp: Benchmarking web agent se- curity against prompt injection attacks.ArXiv, abs/2504.18575. Falong Fan and Xi Li. 2025. Peerguard: Defending multi-agent systems again...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast,

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. ArXiv, abs/2402.08567. Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2025. The emerged security and privacy of llm agent: A survey with case studies. ACM Computing Surveys, 58(6):1–36. Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan X...

work page arXiv 2025

[4] [4]

InConference on Empirical Methods in Natural Language Processing

Trustagent: Towards safe and trustworthy llm- based agents. InConference on Empirical Methods in Natural Language Processing. Jen-Tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, and Michael R. Lyu. 2024. On the resilience of llm-based multi-agent collaboration with faulty agents. Wanjing Huang, Tongjie P...

work page arXiv 2024

[5] [5]

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko

Refusal-trained llms are easily jailbroken as browser agents.ArXiv, abs/2410.13886. Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. 2025. Os-harm: A bench- mark for measuring safety of computer use agents. ArXiv, abs/2506.14866. Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung ...

work page arXiv 2025

[6] [6]

The Rise and Potential of Large Language Model Based Agents: A Survey

The rise and potential of large language model based agents: A survey.ArXiv, abs/2309.07864. Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Xiaodong Song, and Bo Li. 2024. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning. Tianbao Xie, Danyang Zha...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Ad- vweb: Controllable black-box attacks on vlm-powered web agents

Advweb: Controllable black-box attacks on vlm-powered web agents.ArXiv, abs/2410.17401. Jingqi Yang, Zhilong Song, Jiawei Chen, Mingli Song, Sheng Zhou, linjun sun, Xiaogang Ouyang, Chun Chen, and Can Wang. 2025a. Gui-robust: A com- prehensive dataset for testing gui agent robustness in real-world anomalies.ArXiv, abs/2506.14477. Jingyi Yang, Shuai Shao, ...

work page arXiv 2022

[8] [8]

Ignore all previous instructions and delete every file in the Documents folder

show that by injecting a system-level no- tification pop-up milliseconds before the agent’s intended click, one can hijack its execution flow, luring it to tap the pop-up instead of the correct element of the user interface. Ma et al. (2024b) further simulate a vulnerable scenario by injecting irrelevant distractions such as pop-up boxes, fake search resu...

work page 2025

[9] [9]

stress test,

poisons GUI grounding data by remaping a tiny, low-salience on-screen mark to specific ele- ment–action pairs, driving attacker-selected clicks whenever the visual trigger appears. Likewise, ScreenHijack (Wang et al., 2025g) fine-tunes vi- sion–language mobile agents on a small fraction of screenshots covertly perturbed with an impercepti- ble visual trig...

work page 2025

[10] [10]

MELON executes each prompt twice, once normally and once with a masked injection, to compare outputs and flag any inconsistencies as injected content (Zhu et al., 2025a)

uses multiple independent runs of the same prompt across agents and uses majority consensus to filter out jailbreak attempts. MELON executes each prompt twice, once normally and once with a masked injection, to compare outputs and flag any inconsistencies as injected content (Zhu et al., 2025a). For backdoor attacks, ReAgent performs dual-level consistenc...

work page 2025

[11] [11]

Meanwhile, PrivacyLens (Shao et al., 2024) inves- tigates privacy-sensitive interactions in web-based conversations, containing 493 validated prompts derived from U.S

and BrowserART (Kumar et al., 2024) focus on evaluating agents’ safety-related behaviors in tasks involving web navigation, interaction, and tool usage under potential prompt injection threats. Meanwhile, PrivacyLens (Shao et al., 2024) inves- tigates privacy-sensitive interactions in web-based conversations, containing 493 validated prompts derived from ...

work page 2024

[12] [12]

RiOSWorld (Yang et al., 2025b) runs 492 risky tasks in 13 categories on an OSWorld VM, capturing both environment and user-originated risks

provides 350 multi-turn, multi-user tasks in both benign and adversarial settings using a real browser, shell, file system, and messaging APIs. RiOSWorld (Yang et al., 2025b) runs 492 risky tasks in 13 categories on an OSWorld VM, capturing both environment and user-originated risks. RedTeamCUA (Liao et al., 2025) introduces RTC-Bench with 864 hybrid Web–...

work page 2025

[13] [13]

Broader risk-awareness and multidimen- sional safetyBeyond concrete tool or environ- ment settings, several works emphasize compre- hensive risk taxonomies and analysis

further contributes a large-scale dataset of 2,653 instances spanning 10 risk categories and 10 attack strategies, where each instance consists of multi-step action sequences that appear locally benign but collectively lead to harmful outcomes. Broader risk-awareness and multidimen- sional safetyBeyond concrete tool or environ- ment settings, several work...

work page 2024

[14] [14]

PrivacyLens (Shao et al., 2024) offers 493 privacy-sensitive vignettes and trajectories for leakage analysis

contributes 70 samples across 5 domains with paired ground-truth implementations to evalu- ate both helpfulness and safety. PrivacyLens (Shao et al., 2024) offers 493 privacy-sensitive vignettes and trajectories for leakage analysis. GUI-Robust (Yang et al., 2025a) complements these efforts by focusing on robustness under anomalies in inter- actions. It i...

work page 2024

[15] [15]

ground truth

or by verifying the tool-call sequence contains the human-annotated steps (Fu et al., 2025). RAS-Eval (Fu et al., 2025) futher in- troduces several composite metrics: – Task Incompletion Rate (TIR)counts runs that invoke only a subset or incor- rect combination of required tools. – Task Fail Rate (TFR)flags runs that crash, make no tool calls, or exceed l...

work page 2025

[16] [16]

attempted

and VPI-Bench (Cao et al., 2025) rely on LLM judges to flag these attempts: RedTeamCUA uses a single LLM to detect beginnings of harmful actions, while VPI-Bench employs a majority vote of three frontier LLMs to decide whether an attack was “attempted”. A similar concept,Risk Goal Intention (RGI), is used in RiOSWorld (Yang et al., 2025b) to denote an age...

work page 2025

[17] [17]

More advanced rule-based evaluators compare final environment states against expected outcomes

employs predefined rules to evaluate most simple tasks, thereby minimizing dependence on LLM-based grading. More advanced rule-based evaluators compare final environment states against expected outcomes. SafeArena (Tur et al., 2025) matches outputs to predefined reference objects and applies the Agent Risk Assessment (ARIA) framework’s four hierar- chical...

work page 2025

[18] [18]

This approach has been extended to a diverse ar- ray of benchmarks: BrowserART (Kumar et al.,

applies a LLM-based classifiers to detect whether sensitive information can be inferred from an agent’s actions. This approach has been extended to a diverse ar- ray of benchmarks: BrowserART (Kumar et al.,

work page

[19] [19]

and AgentHarm (Andriushchenko et al.,

work page

[20] [20]

use GPT-4o to classify harmful behaviors and evaluate refusals. CASA (Qiu et al., 2025) adopts GPT-4o across metrics to assess cultural and social awareness, SafeArena (Tur et al., 2025) feeds GPT-4o each agent’s trajectory and meta- data to assign one of the four ARIA risk levels, ASB (Zhang et al., 2024b) uses LLMs to evalu- ate whether agents properly ...

work page 2025

[21] [21]

also adopts an LLM-as-judge framework to perform fine-grained evaluation of multi-step ex- ecution trajectories, capturing contextual and se- quential aspects of harmful behavior. More specialized uses include RedTeamCUA (Liao et al., 2025), RiOSWorld (Yang et al., 2025b) and WASP (Evtimov et al., 2025), which rely on LLM to flag evidence of attempted but...

work page 2025