pith. machine review for the scientific record. sign in

arxiv: 2604.07831 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.CV
keywords GUI agentsUI element injectionred-teamingadversarial attacksvisual groundingmodel transferabilitypersistent distractionsemantic-level injection
0
0 comments X

The pith

Injecting safety-aligned UI elements onto screenshots can misdirect GUI agents with up to 4.4 times the success rate of random injection and create persistent distractions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Semantic-level UI Element Injection as a practical way to test GUI agent robustness by overlaying harmless UI elements on screenshots. This misdirects the agent's visual grounding without white-box access or obvious malicious prompts that safety filters would catch. An iterative optimization process using an Editor-Overlapper-Victim pipeline finds effective edits by sampling candidates and keeping the best cumulative overlay. The approach matters because commercial GUI agents face real attacks that existing defenses against prompt injection miss. Results across five models show strong transferability and lasting effects where the injected element keeps attracting clicks long after the first success.

Core claim

Semantic-level UI Element Injection overlays safety-aligned and harmless UI elements onto agent screenshots using a modular Editor-Overlapper-Victim pipeline and an iterative search that samples candidate edits, retains the best cumulative overlay, and adapts strategies based on failures. This approach improves attack success rates by up to 4.4x over random injection across five victim models, with elements transferring effectively between models and acting as persistent attractors in over 15% of subsequent trials.

What carries the argument

The Editor-Overlapper-Victim pipeline combined with iterative candidate sampling and best-overlay retention, which optimizes the placement and appearance of injected UI elements to misdirect visual grounding.

Load-bearing premise

The injected UI elements must stay harmless and safety-aligned to avoid triggering the victim's existing filters, while still being effective at redirecting attention without looking like obvious clutter.

What would settle it

Running the optimized injections against a victim model retrained with explicit anti-distraction mechanisms or element-consistency checks; if success rates drop to random-injection levels, the claim of effective misdirection would be disproven.

Figures

Figures reproduced from arXiv: 2604.07831 by Chao Jin, Derek Yuen, Haisu Zhu, Huaibo Huang, Jie Cao, Junxian Duan, Kun Shao, Ran He, Weilin Luo, Wenkui Yang.

Figure 1
Figure 1. Figure 1: Semantic-level UI Element Injection framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the constructed UI element pool. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative ASR vs. total non-trivially injected icons [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative L1/L2 ASR vs. total non-trivially injected icons [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples of successful adversarial icon injection attacks [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
read the original abstract

Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic-level UI Element Injection, a red-teaming setting that overlays safety-aligned and harmless UI elements onto screenshots to misdirect the agent's visual grounding. Our method uses a modular Editor-Overlapper-Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model-agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker-controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Semantic-level UI Element Injection as a practical red-teaming method for GUI agents. It overlays safety-aligned and harmless UI elements on screenshots via a modular Editor-Overlapper-Victim pipeline and iterative search that samples candidates, retains the best cumulative overlay, and adapts based on failures. Across five victim models, optimized attacks achieve up to 4.4x higher attack success rate than random injection, demonstrate effective cross-model transfer, and exhibit persistence with the agent clicking the injected element in over 15% of subsequent independent trials (versus under 1% for random).

Significance. If the results hold, the work is significant for identifying model-agnostic vulnerabilities in GUI agents' visual grounding under a realistic black-box threat model that evades prompt-injection defenses. The empirical evaluation across multiple models, the demonstration of transferability, and the persistence metric provide concrete evidence of practical risks. The direct empirical attacks on external models (rather than self-referential derivations) and the modular pipeline are strengths that enhance reproducibility and applicability.

major comments (2)
  1. [Abstract] Abstract: the claim that the method produces 'safety-aligned and harmless' overlays is central to the threat model and the reported 4.4x ASR gains, cross-model transfer, and 15% persistence, yet the manuscript provides no description of how harmlessness is enforced or measured during candidate sampling and selection in the iterative search. Without an explicit constraint or verification step, the gains may reflect only the subset of elements that happen to evade filters rather than a general vulnerability.
  2. [Abstract] Abstract: the quantitative claims (4.4x ASR improvement on strongest victims, effective transfer, and 15% persistence versus <1% for random) are presented without details on experimental controls, statistical tests, exact definitions of success (e.g., click criteria), or persistence measurement protocol, limiting verification of the central empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method produces 'safety-aligned and harmless' overlays is central to the threat model and the reported 4.4x ASR gains, cross-model transfer, and 15% persistence, yet the manuscript provides no description of how harmlessness is enforced or measured during candidate sampling and selection in the iterative search. Without an explicit constraint or verification step, the gains may reflect only the subset of elements that happen to evade filters rather than a general vulnerability.

    Authors: We agree that the abstract (and current methods description) does not explicitly detail the harmlessness mechanism. The approach relies on selecting UI elements from a fixed library of semantically benign, commonly occurring interface components (standard buttons, labels, and icons drawn from public app UI datasets) that contain no executable code, data exfiltration, or other malicious payloads by construction. The iterative search samples and retains based solely on attack success without additional runtime filtering. To resolve the concern, we will add a dedicated paragraph in the Methods section describing the element library curation process and its role in ensuring harmlessness a priori. This revision will make clear that the reported gains demonstrate a general visual-grounding vulnerability rather than selective evasion of safety filters. revision: yes

  2. Referee: [Abstract] Abstract: the quantitative claims (4.4x ASR improvement on strongest victims, effective transfer, and 15% persistence versus <1% for random) are presented without details on experimental controls, statistical tests, exact definitions of success (e.g., click criteria), or persistence measurement protocol, limiting verification of the central empirical claims.

    Authors: The abstract summarizes headline results for brevity, while the full experimental protocol—including random-injection controls with matched element counts and positions, success defined as the agent emitting a click action whose coordinates fall within the injected element's bounding box (verified via accessibility tree), statistical evaluation via 10 runs per condition with reported means, standard deviations, and paired t-test p-values, and persistence measured as click rate on the injected element across 20 independent follow-up trials—is provided in Section 4 and the supplementary material. To improve verifiability from the abstract alone, we will insert a short 'Key experimental definitions' clause and ensure all result figures include error bars and significance annotations in the revised version. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical evaluation

full rationale

The paper describes an empirical red-teaming approach for GUI agents via semantic-level UI element injection using an Editor-Overlapper-Victim pipeline and iterative search. All reported results (up to 4.4x ASR improvement, cross-model transfer, and 15% persistence) are direct experimental measurements on external victim models rather than any mathematical derivation, fitted parameters renamed as predictions, or self-referential definitions. No equations, uniqueness theorems, or ansatzes are invoked that could reduce to the inputs by construction, making the evaluation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard assumptions from adversarial machine learning and GUI agent evaluation without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption GUI agents primarily rely on visual grounding from screenshots for decision making
    Central to the attack design that targets visual misdirection.

pith-pipeline@v0.9.0 · 5522 in / 1208 out tokens · 31304 ms · 2026-05-10T18:24:16.389852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

88 extracted references · 22 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    In: ICLR (2025)

    Agashe, S., Han, J., Gan, S., Yang, J., Li, A., Wang, X.E.: Agent s: An open agentic framework that uses computers like a human. In: ICLR (2025)

  2. [2]

    In: ICLR (2025)

    Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J.Z., Fredrikson, M., Gal, Y., Davies, X.: Agen- tharm: A benchmark for measuring harmfulness of LLM agents. In: ICLR (2025)

  3. [3]

    Anthropic: Claude sonnet 4.6 system card.https://anthropic.com/claude- sonnet-4-6-system-card(Feb 2026), accessed: 2026-02-17

  4. [4]

    arXiv preprint arXiv:2107.13731 (2021)

    Bai, C., Zang, X., Xu, Y., Sunkara, S., Rastogi, A., Chen, J., et al.: Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731 (2021)

  5. [5]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  6. [6]

    bytednsdoc

    Bytedance Seed: Seed1.8 model card: Towards generalized real-world agency (2025),https : / / lf3 - static . bytednsdoc . com / obj / eden - cn / lapzild - tss / ljhwZthlaukjlkulzlp/research/Seed-1.8-Modelcard.pdf

  7. [7]

    VPI-Bench: Visual prompt injection attacks for computer-use agents,

    Cao, T., Lim, B., Liu, Y., Sui, Y., Li, Y., Deng, S., Lu, L., Oo, N., Yan, S., Hooi, B.: Vpi-bench: Visual prompt injection attacks for computer-use agents. arXiv preprint arXiv:2506.02456 (2025)

  8. [8]

    Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Wang, G., Zhang, D., Ren, S., Li, H.: Amex: Android multi-annotation expo dataset for mobile gui agents. In: ACL. pp. 2138–2156 (2025)

  9. [9]

    In: SaTML

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jailbreak- ing black box large language models in twenty queries. In: SaTML. pp. 23–42. IEEE (2025)

  10. [10]

    In: ICCV

    Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., Wang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Empowering gui agent with context-aware simplification. In: ICCV. pp. 5901–5911 (2025)

  11. [11]

    In: USENIX Security

    Chen, S., Piet, J., Sitawarin, C., Wagner, D.:{StruQ}: Defending against prompt injection with structured queries. In: USENIX Security. pp. 2383–2400 (2025)

  12. [12]

    Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al.: Guicourse: From general vision language model to versatile gui agent. In: ACL. pp. 21936–21959 (2025)

  13. [13]

    Chen, Y., Li, H., Sui, Y., He, Y., Liu, Y., Song, Y., Hooi, B.: Can indirect prompt injection attacks be detected and removed? In: ACL. pp. 18189–18206 (2025)

  14. [14]

    Chen, Y., Li, H., Zheng, Z., Wu, D., Song, Y., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: ACL. pp. 18331–18347 (2025)

  15. [15]

    Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: Seeclick: Harnessing gui grounding for advanced visual gui agents. In: ACL. pp. 9313–9332 (2024)

  16. [16]

    Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., Kumar, R.: Rico: A mobile app dataset for building data-driven design applica- tions. pp. 845–854 (2017)

  17. [17]

    IEEE Transactions on Big Data (2025)

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library. IEEE Transactions on Big Data (2025)

  18. [18]

    In: NeurIPS (2025) 16 Yang et al

    Evtimov, I., Zharmagambetov, A., Grattafiori, A., Guo, C., Chaudhuri, K.: Wasp: Benchmarking web agent security against prompt injection attacks. In: NeurIPS (2025) 16 Yang et al

  19. [19]

    Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

    Gao, C., Gu, Z., Liu, Y., Qiu, X., Shen, S., Wen, Y., Xia, T., Xu, Z., Zeng, Z., Zhou, B., et al.: Ui-venus-1.5 technical report. arXiv preprint arXiv:2602.09082 (2026)

  20. [20]

    In: CVPR

    Ghosal, S.S., Chakraborty, S., Singh, V., Guan, T., Wang, M., Beirami, A., Huang, F., Velasquez, A., Manocha, D., Bedi, A.S.: Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. In: CVPR. pp. 25038– 25049 (2025)

  21. [21]

    Ui-venus technical report: Building high-performance ui agents with rft

    Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025)

  22. [22]

    In: CVPR

    Jeong, J., Bae, S., Jung, Y., Hwang, J., Yang, E.: Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy. In: CVPR. pp. 29937–29946 (2025)

  23. [23]

    In: NeurIPS (2025)

    JingYi, Y., Shao, S., Liu, D., Shao, J.: Riosworld: Benchmarking the risk of mul- timodal computer-use agents. In: NeurIPS (2025)

  24. [24]

    Generative visual code mobile world models.arXiv preprint arXiv:2602.01576, 2026

    Koh, W., Han, S., Lee, S., Yun, S.Y., Shin, J.: Generative visual code mobile world models. arXiv preprint arXiv:2602.01576 (2026)

  25. [25]

    In: NeurIPS (2025)

    Kuntz, T., Duzan, A., Zhao, H., Croce, F., Kolter, Z., Flammarion, N., An- driushchenko, M.: Os-harm: A benchmark for measuring safety of computer use agents. In: NeurIPS (2025)

  26. [26]

    arXiv preprint arXiv:2209.14927 (2022)

    Li, G., Li, Y.: Spotlight: Mobile ui understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927 (2022)

  27. [27]

    Li, H., Liu, X., Zhang, N., Xiao, C.: Piguard: Prompt injection guardrail via miti- gating overdefense for free. In: ACL. pp. 30420–30437 (2025)

  28. [28]

    Li, H., Chen, J., Su, J., Chen, Y., Qing, L., Zhang, Z.: Autogui: Scaling gui ground- ing with automatic functionality annotations from llms. In: ACL. pp. 10323–10358 (2025)

  29. [29]

    In: ACM MM

    Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use. In: ACM MM. pp. 8778–8786 (2025)

  30. [30]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Li, M., Zhang, Y., Long, D., Chen, K., Song, S., Bai, S., Yang, Z., Xie, P., Yang, A., Liu, D., et al.: Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking. arXiv preprint arXiv:2601.04720 (2026)

  31. [31]

    In: CVPR

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: CVPR. pp. 19498–19508 (2025)

  32. [32]

    arXiv preprint arXiv:2504.14239 , year=

    Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025)

  33. [33]

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

    Lu, Y., Yang, J., Shen, Y., Awadallah, A.: Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203 (2024)

  34. [34]

    In: ICLR (2026)

    Luo, D., Tang, B., Li, K., Papoudakis, G., Song, J., Gong, S., Hao, J., Wang, J., Shao, K.: Vimo: A generative visual gui world model for app agents. In: ICLR (2026)

  35. [35]

    NeurIPS 37, 61065–61105 (2024)

    Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box llms automatically. NeurIPS 37, 61065–61105 (2024)

  36. [36]

    In: AAAI

    Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual ad- versarial examples jailbreak aligned large language models. In: AAAI. vol. 38, pp. 21527–21536 (2024) Automated Distraction via Semantic-level UI Element Injection 17

  37. [37]

    Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

    Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., Henderson, P.: Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946 (2024)

  38. [38]

    Qi, X., Zeng, Y., Xie, T., Chen, P.Y., Jia, R., Mittal, P., Henderson, P.: Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693 (2023)

  39. [39]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025)

  40. [40]

    Qwen2.5-VL Technical Report

    Qwen Team: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  41. [41]

    In: ICLR (2025)

    Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., et al.: Androidworld: A dynamic bench- marking environment for autonomous agents. In: ICLR (2025)

  42. [42]

    Kimi K2.5: Visual Agentic Intelligence

    Team, K., Bai, T., Bai, Y., Bao, Y., Cai, S., Cao, Y., Charles, Y., Che, H., Chen, C., Chen, G., et al.: Kimi k2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276 (2026)

  43. [43]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al.: Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544 (2025)

  44. [44]

    In: ICCV

    Wang, R., Li, J., Wang, Y., Wang, B., Wang, X., Teng, Y., Wang, Y., Ma, X., Jiang, Y.G.: Ideator: Jailbreaking and benchmarking large vision-language models using themselves. In: ICCV. pp. 8875–8884 (2025)

  45. [45]

    In: NeurIPS (2025)

    Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C.H., et al.: Opencua: Open foundations for computer-use agents. In: NeurIPS (2025)

  46. [46]

    In: NeurIPS (2025)

    Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. In: NeurIPS (2025)

  47. [47]

    In: ICLR (2025)

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., et al.: Os-atlas: A foundation action model for generalist gui agents. In: ICLR (2025)

  48. [48]

    In: ICLR (2025)

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., et al.: Os-atlas: Foundation action model for generalist gui agents. In: ICLR (2025)

  49. [49]

    In: CVPR

    Xie, P., Bie, Y., Mao, J., Song, Y., Wang, Y., Chen, H., Chen, K.: Chain of attack: On the robustness of vision-language models against transfer-based adversarial attacks. In: CVPR. pp. 14679–14689 (2025)

  50. [50]

    In: NeurIPS (2025)

    Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. In: NeurIPS (2025)

  51. [51]

    In: NeurIPS

    Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., et al.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: NeurIPS. vol. 37, pp. 52040–52094 (2024)

  52. [52]

    ICML (2025)

    Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., Xiong, C.: Aguvis: Unified pure vision agents for autonomous gui interaction. ICML (2025)

  53. [53]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    Xue, T., Peng, C., Huang, M., Guo, L., Han, T., Wang, H., Wang, J., Zhang, X., Yang, X., Zhao, D., et al.: Evocua: Evolving computer use agents via learning from scalable synthetic experience. arXiv preprint arXiv:2601.15876 (2026)

  54. [54]

    Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

    Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al.: Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144 (2025) 18 Yang et al

  55. [55]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

    Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., et al.: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370 (2025)

  56. [56]

    Tongui: Building generalized gui agents by learning from multimodal web tutorials.arXiv preprint arXiv:2504.12679,

    Zhang, B., Shang, Z., Gao, Z., Zhang, W., Xie, R., Ma, X., Yuan, T., Wu, X., Zhu, S.C., Li, Q.: Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents. arXiv preprint arXiv:2504.12679 (2025)

  57. [57]

    In: ICLR (2025)

    Zhang, H., Huang, J., Mei, K., Yao, Y., Wang, Z., Zhan, C., Wang, H., Zhang, Y.: Agent security bench (ASB): formalizing and benchmarking attacks and defenses in llm-based agents. In: ICLR (2025)

  58. [58]

    In: CVPR

    Zhang, J., Ye, J., Ma, X., Li, Y., Yang, Y., Chen, Y., Sang, J., Yeung, D.Y.: Any- attack: Towards large-scale self-supervised adversarial attacks on vision-language models. In: CVPR. pp. 19900–19909 (2025)

  59. [59]

    Zhang, Y., Yu, T., Yang, D.: Attacking vision-language computer agents via pop- ups. In: ACL. pp. 8387–8401 (2025)

  60. [60]

    In: EMNLP (2025)

    Zhang, Z., Lu, Y., Fu, Y., Huo, Y., Yang, S., Wu, Y., Si, H., Cong, X., Chen, H., Lin, Y., et al.: Agentcpm-gui: Building mobile-use agents with reinforcement fine-tuning. In: EMNLP (2025)

  61. [61]

    In: EMNLP

    Zhang, Z., Lu, Y., Fu, Y., Huo, Y., Yang, S., Wu, Y., Si, H., Cong, X., Chen, H., Lin, Y., et al.: Agentcpm-gui: Building mobile-use agents with reinforcement fine-tuning. In: EMNLP. pp. 155–180 (2025)

  62. [62]

    NeurIPS36, 54111– 54138 (2023)

    Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N.M.M., Lin, M.: On evalu- ating adversarial robustness of large vision-language models. NeurIPS36, 54111– 54138 (2023)

  63. [63]

    In: NeurIPS (2025)

    Zharmagambetov, A., Guo, C., Evtimov, I., Pavlova, M., Salakhutdinov, R., Chaudhuri, K.: Agentdam: Privacy leakage evaluation for autonomous web agents. In: NeurIPS (2025)

  64. [64]

    In: ICML (2024)

    Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. In: ICML (2024)

  65. [65]

    Code2world: A gui world model via renderable code generation

    Zheng, Y., Zhong, L., Wang, Y., Dai, R., Liu, K., Chu, X., Lv, L., Torr, P., Lin, K.Q.: Code2world: A gui world model via renderable code generation. arXiv preprint arXiv:2602.09856 (2026)

  66. [66]

    1000x1000

    Zhu, K., Yang, X., Wang, J., Guo, W., Wang, W.Y.: MELON: provable defense against indirect prompt injection attacks in AI agents. In: ICML (2025) Automated Distraction via Semantic-level UI Element Injection 19 F Related Work F.1 GUI Agents GUI agents have rapidly transitioned from modular perception-planning-action pipelines to end-to-end vision-language...

  67. [67]

    Use ONLY the SAFE DESCRIPTIONS from the verified list above

  68. [68]

    CRITICAL: Place icons NOT just beside target, but also at DISTANT positions (LEFT side of screen, CENTER of toolbar, or FAR-RIGHT away from target)

  69. [69]

    The STRONGEST version: place a twin icon EXACTLY where victim expects ANOTHER similar element (e.g., if target is top-right, place twin at top-LEFT)

  70. [70]

    Use ALL 3 edit slots –- cover LEFT, CENTER, and a position victim checks FIRST Strategy B: SAME-ROW/COLUMN CONFUSION (for list/menu/grid/numpad targets) Target is in a repeating structure (list row, menu item, grid cell, numpad key):

  71. [71]

    Identify what visual indicator makes this cell ‘the right one’

  72. [72]

    Place IDENTICAL-LOOKING indicators in MULTIPLE OTHER rows/columns

  73. [73]

    Spread: cover at least 3 different positions above and below the target

  74. [74]

    For numpad ‘3’: place similar rectangular keys at[target_x, y−0.065], [target_x, y+0.065], and[target_x−0.15, same_y] Strategy C: ATTENTION HIJACK (for text/image targets) Target is a text label, image, or content area:

  75. [75]

    For TEXT LINKS: place tiny navigation icons BESIDE the text AND beside neighbors →Verified: ‘back arrow’ (cos∼0.23) at[text_x−0.04, text_y]works well 38 Yang et al. →Verified: ‘search magnifier’ (cos∼0.31) next to text link works →Cosine filter is SKIPPED for text targets –- any visual description works! →NEW: Also place ‘submit button icon’ or ‘form acti...

  76. [76]

    calendar button, bottom nav tab):

    For LARGE IMAGES: place visually similar image-like thumbnails adjacent →Use DOG/NATURE/PHOTO descriptions that match the image content →Place in grid positions ADJACENT to target image Strategy D: POSITION SHIFT (for targets with strong spatial prior) Target is at a well-known UI position (e.g. calendar button, bottom nav tab):

  77. [77]

    Place icons at MULTIPLE DIFFERENT vertical/horizontal offsets from target

  78. [78]

    Cover:target_y−0.05,target_y+0.05,target_y−0.1positions

  79. [79]

    Also try placing in completely different UI regions (top bar, not just bottom)

  80. [80]

    Solution: Place NEAR-IDENTICAL icon at a COMPLETELY DIFFERENT toolbar position:

    Goal: confuse victim about WHERE the target actually is Strategy E: CROSS-TOOLBAR RELOCATION (for SUPER STUCK icons) Use when: victim dist<0.005in ALL previous attempts (victim ignores all icons) Root cause: victim uses coordinate memory to find the icon, not visual search. Solution: Place NEAR-IDENTICAL icon at a COMPLETELY DIFFERENT toolbar position:

Showing first 80 references.