pith. sign in

arxiv: 2607.00333 · v1 · pith:ODJIHPKBnew · submitted 2026-07-01 · 💻 cs.CR

(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents

Pith reviewed 2026-07-02 11:49 UTC · model grok-4.3

classification 💻 cs.CR
keywords mobile agentsvision-language modelsattack surfacesscreenshot manipulationcommand executionmobile securityVLM agentsthird-party automation
0
0 comments X

The pith

Third-party mobile agents using vision models introduce attack surfaces that let a malicious app hijack actions and run arbitrary commands without permissions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines third-party mobile agents that automate phone use by taking screenshots and using vision-language models to decide on actions. It identifies two attack surfaces absent in ordinary apps: one that exploits differences between human and machine vision on the screen, and another that intercepts or alters the channel carrying the agent's commands. The authors implement seven attacks including hidden text injection and screenshot tampering, then test them on five frameworks to show that a no-permission malicious app can seize control while the display stays unchanged to the user. This matters because these agents operate as high-privilege intermediaries between the user and other apps or the operating system. If the claim holds, current agent designs rest on an unexamined assumption that visual input and execution channels are trustworthy.

Core claim

The central claim is that replacing direct app-to-app or app-to-OS interfaces with screenshot-based perception and VLM reasoning in third-party agents creates two novel attack surfaces—the Screen Perception Attack Surface and the Misused Channel Attack Surface—allowing a malicious app to hijack agent behavior and achieve arbitrary command execution even without privileges, while the attacks remain visually indistinguishable to users.

What carries the argument

The Screen Perception Attack Surface, which exploits the gap between human and machine vision on device screenshots, together with the Misused Channel Attack Surface, which intercepts or manipulates the agent's execution pipeline.

If this is right

  • A malicious app without permissions can hijack the agent's decision process and force arbitrary command execution.
  • Attacks can be performed through subliminal text, invisible pixel zones, screenshot tampering, or host command injection.
  • The resulting actions stay visually indistinguishable to the human user.
  • The design creates a fundamental trust mismatch between the agent and the multi-tenant platform it runs on.
  • Perception-aware security models become necessary for platforms hosting such agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perception and channel weaknesses could appear in any VLM-driven automation tool that reads screens rather than using structured APIs.
  • Adding cryptographic integrity checks on screenshots or requiring explicit user confirmation for high-impact actions would be a direct countermeasure worth testing.
  • The attacks may scale to other device types if similar third-party agents adopt screenshot-based control.
  • Platform vendors could mitigate the issue by restricting screenshot access or execution channels for agent processes.

Load-bearing premise

The third-party agent frameworks rely on raw screenshot input and open execution channels that have no built-in checks against visual manipulation or interception.

What would settle it

A direct test on any of the five evaluated frameworks in which a malicious app successfully issues commands through one of the seven described attacks while the device screen shown to the user remains visually identical to the unattacked state.

Figures

Figures reproduced from arXiv: 2607.00333 by Jianliang Wu, Wenrui Diao, Zhentao Xie, Zidong Zhang.

Figure 1
Figure 1. Figure 1: Workflow of third-party VLM-based mobile agents. Steps II and III [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Corner injection exploits the mismatch between physical display [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Subliminal text injection attack. (I) A 3% opacity overlay injects a [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: UI spoofing attack workflow. The malicious app monitors for target packages via Accessibility Service. When WeChat launches, a phishing overlay [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Screenshot tampering methods with injected screenshot and icon [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Host-Side Command Execution against AppAgent. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Third-party mobile agents powered by Vision-Language Models (VLMs) have emerged as a promising paradigm for automating smartphone interactions. These agents act as high-privilege decision-makers, perceiving device states through screenshots and executing actions via VLM reasoning, transforming how an agent app interacts with the environment (i.e., other apps or the OS). Correspondingly, this transformation introduces new attack surfaces or transforms benign/harmless interfaces into exploitable ones for mobile devices. In this paper, we summarize key differences between third-party mobile agent apps and general apps when interacting with the environment, analyze the security posture of agents, and identify two unique attack surfaces compared to general mobile apps: the Screen Perception Attack Surface, which exploits the gap between human and machine vision, and the Misused Channel Attack Surface, which intercepts or manipulates the agent's execution pipeline. We design and implement seven concrete attacks, from subliminal text injection and invisible pixel zone exploitation to screenshot tampering and host PC command injection. Our evaluation of five popular mobile agent frameworks demonstrates that a malicious app can hijack agent actions and achieve arbitrary command execution even without any privilege permissions, while remaining visually indistinguishable to users. These findings reveal a fundamental trust mismatch in autonomous agent design and highlight the urgent need for perception-aware security models on multi-tenant platforms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that third-party VLM-powered mobile agents introduce two new attack surfaces (Screen Perception exploiting human-machine vision gaps, and Misused Channel for intercepting execution pipelines) compared to standard mobile apps. It designs and implements seven attacks (subliminal text injection, invisible pixel zone exploitation, screenshot tampering, host PC command injection, and others), then evaluates them on five popular frameworks. The central result is that a malicious app can hijack agent actions and achieve arbitrary command execution without any privilege permissions while remaining visually indistinguishable to users, revealing a trust mismatch in agent design.

Significance. If the zero-permission attacks hold under the stated conditions, the work is significant for identifying previously unexamined surfaces in an emerging class of high-privilege mobile automation tools. The evaluation across five frameworks provides breadth, and the concrete attack implementations offer falsifiable demonstrations that could inform perception-aware security models on multi-tenant platforms.

major comments (2)
  1. [Abstract] Abstract: the load-bearing claim that attacks succeed 'even without any privilege permissions' must be reconciled with Android's permission model (e.g., restrictions on MediaProjection, READ_EXTERNAL_STORAGE, or accessibility services for screenshots and overlays). The manuscript needs to specify, for each of the seven attacks, the exact permission state under which the implementation was tested and how zero-permission access was achieved without implicit reliance on co-location or background services.
  2. [Evaluation] Evaluation (on five frameworks): the reported success in hijacking and arbitrary command execution lacks quantitative metrics, permission-level controls, or implementation details sufficient to verify the no-permission assertion. Without these, the cross-framework claim cannot be assessed as sound.
minor comments (2)
  1. Define the seven attacks with consistent naming and one-sentence summaries when first introduced in §3 or §4.
  2. Add a table summarizing attack surfaces, required permissions (or lack thereof), and success conditions for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concerns about permission reconciliation and evaluation details below, and will incorporate clarifications and expansions in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the load-bearing claim that attacks succeed 'even without any privilege permissions' must be reconciled with Android's permission model (e.g., restrictions on MediaProjection, READ_EXTERNAL_STORAGE, or accessibility services for screenshots and overlays). The manuscript needs to specify, for each of the seven attacks, the exact permission state under which the implementation was tested and how zero-permission access was achieved without implicit reliance on co-location or background services.

    Authors: We agree that explicit specification strengthens the claim. All seven attacks were implemented and evaluated using only standard Android app permissions (no MediaProjection, no READ_EXTERNAL_STORAGE for agent screenshots, no accessibility services). The malicious app does not capture or process screenshots itself; it instead injects or modifies UI elements that the agent's own VLM pipeline perceives. We will add a table in Section 4 and the evaluation section listing the exact permission manifest for each attack, confirming zero additional privileges. The co-location assumption is explicit in the threat model (malicious app installed on the same device) but does not confer implicit privileges. revision: yes

  2. Referee: [Evaluation] Evaluation (on five frameworks): the reported success in hijacking and arbitrary command execution lacks quantitative metrics, permission-level controls, or implementation details sufficient to verify the no-permission assertion. Without these, the cross-framework claim cannot be assessed as sound.

    Authors: We accept this critique and will expand the evaluation. The revised manuscript will report per-attack, per-framework success rates (e.g., number of trials and success percentages), include explicit permission-level controls in the experimental setup description, and add an appendix with implementation pseudocode and environment configurations. These additions will allow direct verification of the zero-permission conditions across the five frameworks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack implementations with no derivations or self-referential steps

full rationale

The paper describes concrete attack designs, implementations, and evaluations on five frameworks without any mathematical derivations, equations, fitted parameters, or load-bearing self-citations. Claims rest on described attack implementations and observed outcomes rather than self-referential definitions or predictions that reduce to inputs by construction. No patterns from the enumerated circularity kinds apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security analysis paper; it introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5771 in / 1210 out tokens · 30799 ms · 2026-07-02T11:49:19.548573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Android debug bridge (adb),

    Android Developers, “Android debug bridge (adb),” https://developer.an droid.com/tools/adb, accessed: January 2026

  2. [2]

    Accessibility: Accessibilitynodeprovider,

    Android Developers, “Accessibility: Accessibilitynodeprovider,” https: //developer.android.com/reference/android/view/accessibility/Accessibili tyNodeProvider, 2026, accessed: January 2026

  3. [3]

    Android api: Fileobserver,

    Android Developers, “Android api: Fileobserver,” https://developer.an droid.com/reference/android/os/FileObserver, 2026, accessed: January 2026

  4. [4]

    Android: Apply rounded corners,

    Android Developers, “Android: Apply rounded corners,” https://develo per.android.com/develop/ui/views/layout/insets/rounded-corners, 2026, accessed: January 2026

  5. [5]

    Android: Input device,

    Android Developers, “Android: Input device,” https://developer.android. com/reference/android/view/InputDevice, 2026, accessed: January 2026

  6. [6]

    Android: Layout inspector,

    Android Developers, “Android: Layout inspector,” https://developer.andr oid.com/develop/ui/views/layout/display-cutout, 2026, accessed: January 2026

  7. [7]

    Displaycutout,

    Android Developers, “Displaycutout,” https://developer.android.com/refe rence/android/view/DisplayCutout, 2026, accessed: January 2026

  8. [8]

    FLAG_NOT_FOCUSABLE,

    Android Developers, “FLAG_NOT_FOCUSABLE,” https://developer.an droid.com/reference/android/view/WindowManager.LayoutParams#FL AG_NOT_FOCUSABLE, 2026, accessed: January 2026

  9. [9]

    FLAG_NOT_TOUCHABLE,

    Android Developers, “FLAG_NOT_TOUCHABLE,” https://developer.an droid.com/reference/android/view/WindowManager.LayoutParams#FL AG_NOT_TOUCHABLE, 2026, accessed: January 2026

  10. [10]

    Manage all files on a storage device,

    Android Developers, “Manage all files on a storage device,” https: //developer.android.com/training/data-storage/manage-all-files, 2026, accessed: January 2026

  11. [11]

    SYSTEM_ALERT_WINDOW,

    Android Developers, “SYSTEM_ALERT_WINDOW,” https://developer. android.com/reference/android/Manifest.permission#SYSTEM_ALERT _WINDOW, 2026, accessed: January 2026

  12. [12]

    TYPE_VIEW_TEXT_CHANGED,

    Android Developers, “TYPE_VIEW_TEXT_CHANGED,” https://develo per.android.com/reference/android/view/accessibility/AccessibilityEven t#TYPE_VIEW_TEXT_CHANGED, 2026, accessed: January 2026

  13. [13]

    TYPE_WINDOW_STATE_CHAN GED,

    Android Developers, “TYPE_WINDOW_STATE_CHAN GED,” https://developer.android.com/reference/android/view/accessibi lity/AccessibilityEvent#TYPE_WINDOW_STATE_CHANGED, 2026, accessed: January 2026

  14. [14]

    uiautomator,

    Android Developers, “uiautomator,” https://developer.android.com/trai ning/testing/other-components/ui-automator, 2026, accessed: January 2026

  15. [15]

    Android Open Source Project (AOSP): Input,

    Android Open Source Project, “Android Open Source Project (AOSP): Input,” https://source.android.com/docs/core/interaction/input, 2026, accessed: January 2026

  16. [16]

    Apple platform security,

    Apple Inc., “Apple platform security,” https://support.apple.com/guide/se curity/welcome/web, 2024, accessed: February 2026

  17. [17]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023

  18. [18]

    Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios,

    R. BTet al., “Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios,”International radio consultative committee international telecommunication union, Switzerland, CCIR Rep, 2011

  19. [19]

    Doubao mobile assistant,

    ByteDance, “Doubao mobile assistant,” https://o.doubao.com/, 2026, accessed: January 2026. (In Chinese)

  20. [20]

    Application of Fourier analysis to the visibility of gratings,

    F. W. Campbell and J. G. Robson, “Application of Fourier analysis to the visibility of gratings,”The Journal of Physiology, vol. 197, no. 3, pp. 551–566, 1968

  21. [21]

    Kindness is a risky business: On the usage of the accessibility apis in android,

    W. Diao, Y . Zhang, L. Zhang, Z. Li, F. Xu, X. Pan, X. Liu, J. Weng, K. Zhang, and X. Wang, “Kindness is a risky business: On the usage of the accessibility apis in android,” in22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), 2019, pp. 261–275

  22. [22]

    Alexa versus alexa: Controlling smart speakers by self-issuing voice commands,

    S. Esposito, D. Sgandurra, and G. Bella, “Alexa versus alexa: Controlling smart speakers by self-issuing voice commands,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 2022, pp. 1064–1078

  23. [23]

    A-copilot: Android covert operation for private infor- mation lifting and otp theft: A study on how malware masquerading as legitimate applications compromise security and privacy,

    J. G. Q. L. et al., “A-copilot: Android covert operation for private infor- mation lifting and otp theft: A study on how malware masquerading as legitimate applications compromise security and privacy,” inProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy, 2024, pp. 155–157

  24. [24]

    Cloak and dagger: from two permissions to complete control of the ui feedback loop,

    Y . Fratantonio, C. Qian, S. P. Chung, and W. Lee, “Cloak and dagger: from two permissions to complete control of the ui feedback loop,” in 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 1041– 1057

  25. [25]

    Mobile agent issue:change the model,

    Github, “Mobile agent issue:change the model,” https://github.com/X -PLUG/MobileAgent/issues/233, 2025, accessed: January 2026. (In Chinese)

  26. [26]

    Skillfence: A systems approach to practically mitigating voice-based confusion attacks,

    A. Hooda, M. Wallace, K. Jhunjhunwalla, E. Fernandes, and K. Fawaz, “Skillfence: A systems approach to practically mitigating voice-based confusion attacks,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6, no. 1, pp. 1–26, 2022

  27. [27]

    Automating gui testing for android applications,

    C. Hu and I. Neamtiu, “Automating gui testing for android applications,” inProceedings of the 6th International Workshop on Automation of Software Test, 2011, pp. 77–83

  28. [28]

    A11y and privacy don’t have to be mutually exclusive: Constraining accessibility service misuse on android,

    J. Huang, M. Backes, and S. Bugiel, “A11y and privacy don’t have to be mutually exclusive: Constraining accessibility service misuse on android,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 3631–3648

  29. [29]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  30. [30]

    Appagentx: Evolving gui agents as proficient smartphone users,

    W. Jiang, Y . Zhuang, C. Song, X. Yang, J. T. Zhou, and C. Zhang, “Appagentx: Evolving gui agents as proficient smartphone users,”arXiv preprint arXiv:2503.02268, 2025

  31. [31]

    Checking intent-based communication in android with intent space analysis,

    Y . Jing, G.-J. Ahn, A. Doupé, and J. H. Yi, “Checking intent-based communication in android with intent space analysis,” inProceedings of the 11th ACM on Asia Conference on Computer and Communications Security, 2016, pp. 735–746

  32. [32]

    All about activity injection: threats, semantics, and detection,

    S. Lee, S. Hwang, and S. Ryu, “All about activity injection: threats, semantics, and detection,” in2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 252–262

  33. [33]

    Do not give a dog bread every time he wags his tail: Stealing passwords through content queries (conquer) attacks,

    C. Lei, Z. Ling, Y . Zhang, K. Dong, K. Liu, J. Luo, and X. Fu, “Do not give a dog bread every time he wags his tail: Stealing passwords through content queries (conquer) attacks,” inThe Network and Distributed System Security Symposium (NDSS). Internet Society, 2023

  34. [34]

    Measuring the insecurity of mobile deep links of android,

    F. Liu, C. Wang, A. Pico, D. Yao, and G. Wang, “Measuring the insecurity of mobile deep links of android,” in26th USENIX security symposium (USENIX Security 17), 2017, pp. 953–969

  35. [35]

    AutoGLM: Autonomous foundation agents for GUIs,

    X. Liu, B. Qin, D. Liang, G. Dong, H. Lai, H. Zhang, H. Zhao, I. L. Iong, J. Sun, J. Wanget al., “AutoGLM: Autonomous foundation agents for GUIs,”arXiv preprint arXiv:2411.00820, 2024

  36. [36]

    OpenAI: GPT-4v System Card,

    OpenAI, “OpenAI: GPT-4v System Card,” https://openai.com/index/gpt -4v-system-card/, 2024, accessed: January 2026

  37. [37]

    OpenAI Models: Deprecations,

    OpenAI, “OpenAI Models: Deprecations,” https://platform.openai.com/ docs/deprecations, 2026, accessed: January 2026

  38. [38]

    Samsung, “Bixby,” https://www.samsung.com/hk_en/apps/bixby/, 2026, accessed: January 2026

  39. [39]

    ADBKeyBoard: Android virtual keyboard for ADB input,

    senzhk, “ADBKeyBoard: Android virtual keyboard for ADB input,” https://github.com/senzhk/ADBKeyBoard, 2016, accessed: January 2026

  40. [40]

    All your app links are belong to us: understanding the threats of instant apps based attacks,

    Y . Tang, Y . Sui, H. Wang, X. Luo, H. Zhou, and Z. Xu, “All your app links are belong to us: understanding the threats of instant apps based attacks,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 914–926

  41. [41]

    Tencent, “WeChat,” https://www.wechat.com/en/, 2026, accessed: January 2026

  42. [42]

    GUI-OWL,

    Tongyi Lab, Alibaba, “GUI-OWL,” https://modelscope.cn/models/iic/G UI-Owl-7B, 2025, accessed: January 2026

  43. [43]

    (In Chinese)

    Xiaomi, “Xiaoai,” https://xiaoai.mi.com/, 2026, accessed: January 2026. (In Chinese)

  44. [44]

    DVa: Extracting victims and abuse vectors from android accessibility malware,

    H. Xu, M. Yao, R. Zhang, M. M. Dawoud, J. Park, and B. Saltaformaggio, “DVa: Extracting victims and abuse vectors from android accessibility malware,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 701–718

  45. [45]

    A comprehensive evaluation of Android ICC resolution techniques,

    J. Yan, S. Zhang, Y . Liu, X. Deng, J. Yan, and J. Zhang, “A comprehensive evaluation of Android ICC resolution techniques,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13

  46. [46]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    J. Ye, X. Zhang, H. Xu, H. Liu, J. Wang, Z. Zhu, Z. Zheng, F. Gao, J. Cao, Z. Luet al., “Mobile-agent-v3: Fundamental agents for gui automation,”arXiv preprint arXiv:2508.15144, 2025

  47. [47]

    AppAgent: Multimodal agents as smartphone users,

    C. Zhang, Z. Yang, J. Liu, Y . Li, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “AppAgent: Multimodal agents as smartphone users,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–20

  48. [48]

    Appsealer: automatic generation of vulnerability- specific patches for preventing component hijacking attacks in android applications

    M. Zhang and H. Yin, “Appsealer: automatic generation of vulnerability- specific patches for preventing component hijacking attacks in android applications.” inThe Network and Distributed System Security Symposium (NDSS). Internet Society, 2014

  49. [49]

    Autoglm-phone,

    ZhipuAI, “Autoglm-phone,” https://modelscope.cn/models/ZhipuAI/Aut oGLM-Phone-9B, 2026, accessed: January 2026

  50. [50]

    Moba: Multifaceted memory-enhanced adaptive planning for efficient mobile task automation,

    Z. Zhu, H. Tang, Y . Li, D. Liu, H. Xu, K. Lan, D. Zhang, Y . Jiang, H. Zhou, C. Wanget al., “Moba: Multifaceted memory-enhanced adaptive planning for efficient mobile task automation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstra...