(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents
Pith reviewed 2026-07-02 11:49 UTC · model grok-4.3
The pith
Third-party mobile agents using vision models introduce attack surfaces that let a malicious app hijack actions and run arbitrary commands without permissions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that replacing direct app-to-app or app-to-OS interfaces with screenshot-based perception and VLM reasoning in third-party agents creates two novel attack surfaces—the Screen Perception Attack Surface and the Misused Channel Attack Surface—allowing a malicious app to hijack agent behavior and achieve arbitrary command execution even without privileges, while the attacks remain visually indistinguishable to users.
What carries the argument
The Screen Perception Attack Surface, which exploits the gap between human and machine vision on device screenshots, together with the Misused Channel Attack Surface, which intercepts or manipulates the agent's execution pipeline.
If this is right
- A malicious app without permissions can hijack the agent's decision process and force arbitrary command execution.
- Attacks can be performed through subliminal text, invisible pixel zones, screenshot tampering, or host command injection.
- The resulting actions stay visually indistinguishable to the human user.
- The design creates a fundamental trust mismatch between the agent and the multi-tenant platform it runs on.
- Perception-aware security models become necessary for platforms hosting such agents.
Where Pith is reading between the lines
- The same perception and channel weaknesses could appear in any VLM-driven automation tool that reads screens rather than using structured APIs.
- Adding cryptographic integrity checks on screenshots or requiring explicit user confirmation for high-impact actions would be a direct countermeasure worth testing.
- The attacks may scale to other device types if similar third-party agents adopt screenshot-based control.
- Platform vendors could mitigate the issue by restricting screenshot access or execution channels for agent processes.
Load-bearing premise
The third-party agent frameworks rely on raw screenshot input and open execution channels that have no built-in checks against visual manipulation or interception.
What would settle it
A direct test on any of the five evaluated frameworks in which a malicious app successfully issues commands through one of the seven described attacks while the device screen shown to the user remains visually identical to the unattacked state.
Figures
read the original abstract
Third-party mobile agents powered by Vision-Language Models (VLMs) have emerged as a promising paradigm for automating smartphone interactions. These agents act as high-privilege decision-makers, perceiving device states through screenshots and executing actions via VLM reasoning, transforming how an agent app interacts with the environment (i.e., other apps or the OS). Correspondingly, this transformation introduces new attack surfaces or transforms benign/harmless interfaces into exploitable ones for mobile devices. In this paper, we summarize key differences between third-party mobile agent apps and general apps when interacting with the environment, analyze the security posture of agents, and identify two unique attack surfaces compared to general mobile apps: the Screen Perception Attack Surface, which exploits the gap between human and machine vision, and the Misused Channel Attack Surface, which intercepts or manipulates the agent's execution pipeline. We design and implement seven concrete attacks, from subliminal text injection and invisible pixel zone exploitation to screenshot tampering and host PC command injection. Our evaluation of five popular mobile agent frameworks demonstrates that a malicious app can hijack agent actions and achieve arbitrary command execution even without any privilege permissions, while remaining visually indistinguishable to users. These findings reveal a fundamental trust mismatch in autonomous agent design and highlight the urgent need for perception-aware security models on multi-tenant platforms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that third-party VLM-powered mobile agents introduce two new attack surfaces (Screen Perception exploiting human-machine vision gaps, and Misused Channel for intercepting execution pipelines) compared to standard mobile apps. It designs and implements seven attacks (subliminal text injection, invisible pixel zone exploitation, screenshot tampering, host PC command injection, and others), then evaluates them on five popular frameworks. The central result is that a malicious app can hijack agent actions and achieve arbitrary command execution without any privilege permissions while remaining visually indistinguishable to users, revealing a trust mismatch in agent design.
Significance. If the zero-permission attacks hold under the stated conditions, the work is significant for identifying previously unexamined surfaces in an emerging class of high-privilege mobile automation tools. The evaluation across five frameworks provides breadth, and the concrete attack implementations offer falsifiable demonstrations that could inform perception-aware security models on multi-tenant platforms.
major comments (2)
- [Abstract] Abstract: the load-bearing claim that attacks succeed 'even without any privilege permissions' must be reconciled with Android's permission model (e.g., restrictions on MediaProjection, READ_EXTERNAL_STORAGE, or accessibility services for screenshots and overlays). The manuscript needs to specify, for each of the seven attacks, the exact permission state under which the implementation was tested and how zero-permission access was achieved without implicit reliance on co-location or background services.
- [Evaluation] Evaluation (on five frameworks): the reported success in hijacking and arbitrary command execution lacks quantitative metrics, permission-level controls, or implementation details sufficient to verify the no-permission assertion. Without these, the cross-framework claim cannot be assessed as sound.
minor comments (2)
- Define the seven attacks with consistent naming and one-sentence summaries when first introduced in §3 or §4.
- Add a table summarizing attack surfaces, required permissions (or lack thereof), and success conditions for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the concerns about permission reconciliation and evaluation details below, and will incorporate clarifications and expansions in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the load-bearing claim that attacks succeed 'even without any privilege permissions' must be reconciled with Android's permission model (e.g., restrictions on MediaProjection, READ_EXTERNAL_STORAGE, or accessibility services for screenshots and overlays). The manuscript needs to specify, for each of the seven attacks, the exact permission state under which the implementation was tested and how zero-permission access was achieved without implicit reliance on co-location or background services.
Authors: We agree that explicit specification strengthens the claim. All seven attacks were implemented and evaluated using only standard Android app permissions (no MediaProjection, no READ_EXTERNAL_STORAGE for agent screenshots, no accessibility services). The malicious app does not capture or process screenshots itself; it instead injects or modifies UI elements that the agent's own VLM pipeline perceives. We will add a table in Section 4 and the evaluation section listing the exact permission manifest for each attack, confirming zero additional privileges. The co-location assumption is explicit in the threat model (malicious app installed on the same device) but does not confer implicit privileges. revision: yes
-
Referee: [Evaluation] Evaluation (on five frameworks): the reported success in hijacking and arbitrary command execution lacks quantitative metrics, permission-level controls, or implementation details sufficient to verify the no-permission assertion. Without these, the cross-framework claim cannot be assessed as sound.
Authors: We accept this critique and will expand the evaluation. The revised manuscript will report per-attack, per-framework success rates (e.g., number of trials and success percentages), include explicit permission-level controls in the experimental setup description, and add an appendix with implementation pseudocode and environment configurations. These additions will allow direct verification of the zero-permission conditions across the five frameworks. revision: yes
Circularity Check
No circularity: empirical attack implementations with no derivations or self-referential steps
full rationale
The paper describes concrete attack designs, implementations, and evaluations on five frameworks without any mathematical derivations, equations, fitted parameters, or load-bearing self-citations. Claims rest on described attack implementations and observed outcomes rather than self-referential definitions or predictions that reduce to inputs by construction. No patterns from the enumerated circularity kinds apply.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Android debug bridge (adb),
Android Developers, “Android debug bridge (adb),” https://developer.an droid.com/tools/adb, accessed: January 2026
2026
-
[2]
Accessibility: Accessibilitynodeprovider,
Android Developers, “Accessibility: Accessibilitynodeprovider,” https: //developer.android.com/reference/android/view/accessibility/Accessibili tyNodeProvider, 2026, accessed: January 2026
2026
-
[3]
Android api: Fileobserver,
Android Developers, “Android api: Fileobserver,” https://developer.an droid.com/reference/android/os/FileObserver, 2026, accessed: January 2026
2026
-
[4]
Android: Apply rounded corners,
Android Developers, “Android: Apply rounded corners,” https://develo per.android.com/develop/ui/views/layout/insets/rounded-corners, 2026, accessed: January 2026
2026
-
[5]
Android: Input device,
Android Developers, “Android: Input device,” https://developer.android. com/reference/android/view/InputDevice, 2026, accessed: January 2026
2026
-
[6]
Android: Layout inspector,
Android Developers, “Android: Layout inspector,” https://developer.andr oid.com/develop/ui/views/layout/display-cutout, 2026, accessed: January 2026
2026
-
[7]
Displaycutout,
Android Developers, “Displaycutout,” https://developer.android.com/refe rence/android/view/DisplayCutout, 2026, accessed: January 2026
2026
-
[8]
FLAG_NOT_FOCUSABLE,
Android Developers, “FLAG_NOT_FOCUSABLE,” https://developer.an droid.com/reference/android/view/WindowManager.LayoutParams#FL AG_NOT_FOCUSABLE, 2026, accessed: January 2026
2026
-
[9]
FLAG_NOT_TOUCHABLE,
Android Developers, “FLAG_NOT_TOUCHABLE,” https://developer.an droid.com/reference/android/view/WindowManager.LayoutParams#FL AG_NOT_TOUCHABLE, 2026, accessed: January 2026
2026
-
[10]
Manage all files on a storage device,
Android Developers, “Manage all files on a storage device,” https: //developer.android.com/training/data-storage/manage-all-files, 2026, accessed: January 2026
2026
-
[11]
SYSTEM_ALERT_WINDOW,
Android Developers, “SYSTEM_ALERT_WINDOW,” https://developer. android.com/reference/android/Manifest.permission#SYSTEM_ALERT _WINDOW, 2026, accessed: January 2026
2026
-
[12]
TYPE_VIEW_TEXT_CHANGED,
Android Developers, “TYPE_VIEW_TEXT_CHANGED,” https://develo per.android.com/reference/android/view/accessibility/AccessibilityEven t#TYPE_VIEW_TEXT_CHANGED, 2026, accessed: January 2026
2026
-
[13]
TYPE_WINDOW_STATE_CHAN GED,
Android Developers, “TYPE_WINDOW_STATE_CHAN GED,” https://developer.android.com/reference/android/view/accessibi lity/AccessibilityEvent#TYPE_WINDOW_STATE_CHANGED, 2026, accessed: January 2026
2026
-
[14]
uiautomator,
Android Developers, “uiautomator,” https://developer.android.com/trai ning/testing/other-components/ui-automator, 2026, accessed: January 2026
2026
-
[15]
Android Open Source Project (AOSP): Input,
Android Open Source Project, “Android Open Source Project (AOSP): Input,” https://source.android.com/docs/core/interaction/input, 2026, accessed: January 2026
2026
-
[16]
Apple platform security,
Apple Inc., “Apple platform security,” https://support.apple.com/guide/se curity/welcome/web, 2024, accessed: February 2026
2024
-
[17]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios,
R. BTet al., “Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios,”International radio consultative committee international telecommunication union, Switzerland, CCIR Rep, 2011
2011
-
[19]
Doubao mobile assistant,
ByteDance, “Doubao mobile assistant,” https://o.doubao.com/, 2026, accessed: January 2026. (In Chinese)
2026
-
[20]
Application of Fourier analysis to the visibility of gratings,
F. W. Campbell and J. G. Robson, “Application of Fourier analysis to the visibility of gratings,”The Journal of Physiology, vol. 197, no. 3, pp. 551–566, 1968
1968
-
[21]
Kindness is a risky business: On the usage of the accessibility apis in android,
W. Diao, Y . Zhang, L. Zhang, Z. Li, F. Xu, X. Pan, X. Liu, J. Weng, K. Zhang, and X. Wang, “Kindness is a risky business: On the usage of the accessibility apis in android,” in22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), 2019, pp. 261–275
2019
-
[22]
Alexa versus alexa: Controlling smart speakers by self-issuing voice commands,
S. Esposito, D. Sgandurra, and G. Bella, “Alexa versus alexa: Controlling smart speakers by self-issuing voice commands,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 2022, pp. 1064–1078
2022
-
[23]
A-copilot: Android covert operation for private infor- mation lifting and otp theft: A study on how malware masquerading as legitimate applications compromise security and privacy,
J. G. Q. L. et al., “A-copilot: Android covert operation for private infor- mation lifting and otp theft: A study on how malware masquerading as legitimate applications compromise security and privacy,” inProceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy, 2024, pp. 155–157
2024
-
[24]
Cloak and dagger: from two permissions to complete control of the ui feedback loop,
Y . Fratantonio, C. Qian, S. P. Chung, and W. Lee, “Cloak and dagger: from two permissions to complete control of the ui feedback loop,” in 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 1041– 1057
2017
-
[25]
Mobile agent issue:change the model,
Github, “Mobile agent issue:change the model,” https://github.com/X -PLUG/MobileAgent/issues/233, 2025, accessed: January 2026. (In Chinese)
2025
-
[26]
Skillfence: A systems approach to practically mitigating voice-based confusion attacks,
A. Hooda, M. Wallace, K. Jhunjhunwalla, E. Fernandes, and K. Fawaz, “Skillfence: A systems approach to practically mitigating voice-based confusion attacks,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6, no. 1, pp. 1–26, 2022
2022
-
[27]
Automating gui testing for android applications,
C. Hu and I. Neamtiu, “Automating gui testing for android applications,” inProceedings of the 6th International Workshop on Automation of Software Test, 2011, pp. 77–83
2011
-
[28]
A11y and privacy don’t have to be mutually exclusive: Constraining accessibility service misuse on android,
J. Huang, M. Backes, and S. Bugiel, “A11y and privacy don’t have to be mutually exclusive: Constraining accessibility service misuse on android,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 3631–3648
2021
-
[29]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Appagentx: Evolving gui agents as proficient smartphone users,
W. Jiang, Y . Zhuang, C. Song, X. Yang, J. T. Zhou, and C. Zhang, “Appagentx: Evolving gui agents as proficient smartphone users,”arXiv preprint arXiv:2503.02268, 2025
-
[31]
Checking intent-based communication in android with intent space analysis,
Y . Jing, G.-J. Ahn, A. Doupé, and J. H. Yi, “Checking intent-based communication in android with intent space analysis,” inProceedings of the 11th ACM on Asia Conference on Computer and Communications Security, 2016, pp. 735–746
2016
-
[32]
All about activity injection: threats, semantics, and detection,
S. Lee, S. Hwang, and S. Ryu, “All about activity injection: threats, semantics, and detection,” in2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 252–262
2017
-
[33]
Do not give a dog bread every time he wags his tail: Stealing passwords through content queries (conquer) attacks,
C. Lei, Z. Ling, Y . Zhang, K. Dong, K. Liu, J. Luo, and X. Fu, “Do not give a dog bread every time he wags his tail: Stealing passwords through content queries (conquer) attacks,” inThe Network and Distributed System Security Symposium (NDSS). Internet Society, 2023
2023
-
[34]
Measuring the insecurity of mobile deep links of android,
F. Liu, C. Wang, A. Pico, D. Yao, and G. Wang, “Measuring the insecurity of mobile deep links of android,” in26th USENIX security symposium (USENIX Security 17), 2017, pp. 953–969
2017
-
[35]
Autoglm: Autonomous foundation agents for guis
X. Liu, B. Qin, D. Liang, G. Dong, H. Lai, H. Zhang, H. Zhao, I. L. Iong, J. Sun, J. Wanget al., “AutoGLM: Autonomous foundation agents for GUIs,”arXiv preprint arXiv:2411.00820, 2024
-
[36]
OpenAI: GPT-4v System Card,
OpenAI, “OpenAI: GPT-4v System Card,” https://openai.com/index/gpt -4v-system-card/, 2024, accessed: January 2026
2024
-
[37]
OpenAI Models: Deprecations,
OpenAI, “OpenAI Models: Deprecations,” https://platform.openai.com/ docs/deprecations, 2026, accessed: January 2026
2026
-
[38]
Samsung, “Bixby,” https://www.samsung.com/hk_en/apps/bixby/, 2026, accessed: January 2026
2026
-
[39]
ADBKeyBoard: Android virtual keyboard for ADB input,
senzhk, “ADBKeyBoard: Android virtual keyboard for ADB input,” https://github.com/senzhk/ADBKeyBoard, 2016, accessed: January 2026
2016
-
[40]
All your app links are belong to us: understanding the threats of instant apps based attacks,
Y . Tang, Y . Sui, H. Wang, X. Luo, H. Zhou, and Z. Xu, “All your app links are belong to us: understanding the threats of instant apps based attacks,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 914–926
2020
-
[41]
Tencent, “WeChat,” https://www.wechat.com/en/, 2026, accessed: January 2026
2026
-
[42]
GUI-OWL,
Tongyi Lab, Alibaba, “GUI-OWL,” https://modelscope.cn/models/iic/G UI-Owl-7B, 2025, accessed: January 2026
2025
-
[43]
(In Chinese)
Xiaomi, “Xiaoai,” https://xiaoai.mi.com/, 2026, accessed: January 2026. (In Chinese)
2026
-
[44]
DVa: Extracting victims and abuse vectors from android accessibility malware,
H. Xu, M. Yao, R. Zhang, M. M. Dawoud, J. Park, and B. Saltaformaggio, “DVa: Extracting victims and abuse vectors from android accessibility malware,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 701–718
2024
-
[45]
A comprehensive evaluation of Android ICC resolution techniques,
J. Yan, S. Zhang, Y . Liu, X. Deng, J. Yan, and J. Zhang, “A comprehensive evaluation of Android ICC resolution techniques,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13
2022
-
[46]
Mobile-Agent-v3: Fundamental Agents for GUI Automation
J. Ye, X. Zhang, H. Xu, H. Liu, J. Wang, Z. Zhu, Z. Zheng, F. Gao, J. Cao, Z. Luet al., “Mobile-agent-v3: Fundamental agents for gui automation,”arXiv preprint arXiv:2508.15144, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
AppAgent: Multimodal agents as smartphone users,
C. Zhang, Z. Yang, J. Liu, Y . Li, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “AppAgent: Multimodal agents as smartphone users,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–20
2025
-
[48]
Appsealer: automatic generation of vulnerability- specific patches for preventing component hijacking attacks in android applications
M. Zhang and H. Yin, “Appsealer: automatic generation of vulnerability- specific patches for preventing component hijacking attacks in android applications.” inThe Network and Distributed System Security Symposium (NDSS). Internet Society, 2014
2014
-
[49]
Autoglm-phone,
ZhipuAI, “Autoglm-phone,” https://modelscope.cn/models/ZhipuAI/Aut oGLM-Phone-9B, 2026, accessed: January 2026
2026
-
[50]
Moba: Multifaceted memory-enhanced adaptive planning for efficient mobile task automation,
Z. Zhu, H. Tang, Y . Li, D. Liu, H. Xu, K. Lan, D. Zhang, Y . Jiang, H. Zhou, C. Wanget al., “Moba: Multifaceted memory-enhanced adaptive planning for efficient mobile task automation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstra...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.