arxiv: 2604.18860 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

Wenpeng Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords GUI agentsTOCTOU attacksUI securitydesktop automationvisual verificationaction interceptioncomputer-use agentsobservation-action gap

0 comments

The pith

The delay between screenshot and click in GUI agents creates a TOCTOU window that attackers can exploit, and immediate pre-action re-verification blocks two of the three attack types with full success and negligible cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents decide actions from a screenshot but execute them seconds later, opening a window during which an attacker can alter the desktop UI without the agent noticing. The paper formalizes this inconsistency as a Visual Atomicity Violation and shows three concrete ways an unprivileged attacker can redirect or corrupt the intended action. A lightweight defense called Pre-execution UI State Verification rechecks the target region, the full screen, and the window state right before dispatching the action. This layered check intercepts every overlay and focus attack in extensive trials while adding less than a tenth of a second and producing no false blocks on normal use. The same checks fail against attacks that change the UI without leaving any visible trace, exposing a remaining structural gap.

Core claim

The observation-to-action gap in screenshot-driven GUI agents constitutes a Visual Atomicity Violation that permits three attack primitives—notification overlay hijack, window focus manipulation, and zero-visual-footprint DOM injection—to redirect or corrupt actions. Pre-execution UI State Verification, which re-validates the click target with masked SSIM, the global screenshot with pixel diff, and the X Window snapshot immediately before action dispatch, achieves 100% interception of the first two primitives across 180 trials with zero false positives and under 0.1 s overhead, while exposing that no visual check can detect the third primitive.

What carries the argument

Pre-execution UI State Verification (PUSV), a three-layer check that re-verifies UI state at the moment of action dispatch using local pixel similarity at the target, global screenshot differences, and window-level snapshot differences.

If this is right

GUI agents equipped with PUSV can execute actions without risk from overlay or focus manipulations.
No single detection signal covers all attack primitives, so the layered combination of local, global, and window checks is required for the reported coverage.
The defense incurs less than 0.1 seconds of added latency while producing zero false blocks on legitimate UI changes.
Attacks that modify underlying structure without altering visible pixels remain possible even when PUSV is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing PUSV with operating-system hooks that expose DOM or accessibility tree state could close the remaining blind spot.
Reducing the raw observation-to-action latency in agent architectures would shrink the attack window even before any verification layer is applied.
Similar re-verification patterns could be adapted to mobile or web-based agents that also rely on visual observation loops.
Agent platforms may need defense-in-depth that combines visual checks with structural state inspection rather than relying on images alone.

Load-bearing premise

Re-verifying the UI state right before the agent dispatches an action is always feasible and the chosen visual signals can reliably separate attacker changes from ordinary UI updates without extra context.

What would settle it

Running the full set of 180 adversarial trials with PUSV enabled and confirming whether any Primitive C DOM injection still succeeds in redirecting the action would show whether the structural blind spot is real.

Figures

Figures reproduced from arXiv: 2604.18860 by Wenpeng Xu.

**Figure 2.** Figure 2: Primitive A (Fullscreen Overlay): agent observes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Primitive B (Window Focus Swap): zero visual evidence at Tobs; the attacker window is pre-staged but unmapped (withdrawn) until 1 s after the screenshot. Why the GNOME dock is not a viable Primitive B target. The GNOME Shell dock is managed by the Mutter compositor and is always rendered above regular X11 windows. An attacker Tkinter window cannot be raised to intercept clicks on dock icons (Trigger-ASR=… view at source ↗

**Figure 4.** Figure 4: Primitive C (DOM Injection): both frames are [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: PUSV architecture: three layers applied sequen [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

GUI agents that control desktop computers via screenshot-and-click loops introduce a new class of vulnerability: the observation-to-action gap (mean 6.51 s on real OSWorld workloads) creates a Time-Of-Check, Time-Of-Use (TOCTOU) window during which an unprivileged attacker can manipulate the UI state. We formalize this as a Visual Atomicity Violation and characterize three concrete attack primitives: (A) Notification Overlay Hijack, (B) Window Focus Manipulation, and (C) Web DOM Injection. Primitive B, the closest desktop analog to Android Action Rebinding, achieves 100% action-redirection success rate with zero visual evidence at the observation time. We propose Pre-execution UI State Verification (PUSV), a lightweight three-layer defense that re-verifies the UI state immediately before each action dispatch: masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and X Window snapshot diff (L2b). PUSV achieves 100% Action Interception Rate across 180 adversarial trials (135 Primitive A + 45 Primitive B) with zero false positives and < 0.1 s overhead. Against Primitive C (zero-visual-footprint DOM injection), PUSV reveals a structural blind spot (~0% AIR), motivating future OS+DOM defense-in-depth architectures. No single PUSV layer alone achieves full coverage; different primitives require different detection signals, validating the layered design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots a real attack window in GUI agents and tests a defense that catches two out of three primitives cleanly, though the no-false-positive result needs checking against everyday UI changes.

read the letter

The main point to take away is that this work identifies a practical TOCTOU vulnerability in desktop GUI agents that rely on screenshots for observation and then issue actions after a delay. They formalize it as Visual Atomicity Violation and break down three attack primitives that exploit the gap. What the paper does well is lay out the problem with a real measurement: the average delay is 6.51 seconds on OSWorld workloads. They describe Primitive B as achieving full redirection success with no visual trace at the time of observation, which is a strong desktop analog to known mobile attacks. The proposed PUSV defense uses three layers to recheck the UI right before acting, and their tests show it stops the first two primitives completely across 180 trials while adding almost no time. They also point out the limitation with Primitive C, which has no visual footprint, so the defense can't catch it. That honesty about the blind spot is good. The main soft spot is around the false positive rate. The zero false positives are reported only for the adversarial cases. Without seeing the volume or variety of benign test cases, it's unclear whether the combination of masked SSIM at the target, global diff, and X Window snapshot will stay clean when the desktop has normal activity like notifications, animations, or content updates. The assumption that these signals can always separate attacker moves from legitimate ones without more context might not hold in messier real-world UIs. Trial construction details would help too, since the abstract gives the numbers but not how the 180 cases were built or what baselines were used. This is relevant for anyone developing or deploying agents that take over desktop control, especially in safety-critical settings. A reader interested in agent reliability or security would find the attack characterization and the layered defense approach worth looking at, even if they need to verify the evaluation themselves. I would recommend sending it for peer review. The idea is timely, the attacks are concrete, and the defense shows promise for part of the problem. Referees can push for more on the benign side and reproducibility, which would strengthen it.

Referee Report

2 major / 2 minor

Summary. The paper formalizes the observation-to-action gap (mean 6.51 s) in desktop GUI agents as a Visual Atomicity Violation enabling TOCTOU attacks. It defines three attack primitives—(A) Notification Overlay Hijack, (B) Window Focus Manipulation, and (C) Web DOM Injection—and proposes Pre-execution UI State Verification (PUSV), a three-layer defense using masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and X Window snapshot diff (L2b). Empirical results claim 100% Action Interception Rate on 180 adversarial trials (135 A + 45 B) with zero false positives and <0.1 s overhead, while noting a structural blind spot (~0% AIR) against Primitive C that motivates OS+DOM defense-in-depth.

Significance. If the empirical claims hold under broader conditions, the work is significant for identifying a new vulnerability class in screenshot-and-click GUI agents and for demonstrating that no single detection signal covers all primitives, validating the layered PUSV design. The low-overhead defense and explicit characterization of Primitive B as a desktop analog to Android Action Rebinding provide concrete, actionable insights for securing computer-use agents.

major comments (2)

[Abstract] Abstract: The central claim of 100% AIR and zero false positives across 180 trials for Primitives A and B is load-bearing for the defense's practicality, yet the manuscript supplies no quantitative details on benign workload construction, animation coverage, or dynamic-content cases used to establish that masked SSIM + diffs reliably separate attacker TOCTOU changes from legitimate UI updates.
[Defense Evaluation] Defense description and evaluation: The assumption that immediate pre-dispatch re-verification is always feasible and that the three signals distinguish malicious from benign updates without extra context is not supported by reported benign test volume or variance metrics; adversarial trials alone do not substantiate the zero-FP generalization.

minor comments (2)

[Abstract] The overhead figure (<0.1 s) would benefit from explicit measurement methodology and hardware specification to aid reproducibility.
Notation for the three PUSV layers (L1, L2a, L2b) is introduced in the abstract but could be cross-referenced more explicitly when results per layer are discussed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting the importance of substantiating the zero false-positive claims for PUSV. We address each major comment below and will revise the manuscript accordingly to strengthen the evaluation section.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 100% AIR and zero false positives across 180 trials for Primitives A and B is load-bearing for the defense's practicality, yet the manuscript supplies no quantitative details on benign workload construction, animation coverage, or dynamic-content cases used to establish that masked SSIM + diffs reliably separate attacker TOCTOU changes from legitimate UI updates.

Authors: We agree that the manuscript would be strengthened by providing explicit quantitative details on the benign workloads used to validate zero false positives. In the revised version, we will add a dedicated paragraph in the Evaluation section describing the benign test construction: the specific OSWorld workloads employed, the number of trials covering animations (window transitions, notifications) and dynamic content (web elements, scrolling), and the observed mean/variance of the masked SSIM and diff signals under legitimate updates. This will clarify the threshold selection process and how the signals separate TOCTOU changes from benign ones. revision: yes
Referee: [Defense Evaluation] Defense description and evaluation: The assumption that immediate pre-dispatch re-verification is always feasible and that the three signals distinguish malicious from benign updates without extra context is not supported by reported benign test volume or variance metrics; adversarial trials alone do not substantiate the zero-FP generalization.

Authors: We acknowledge that the current text assumes feasibility of immediate pre-dispatch verification without fully elaborating the implementation or providing benign volume/variance data. In revision, we will expand the PUSV description to detail its integration into the agent loop (executing in <0.1 s immediately before dispatch) and add the benign test volume along with variance metrics for L1/L2a/L2b signals. We will also note that the signals are intentionally context-light for low overhead, while acknowledging that additional context could further reduce edge cases; the layered design is presented as necessary precisely because no single signal covers all primitives. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack/defense measurements are direct experimental outcomes

full rationale

The manuscript formalizes TOCTOU vulnerabilities via three attack primitives and evaluates the PUSV defense through explicit trial counts (180 adversarial cases for A+B, zero-FP claims on described signals). No equations, fitted parameters, or first-principles derivations appear; interception rates and overhead figures are reported measurements from the stated workloads rather than outputs computed from the inputs by construction. No self-citation chains, ansatzes, or renamings reduce the central claims to tautologies. The work is self-contained as an empirical security study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security evaluation paper. No mathematical derivations, fitted constants, or new postulated entities are introduced; the work rests on standard assumptions about unprivileged attacker capabilities and OS-provided snapshot APIs.

pith-pipeline@v0.9.0 · 5569 in / 1153 out tokens · 34044 ms · 2026-05-10T03:47:07.222223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Xie et al

T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer En- vironments. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2024

2024
[2]

Y . Qian, K. Qian, X. He, L. Chen, J. Zhang, T. Zhang, H. Wei, L. Wang, H. Wu, and B. Mao. Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents? arXiv:2601.12349, 2026

work page arXiv 2026
[3]

Jiang, Z

L. Jiang, Z. Liu, H. Luo, and Z. Lin. Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents. arXiv:2603.00476, 2026

work page arXiv 2026
[4]

S. Zhou, F. F. Hou, Y . Cheng, et al. WebArena: A Re- alistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations (ICLR), 2024

2024
[5]

X. Deng, Y . Gu, B. Zheng, et al. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[6]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual Instruction Tuning. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2023

2023
[7]

J. Bai, S. Bai, S. Yang, et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localiza- tion, Text Reading, and Beyond. arXiv:2308.12966, 2023

work page internal anchor Pith review arXiv 2023
[8]

GPT-4V(ision) System Card

OpenAI. GPT-4V(ision) System Card. OpenAI Tech- nical Report, 2023

2023
[9]

Z. Wu, C. Han, Z. Ding, et al. OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. InInternational Conference on Learning Representa- tions (ICLR), 2024

2024
[10]

Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu

C. Zhang, Z. Yang, et al. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771, 2023

work page arXiv 2023
[11]

GPT-4o System Card

OpenAI. GPT-4o System Card. Technical Report, OpenAI, 2024

2024
[12]

Qwen3 Technical Report

Alibaba Cloud. Qwen3 Technical Report. arXiv preprint, 2025

2025
[13]

Greshake, S

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In ACM Workshop on Artificial Intelligence and Security (AISec), 2023

2023
[14]

X. Liu, H. Nan, P. Gu, J. Chen, C. Mao, and T. Fang. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInternational Conference on Learning Representations (ICLR), 2024

2024
[15]

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Abusing images and sounds for indirect instruction injection in multi-modal llms,

E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov. Abusing Images and Sounds for In- direct Instruction Injection in Multi-Modal LLMs. arXiv:2307.10490, 2023

work page arXiv 2023
[17]

X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal. Visual Adversarial Examples Jailbreak Aligned Large Language Models. InAAAI Confer- ence on Artificial Intelligence, 2024

2024
[18]

T. Cao, B. Lim, Y . Liu, Y . Sui, Y . Li, S. Deng, L. Lu, N. Oo, S. Yan, and B. Hooi. VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents. InInternational Conference on Learning Representa- tions (ICLR), 2026

2026
[19]

Kuntz, A

T. Kuntz, A. Duzan, H. Zhao, F. Croce, Z. Kolter, N. Flammarion, and M. Andriushchenko. OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. InNeurIPS Datasets and Benchmarks, 2025

2025
[20]

Y . Lu, T. Ju, M. Zhao, X. Ma, Y . Guo, and Z. Zhang. EV A: Red-Teaming GUI Agents via Evolving Indi- rect Prompt Injection. arXiv:2505.14289, 2025. 11

work page arXiv 2025
[21]

Bishop and M

M. Bishop and M. Dilger. Checking for Race Con- ditions in File Accesses.Computing Systems, 9(2), 1996

1996
[22]

Fratantonio, C

Y . Fratantonio, C. Chen, A. Bianchi, C. Kruegel, and G. Vigna. Cloak and Dagger: From Two Permissions to Complete Control of the Android UI. InIEEE Symposium on Security and Privacy (S&P), 2017

2017
[23]

Huang, A

L.-S. Huang, A. Moshchuk, H. J. Wang, S. Schecter, and C. Jackson. Clickjacking: Attacks and Defenses. InUSENIX Security Symposium, 2012

2012
[24]

X. Liu, B. He, X. Liu, A. Luo, H. Zhang, and H. Chen. Visual Confused Deputy: Exploiting and Defend- ing Perception Failures in Computer-Using Agents. arXiv:2603.14707, 2026. 12

work page arXiv 2026