pith. sign in

arxiv: 2605.16402 · v1 · pith:F6QQXLNSnew · submitted 2026-05-13 · 💻 cs.CV

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

Pith reviewed 2026-05-20 22:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords GUI groundingmultimodal large language modelsbenchmarkmulti-window environmentsocclusion robustnessdesktop automationvisual clutter
0
0 comments X

The pith

State-of-the-art multimodal models for GUI tasks lose accuracy when desktop windows overlap and create occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that leading multimodal large language models perform strongly on simple single-window GUI grounding but show clear accuracy declines in realistic multi-window desktop environments with partial window overlaps and visual clutter. The authors introduce WinDeskGround, a benchmark and parametric synthesis framework that generates complex scenarios by varying occlusion levels, layout density, and semantic similarity to better match real desktop conditions. They created a meta-dataset of 1,356 instruction-target pairs and evaluated five top MLLMs, confirming the performance gap. A sympathetic reader would care because practical GUI automation must operate amid stacked application windows rather than clean single-layer views, so current models may not transfer reliably to everyday use.

Core claim

The paper claims that while state-of-the-art MLLMs excel at GUI grounding in idealized single-layer interfaces, their accuracy declines under partial occlusion in multi-window desktop environments, as demonstrated through evaluations on the WinDeskGround benchmark of parametrically generated scenarios that control window occlusion, layout density, and semantic similarity.

What carries the argument

The WinDeskGround synthesis framework, which parametrically generates high-fidelity multi-window desktop scenarios by controlling occlusion, density, and semantic similarity to simulate real workflow distribution shifts.

If this is right

  • GUI automation agents require targeted improvements in handling partial occlusions to function reliably on typical user desktops.
  • Evaluation protocols for GUI agents should shift from single-layer tests to include multi-window and cluttered settings for better relevance.
  • Training data for these models should incorporate parametric generation of occluded scenes to build robustness.
  • Advances in visual reasoning for layered interfaces could directly improve grounding performance in desktop environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the generated scenarios match real workflows well, agents fine-tuned on WinDeskGround data could show improved generalization to live user desktops.
  • The robustness gap may appear in other layered visual tasks such as multi-document editing or overlapping map interfaces.
  • Extending the benchmark to dynamic cases like window resizing or dragging could reveal additional failure modes not captured in static scenes.

Load-bearing premise

The parametrically generated scenarios with controlled occlusion and density accurately simulate the visual challenges of authentic real-world multi-window desktop workflows.

What would settle it

Running the same five MLLMs on real captured screenshots from actual multi-window desktop sessions and finding no accuracy decline compared to single-window cases would challenge the central claim about occlusion effects.

Figures

Figures reproduced from arXiv: 2605.16402 by Haoren Zhao, Tianyi Chen, Zhen Wang.

Figure 1
Figure 1. Figure 1: The distribution shift and research gap between exist￾ing datasets and real-world desktops. While existing datasets focus on idealized scenarios (Mobile/Web), real-world desktops exhibit complexity through multi-window stacking and visual clut￾ter, leading to low robustness in out-of-domain settings. 2. Related Work 2.1. GUI Datasets and Environments High-quality datasets serve as the foundation for traini… view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of data construction for the metadata. 2025), SeeClick (Cheng et al., 2024)). Unlike earlier ap￾proaches that treat coordinates as classification bins (Wang et al., 2023), these state-of-the-art models normalize spatial coordinates to a specific range (e.g., [0, 1] or [0, 1000]) and process them as plain text sequences. During the inference phase, the model autoregressively generates the targe… view at source ↗
Figure 3
Figure 3. Figure 3: Detailed metadata distribution across domains. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-level Difficulty Evaluation. Comparison of Click Accuracy across varying difficulty levels. While all models per￾form well in the Single Window setting, InfiGUI demonstrates superior resilience in the mid-to-high complexity range (L2–L4). In the baseline Single Window setting, distinct tiers of capability are apparent. The top-tier agents—UGround, UI￾TARS, and InfiGUI—demonstrate dominant performance… view at source ↗
Figure 4
Figure 4. Figure 4: Robustness analysis under controlled variables. We evaluate the models’ Click Accuracy against varying levels of (a) Visual Clutter, (b) Occlusion, and (c) Semantic Interference [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative analysis of UGround failure cases. The Green Box denotes the ground truth target, while the Red Dot indicates the model’s incorrect prediction. (a) Semantic Interference: The model fails to distinguish the primary ”Like” button from a background distractor. (b) Occlusion: The model skips the partially occluded target ”Paste” option in favor of a fully visible but incorrect icon. 7. Realism and … view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WinDeskGround, a benchmark and parametric synthesis framework for assessing GUI grounding robustness of MLLMs in multi-window desktop settings. It generates 1,356 instruction-target pairs by controlling window occlusion, layout density, and semantic similarity to simulate real-world distribution shifts, evaluates five leading MLLMs, and reports that accuracy declines under partial occlusion relative to simplified single-layer conditions. The code is released publicly.

Significance. If the parametrically generated scenarios prove representative of authentic desktop workflows, the work identifies a practically relevant robustness gap in current MLLMs for GUI automation and supplies a controllable testbed for diagnosing failure modes such as occlusion. The public release of the synthesis pipeline supports reproducibility and extension by the community.

major comments (2)
  1. [Section 3] Section 3: The central claim that accuracy declines under partial occlusion in multi-window environments rests on the assumption that the parametrically generated scenarios (via tunable overlap, density, and semantic similarity) produce distribution shifts representative of real desktops. No quantitative validation is provided, such as KL divergence on visual statistics, feature distributions, or user-study fidelity scores comparing the 1,356 synthetic pairs to real multi-window traces. This is load-bearing because synthesis artifacts (e.g., geometrically clean occlusions or predictable semantics) could artifactually inflate the measured performance drop.
  2. [Evaluations] Evaluations section: The reported accuracy decline across five MLLMs lacks accompanying details on the precise metrics (e.g., grounding accuracy definition, IoU thresholds), the distribution of occlusion ratios tested, statistical significance tests, or any data exclusion criteria. Without these, the magnitude and reliability of the robustness gap cannot be fully assessed.
minor comments (2)
  1. [Abstract] Abstract: The statement that 'top-tier agents excel in simplified settings' would be strengthened by an explicit reference to the specific baseline or prior single-layer benchmark used for comparison.
  2. Figure captions and tables: Ensure all figures showing example desktop scenarios include clear annotations for occlusion levels and window boundaries to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and outlining planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The central claim that accuracy declines under partial occlusion in multi-window environments rests on the assumption that the parametrically generated scenarios (via tunable overlap, density, and semantic similarity) produce distribution shifts representative of real desktops. No quantitative validation is provided, such as KL divergence on visual statistics, feature distributions, or user-study fidelity scores comparing the 1,356 synthetic pairs to real multi-window traces. This is load-bearing because synthesis artifacts (e.g., geometrically clean occlusions or predictable semantics) could artifactually inflate the measured performance drop.

    Authors: We thank the referee for this insightful comment. The strength of our parametric synthesis lies in its ability to generate controlled variations that isolate the effects of occlusion and other factors, which is essential for understanding specific failure modes in MLLMs. Although we have not performed quantitative distribution matching (e.g., KL divergence) or user studies against real desktop traces—primarily because obtaining large-scale, privacy-compliant real-world multi-window interaction data is non-trivial—we maintain that the benchmark still provides valuable insights into robustness under realistic conditions simulated parametrically. In the revised paper, we will add a more detailed justification of the parameter choices based on common desktop usage patterns and include additional visualizations to illustrate the generated scenarios. revision: partial

  2. Referee: [Evaluations] Evaluations section: The reported accuracy decline across five MLLMs lacks accompanying details on the precise metrics (e.g., grounding accuracy definition, IoU thresholds), the distribution of occlusion ratios tested, statistical significance tests, or any data exclusion criteria. Without these, the magnitude and reliability of the robustness gap cannot be fully assessed.

    Authors: We agree that more details are needed for full assessment. In the updated manuscript, we will specify that grounding accuracy is defined as the percentage of instructions where the predicted bounding box has an IoU greater than 0.5 with the ground-truth target. We will include histograms or tables showing the distribution of occlusion ratios (ranging from 0% to over 50% overlap) in the 1,356 pairs. Furthermore, we will report p-values from appropriate statistical tests to validate the significance of accuracy differences across conditions and confirm that the entire generated set was used without additional exclusions. These enhancements will improve the transparency of our evaluation protocol. revision: yes

standing simulated objections not resolved
  • Quantitative comparison of the synthetic benchmark to real multi-window desktop environments using metrics like KL divergence or user-study fidelity scores, as we do not possess or have collected such real-world trace data for this study.

Circularity Check

0 steps flagged

Empirical benchmark with no derivation or self-referential reduction

full rationale

The paper constructs WinDeskGround via a parametric synthesis pipeline that controls occlusion, density, and similarity to produce 1,356 instruction-target pairs, then reports direct MLLM evaluation results showing accuracy decline under occlusion. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text or abstract. The central claim rests on external model evaluations rather than any quantity defined in terms of itself or reduced by construction to the generation inputs. This is a standard empirical benchmark contribution whose methodology is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that parametric control of occlusion, density, and semantic similarity produces realistic distribution shifts; no free parameters are fitted to target results and no new entities are postulated.

free parameters (2)
  • occlusion control parameters
    Parameters that set the degree of window overlap and visibility in generated scenes.
  • layout density parameters
    Parameters controlling the number and arrangement of windows on the desktop.
axioms (1)
  • domain assumption Parametrically generated desktop scenes with controlled occlusion and density simulate authentic real-world distribution shifts.
    Invoked in the description of the synthesis framework as the basis for the benchmark's relevance.

pith-pipeline@v0.9.0 · 5716 in / 1281 out tokens · 45905 ms · 2026-05-20T22:14:00.947854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 10 internal anchors

  1. [1]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  2. [2]

    Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

    Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y ., Lu, Y ., Wagle, J., Koishida, K., Bucker, A., et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    Apps- electbench: Application-level tool selection benchmark

    Chen, T., Solodko, M., Wang, S., Ko, J., Hao, J., Banbury, C., Abdali, S., Amizadeh, S., Xiao, Q., Li, Y ., et al. Apps- electbench: Application-level tool selection benchmark. arXiv preprint arXiv:2511.19957,

  5. [5]

    Cua-skill: Develop skills for computer using agent,

    Chen, T., Li, Y ., Solodko, M., Wang, S., Jiang, N., Cui, T., Hao, J., Ko, J., Abdali, S., Xu, L., et al. Cua-skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123,

  6. [6]

    Guicourse: From general vision language models to versatile gui agents

    Chen, W., Cui, J., Hu, J., Qin, Y ., Fang, J., Zhao, Y ., Wang, C., Liu, J., Chen, G., Huo, Y ., et al. Guicourse: From general vision language models to versatile gui agents. URL https://arxiv. org/abs/2406.11317,

  7. [7]

    net/forum?id=kxnoqaisCT

    URL https://openreview. net/forum?id=kxnoqaisCT. Hui, Z., Li, Y ., Chen, T., Banbury, C., Koishida, K., et al. Winclick: Gui grounding with multimodal large language models.arXiv preprint arXiv:2503.04730,

  8. [8]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  9. [9]

    To- wards ideal window layouts for multi-party, gaze-aware desktop videoconferencing

    Junuzovic, S., Inkpen, K., Hegde, R., and Zhang, Z. To- wards ideal window layouts for multi-party, gaze-aware desktop videoconferencing. In Brooks, S. and Irani, P. (eds.),Proceedings of the Graphics Interface 2011 Con- ference, May 25-27, 2011, St. John’s, Newfoundland, Canada, pp. 119–126. Canadian Human-Computer Com- munications Society,

  10. [10]

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

    10 WinDeskGround: Multi-Window GUI Grounding Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Re- inforcement learning on web interfaces using workflow- guided exploration.arXiv preprint arXiv:1802.08802,

  11. [11]

    Infigui-g1: Advancing gui grounding with adaptive exploration policy optimiza- tion.arXiv preprint arXiv:2508.05731,

    Liu, Y ., Liu, Z., Zhu, S., Li, P., Xie, C., Wang, J., Hu, X., Han, X., Yuan, J., Wang, X., et al. Infigui-g1: Advancing gui grounding with adaptive exploration policy optimiza- tion.arXiv preprint arXiv:2508.05731,

  12. [12]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Pan, Y ., Kong, D., Zhou, S., Cui, C., Leng, Y ., Jiang, B., Liu, H., Shang, Y ., Zhou, S., Wu, T., et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,

  13. [13]

    and Gurevych, I

    Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- beddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computational Linguistics, 11

  14. [14]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    URL https://arxiv.org/ abs/1908.10084. Sager, P. J., Meyer, B., Yan, P., von Wartburg-Kottler, R., Etaiwi, L., Enayati, A., Nobel, G., Abdulkadir, A., Grewe, B. F., and Stadelmann, T. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

  15. [15]

    From grounding to planning: Benchmarking bottlenecks in web agents.arXiv preprint arXiv:2409.01927,

    Shlomov, S., Sela, A., Levy, I., Galanti, L., Abitbol, R., et al. From grounding to planning: Benchmarking bottlenecks in web agents.arXiv preprint arXiv:2409.01927,

  16. [16]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al. Ui-tars- 2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544,

  17. [17]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158,

  18. [18]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Z., Wu, Z., Xu, F., Wang, Y ., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218,

  19. [19]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Xu, Y ., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454,

  20. [20]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023a. Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 2023b. Zhang, C., H...

  21. [21]

    Phi-ground tech report: Advancing perception in gui grounding.arXiv preprint arXiv:2507.23779,

    Zhang, M., Xu, Z., Zhu, J., Dai, Q., Qiu, K., Yang, Y ., Luo, C., Chen, T., Wagle, J., Franklin, T., et al. Phi-ground tech report: Advancing perception in gui grounding.arXiv preprint arXiv:2507.23779,

  22. [22]

    Screen recognition: Creating accessibility metadata for mobile applications from pixels

    Zhang, X., De Greef, L., Swearngin, A., White, S., Murray, K., Yu, L., Shan, Q., Nichols, J., Wu, J., Fleizach, C., et al. Screen recognition: Creating accessibility metadata for mobile applications from pixels. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15,