pith. machine review for the scientific record. sign in

arxiv: 2604.14262 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords GUI groundingdomain randomizationmodel robustnessspatial reasoninginstruction perturbationbenchmark evaluationfine-tuning degradation
0
0 comments X

The pith

GUI grounding models suffer large accuracy drops when instructions require relational spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a perturbation framework to test how GUI grounding models respond to changes in both the visual interface and the wording of instructions. It shows that high performance on standard tests does not hold when instructions use relational terms like next to instead of naming elements directly, with drops of 27 to 56 points across models. The same framework reveals that changes in zoom level also reduce accuracy and that fine-tuning on perturbed data does not improve results. A reader should care because real user interactions with graphical interfaces often involve varied instructions and viewing conditions that current evaluation methods ignore.

Core claim

By varying visual scenes and instructions along independent axes, the GUI-Perturbed framework isolates specific weaknesses in GUI grounding: relational instructions cause systematic accuracy collapse in all tested 7B models, a 70 percent browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance instead of improving it. This provides diagnostic information on capabilities like spatial reasoning and visual robustness that aggregate benchmarks cannot supply.

What carries the argument

GUI-Perturbed, the controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness.

If this is right

  • Standard benchmarks that use a single fixed instruction per screenshot overestimate model capabilities.
  • GUI grounding models lack reliable performance on instructions requiring spatial relations between elements.
  • Visual changes such as browser zoom levels can significantly affect model accuracy in grounding tasks.
  • Simple data augmentation and low-rank fine-tuning may not address the brittleness and can even reduce performance.
  • Diagnostic signals from separate perturbation axes can guide targeted improvements in spatial reasoning and calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world GUI applications would likely encounter similar instruction variations, making current models unreliable without changes in training or evaluation.
  • Future benchmarks should routinely include controlled perturbations along visual and linguistic axes to better predict deployment performance.
  • Alternative training approaches beyond standard LoRA on augmented data may be needed to build robustness to relational instructions.
  • The released dataset and pipeline could support development of models that handle varied user phrasings more consistently.

Load-bearing premise

The specific perturbations chosen, such as relational instructions and particular zoom levels, represent the variations that matter most in actual GUI usage.

What would settle it

Testing the three models or similar ones on the released GUI-Perturbed dataset with relational instructions and observing whether accuracy remains above 85 percent or drops as reported.

Figures

Figures reproduced from arXiv: 2604.14262 by Harshvardhan Sikka, Jinu Nyachhyon, Pranav Guruprasad, Tony Zhou, Yangyue Wang, Yash Mathur.

Figure 1
Figure 1. Figure 1: Hit rates with 95% bootstrap confidence intervals across models and configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Flip rate decomposition for baseline models under each perturbation type. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Direct vs. relational instruction accuracy across models. The 95% bootstrap CIs do not [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy by spatial direction (above, below, left, right) across models. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of different augmentation types on model performance. All variants cause slight [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Flip rate decomposition for fine-tuned models under each augmentation type. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of scaling training data from 6.5k to 25k samples. More data leads to worse [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of real (Salesforce mix) vs. synthetic (GUI-Perturbed) training data. Neither [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ScreenSpot-v2 accuracy by platform and element type: Baseline vs. FT-All (6.5k) vs. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Click Region Error. The model selects the correct UI element conceptually but clicks the wrong physical area of it. Instruction: “Click on ‘Done’ button.” Model output: click(start_box=’(639,438)’) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Location Hallucination. The model correctly identifies what to click but fabricates or misplaces its on-screen coordinates. Instruction: “Click on ‘Notifications’ div.” Model output: “Thought: I noticed that there is a ‘Notifications’ option in the left sidebar. . . This option is located just below ‘Privacy and data’ and above ‘Security and logins.’ ” The model’s reasoning is correct but the predicted co… view at source ↗
Figure 12
Figure 12. Figure 12: Spatial Reasoning Error. The model incorrectly interprets relative spatial relationships. Instruction: “Click on the link to the left of ‘Side effects’.” Model output: “Thought: I noticed that there is a link labeled ‘Interactions’ located to the left of ‘Side effects’. . . ” The model correctly names the target but clicks a link on the right instead of the left. Semantic Failures [PITH_FULL_IMAGE:figure… view at source ↗
Figure 13
Figure 13. Figure 13: Goal Hallucination. The model invents user intentions not specified in the instruction. Instruction: “Click on the button above ‘May 17th 2023’.” Model output: “Thought: I noticed that there is a gray heart-shaped button. . . clicking on it should allow me to save this cooking activity to my favorites.” The “save to favorites” goal is absent from the instruction. 22 [PITH_FULL_IMAGE:figures/full_fig_p022… view at source ↗
Figure 14
Figure 14. Figure 14: Instruction Misinterpretation. The model misunderstands what the instruction refers to and selects a related but incorrect element. Instruction: “Click on ‘Egypt’.” Model output: “Thought: . . . you need to click on the dropdown button located to the right of the word ‘Egypt.’ ” The model reinterprets “click on Egypt” as “click the dropdown arrow beside Egypt.” [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Text Matching Bias. The model interacts with visible text matching the instruction without properly grounding it to the correct UI element. Instruction: “Click on ‘First Name’ textbox.” Model output: click(start_box=’(1242,509)’). The model clicks the “First Name” label text rather than the input field beneath it. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visual Confusion. The model relies on superficial visual cues (shape, color, position) and misidentifies the functional element. Instruction: “Type ‘bed sheets queen’ in ‘Search: suggestions appear below’ searchbox.” Model output: “Thought: I noticed that there are two search boxes at the top of the page. The one on the left is the one I need.” The model mistakes a light-colored button with faint text for… view at source ↗
Figure 17
Figure 17. Figure 17: Reasoning Drift. Adding explicit reasoning causes the model’s grounded action prediction to deteriorate. Instruction: “Click on ‘Verizon’ img.” Model output: “Thought: The task is to click on the ‘Verizon’ image. . . The image is located at the bottom of the product details section.” The phrase “located at the bottom” in the model’s own reasoning misleads it to click an unrelated image at the bottom of th… view at source ↗
read the original abstract

GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause systematic accuracy collapse across all models, a 70% browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance rather than improving it. By perturbing along independent axes, GUI-Perturbed isolates which specific capability axes are affected-spatial reasoning, visual robustness, reasoning calibration-providing diagnostic signal that aggregate benchmarks cannot. We release the dataset, augmentation pipeline, and a fine-tuned model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes (e.g., browser zoom) and instructions (direct element naming vs. relational/spatial) to diagnose robustness in GUI grounding models. It reports that three 7B models suffer 27-56pp accuracy drops on relational instructions, statistically significant degradation at 70% zoom, and that rank-8 LoRA fine-tuning on augmented data worsens rather than improves performance. The framework and released dataset/pipeline are positioned as providing diagnostic axes (spatial reasoning, visual robustness, calibration) missing from standard single-instruction benchmarks.

Significance. If the perturbations successfully isolate the claimed capability axes, the work is significant for revealing systematic brittleness in GUI grounding that aggregate benchmarks overlook, with direct implications for real-world GUI agents. The release of the dataset, augmentation pipeline, and fine-tuned model is a concrete strength that supports reproducibility and follow-on work.

major comments (3)
  1. [Perturbation Framework] The central claim that relational instructions cause accuracy collapse specifically due to spatial reasoning requirements is load-bearing. The perturbation framework description does not report matching or controlling for instruction length, token count, syntactic complexity, parse-tree depth, or lexical difficulty between direct and relational variants. If relational phrasings are longer or more complex on average, the 27-56pp drop may reflect general instruction-following brittleness rather than a targeted spatial-reasoning failure.
  2. [Experimental Setup and Results] Full details on exact perturbation generation (including how relational instructions and zoom levels are constructed) and any exclusion criteria are required to confirm the reported statistical significance and rule out post-hoc selection. This directly affects the soundness of the cross-model consistency claims.
  3. [Fine-tuning Experiments] The finding that rank-8 LoRA fine-tuning with augmented data degrades performance (rather than improving it) is a key negative result. More information on the augmentation process, data mixture, and training details is needed to interpret whether this reflects true brittleness or an artifact of the fine-tuning protocol.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction would benefit from explicitly naming the three 7B models evaluated and their base checkpoints for reproducibility.
  2. [Figures and Tables] Figure and table captions should explicitly state the perturbation axes, metrics (e.g., accuracy delta), and statistical tests used so that readers can interpret results without cross-referencing the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our perturbation framework, experimental details, and fine-tuning results. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Perturbation Framework] The central claim that relational instructions cause accuracy collapse specifically due to spatial reasoning requirements is load-bearing. The perturbation framework description does not report matching or controlling for instruction length, token count, syntactic complexity, parse-tree depth, or lexical difficulty between direct and relational variants. If relational phrasings are longer or more complex on average, the 27-56pp drop may reflect general instruction-following brittleness rather than a targeted spatial-reasoning failure.

    Authors: We agree that explicit controls for linguistic factors are necessary to isolate spatial reasoning. Our instruction generation used parallel template families (direct naming vs. relational/spatial descriptions) applied to the same UI elements, with an effort to keep surface forms comparable; however, we did not report aggregate statistics on token counts or parse complexity. In revision we will add a supplementary table reporting mean token length, word count, dependency parse depth, and lexical diversity for both instruction classes across the full dataset. This will allow direct assessment of whether the observed drops exceed what would be expected from complexity alone. The high performance on direct instructions and the consistency of the relational drop across three independently trained 7B models provide supporting evidence for a spatial-specific effect, but the added metrics will make the claim more robust. revision: yes

  2. Referee: [Experimental Setup and Results] Full details on exact perturbation generation (including how relational instructions and zoom levels are constructed) and any exclusion criteria are required to confirm the reported statistical significance and rule out post-hoc selection. This directly affects the soundness of the cross-model consistency claims.

    Authors: We will expand the Methods section with complete generation procedures: the exact template sets for relational instructions, the browser-level CSS and viewport scaling used to produce the 70% zoom condition, and the full list of exclusion criteria (e.g., screenshots containing overlapping clickable regions, rendering failures, or elements outside the viewport after perturbation). Statistical significance was obtained via pre-specified paired t-tests on matched screenshot-instruction pairs; no post-hoc filtering of results occurred. The released augmentation pipeline already contains the generation scripts; we will also include a detailed pseudocode description and the precise exclusion rules in the revised manuscript to eliminate any ambiguity. revision: yes

  3. Referee: [Fine-tuning Experiments] The finding that rank-8 LoRA fine-tuning with augmented data degrades performance (rather than improving it) is a key negative result. More information on the augmentation process, data mixture, and training details is needed to interpret whether this reflects true brittleness or an artifact of the fine-tuning protocol.

    Authors: We will add a dedicated appendix with the full fine-tuning protocol: the exact data mixture ratios (original vs. perturbed samples), the specific perturbation variants included in augmentation, LoRA configuration (rank 8, alpha, dropout), optimizer, learning rate schedule, batch size, number of epochs, and early-stopping criteria. Validation loss curves and per-epoch accuracy on held-out perturbed test sets will also be reported. The consistent degradation across multiple random seeds and the fact that direct-instruction performance remained stable while relational performance declined suggest the result reflects genuine brittleness rather than a training artifact; the additional details will allow readers to reproduce and interpret the outcome. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct measurements

full rationale

The paper introduces GUI-Perturbed as a controlled perturbation framework and evaluates three models via direct accuracy measurements on relational vs. direct instructions, zoom levels, and fine-tuning. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on observed percentage-point drops and statistical significance from experiments, not on any reduction to inputs by construction. This matches the reader's assessment of an empirical study without self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the domain assumption that independent axis perturbations isolate spatial reasoning, visual robustness, and calibration without introducing new confounds.

axioms (1)
  • domain assumption Perturbations along visual and instruction axes are representative of real-world GUI variations and do not introduce unrelated artifacts.
    Invoked to claim that observed drops reflect specific capability failures rather than test artifacts.

pith-pipeline@v0.9.0 · 5472 in / 1065 out tokens · 57068 ms · 2026-05-10T13:35:09.237069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    K. Cheng et al. SeeClick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024. doi: 10.48550/arXiv.2401.10935

  2. [2]

    Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

    K. Li et al. ScreenSpot-Pro: Gui grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025. doi: 10.48550/arXiv.2504.07981

  3. [3]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    T. Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024. doi: 10.48550/arXiv.2404.07972

  4. [4]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    X. Deng et al. Mind2Web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023. doi: 10.48550/arXiv.2306.06070

  5. [5]

    Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,

    J. Yang et al. GUI-Robust: A comprehensive dataset for testing gui agent robustness in real- world anomalies.arXiv preprint arXiv:2506.14477, 2025. doi: 10.48550/arXiv.2506.14477

  6. [6]

    H. H. Zhao, K. Yang, W. Yu, D. Gao, and M. Z. Shou. WorldGUI: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2026. doi: 10.48550/arXiv.2502.08047

  7. [7]

    Domain randomization for transferring deep neural networks from simulation to the real world,

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017. doi: 10.48550/arXiv.1703.06907

  8. [8]

    Qwen2.5-VL Technical Report

    S. Bai et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. doi: 10.48550/arXiv.2502.13923

  9. [9]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Y . Qin et al. UI-TARS: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025. doi: 10.48550/arXiv.2501.12326

  10. [10]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Y . Yang et al. GTA1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025. doi: 10.48550/arXiv.2507.05791

  11. [11]

    E. J. Hu et al. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. doi: 10.48550/arXiv.2106.09685

  12. [12]

    Mind2web 2: Evaluating agentic search with agent-as-a-judge

    Boyu Gou et al. Mind2web 2: Evaluating agentic search with agent-as-a-judge. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=AUaW6DS9si

  13. [13]

    arXiv preprint arXiv:2401.13649 , year=

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024

  14. [14]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  15. [15]

    Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

    T. Xie et al. Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227, 2025. doi: 10.48550/arXiv.2505.13227

  16. [16]

    Xue et al

    T. Xue et al. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025. doi: 10.48550/arXiv.2504.01382

  17. [17]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    X. Wang et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025. doi: 10.48550/arXiv.2508.09123

  18. [18]

    Aria-ui: Visual grounding for gui instruc- tions.arXiv preprint arXiv:2412.16256, 2024

    Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024

  19. [19]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. 19

  20. [20]

    Widget captioning: Generating natural language description for mobile user interface elements

    Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5495–5510, 2020

  21. [21]

    Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M

    Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction, 2025. URLhttps://arxiv.org/abs/2503. 15661

  22. [22]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Z. Wu et al. OS-ATLAS: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. doi: 10.48550/arXiv.2410.23218

  23. [23]

    K. Yu, N. Yu, H. Wang, R. Yang, and H. Zhang. How do visual attributes influence web agents? a comprehensive evaluation of user interface design factors.arXiv preprint arXiv:2601.21961,

  24. [24]

    doi: 10.48550/arXiv.2601.21961

  25. [25]

    F., Cheng, K.-T., and Chen, M.-H

    S.-Y . Liu et al. DoRA: Weight-decomposed low-rank adaptation.arXiv preprint arXiv:2402.09353, 2024. doi: 10.48550/arXiv.2402.09353

  26. [26]

    Y . Peng, P. Wang, J. Liu, and S. Chen. GLAD: Generalizable tuning for vision-language models. arXiv preprint arXiv:2507.13089, 2025. doi: 10.48550/arXiv.2507.13089

  27. [27]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    T. Xue et al. EvoCUA: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. doi: 10.48550/arXiv.2601.15876

  28. [28]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

    H. Li et al. SpatialLadder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025. doi: 10.48550/arXiv.2510.08531

  29. [29]

    Guirlvg: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025

    W. Kang, B. Lei, G. Liu, C. Ding, and Y . Yan. GuirlVG: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025. doi: 10.48550/arXiv.2508.04389. 20 A Failure Mode Taxonomy We present qualitative examples for each failure mode identified in table 8. Each example includes the instruction, mo...