arxiv: 2604.14262 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

Recognition: unknown

GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

Yangyue Wang , Harshvardhan Sikka , Yash Mathur , Tony Zhou , Jinu Nyachhyon , Pranav Guruprasad

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GUI groundingdomain randomizationmodel robustnessspatial reasoninginstruction perturbationbenchmark evaluationfine-tuning degradation

0 comments

The pith

GUI grounding models suffer large accuracy drops when instructions require relational spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a perturbation framework to test how GUI grounding models respond to changes in both the visual interface and the wording of instructions. It shows that high performance on standard tests does not hold when instructions use relational terms like next to instead of naming elements directly, with drops of 27 to 56 points across models. The same framework reveals that changes in zoom level also reduce accuracy and that fine-tuning on perturbed data does not improve results. A reader should care because real user interactions with graphical interfaces often involve varied instructions and viewing conditions that current evaluation methods ignore.

Core claim

By varying visual scenes and instructions along independent axes, the GUI-Perturbed framework isolates specific weaknesses in GUI grounding: relational instructions cause systematic accuracy collapse in all tested 7B models, a 70 percent browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance instead of improving it. This provides diagnostic information on capabilities like spatial reasoning and visual robustness that aggregate benchmarks cannot supply.

What carries the argument

GUI-Perturbed, the controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness.

If this is right

Standard benchmarks that use a single fixed instruction per screenshot overestimate model capabilities.
GUI grounding models lack reliable performance on instructions requiring spatial relations between elements.
Visual changes such as browser zoom levels can significantly affect model accuracy in grounding tasks.
Simple data augmentation and low-rank fine-tuning may not address the brittleness and can even reduce performance.
Diagnostic signals from separate perturbation axes can guide targeted improvements in spatial reasoning and calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world GUI applications would likely encounter similar instruction variations, making current models unreliable without changes in training or evaluation.
Future benchmarks should routinely include controlled perturbations along visual and linguistic axes to better predict deployment performance.
Alternative training approaches beyond standard LoRA on augmented data may be needed to build robustness to relational instructions.
The released dataset and pipeline could support development of models that handle varied user phrasings more consistently.

Load-bearing premise

The specific perturbations chosen, such as relational instructions and particular zoom levels, represent the variations that matter most in actual GUI usage.

What would settle it

Testing the three models or similar ones on the released GUI-Perturbed dataset with relational instructions and observing whether accuracy remains above 85 percent or drops as reported.

Figures

Figures reproduced from arXiv: 2604.14262 by Harshvardhan Sikka, Jinu Nyachhyon, Pranav Guruprasad, Tony Zhou, Yangyue Wang, Yash Mathur.

**Figure 2.** Figure 2: Flip rate decomposition for baseline models under each perturbation type. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Direct vs. relational instruction accuracy across models. The 95% bootstrap CIs do not [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy by spatial direction (above, below, left, right) across models. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of different augmentation types on model performance. All variants cause slight [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Flip rate decomposition for fine-tuned models under each augmentation type. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of scaling training data from 6.5k to 25k samples. More data leads to worse [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of real (Salesforce mix) vs. synthetic (GUI-Perturbed) training data. Neither [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: ScreenSpot-v2 accuracy by platform and element type: Baseline vs. FT-All (6.5k) vs. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Click Region Error. The model selects the correct UI element conceptually but clicks the wrong physical area of it. Instruction: “Click on ‘Done’ button.” Model output: click(start_box=’(639,438)’) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Location Hallucination. The model correctly identifies what to click but fabricates or misplaces its on-screen coordinates. Instruction: “Click on ‘Notifications’ div.” Model output: “Thought: I noticed that there is a ‘Notifications’ option in the left sidebar. . . This option is located just below ‘Privacy and data’ and above ‘Security and logins.’ ” The model’s reasoning is correct but the predicted co… view at source ↗

**Figure 12.** Figure 12: Spatial Reasoning Error. The model incorrectly interprets relative spatial relationships. Instruction: “Click on the link to the left of ‘Side effects’.” Model output: “Thought: I noticed that there is a link labeled ‘Interactions’ located to the left of ‘Side effects’. . . ” The model correctly names the target but clicks a link on the right instead of the left. Semantic Failures [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 13.** Figure 13: Goal Hallucination. The model invents user intentions not specified in the instruction. Instruction: “Click on the button above ‘May 17th 2023’.” Model output: “Thought: I noticed that there is a gray heart-shaped button. . . clicking on it should allow me to save this cooking activity to my favorites.” The “save to favorites” goal is absent from the instruction. 22 [PITH_FULL_IMAGE:figures/full_fig_p022… view at source ↗

**Figure 14.** Figure 14: Instruction Misinterpretation. The model misunderstands what the instruction refers to and selects a related but incorrect element. Instruction: “Click on ‘Egypt’.” Model output: “Thought: . . . you need to click on the dropdown button located to the right of the word ‘Egypt.’ ” The model reinterprets “click on Egypt” as “click the dropdown arrow beside Egypt.” [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Text Matching Bias. The model interacts with visible text matching the instruction without properly grounding it to the correct UI element. Instruction: “Click on ‘First Name’ textbox.” Model output: click(start_box=’(1242,509)’). The model clicks the “First Name” label text rather than the input field beneath it. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Visual Confusion. The model relies on superficial visual cues (shape, color, position) and misidentifies the functional element. Instruction: “Type ‘bed sheets queen’ in ‘Search: suggestions appear below’ searchbox.” Model output: “Thought: I noticed that there are two search boxes at the top of the page. The one on the left is the one I need.” The model mistakes a light-colored button with faint text for… view at source ↗

**Figure 17.** Figure 17: Reasoning Drift. Adding explicit reasoning causes the model’s grounded action prediction to deteriorate. Instruction: “Click on ‘Verizon’ img.” Model output: “Thought: The task is to click on the ‘Verizon’ image. . . The image is located at the bottom of the product details section.” The phrase “located at the bottom” in the model’s own reasoning misleads it to click an unrelated image at the bottom of th… view at source ↗

read the original abstract

GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause systematic accuracy collapse across all models, a 70% browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance rather than improving it. By perturbing along independent axes, GUI-Perturbed isolates which specific capability axes are affected-spatial reasoning, visual robustness, reasoning calibration-providing diagnostic signal that aggregate benchmarks cannot. We release the dataset, augmentation pipeline, and a fine-tuned model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUI-Perturbed shows real brittleness in grounding models under relational instructions and zoom, but the spatial-reasoning claim needs checks on whether instruction complexity was matched.

read the letter

The main thing here is that standard GUI grounding benchmarks overstate what the models can do. When instructions shift from naming an element directly to describing its spatial relation to others, accuracy falls sharply across the tested models. Visual changes like browser zoom also hurt, and the attempted LoRA fine-tuning on augmented data made results worse rather than better. The independent variation of scenes and instructions is the useful new piece; it gives separate signals on spatial reasoning, visual robustness, and calibration that single-instruction tests miss. Releasing the dataset, pipeline, and fine-tuned model is straightforward and helpful for follow-up work. The patterns hold across the three 7B models with reported statistical significance, which is better than many robustness papers that stop at one system. The soft spot is the stress-test point on instruction properties. If the relational versions are longer, have deeper syntax, or use harder vocabulary on average, the drop could come from general prompt brittleness instead of a specific failure to handle relations. The abstract does not spell out matching on token count or parse depth, so the methods section needs to show that was controlled or the attribution stays partly open. Testing only models from the same lineage also limits how far the patterns generalize. This is worth a serious referee for groups building interface agents or robustness benchmarks. The framework is concrete enough to be adopted or extended, and the empirical gaps it flags are real even if the exact cause of the relational collapse needs tighter isolation. I would send it for review with a request to clarify the perturbation matching and perhaps add one more architecture family.

Referee Report

3 major / 2 minor

Summary. The paper introduces GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes (e.g., browser zoom) and instructions (direct element naming vs. relational/spatial) to diagnose robustness in GUI grounding models. It reports that three 7B models suffer 27-56pp accuracy drops on relational instructions, statistically significant degradation at 70% zoom, and that rank-8 LoRA fine-tuning on augmented data worsens rather than improves performance. The framework and released dataset/pipeline are positioned as providing diagnostic axes (spatial reasoning, visual robustness, calibration) missing from standard single-instruction benchmarks.

Significance. If the perturbations successfully isolate the claimed capability axes, the work is significant for revealing systematic brittleness in GUI grounding that aggregate benchmarks overlook, with direct implications for real-world GUI agents. The release of the dataset, augmentation pipeline, and fine-tuned model is a concrete strength that supports reproducibility and follow-on work.

major comments (3)

[Perturbation Framework] The central claim that relational instructions cause accuracy collapse specifically due to spatial reasoning requirements is load-bearing. The perturbation framework description does not report matching or controlling for instruction length, token count, syntactic complexity, parse-tree depth, or lexical difficulty between direct and relational variants. If relational phrasings are longer or more complex on average, the 27-56pp drop may reflect general instruction-following brittleness rather than a targeted spatial-reasoning failure.
[Experimental Setup and Results] Full details on exact perturbation generation (including how relational instructions and zoom levels are constructed) and any exclusion criteria are required to confirm the reported statistical significance and rule out post-hoc selection. This directly affects the soundness of the cross-model consistency claims.
[Fine-tuning Experiments] The finding that rank-8 LoRA fine-tuning with augmented data degrades performance (rather than improving it) is a key negative result. More information on the augmentation process, data mixture, and training details is needed to interpret whether this reflects true brittleness or an artifact of the fine-tuning protocol.

minor comments (2)

[Abstract and §1] The abstract and introduction would benefit from explicitly naming the three 7B models evaluated and their base checkpoints for reproducibility.
[Figures and Tables] Figure and table captions should explicitly state the perturbation axes, metrics (e.g., accuracy delta), and statistical tests used so that readers can interpret results without cross-referencing the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our perturbation framework, experimental details, and fine-tuning results. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Perturbation Framework] The central claim that relational instructions cause accuracy collapse specifically due to spatial reasoning requirements is load-bearing. The perturbation framework description does not report matching or controlling for instruction length, token count, syntactic complexity, parse-tree depth, or lexical difficulty between direct and relational variants. If relational phrasings are longer or more complex on average, the 27-56pp drop may reflect general instruction-following brittleness rather than a targeted spatial-reasoning failure.

Authors: We agree that explicit controls for linguistic factors are necessary to isolate spatial reasoning. Our instruction generation used parallel template families (direct naming vs. relational/spatial descriptions) applied to the same UI elements, with an effort to keep surface forms comparable; however, we did not report aggregate statistics on token counts or parse complexity. In revision we will add a supplementary table reporting mean token length, word count, dependency parse depth, and lexical diversity for both instruction classes across the full dataset. This will allow direct assessment of whether the observed drops exceed what would be expected from complexity alone. The high performance on direct instructions and the consistency of the relational drop across three independently trained 7B models provide supporting evidence for a spatial-specific effect, but the added metrics will make the claim more robust. revision: yes
Referee: [Experimental Setup and Results] Full details on exact perturbation generation (including how relational instructions and zoom levels are constructed) and any exclusion criteria are required to confirm the reported statistical significance and rule out post-hoc selection. This directly affects the soundness of the cross-model consistency claims.

Authors: We will expand the Methods section with complete generation procedures: the exact template sets for relational instructions, the browser-level CSS and viewport scaling used to produce the 70% zoom condition, and the full list of exclusion criteria (e.g., screenshots containing overlapping clickable regions, rendering failures, or elements outside the viewport after perturbation). Statistical significance was obtained via pre-specified paired t-tests on matched screenshot-instruction pairs; no post-hoc filtering of results occurred. The released augmentation pipeline already contains the generation scripts; we will also include a detailed pseudocode description and the precise exclusion rules in the revised manuscript to eliminate any ambiguity. revision: yes
Referee: [Fine-tuning Experiments] The finding that rank-8 LoRA fine-tuning with augmented data degrades performance (rather than improving it) is a key negative result. More information on the augmentation process, data mixture, and training details is needed to interpret whether this reflects true brittleness or an artifact of the fine-tuning protocol.

Authors: We will add a dedicated appendix with the full fine-tuning protocol: the exact data mixture ratios (original vs. perturbed samples), the specific perturbation variants included in augmentation, LoRA configuration (rank 8, alpha, dropout), optimizer, learning rate schedule, batch size, number of epochs, and early-stopping criteria. Validation loss curves and per-epoch accuracy on held-out perturbed test sets will also be reported. The consistent degradation across multiple random seeds and the fact that direct-instruction performance remained stable while relational performance declined suggest the result reflects genuine brittleness rather than a training artifact; the additional details will allow readers to reproduce and interpret the outcome. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct measurements

full rationale

The paper introduces GUI-Perturbed as a controlled perturbation framework and evaluates three models via direct accuracy measurements on relational vs. direct instructions, zoom levels, and fine-tuning. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Central claims rest on observed percentage-point drops and statistical significance from experiments, not on any reduction to inputs by construction. This matches the reader's assessment of an empirical study without self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework rests on the domain assumption that independent axis perturbations isolate spatial reasoning, visual robustness, and calibration without introducing new confounds.

axioms (1)

domain assumption Perturbations along visual and instruction axes are representative of real-world GUI variations and do not introduce unrelated artifacts.
Invoked to claim that observed drops reflect specific capability failures rather than test artifacts.

pith-pipeline@v0.9.0 · 5472 in / 1065 out tokens · 57068 ms · 2026-05-10T13:35:09.237069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

K. Cheng et al. SeeClick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024. doi: 10.48550/arXiv.2401.10935

work page doi:10.48550/arxiv.2401.10935 2024
[2]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

K. Li et al. ScreenSpot-Pro: Gui grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025. doi: 10.48550/arXiv.2504.07981

work page doi:10.48550/arxiv.2504.07981 2025
[3]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

T. Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024. doi: 10.48550/arXiv.2404.07972

work page internal anchor Pith review doi:10.48550/arxiv.2404.07972 2024
[4]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

X. Deng et al. Mind2Web: Towards a generalist agent for the web.arXiv preprint arXiv:2306.06070, 2023. doi: 10.48550/arXiv.2306.06070

work page doi:10.48550/arxiv.2306.06070 2023
[5]

Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,

J. Yang et al. GUI-Robust: A comprehensive dataset for testing gui agent robustness in real- world anomalies.arXiv preprint arXiv:2506.14477, 2025. doi: 10.48550/arXiv.2506.14477

work page doi:10.48550/arxiv.2506.14477 2025
[6]

H. H. Zhao, K. Yang, W. Yu, D. Gao, and M. Z. Shou. WorldGUI: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2026. doi: 10.48550/arXiv.2502.08047

work page doi:10.48550/arxiv.2502.08047 2026
[7]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017. doi: 10.48550/arXiv.1703.06907

work page doi:10.48550/arxiv.1703.06907 2017
[8]

Qwen2.5-VL Technical Report

S. Bai et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. doi: 10.48550/arXiv.2502.13923

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[9]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Y . Qin et al. UI-TARS: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025. doi: 10.48550/arXiv.2501.12326

work page Pith review doi:10.48550/arxiv.2501.12326 2025
[10]

java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

Y . Yang et al. GTA1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025. doi: 10.48550/arXiv.2507.05791

work page doi:10.48550/arxiv.2507.05791 2025
[11]

E. J. Hu et al. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. doi: 10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[12]

Mind2web 2: Evaluating agentic search with agent-as-a-judge

Boyu Gou et al. Mind2web 2: Evaluating agentic search with agent-as-a-judge. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=AUaW6DS9si

2025
[13]

arXiv preprint arXiv:2401.13649 , year=

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024

work page arXiv 2024
[14]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

2017
[15]

Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

T. Xie et al. Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227, 2025. doi: 10.48550/arXiv.2505.13227

work page doi:10.48550/arxiv.2505.13227 2025
[16]

Xue et al

T. Xue et al. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025. doi: 10.48550/arXiv.2504.01382

work page doi:10.48550/arxiv.2504.01382 2025
[17]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

X. Wang et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025. doi: 10.48550/arXiv.2508.09123

work page doi:10.48550/arxiv.2508.09123 2025
[18]

Aria-ui: Visual grounding for gui instruc- tions.arXiv preprint arXiv:2412.16256, 2024

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024

work page arXiv 2024
[19]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. 19

2024
[20]

Widget captioning: Generating natural language description for mobile user interface elements

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5495–5510, 2020

2020
[21]

Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction, 2025. URLhttps://arxiv.org/abs/2503. 15661

2025
[22]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Z. Wu et al. OS-ATLAS: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. doi: 10.48550/arXiv.2410.23218

work page internal anchor Pith review doi:10.48550/arxiv.2410.23218 2024
[23]

K. Yu, N. Yu, H. Wang, R. Yang, and H. Zhang. How do visual attributes influence web agents? a comprehensive evaluation of user interface design factors.arXiv preprint arXiv:2601.21961,

work page arXiv
[24]

doi: 10.48550/arXiv.2601.21961

work page doi:10.48550/arxiv.2601.21961
[25]

F., Cheng, K.-T., and Chen, M.-H

S.-Y . Liu et al. DoRA: Weight-decomposed low-rank adaptation.arXiv preprint arXiv:2402.09353, 2024. doi: 10.48550/arXiv.2402.09353

work page doi:10.48550/arxiv.2402.09353 2024
[26]

Y . Peng, P. Wang, J. Liu, and S. Chen. GLAD: Generalizable tuning for vision-language models. arXiv preprint arXiv:2507.13089, 2025. doi: 10.48550/arXiv.2507.13089

work page doi:10.48550/arxiv.2507.13089 2025
[27]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

T. Xue et al. EvoCUA: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. doi: 10.48550/arXiv.2601.15876

work page doi:10.48550/arxiv.2601.15876 2026
[28]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

H. Li et al. SpatialLadder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025. doi: 10.48550/arXiv.2510.08531

work page doi:10.48550/arxiv.2510.08531 2025
[29]

Guirlvg: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025

W. Kang, B. Lei, G. Liu, C. Ding, and Y . Yan. GuirlVG: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025. doi: 10.48550/arXiv.2508.04389. 20 A Failure Mode Taxonomy We present qualitative examples for each failure mode identified in table 8. Each example includes the instruction, mo...

work page doi:10.48550/arxiv.2508.04389 2025