pith. machine review for the scientific record. sign in

arxiv: 2605.06664 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: unknown

BAMI: Training-Free Bias Mitigation in GUI Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords GUI groundingbias mitigationtraining-freeScreenSpot-ProMasked Prediction Distributioncoarse-to-fine focuscandidate selection
0
0 comments X

The pith

A training-free method called BAMI corrects precision and ambiguity biases in GUI grounding models by analyzing masked predictions and applying coarse-to-fine focus plus candidate selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to raise the reliability of GUI grounding, the process by which AI agents locate on-screen elements to perform clicks or drags. It uses a Masked Prediction Distribution analysis to trace most errors to two sources: high-resolution inputs that dilute precision and complex interface elements that create prediction ambiguity. BAMI then applies two lightweight manipulations—first narrowing focus from coarse to fine regions, then choosing among candidate locations—to counteract those biases without retraining any model. This matters because stronger grounding directly improves the success rate of automated GUI agents in practical software interfaces. Experiments show the method lifts accuracy on the demanding ScreenSpot-Pro benchmark for several existing models.

Core claim

Through the Masked Prediction Distribution (MPD) attribution method the authors determine that high image resolution produces precision bias while intricate elements produce ambiguity bias. They introduce Bias-Aware Manipulation Inference (BAMI), which counters these biases via coarse-to-fine focus to restore spatial detail and candidate selection to resolve competing predictions. The resulting procedure improves grounding performance across models in a training-free manner, for example raising TianXi-Action-7B accuracy on ScreenSpot-Pro from 51.9 percent to 57.8 percent while remaining stable under varied parameter choices.

What carries the argument

The Masked Prediction Distribution (MPD) attribution technique that isolates error sources, paired with Bias-Aware Manipulation Inference (BAMI) that executes coarse-to-fine focus and candidate selection to mitigate the identified biases.

If this is right

  • GUI grounding models achieve higher accuracy on complex benchmarks such as ScreenSpot-Pro without any additional training.
  • The same BAMI manipulations can be applied to multiple existing models, including TianXi-Action-7B, with consistent gains.
  • Ablation results confirm that the approach remains effective across different choices of focus granularity and candidate count.
  • Performance improvements occur in a training-free setting, preserving the original model weights and avoiding extra data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bias-attribution steps could be tested on other screen-based vision tasks such as document layout analysis or mobile app testing.
  • If the manipulations prove general, they could be inserted into agent planning loops to raise success rates on long-horizon GUI tasks.
  • The method may lower deployment costs for GUI agents by removing the need for task-specific fine-tuning on new interfaces.
  • Extending MPD analysis to video or multi-frame screen recordings might uncover temporal biases not visible in static images.

Load-bearing premise

The two biases detected by MPD are the dominant error sources and the two manipulations correct them effectively without introducing fresh failure modes or requiring model-specific tuning.

What would settle it

Applying BAMI to a held-out GUI grounding model on ScreenSpot-Pro or a similar benchmark and observing no accuracy gain or a net loss would falsify the claim of reliable bias mitigation.

Figures

Figures reproduced from arXiv: 2605.06664 by Borui Zhang, Bo Wang, Bo Zhang, Jie Zhou, Jiwen Lu, Liang Tang, Wenzhao Zheng, Yiqiang Yan, Yuhao Cheng.

Figure 1
Figure 1. Figure 1: Compared with conventional grounding models, view at source ↗
Figure 2
Figure 2. Figure 2: Bias Mitigation Strategy. To address accuracy bias and ambiguity bias, BAMI introduces two manipulations: coarse-to￾fine focus and candidate selection. Knowledge deficiency: The model fails to recognize the target due to a lack of relevant knowledge. (2) Inductive bias: The model has the necessary knowledge but makes errors due to its inherent selection bias, which manifests in two typical forms, namely pr… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy comparison on ScreenSpot-Pro. BAMI consistently improves performance across all model backbones. 2.2. Reinforcement Learning Given the fine-grained nature of GUI localization, instruc￾tion fine-tuning alone is often insufficient for achieving high precision. DeepSeek-R1 [8] introduced the GRPO method, demonstrating the potential of reinforcement learning in enhancing spatial reasoning for GUI grou… view at source ↗
Figure 4
Figure 4. Figure 4: Error Attribution Analysis. (a) Proportions of attribution types. (b) Attribution analysis of model predictions. The deep red regions in the heatmap indicate potential prediction locations, demonstrating how the MPD can clearly identify the sources of model errors view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of BAMI. Step 1: Based on the initial prediction results of the grounding model, BAMI performs cropping around these initial predictions at a predefined ratio. Step 2: The model conducts multiple predictions on the cropped images; after each prediction, the pixels within the predicted bounding box are randomly masked to ensure the diversity of multiple prediction results. Step 3: Using predefi… view at source ↗
Figure 6
Figure 6. Figure 6: Ablations on accuracy bias elimination. (a) Effect of crop ratio and iteration count. (b) Performance across different target types view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of candidate box generation strategies. view at source ↗
Figure 8
Figure 8. Figure 8: Visualizations of BAMI corrections view at source ↗
Figure 9
Figure 9. Figure 9: More Attributions Visualizations view at source ↗
read the original abstract

GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that GUI grounding models suffer from precision bias due to high image resolution and ambiguity bias due to intricate interface elements, as identified via a new Masked Prediction Distribution (MPD) attribution method. It introduces Bias-Aware Manipulation Inference (BAMI), a training-free approach using coarse-to-fine focus and candidate selection to mitigate these biases, reporting e.g. a lift from 51.9% to 57.8% accuracy on ScreenSpot-Pro for the TianXi-Action-7B model, plus ablation stability across parameter settings. Code is released.

Significance. If the central mechanism holds, BAMI would provide a practical, zero-training-cost improvement for GUI agents on challenging benchmarks, with potential for broader deployment. The training-free nature and public code are strengths that aid reproducibility and adoption. However, significance is tempered by the absence of evidence that gains arise specifically from bias-targeted interventions rather than generic test-time heuristics.

major comments (3)
  1. [Abstract] Abstract: MPD is presented as the key tool for attributing primary error sources to high-resolution precision bias and intricate-element ambiguity bias, yet the abstract (and thus the core claim) supplies no implementation details on how MPD is computed, how candidates are selected, or any statistical tests confirming these are dominant causal factors rather than correlated symptoms.
  2. [Abstract] Abstract: The reported 5.9-point gain on ScreenSpot-Pro lacks error bars, run counts, or direct comparisons to simple baselines (e.g., multi-scale cropping or top-k filtering), so it is impossible to determine whether BAMI's manipulations specifically counteract the claimed biases or merely deliver unrelated test-time benefits.
  3. [Ablation studies] Ablation studies: While robustness across parameter ranges is asserted, the ablations do not measure whether coarse-to-fine focus and candidate selection reduce the targeted error categories (precision and ambiguity) versus trading one failure mode for another; this leaves the mechanistic explanation unverified.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'extensive experimental results' is used without enumerating the full set of models, benchmarks, or protocol details, which would aid immediate assessment of scope.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating revisions where they strengthen the presentation without altering our core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: MPD is presented as the key tool for attributing primary error sources to high-resolution precision bias and intricate-element ambiguity bias, yet the abstract (and thus the core claim) supplies no implementation details on how MPD is computed, how candidates are selected, or any statistical tests confirming these are dominant causal factors rather than correlated symptoms.

    Authors: The abstract is intentionally concise per venue guidelines. Full details on MPD computation (masking procedure, prediction distribution analysis) appear in Section 3.1, candidate selection logic in Section 3.2, and supporting error attribution evidence (including qualitative examples and quantitative breakdowns) in Sections 4.1–4.2. No formal statistical hypothesis tests were performed in the original submission, as the attribution relies on empirical patterns across models and benchmarks. We will revise the abstract to include one sentence summarizing the MPD approach and reference the relevant sections for implementation and validation details. revision: partial

  2. Referee: [Abstract] Abstract: The reported 5.9-point gain on ScreenSpot-Pro lacks error bars, run counts, or direct comparisons to simple baselines (e.g., multi-scale cropping or top-k filtering), so it is impossible to determine whether BAMI's manipulations specifically counteract the claimed biases or merely deliver unrelated test-time benefits.

    Authors: The 5.9-point gain (51.9% to 57.8%) reflects single-run evaluation on the fixed ScreenSpot-Pro test set, consistent with prior GUI grounding papers. We agree that variance estimates and baseline comparisons would better isolate BAMI's contribution. In the revision we will report means and standard deviations over 3–5 runs and add explicit comparisons against generic test-time heuristics (multi-scale cropping, top-k filtering) to demonstrate that gains exceed those from non-bias-targeted methods. revision: yes

  3. Referee: [Ablation studies] Ablation studies: While robustness across parameter ranges is asserted, the ablations do not measure whether coarse-to-fine focus and candidate selection reduce the targeted error categories (precision and ambiguity) versus trading one failure mode for another; this leaves the mechanistic explanation unverified.

    Authors: Current ablations (Section 4.3) quantify overall accuracy impact when each BAMI component is removed. To directly link manipulations to bias reduction, we will augment the ablation section with an error-type breakdown: we will classify failures into precision-bias and ambiguity-bias categories on a held-out subset both before and after applying coarse-to-fine focus and candidate selection, showing targeted decreases rather than simple trade-offs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical post-hoc method with external validation

full rationale

The paper proposes MPD for observational bias attribution and BAMI for training-free mitigation, then reports accuracy lifts on independent benchmarks (ScreenSpot-Pro). No equations, fitted parameters, or derivations are present that reduce to self-inputs by construction. MPD identifies error sources from data patterns, and BAMI applies fixed manipulations; neither step renames a known result, imports uniqueness via self-citation, nor defines the output in terms of the input. The 5.9-point gain is presented as an experimental outcome rather than a tautological prediction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on the domain assumption that identified biases are primary and fixable by the two manipulations; MPD and BAMI are newly introduced methods rather than external entities with independent evidence.

axioms (1)
  • domain assumption GUI grounding errors stem primarily from precision bias due to resolution and ambiguity bias due to interface complexity, and these can be mitigated post-hoc without retraining.
    Invoked to justify the MPD analysis and BAMI design.
invented entities (2)
  • Masked Prediction Distribution (MPD) no independent evidence
    purpose: Attribution method to identify bias sources in model predictions
    Newly proposed attribution technique described in the abstract.
  • Bias-Aware Manipulation Inference (BAMI) no independent evidence
    purpose: Training-free bias mitigation framework using coarse-to-fine focus and candidate selection
    Core new method introduced to address the identified biases.

pith-pipeline@v0.9.0 · 5519 in / 1441 out tokens · 64808 ms · 2026-05-08T12:17:05.528353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 24 canonical work pages · 8 internal anchors

  1. [1]

    Explain- ing deep neural networks with a polynomial time algorithm for shapley value approximation

    Marco Ancona, Cengiz Oztireli, and Markus Gross. Explain- ing deep neural networks with a polynomial time algorithm for shapley value approximation. InICML, pages 272–281, 2019

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, abs/2502.13923, 2025

  3. [3]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024

  4. [4]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents.arXiv, abs/2401.10935, 2024

  5. [5]

    Mind2web: Towards a generalist agent for the web.NeurIPS, 36:28091– 28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.NeurIPS, 36:28091– 28114, 2023

  6. [6]

    Test-time reinforcement learning for gui grounding via region consistency,

    Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test- time reinforcement learning for gui grounding via region consistency.arXiv, abs/2508.05615, 2025

  7. [7]

    Navigating the digital world as humans do: Universal visual grounding for gui agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InICLR, 2025

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv, abs/2501.12948, 2025

  9. [9]

    A real-world webagent with planning, long context understand- ing, and program synthesis

    Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understand- ing, and program synthesis. InICLR, 2024

  10. [10]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wen- meng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InCVPR, 2024

  11. [11]

    The dawn of gui agent: A preliminary case study with claude 3.5 computer use,

    Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv, abs/2411.10323, 2024

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv, abs/2410.21276, 2024

  13. [13]

    Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

  14. [14]

    Showui: One vision-language-action model for gui visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InCVPR, pages 19498–19508, 2025

  15. [15]

    Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

    Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. InEuropean Conference on Computer Vision, pages 125–140. Springer, 2024

  16. [16]

    arXiv preprint arXiv:2504.14239 , year=

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui- r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv, abs/2504.14239, 2025

  17. [17]

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadal- lah. Omniparser for pure vision based gui agent.arXiv, abs/2408.00203, 2024

  18. [18]

    Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hong- sheng Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv, abs/2503.21620, 2025

  19. [19]

    A unified approach to interpreting model predictions.NeurIPS, 30, 2017

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.NeurIPS, 30, 2017

  20. [20]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv, abs/2504.10458, 2025

  21. [21]

    R-vlm: Region-aware vision language model for precise gui grounding.arXiv, abs/2507.05673, 2025

    Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R Manmatha, and Shabnam Ghadar. R-vlm: Region-aware vision language model for precise gui grounding.arXiv, abs/2507.05673, 2025

  22. [22]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui interac- tion with native agents.arXiv, abs/2501.12326, 2025

  23. [23]

    Grad- cam: Visual explanations from deep networks via gradient- based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad- cam: Visual explanations from deep networks via gradient- based localization. InICCV, pages 618–626, 2017

  24. [24]

    A value for n-person games

    Lloyd S Shapley et al. A value for n-person games. 1953

  25. [25]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328, 2017

  26. [26]

    GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

    Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g2: Gaussian reward modeling for gui grounding.arXiv, abs/2507.15846, 2025

  27. [27]

    Sea: Self-evolution agent with step-wise reward for computer use.arXiv, abs/2508.04037, 2025

    Liang Tang, Shuxian Li, Yuhao Cheng, Yukang Huo, Zhepeng Wang, Yiqiang Yan, Kaer Huang, Yanzhe Jing, and Tiaonan Duan. Sea: Self-evolution agent with step-wise reward for computer use.arXiv, abs/2508.04037, 2025

  28. [28]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s percep- tion of the world at any resolution.arXiv, abs/2409.12191, 2024

  29. [29]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022

  30. [30]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv, abs/1910.03771, 2019

  31. [31]

    Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning.arXiv, abs/2507.00008, 2025

    Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning.arXiv, abs/2507.00008, 2025

  32. [32]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv, abs/2410.23218, 2024

  33. [33]

    Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui inter- action.arXiv, abs/2412.04454, 2024

  34. [34]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

    Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv, abs/2505.12370, 2025

  35. [35]

    Does chain-of- thought reasoning help mobile gui agent? an empirical study

    Li Zhang, Longxi Gao, and Mengwei Xu. Does chain-of- thought reasoning help mobile gui agent? an empirical study. arXiv, abs/2503.16788, 2025

  36. [36]

    Phi-ground tech report: Advancing perception in gui grounding.arXiv preprint arXiv:2507.23779, 2025

    Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, et al. Phi-ground tech report: Advancing perception in gui grounding.arXiv, abs/2507.23779, 2025

  37. [37]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv, abs/2505.15810, 2025. BAMI: Training-Free Bias Mitigation in GUI Grounding Supplementary Material Table of Contents for Supplementary Material A . Usage of Large Models in Paper Writing 11 B . Detai...

  38. [38]

    Examine what GUI element is highlighted in each image

  39. [39]

    Consider which element better matches the user’s intent

  40. [40]

    Think about standard GUI patterns and user expectations

  41. [41]

    Choose the image that shows the more appropriate interaction target 15 16KEY PRINCIPLES: 17- Focus on the functional purpose of the highlighted elements 18- Consider standard UI patterns (buttons for actions, text fields for input, etc.) 19- Choose interactive elements over static text/labels 20- If one shows a selected state and the other shows normal st...