Recognition: unknown
BAMI: Training-Free Bias Mitigation in GUI Grounding
Pith reviewed 2026-05-08 12:17 UTC · model grok-4.3
The pith
A training-free method called BAMI corrects precision and ambiguity biases in GUI grounding models by analyzing masked predictions and applying coarse-to-fine focus plus candidate selection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the Masked Prediction Distribution (MPD) attribution method the authors determine that high image resolution produces precision bias while intricate elements produce ambiguity bias. They introduce Bias-Aware Manipulation Inference (BAMI), which counters these biases via coarse-to-fine focus to restore spatial detail and candidate selection to resolve competing predictions. The resulting procedure improves grounding performance across models in a training-free manner, for example raising TianXi-Action-7B accuracy on ScreenSpot-Pro from 51.9 percent to 57.8 percent while remaining stable under varied parameter choices.
What carries the argument
The Masked Prediction Distribution (MPD) attribution technique that isolates error sources, paired with Bias-Aware Manipulation Inference (BAMI) that executes coarse-to-fine focus and candidate selection to mitigate the identified biases.
If this is right
- GUI grounding models achieve higher accuracy on complex benchmarks such as ScreenSpot-Pro without any additional training.
- The same BAMI manipulations can be applied to multiple existing models, including TianXi-Action-7B, with consistent gains.
- Ablation results confirm that the approach remains effective across different choices of focus granularity and candidate count.
- Performance improvements occur in a training-free setting, preserving the original model weights and avoiding extra data collection.
Where Pith is reading between the lines
- Similar bias-attribution steps could be tested on other screen-based vision tasks such as document layout analysis or mobile app testing.
- If the manipulations prove general, they could be inserted into agent planning loops to raise success rates on long-horizon GUI tasks.
- The method may lower deployment costs for GUI agents by removing the need for task-specific fine-tuning on new interfaces.
- Extending MPD analysis to video or multi-frame screen recordings might uncover temporal biases not visible in static images.
Load-bearing premise
The two biases detected by MPD are the dominant error sources and the two manipulations correct them effectively without introducing fresh failure modes or requiring model-specific tuning.
What would settle it
Applying BAMI to a held-out GUI grounding model on ScreenSpot-Pro or a similar benchmark and observing no accuracy gain or a net loss would falsify the claim of reliable bias mitigation.
Figures
read the original abstract
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that GUI grounding models suffer from precision bias due to high image resolution and ambiguity bias due to intricate interface elements, as identified via a new Masked Prediction Distribution (MPD) attribution method. It introduces Bias-Aware Manipulation Inference (BAMI), a training-free approach using coarse-to-fine focus and candidate selection to mitigate these biases, reporting e.g. a lift from 51.9% to 57.8% accuracy on ScreenSpot-Pro for the TianXi-Action-7B model, plus ablation stability across parameter settings. Code is released.
Significance. If the central mechanism holds, BAMI would provide a practical, zero-training-cost improvement for GUI agents on challenging benchmarks, with potential for broader deployment. The training-free nature and public code are strengths that aid reproducibility and adoption. However, significance is tempered by the absence of evidence that gains arise specifically from bias-targeted interventions rather than generic test-time heuristics.
major comments (3)
- [Abstract] Abstract: MPD is presented as the key tool for attributing primary error sources to high-resolution precision bias and intricate-element ambiguity bias, yet the abstract (and thus the core claim) supplies no implementation details on how MPD is computed, how candidates are selected, or any statistical tests confirming these are dominant causal factors rather than correlated symptoms.
- [Abstract] Abstract: The reported 5.9-point gain on ScreenSpot-Pro lacks error bars, run counts, or direct comparisons to simple baselines (e.g., multi-scale cropping or top-k filtering), so it is impossible to determine whether BAMI's manipulations specifically counteract the claimed biases or merely deliver unrelated test-time benefits.
- [Ablation studies] Ablation studies: While robustness across parameter ranges is asserted, the ablations do not measure whether coarse-to-fine focus and candidate selection reduce the targeted error categories (precision and ambiguity) versus trading one failure mode for another; this leaves the mechanistic explanation unverified.
minor comments (1)
- [Abstract] Abstract: The phrase 'extensive experimental results' is used without enumerating the full set of models, benchmarks, or protocol details, which would aid immediate assessment of scope.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating revisions where they strengthen the presentation without altering our core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: MPD is presented as the key tool for attributing primary error sources to high-resolution precision bias and intricate-element ambiguity bias, yet the abstract (and thus the core claim) supplies no implementation details on how MPD is computed, how candidates are selected, or any statistical tests confirming these are dominant causal factors rather than correlated symptoms.
Authors: The abstract is intentionally concise per venue guidelines. Full details on MPD computation (masking procedure, prediction distribution analysis) appear in Section 3.1, candidate selection logic in Section 3.2, and supporting error attribution evidence (including qualitative examples and quantitative breakdowns) in Sections 4.1–4.2. No formal statistical hypothesis tests were performed in the original submission, as the attribution relies on empirical patterns across models and benchmarks. We will revise the abstract to include one sentence summarizing the MPD approach and reference the relevant sections for implementation and validation details. revision: partial
-
Referee: [Abstract] Abstract: The reported 5.9-point gain on ScreenSpot-Pro lacks error bars, run counts, or direct comparisons to simple baselines (e.g., multi-scale cropping or top-k filtering), so it is impossible to determine whether BAMI's manipulations specifically counteract the claimed biases or merely deliver unrelated test-time benefits.
Authors: The 5.9-point gain (51.9% to 57.8%) reflects single-run evaluation on the fixed ScreenSpot-Pro test set, consistent with prior GUI grounding papers. We agree that variance estimates and baseline comparisons would better isolate BAMI's contribution. In the revision we will report means and standard deviations over 3–5 runs and add explicit comparisons against generic test-time heuristics (multi-scale cropping, top-k filtering) to demonstrate that gains exceed those from non-bias-targeted methods. revision: yes
-
Referee: [Ablation studies] Ablation studies: While robustness across parameter ranges is asserted, the ablations do not measure whether coarse-to-fine focus and candidate selection reduce the targeted error categories (precision and ambiguity) versus trading one failure mode for another; this leaves the mechanistic explanation unverified.
Authors: Current ablations (Section 4.3) quantify overall accuracy impact when each BAMI component is removed. To directly link manipulations to bias reduction, we will augment the ablation section with an error-type breakdown: we will classify failures into precision-bias and ambiguity-bias categories on a held-out subset both before and after applying coarse-to-fine focus and candidate selection, showing targeted decreases rather than simple trade-offs. revision: yes
Circularity Check
No significant circularity; empirical post-hoc method with external validation
full rationale
The paper proposes MPD for observational bias attribution and BAMI for training-free mitigation, then reports accuracy lifts on independent benchmarks (ScreenSpot-Pro). No equations, fitted parameters, or derivations are present that reduce to self-inputs by construction. MPD identifies error sources from data patterns, and BAMI applies fixed manipulations; neither step renames a known result, imports uniqueness via self-citation, nor defines the output in terms of the input. The 5.9-point gain is presented as an experimental outcome rather than a tautological prediction. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GUI grounding errors stem primarily from precision bias due to resolution and ambiguity bias due to interface complexity, and these can be mitigated post-hoc without retraining.
invented entities (2)
-
Masked Prediction Distribution (MPD)
no independent evidence
-
Bias-Aware Manipulation Inference (BAMI)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Explain- ing deep neural networks with a polynomial time algorithm for shapley value approximation
Marco Ancona, Cengiz Oztireli, and Markus Gross. Explain- ing deep neural networks with a polynomial time algorithm for shapley value approximation. InICML, pages 272–281, 2019
2019
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, abs/2502.13923, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024
2024
-
[4]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents.arXiv, abs/2401.10935, 2024
-
[5]
Mind2web: Towards a generalist agent for the web.NeurIPS, 36:28091– 28114, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.NeurIPS, 36:28091– 28114, 2023
2023
-
[6]
Test-time reinforcement learning for gui grounding via region consistency,
Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test- time reinforcement learning for gui grounding via region consistency.arXiv, abs/2508.05615, 2025
-
[7]
Navigating the digital world as humans do: Universal visual grounding for gui agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InICLR, 2025
2025
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv, abs/2501.12948, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
A real-world webagent with planning, long context understand- ing, and program synthesis
Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understand- ing, and program synthesis. InICLR, 2024
2024
-
[10]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wen- meng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InCVPR, 2024
2024
-
[11]
The dawn of gui agent: A preliminary case study with claude 3.5 computer use,
Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv, abs/2411.10323, 2024
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv, abs/2410.21276, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025
-
[14]
Showui: One vision-language-action model for gui visual agent
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InCVPR, pages 19498–19508, 2025
2025
-
[15]
Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms
Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. InEuropean Conference on Computer Vision, pages 125–140. Springer, 2024
2024
-
[16]
arXiv preprint arXiv:2504.14239 , year=
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui- r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv, abs/2504.14239, 2025
-
[17]
Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadal- lah. Omniparser for pure vision based gui agent.arXiv, abs/2408.00203, 2024
-
[18]
Ui-r1: Enhancing action prediction of gui agents by reinforcement learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hong- sheng Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv, abs/2503.21620, 2025
-
[19]
A unified approach to interpreting model predictions.NeurIPS, 30, 2017
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.NeurIPS, 30, 2017
2017
-
[20]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv, abs/2504.10458, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
R-vlm: Region-aware vision language model for precise gui grounding.arXiv, abs/2507.05673, 2025
Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R Manmatha, and Shabnam Ghadar. R-vlm: Region-aware vision language model for precise gui grounding.arXiv, abs/2507.05673, 2025
-
[22]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui interac- tion with native agents.arXiv, abs/2501.12326, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Grad- cam: Visual explanations from deep networks via gradient- based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad- cam: Visual explanations from deep networks via gradient- based localization. InICCV, pages 618–626, 2017
2017
-
[24]
A value for n-person games
Lloyd S Shapley et al. A value for n-person games. 1953
1953
-
[25]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328, 2017
2017
-
[26]
GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025
Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g2: Gaussian reward modeling for gui grounding.arXiv, abs/2507.15846, 2025
-
[27]
Sea: Self-evolution agent with step-wise reward for computer use.arXiv, abs/2508.04037, 2025
Liang Tang, Shuxian Li, Yuhao Cheng, Yukang Huo, Zhepeng Wang, Yiqiang Yan, Kaer Huang, Yanzhe Jing, and Tiaonan Duan. Sea: Self-evolution agent with step-wise reward for computer use.arXiv, abs/2508.04037, 2025
-
[28]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s percep- tion of the world at any resolution.arXiv, abs/2409.12191, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022
2022
-
[30]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv, abs/1910.03771, 2019
work page internal anchor Pith review arXiv 1910
-
[31]
Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning.arXiv, abs/2507.00008, 2025
-
[32]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv, abs/2410.23218, 2024
work page internal anchor Pith review arXiv 2024
-
[33]
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui inter- action.arXiv, abs/2412.04454, 2024
-
[34]
Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv, abs/2505.12370, 2025
-
[35]
Does chain-of- thought reasoning help mobile gui agent? an empirical study
Li Zhang, Longxi Gao, and Mengwei Xu. Does chain-of- thought reasoning help mobile gui agent? an empirical study. arXiv, abs/2503.16788, 2025
-
[36]
Phi-ground tech report: Advancing perception in gui grounding.arXiv preprint arXiv:2507.23779, 2025
Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, et al. Phi-ground tech report: Advancing perception in gui grounding.arXiv, abs/2507.23779, 2025
-
[37]
Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv, abs/2505.15810, 2025. BAMI: Training-Free Bias Mitigation in GUI Grounding Supplementary Material Table of Contents for Supplementary Material A . Usage of Large Models in Paper Writing 11 B . Detai...
-
[38]
Examine what GUI element is highlighted in each image
-
[39]
Consider which element better matches the user’s intent
-
[40]
Think about standard GUI patterns and user expectations
-
[41]
Choose the image that shows the more appropriate interaction target 15 16KEY PRINCIPLES: 17- Focus on the functional purpose of the highlighted elements 18- Consider standard UI patterns (buttons for actions, text fields for input, etc.) 19- Choose interactive elements over static text/labels 20- If one shows a selected state and the other shows normal st...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.