pith. sign in

arxiv: 2605.19260 · v1 · pith:3WBQTY3Lnew · submitted 2026-05-19 · 💻 cs.AI · cs.CV· cs.MA

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

Pith reviewed 2026-05-20 06:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.MA
keywords visual token reductionadaptive quadtreesGUI agentsmultimodal modelsinference-time optimizationtoken mergingtemporal consistencyscreenshot redundancy
0
0 comments X

The pith

AQuaUI merges tokens inside adaptive quadtree leaves on GUI screenshots to cut visual input size while keeping nearly all agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AQuaUI as a training-free way to exploit the fact that GUI screenshots contain large uniform areas with little task-relevant detail. It builds an adaptive quadtree that subdivides only where content is dense and replaces each leaf region with one merged token. Spatial coordinates of the retained tokens stay intact so the model’s position encodings continue to work unchanged. A conditional version of the quadtree carries information forward from the previous frame to avoid losing fine details during slow or static GUI transitions. If this holds, existing large multimodal GUI agents can run faster and cheaper on the same hardware without any retraining.

Core claim

AQuaUI partitions each screenshot with an adaptive quadtree and retains a single representative merged token per leaf, while a conditional refinement step uses the prior quadtree to stabilize the partition across consecutive frames; this produces up to 29.52 percent fewer visual tokens and 13.22 percent speedup on GUI-Owl-1.5-32B-Instruct while preserving 99.06 percent of the original grounding and navigation accuracy.

What carries the argument

Adaptive quadtree that recursively splits the screenshot according to local homogeneity and collapses each final leaf to one merged token, with position information preserved and a conditional variant that references the preceding frame’s partition.

If this is right

  • Existing GUI agent models obtain better accuracy-versus-speed trade-offs on grounding and navigation benchmarks without retraining.
  • Token usage falls by up to 29.52 percent and inference speeds up by up to 13.22 percent on models such as GUI-Owl-1.5-32B-Instruct.
  • Position encodings remain consistent because retained tokens keep their original spatial coordinates through the entire pipeline.
  • Temporal stability improves across multi-step interactions when the conditional quadtree reuses structure from prior frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quadtree partitioning idea could be tested on other visual-agent domains that also show large homogeneous regions, such as robotic camera feeds or document interfaces.
  • Varying the homogeneity threshold or leaf-size limit according to the current task might yield further efficiency gains beyond the fixed settings reported.
  • The method’s reliance on spatial redundancy suggests it could combine with existing attention-based compressors for hybrid token budgets.

Load-bearing premise

A single representative token per quadtree leaf still contains every piece of information the downstream agent needs to make correct grounding and navigation decisions.

What would settle it

Measure grounding or navigation accuracy on a standard benchmark set of screenshots that contain critical small text or icons inside large visually uniform panels; a drop below roughly 99 percent of full-token performance would indicate the representative-token step discards necessary detail.

Figures

Figures reproduced from arXiv: 2605.19260 by Ha Min Son, Muhao Chen, Tinghui Zhu, Xin Liu, Yuankai Li, Zhe Zhao.

Figure 1
Figure 1. Figure 1: An overview of the entire pipeline of AQuaUI. Given a GUI screenshot, AQuaUI will [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance breakdown on UI-Vision and ScreenSpot-V2. More detailed results can be [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance change with respect to different parameter and score function. Left: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AQuaUI, a training-free inference-time token reduction method for GUI agent models. It constructs an adaptive quadtree on each GUI screenshot to exploit non-uniform spatial information density, retaining one representative merged token per quadtree leaf while preserving spatial positions. A conditional quadtree variant leverages continuity between consecutive screenshots for temporal consistency in multi-step interactions. Experiments on standard grounding and navigational benchmarks report improved accuracy-efficiency trade-offs, including up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance on GUI-Owl-1.5-32B-Instruct.

Significance. If the results hold, this work offers a practical contribution by enabling efficient inference for high-resolution GUI agents without retraining or attention-based compression. The training-free design, explicit preservation of spatial positions, and conditional quadtree for temporal consistency are clear strengths that align with the structured redundancy in GUI screenshots. The approach could meaningfully advance deployment of LMM-based agents in resource-limited settings, provided the core merging assumption is substantiated.

major comments (2)
  1. [Abstract] Abstract: the reported performance numbers (13.22% speedup, 29.52% token reduction, 99.06% retention on GUI-Owl-1.5-32B-Instruct) are presented without error bars, ablation studies on quadtree depth or merging parameters, or any description of how the representative token is computed inside each leaf (mean, max-pool, or other). These omissions make the central efficiency claim difficult to evaluate or reproduce.
  2. [Token-merging step] Token-merging step (described in abstract): the assumption that one representative merged token per adaptive quadtree leaf preserves all task-relevant information for downstream grounding and navigation is stated but unsupported. No analysis is given of the splitting criterion (e.g., pixel variance or homogeneity) or its reliability for small but semantically critical elements such as 8-12 px icons or single-line text in dense toolbars.
minor comments (1)
  1. [Abstract] The abstract refers to 'standard grounding and navigational benchmarks' without naming them (e.g., ScreenSpot, AndroidControl); explicit citation would improve comparability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and revised the paper to improve clarity, reproducibility, and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported performance numbers (13.22% speedup, 29.52% token reduction, 99.06% retention on GUI-Owl-1.5-32B-Instruct) are presented without error bars, ablation studies on quadtree depth or merging parameters, or any description of how the representative token is computed inside each leaf (mean, max-pool, or other). These omissions make the central efficiency claim difficult to evaluate or reproduce.

    Authors: We agree that these details are important for evaluation and reproducibility. In the revised manuscript, we describe the representative token computation as the mean pooling of the feature vectors within each quadtree leaf. We have added ablation studies varying the quadtree depth and merging parameters, with results reported in a new table in the experiments section. For error bars, as the quadtree construction is deterministic for a given image and threshold, we instead provide results across multiple benchmarks and models to demonstrate consistency. The abstract has been updated to mention these supporting analyses. revision: yes

  2. Referee: [Token-merging step] Token-merging step (described in abstract): the assumption that one representative merged token per adaptive quadtree leaf preserves all task-relevant information for downstream grounding and navigation is stated but unsupported. No analysis is given of the splitting criterion (e.g., pixel variance or homogeneity) or its reliability for small but semantically critical elements such as 8-12 px icons or single-line text in dense toolbars.

    Authors: We acknowledge the need for more explicit support of this assumption. We have added a detailed explanation of the splitting criterion in Section 3.2, which uses a homogeneity measure based on pixel variance to decide splits, ensuring that regions with text or icons are subdivided until they meet the homogeneity threshold or reach minimum size. To evaluate reliability for small elements, we include a new qualitative analysis with examples of dense toolbars, demonstrating that critical 8-12 px icons and text are preserved through adaptive splitting. The high retention rate of 99.06% on grounding tasks further substantiates that task-relevant information is maintained. We have expanded the discussion to address potential limitations in highly dense interfaces. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AQuaUI derivation

full rationale

The paper describes AQuaUI as a training-free, inference-time procedural algorithm that builds an adaptive quadtree on GUI screenshots and retains one representative merged token per leaf while preserving positions. Performance results (e.g., 99.06% retention) are reported from empirical benchmarks on models like GUI-Owl-1.5-32B-Instruct rather than from any closed-form derivation or fitted parameters. No equations, self-definitional reductions, or load-bearing self-citations appear in the provided text that would make the claimed efficiency gains equivalent to the method's own inputs by construction. The approach is self-contained as an algorithmic exploitation of spatial redundancy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that GUI screenshots contain large homogeneous regions whose information can be summarized by a single token without harming agent performance; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption GUI screenshots exhibit highly non-uniform spatial information density with large homogeneous regions
    Invoked in the opening paragraph to motivate the quadtree construction.

pith-pipeline@v0.9.0 · 5867 in / 1262 out tokens · 37137 ms · 2026-05-20T06:16:07.819369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

  1. [1]

    Developing a computer use model, 2024

    Anthropic. Developing a computer use model, 2024. URL https://www.anthropic.com/ news/developing-computer-use

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  4. [4]

    R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

  5. [5]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  6. [6]

    Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

    Kyle R Chickering, Bangzheng Li, and Muhao Chen. Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

  7. [7]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  8. [8]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  9. [9]

    Pact: Pruning and clustering-based token reduction for faster visual language models

    Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14582–14592, 2025

  10. [10]

    Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

    Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, and Chien-Sheng Wu. Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

  11. [11]

    Hunter and Kenneth Steiglitz

    Gregory M. Hunter and Kenneth Steiglitz. Operations on images using quad trees.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):145–153, 1979. doi: 10.1109/TPAMI.1979.4766900

  12. [12]

    What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph

    Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4075–4083, 2025

  13. [13]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778– 8786, 2025

  14. [14]

    On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

  15. [15]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 10

  16. [16]

    Showui: One vision-language-action model for gui visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

  17. [17]

    Ui-voyager: A self-evolving gui agent learning via failed experience

    Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience.arXiv preprint arXiv:2603.24533, 2026

  18. [18]

    Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract)

    Jizhihui Liu, Guangdao Zhu, and Feiyi Du. Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract). InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 41275–41277, 2026

  19. [19]

    Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

  20. [20]

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

  21. [21]

    Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

    Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

  22. [22]

    Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

    Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, and Hwee Tou Ng. Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

  23. [23]

    Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

    Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, et al. Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

  24. [24]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  25. [25]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  26. [26]

    ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

    Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Clawgui: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

  27. [27]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

  28. [28]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  29. [29]

    Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

  30. [30]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 11

  31. [31]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  32. [32]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  33. [33]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454, 2024

  34. [34]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

  35. [35]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

  36. [36]

    Enhancing multi- modal large language models complex reason via similarity computation

    Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, and Jiawei Yao. Enhancing multi- modal large language models complex reason via similarity computation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10203–10211, 2025

  37. [37]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

  38. [38]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 12 A Detailed Algorithm A.1 Grid Alignment and Boundary Handling Given a resized screenshot of size W×H , AQuaUI c...