AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

Ha Min Son; Muhao Chen; Tinghui Zhu; Xin Liu; Yuankai Li; Zhe Zhao

arxiv: 2605.19260 · v1 · pith:3WBQTY3Lnew · submitted 2026-05-19 · 💻 cs.AI · cs.CV· cs.MA

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

Yuankai Li , Tinghui Zhu , Ha Min Son , Zhe Zhao , Xin Liu , Muhao Chen This is my paper

Pith reviewed 2026-05-20 06:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.MA

keywords visual token reductionadaptive quadtreesGUI agentsmultimodal modelsinference-time optimizationtoken mergingtemporal consistencyscreenshot redundancy

0 comments

The pith

AQuaUI merges tokens inside adaptive quadtree leaves on GUI screenshots to cut visual input size while keeping nearly all agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AQuaUI as a training-free way to exploit the fact that GUI screenshots contain large uniform areas with little task-relevant detail. It builds an adaptive quadtree that subdivides only where content is dense and replaces each leaf region with one merged token. Spatial coordinates of the retained tokens stay intact so the model’s position encodings continue to work unchanged. A conditional version of the quadtree carries information forward from the previous frame to avoid losing fine details during slow or static GUI transitions. If this holds, existing large multimodal GUI agents can run faster and cheaper on the same hardware without any retraining.

Core claim

AQuaUI partitions each screenshot with an adaptive quadtree and retains a single representative merged token per leaf, while a conditional refinement step uses the prior quadtree to stabilize the partition across consecutive frames; this produces up to 29.52 percent fewer visual tokens and 13.22 percent speedup on GUI-Owl-1.5-32B-Instruct while preserving 99.06 percent of the original grounding and navigation accuracy.

What carries the argument

Adaptive quadtree that recursively splits the screenshot according to local homogeneity and collapses each final leaf to one merged token, with position information preserved and a conditional variant that references the preceding frame’s partition.

If this is right

Existing GUI agent models obtain better accuracy-versus-speed trade-offs on grounding and navigation benchmarks without retraining.
Token usage falls by up to 29.52 percent and inference speeds up by up to 13.22 percent on models such as GUI-Owl-1.5-32B-Instruct.
Position encodings remain consistent because retained tokens keep their original spatial coordinates through the entire pipeline.
Temporal stability improves across multi-step interactions when the conditional quadtree reuses structure from prior frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quadtree partitioning idea could be tested on other visual-agent domains that also show large homogeneous regions, such as robotic camera feeds or document interfaces.
Varying the homogeneity threshold or leaf-size limit according to the current task might yield further efficiency gains beyond the fixed settings reported.
The method’s reliance on spatial redundancy suggests it could combine with existing attention-based compressors for hybrid token budgets.

Load-bearing premise

A single representative token per quadtree leaf still contains every piece of information the downstream agent needs to make correct grounding and navigation decisions.

What would settle it

Measure grounding or navigation accuracy on a standard benchmark set of screenshots that contain critical small text or icons inside large visually uniform panels; a drop below roughly 99 percent of full-token performance would indicate the representative-token step discards necessary detail.

Figures

Figures reproduced from arXiv: 2605.19260 by Ha Min Son, Muhao Chen, Tinghui Zhu, Xin Liu, Yuankai Li, Zhe Zhao.

**Figure 2.** Figure 2: Performance breakdown on UI-Vision and ScreenSpot-V2. More detailed results can be [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Performance change with respect to different parameter and score function. Left: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AQuaUI shows a simple training-free quadtree method that trims GUI tokens at inference and reports solid speedups, but skips details on token selection and robustness checks.

read the letter

AQuaUI is a training-free inference-time method that builds adaptive quadtrees over GUI screenshots, merges to one representative token per leaf, and keeps spatial positions intact. It also adds a conditional variant that references prior frames for better consistency in multi-step agent runs. This targets the uneven information density in interfaces without relying on attention scores or extra training, which sets it apart from the baselines cited in the abstract.

Referee Report

2 major / 1 minor

Summary. The paper proposes AQuaUI, a training-free inference-time token reduction method for GUI agent models. It constructs an adaptive quadtree on each GUI screenshot to exploit non-uniform spatial information density, retaining one representative merged token per quadtree leaf while preserving spatial positions. A conditional quadtree variant leverages continuity between consecutive screenshots for temporal consistency in multi-step interactions. Experiments on standard grounding and navigational benchmarks report improved accuracy-efficiency trade-offs, including up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance on GUI-Owl-1.5-32B-Instruct.

Significance. If the results hold, this work offers a practical contribution by enabling efficient inference for high-resolution GUI agents without retraining or attention-based compression. The training-free design, explicit preservation of spatial positions, and conditional quadtree for temporal consistency are clear strengths that align with the structured redundancy in GUI screenshots. The approach could meaningfully advance deployment of LMM-based agents in resource-limited settings, provided the core merging assumption is substantiated.

major comments (2)

[Abstract] Abstract: the reported performance numbers (13.22% speedup, 29.52% token reduction, 99.06% retention on GUI-Owl-1.5-32B-Instruct) are presented without error bars, ablation studies on quadtree depth or merging parameters, or any description of how the representative token is computed inside each leaf (mean, max-pool, or other). These omissions make the central efficiency claim difficult to evaluate or reproduce.
[Token-merging step] Token-merging step (described in abstract): the assumption that one representative merged token per adaptive quadtree leaf preserves all task-relevant information for downstream grounding and navigation is stated but unsupported. No analysis is given of the splitting criterion (e.g., pixel variance or homogeneity) or its reliability for small but semantically critical elements such as 8-12 px icons or single-line text in dense toolbars.

minor comments (1)

[Abstract] The abstract refers to 'standard grounding and navigational benchmarks' without naming them (e.g., ScreenSpot, AndroidControl); explicit citation would improve comparability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and revised the paper to improve clarity, reproducibility, and substantiation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported performance numbers (13.22% speedup, 29.52% token reduction, 99.06% retention on GUI-Owl-1.5-32B-Instruct) are presented without error bars, ablation studies on quadtree depth or merging parameters, or any description of how the representative token is computed inside each leaf (mean, max-pool, or other). These omissions make the central efficiency claim difficult to evaluate or reproduce.

Authors: We agree that these details are important for evaluation and reproducibility. In the revised manuscript, we describe the representative token computation as the mean pooling of the feature vectors within each quadtree leaf. We have added ablation studies varying the quadtree depth and merging parameters, with results reported in a new table in the experiments section. For error bars, as the quadtree construction is deterministic for a given image and threshold, we instead provide results across multiple benchmarks and models to demonstrate consistency. The abstract has been updated to mention these supporting analyses. revision: yes
Referee: [Token-merging step] Token-merging step (described in abstract): the assumption that one representative merged token per adaptive quadtree leaf preserves all task-relevant information for downstream grounding and navigation is stated but unsupported. No analysis is given of the splitting criterion (e.g., pixel variance or homogeneity) or its reliability for small but semantically critical elements such as 8-12 px icons or single-line text in dense toolbars.

Authors: We acknowledge the need for more explicit support of this assumption. We have added a detailed explanation of the splitting criterion in Section 3.2, which uses a homogeneity measure based on pixel variance to decide splits, ensuring that regions with text or icons are subdivided until they meet the homogeneity threshold or reach minimum size. To evaluate reliability for small elements, we include a new qualitative analysis with examples of dense toolbars, demonstrating that critical 8-12 px icons and text are preserved through adaptive splitting. The high retention rate of 99.06% on grounding tasks further substantiates that task-relevant information is maintained. We have expanded the discussion to address potential limitations in highly dense interfaces. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AQuaUI derivation

full rationale

The paper describes AQuaUI as a training-free, inference-time procedural algorithm that builds an adaptive quadtree on GUI screenshots and retains one representative merged token per leaf while preserving positions. Performance results (e.g., 99.06% retention) are reported from empirical benchmarks on models like GUI-Owl-1.5-32B-Instruct rather than from any closed-form derivation or fitted parameters. No equations, self-definitional reductions, or load-bearing self-citations appear in the provided text that would make the claimed efficiency gains equivalent to the method's own inputs by construction. The approach is self-contained as an algorithmic exploitation of spatial redundancy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that GUI screenshots contain large homogeneous regions whose information can be summarized by a single token without harming agent performance; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption GUI screenshots exhibit highly non-uniform spatial information density with large homogeneous regions
Invoked in the opening paragraph to motivate the quadtree construction.

pith-pipeline@v0.9.0 · 5867 in / 1262 out tokens · 37137 ms · 2026-05-20T06:16:07.819369+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree... splitting criterion s(n)=wn hn ·Var(grey(n))
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conditional quadtree algorithm that leverages the continuity between consecutive screenshots

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

[1]

Developing a computer use model, 2024

Anthropic. Developing a computer use model, 2024. URL https://www.anthropic.com/ news/developing-computer-use

work page 2024
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

work page arXiv 2025
[5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024
[6]

Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

Kyle R Chickering, Bangzheng Li, and Muhao Chen. Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

work page arXiv 2025
[7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[8]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[9]

Pact: Pruning and clustering-based token reduction for faster visual language models

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14582–14592, 2025

work page 2025
[10]

Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, and Chien-Sheng Wu. Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

work page arXiv 2025
[11]

Hunter and Kenneth Steiglitz

Gregory M. Hunter and Kenneth Steiglitz. Operations on images using quad trees.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):145–153, 1979. doi: 10.1109/TPAMI.1979.4766900

work page doi:10.1109/tpami.1979.4766900 1979
[12]

What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4075–4083, 2025

work page 2025
[13]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778– 8786, 2025

work page 2025
[14]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024
[15]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 10

work page 2024
[16]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

work page 2025
[17]

Ui-voyager: A self-evolving gui agent learning via failed experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience.arXiv preprint arXiv:2603.24533, 2026

work page arXiv 2026
[18]

Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract)

Jizhihui Liu, Guangdao Zhu, and Feiyi Du. Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract). InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 41275–41277, 2026

work page 2026
[19]

Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024
[20]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

work page arXiv 2025
[22]

Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, and Hwee Tou Ng. Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

work page arXiv 2026
[23]

Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, et al. Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

work page arXiv 2025
[24]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Clawgui: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

work page arXiv 2025
[30]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[32]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[33]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

work page 2025
[35]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025
[36]

Enhancing multi- modal large language models complex reason via similarity computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, and Jiawei Yao. Enhancing multi- modal large language models complex reason via similarity computation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10203–10211, 2025

work page 2025
[37]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025
[38]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 12 A Detailed Algorithm A.1 Grid Alignment and Boundary Handling Given a resized screenshot of size W×H , AQuaUI c...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Developing a computer use model, 2024

Anthropic. Developing a computer use model, 2024. URL https://www.anthropic.com/ news/developing-computer-use

work page 2024

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

work page arXiv 2025

[5] [5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024

[6] [6]

Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

Kyle R Chickering, Bangzheng Li, and Muhao Chen. Qlip: A dynamic quadtree vision prior enhances mllm performance without retraining.arXiv preprint arXiv:2505.23004, 2025

work page arXiv 2025

[7] [7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022

[8] [8]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023

[9] [9]

Pact: Pruning and clustering-based token reduction for faster visual language models

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. Pact: Pruning and clustering-based token reduction for faster visual language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14582–14592, 2025

work page 2025

[10] [10]

Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, and Chien-Sheng Wu. Gui-kv: Ef- ficient gui agents via kv cache with spatio-temporal awareness.arXiv preprint arXiv:2510.00536, 2025

work page arXiv 2025

[11] [11]

Hunter and Kenneth Steiglitz

Gregory M. Hunter and Kenneth Steiglitz. Operations on images using quad trees.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):145–153, 1979. doi: 10.1109/TPAMI.1979.4766900

work page doi:10.1109/tpami.1979.4766900 1979

[12] [12]

What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, and Yiyi Zhou. What kind of visual tokens do we need? training-free visual token pruning for multi-modal large language models from the perspective of graph. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4075–4083, 2025

work page 2025

[13] [13]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778– 8786, 2025

work page 2025

[14] [14]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024

[15] [15]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 10

work page 2024

[16] [16]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

work page 2025

[17] [17]

Ui-voyager: A self-evolving gui agent learning via failed experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience.arXiv preprint arXiv:2603.24533, 2026

work page arXiv 2026

[18] [18]

Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract)

Jizhihui Liu, Guangdao Zhu, and Feiyi Du. Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models (student abstract). InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 41275–41277, 2026

work page 2026

[19] [19]

Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024

[20] [20]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Triattention: Efficient long reasoning with trigonometric kv compression.arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

work page arXiv 2025

[22] [22]

Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, and Hwee Tou Ng. Focusui: Efficient ui grounding via position-preserving visual token selection.arXiv preprint arXiv:2601.03928, 2026

work page arXiv 2026

[23] [23]

Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, et al. Can visual input be compressed? a visual token compression benchmark for large multimodal models.arXiv preprint arXiv:2511.02650, 2025

work page arXiv 2025

[24] [24]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Clawgui: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

work page arXiv 2025

[30] [30]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[32] [32]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026

[33] [33]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

work page 2025

[35] [35]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025

[36] [36]

Enhancing multi- modal large language models complex reason via similarity computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, and Jiawei Yao. Enhancing multi- modal large language models complex reason via similarity computation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10203–10211, 2025

work page 2025

[37] [37]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025

[38] [38]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 12 A Detailed Algorithm A.1 Grid Alignment and Boundary Handling Given a resized screenshot of size W×H , AQuaUI c...

work page internal anchor Pith review Pith/arXiv arXiv 2023