pith. sign in

arxiv: 2605.27134 · v1 · pith:6TUDGJIOnew · submitted 2026-05-26 · 💻 cs.AI

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Pith reviewed 2026-06-29 17:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision language modelsGUI navigationreinforcement learningdata scalingmobile agentsbenchmarking toolkitfinetuning methodsout-of-domain generalization
0
0 comments X

The pith

Reinforcement-based finetuning outperforms supervised finetuning for vision-language agents navigating mobile GUIs, especially in new domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a large dataset and evaluation toolkit to test how vision-language models learn to control mobile apps through their screens. It demonstrates that increasing the amount of training data helps performance, but applying reinforcement learning on top of that data produces stronger results than standard supervised training alone. The advantage of reinforcement learning is most pronounced when the model encounters applications not seen during training. A sympathetic reader would care because reliable GUI agents could enable automation of everyday phone tasks without manual coding.

Core claim

Using the HyperTrack dataset of over 16,000 tasks from more than 650 Chinese mobile applications and the GUIEvalKit benchmark, the paper shows that reinforcement-based finetuning of VLMs consistently outperforms supervised finetuning, with the gap widening in out-of-domain settings, indicating a synergy between data scaling and reinforcement learning for improving agent performance in GUI navigation.

What carries the argument

The comparison of supervised finetuning versus reinforcement-based finetuning on scaled data from the HyperTrack dataset, which carries the argument by showing performance differences in in-domain and out-of-domain tasks.

Load-bearing premise

The 16,000 tasks from Chinese mobile applications represent general real-world GUI navigation challenges, and the offline evaluation in GUIEvalKit measures true agent capability without biases.

What would settle it

A follow-up study collecting tasks from non-Chinese apps and showing that reinforcement finetuning no longer outperforms supervised finetuning on those tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.27134 by Heng Qu, Jian Luan, Pengzhi Gao, Renren Jin, Wei Liu, Wenzong Zhang, Yike Liu.

Figure 1
Figure 1. Figure 1: Comparison of Type Match and Exact Match across models trained with different numbers of episodes under supervised and reinforcement-based finetuning. Type Match is the proportion of responses whose predicted action type matches the ground truth, i.e., Raction-type = 1. Exact Match is the proportion of responses with a correct action type and parameters, i.e., Rparams = 1. 0 otherwise. The parameter reward… view at source ↗
Figure 2
Figure 2. Figure 2: Main components in GUIEvalKit. described above. Click actions account for 69.3% of the HyperTrack training set, making this setting representative of the dominant action category. The results show that the overall scaling trend remains consistent across backbones and reward formulations. RL continues to outperform SFT at comparable data scales, and the Gaussian reward achieves performance comparable to the… view at source ↗
Figure 3
Figure 3. Figure 3: Correlation with AndroidWorld (AW) online success across six models. Left axis: SOEval/offline step exact match; right axis: SOEval progress; ρ: Spearman correlation; R 2 : coefficient of determination. Curves show second-order Legendre fits. These empirical results support interpreting SOEval as a context-aligned, static-data, and step-level approximation, not as a substitute for full online evaluation. t… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between exact match and online source ratio according to different on-policy mixing strategies, with Spearman correlation coefficients Rs reported [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PASS@n performance according to rollout size n. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Diversity distribution under different inference modes. We evaluate GUI-Owl-7B across temporal sampling gaps in [−1, 1]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of reasoning on decision behavior. Left: reasoning consistently increases decision diversity for a substantial subset of tasks, but most diversity gains are accompanied by reduced decision stability. Right: Stability gains are concentrated on samples with low instruct-only baseline stability, while stability losses primarily occur on samples with high baseline stability. Owl-7B. As shown in [PITH_F… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of app categories in the HyperTrack dataset. 5 10 15 20 Number of steps 0 1000 2000 3000 4000 5000 Number of tasks Task Length Distribution 20 40 60 80 Number of words 0 250 500 750 1000 1250 1500 1750 Number of instructions Instruction Length Distribution OPEN CLICK SCROLL TYPE STOP Action type 0 10000 20000 30000 40000 Number of actions Action Distribution [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 10
Figure 10. Figure 10: HyperTrack dataset statistics. A.1. Data Collection, Quality Control, and Privacy Constraints The HyperTrack release format provides a static, reproducible representation of each interaction trajectory for offline training and evaluation. Each episode consists of a high-level user instruction, an ordered sequence of screenshots paired with textual screen observations, structured action annotations (includ… view at source ↗
Figure 11
Figure 11. Figure 11: reports the complementary Qwen3-VL-8B-Thinking scaling results under SFT, binary RL, and Gaussian spatial reward. 10 2 10 3 10 4 Number of episodes 70 75 80 85 90 Accuracy IDD 10 2 10 3 10 4 Number of episodes 70 75 80 85 90 Unseen APP 10 2 10 3 10 4 Number of episodes 65 70 75 80 85 90 Unseen Device 10 2 10 3 10 4 Number of episodes 70 75 80 85 90 Unseen APP & Device RL-binary (Exact Match) RL-binary (Ty… view at source ↗
Figure 12
Figure 12. Figure 12: Representative privacy-cleared Chinese-language HyperTrack screenshots from one trajectory. The corresponding instruction is to view the reputation-ranking page in a Chinese novel app. B. GUIEvalKit Toolkit: Design Details Action Description CLICK(point = (x, y)) Click at the relative coordinates (x, y) ∈ [0, 1000]. LONG_PRESS(point = (x, y), duration = t) Long press at point for duration t. SCROLL(point … view at source ↗
Figure 13
Figure 13. Figure 13: Click executions clustering. Execution points are relatively uniformly dispersed while remaining clearly concentrated on the target UI component, forming a single stable cluster. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Click executions clustering. The execution distribution exhibits two well-separated but semantically aligned substructures. Due to the geometric shape of the target component, both subclusters are centered on the same UI element and correspond to a single decision [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: presents a scenario in which two semantically distinct click decisions—targeting different UI elements—are cleanly separated into two well-defined clusters. In this case, ϵ = 70 correctly recovers the underlying decision structure without ambiguity [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Click Executions Clustering. Three major clusters correspond to distinct high-level functional regions of the interface. However, under a compact and dense layout with multiple adjacent small UI elements (top-left region), execution points targeting different components are merged into a single cluster at ϵ = 70, leading to a mild underestimation of decision diversity. In practical GUI tasks, the majority… view at source ↗
Figure 17
Figure 17. Figure 17: Behavior of nlogi+ under different parameter settings. Left: Fixed plb = 0, gap = 1, and E(p) = 0.5. Increasing κ sharpens the temporal transition and induces stronger temporal bias, while smaller values lead to a more linear evolution. Right: Fixed plb = 0, κ = 16, and E(p) = 0.3. Increasing gap concentrates probability mass toward near-decision contexts, resulting in stronger temporal bias; smaller gaps… view at source ↗
Figure 18
Figure 18. Figure 18: Relationship between exact match and online source ratio according to different on-policy mixing strategies for Qwen3-VL-8B￾Thinking. D.4. Horizon-Length Degradation Analysis We analyze step-level exact match as a function of both absolute step index and relative step ratio. The step-ratio stratification avoids conflating states that have the same absolute index but correspond to different task phases, su… view at source ↗
Figure 19
Figure 19. Figure 19: Horizon-length analysis using absolute step index and relative step ratio. The curves indicate phase- and model-dependent degradation rather than a universal sharp drop after a fixed step count. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Execution target mismatch, corresponding to the action-target category in the R-E consistency failure taxonomy [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Execution type mismatch, corresponding to the action-type category in the R-E consistency failure taxonomy. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces HyperTrack, a dataset of over 16,000 real-world tasks across 650+ Chinese mobile apps, and GUIEvalKit, an open-source toolkit for offline benchmarking of VLM agents on GUI navigation. It studies effects of data scaling on supervised finetuning versus reinforcement-based finetuning, reports that RL finetuning outperforms SFT (especially OOD), benchmarks SOTA VLMs, and analyzes the impact of interaction history and reasoning capabilities on task completion.

Significance. If the results hold under robust evaluation, the work supplies a large-scale public resource and empirical evidence for the value of RL over SFT in scaling VLM agents for mobile GUI tasks, which could inform future agent training pipelines in this domain.

major comments (1)
  1. [GUIEvalKit description and experimental setup] GUIEvalKit / offline evaluation protocol: the central claim that reinforcement-based finetuning outperforms supervised finetuning (especially OOD) rests on automated success metrics whose precise definition is not provided in sufficient detail. If success is computed via exact action-sequence match or final-screen similarity to a single reference trajectory, RL—which directly optimizes reward signals derived from those trajectories—will appear superior by construction even when real-world task completion rates are comparable; the OOD split by app does not automatically eliminate this artifact.
minor comments (2)
  1. Abstract states performance trends but supplies no quantitative numbers, metrics, statistical tests, or error bars, making it impossible to gauge effect sizes from the summary alone.
  2. [Dataset construction] The manuscript should clarify how task difficulty and success criteria were validated to be free of hidden biases when collecting the 16,000 tasks from Chinese apps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the concern on the evaluation protocol below and will revise the manuscript to add the requested detail.

read point-by-point responses
  1. Referee: [GUIEvalKit description and experimental setup] GUIEvalKit / offline evaluation protocol: the central claim that reinforcement-based finetuning outperforms supervised finetuning (especially OOD) rests on automated success metrics whose precise definition is not provided in sufficient detail. If success is computed via exact action-sequence match or final-screen similarity to a single reference trajectory, RL—which directly optimizes reward signals derived from those trajectories—will appear superior by construction even when real-world task completion rates are comparable; the OOD split by app does not automatically eliminate this artifact.

    Authors: We agree that the current description of the success metric in GUIEvalKit is insufficiently detailed and will expand it in the revision. The metric is implemented as a functional task-completion checker that verifies whether the final app state satisfies the task goal (via screen embedding similarity above a threshold plus verification of critical UI elements or state changes), rather than requiring exact action-sequence or single-trajectory matching. This design intentionally permits multiple valid trajectories. We will add pseudocode, threshold values, and concrete examples of how the checker handles trajectory variation in the GUIEvalKit section. The OOD split (by app) further reduces the risk of leakage because reference trajectories are app-specific and the checker operates on semantic state equivalence, not surface-form matching. We will also report an additional human-verified subset to corroborate the automated metric. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no circular derivations

full rationale

The paper is a data-collection and benchmarking study: it introduces HyperTrack (16k tasks from Chinese apps) and GUIEvalKit, then reports experimental comparisons of SFT vs. RL finetuning and VLM benchmarks. No equations, derivations, or fitted parameters are presented whose outputs reduce to the inputs by construction. Central claims rest on new empirical results that are externally falsifiable via the released dataset and toolkit. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical derivations, free parameters, or invented entities are described. The work rests on standard assumptions that collected tasks reflect real usage and that offline success metrics correlate with deployed performance.

pith-pipeline@v0.9.1-grok · 5722 in / 1125 out tokens · 43101 ms · 2026-06-29T17:19:47.740178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Xiaomi-GUI-0 Technical Report

    cs.AI 2026-06 unverdicted novelty 4.0

    Xiaomi-GUI-0 reports 72.0% success on an in-house real-mobile benchmark and 78.9% on AndroidWorld after training a GUI agent in a real-device closed loop with an error-driven data flywheel and three-stage RL pipeline.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl

  2. [2]

    <constraint text>

    URL https://aclanthology.org/2025. findings-acl.110/. Gou, B., Wang, R., Zheng, B., Xie, Y ., Chang, C., Shu, Y ., Sun, H., and Su, Y . Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=kxnoqaisCT. Gu, Z., Z...

  3. [3]

    Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T

    URL https://doi.org/10.18653/v1/ 2024.findings-emnlp.702. Zhang, Z., Lu, Y ., Fu, Y ., Huo, Y ., Yang, S., Wu, Y ., Si, H., Cong, X., Chen, H., Lin, Y ., Xie, J., Zhou, W., Xu, W., Zhang, Y ., Su, Z., Zhai, Z., Liu, X., Mei, Y ., Xu, J., Tian, H., Wang, C., Chen, C., Yao, Y ., Liu, Z., and Sun, M. AgentCPM-GUI: Building mobile-use agents with rein- forcem...