pith. machine review for the scientific record. sign in

arxiv: 2604.23941 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords groundinggoclickelementlightweightmodelsmallaccuracyagent
0
0 comments X

The pith

GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Phone screens contain buttons, icons, and text fields. The goal is to point exactly at the right one when given a command such as 'tap the login button'. Large AI models can do this but require too much memory and power for phones. GoClick uses a smaller encoder-decoder structure instead of shrinking a big decoder-only model. It also filters a 10.8 million example dataset down to 3.8 million high-quality ones by checking task type and adjusting data ratios. Experiments indicate the resulting model performs close to much larger systems and improves success rates when used inside full GUI agents.

Core claim

GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models.

Load-bearing premise

That the encoder-decoder architecture choice and the Progressive Data Refinement pipeline (task type filtering plus data ratio adjustment) produce a generalizable high-quality core set that delivers the claimed accuracy gains over decoder-only downsizing.

read the original abstract

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GoClick, a 230M-parameter encoder-decoder VLM designed for GUI element grounding. It posits that downsizing decoder-only VLMs leads to suboptimal performance at this scale, advocating instead for an encoder-decoder design paired with a Progressive Data Refinement pipeline involving task type filtering and data ratio adjustment to distill a 3.8M-sample core dataset from 10.8M raw samples. The model is reported to match the grounding accuracy of models exceeding 2.5B parameters on various benchmarks and to boost overall GUI agent success rates in a device-cloud collaborative setup.

Significance. Should the quantitative results and ablations confirm the claims, the work would offer a valuable contribution to efficient on-device GUI agents by delivering a compact model suitable for resource-limited devices. It provides insight into optimal architecture selection for small VLMs in domain-specific tasks and a systematic approach to data curation. The proposed device-cloud framework integration illustrates practical benefits for agent performance. These elements position the paper as a meaningful step toward deployable GUI interaction systems.

major comments (3)
  1. [Experiments section (likely §4)] The central claim of achieving parity with larger models while outperforming downsized decoder-only alternatives rests on experimental evidence. However, to establish that the encoder-decoder architecture (rather than the data refinement) is responsible for the gains, direct comparisons are needed where both architectures are trained on the identical 3.8M core set. The current description leaves open the possibility that performance differences arise from training data quality differences.
  2. [§3.2 (Progressive Data Refinement)] The pipeline's task type filtering and ratio adjustment are presented as key to extracting a high-quality core set. Yet, the manuscript should include ablations demonstrating the incremental benefits of each step and tests on benchmarks or data splits not involved in the filtering process to rule out selection bias toward easier or benchmark-specific samples.
  3. [Abstract] While the abstract asserts experimental superiority, it omits the names of the GUI element grounding benchmarks, the specific larger models compared, and any quantitative metrics or error bars. These details are essential in the abstract or early in the experiments section to allow readers to evaluate the strength of the performance claims.
minor comments (2)
  1. [Abstract] Specify the exact parameter counts of the 'significantly larger models' for clarity.
  2. [Method and Experiments] Include hardware details (e.g., GPU type) for the reported inference speed to enhance reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and committing to the necessary revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments section (likely §4)] The central claim of achieving parity with larger models while outperforming downsized decoder-only alternatives rests on experimental evidence. However, to establish that the encoder-decoder architecture (rather than the data refinement) is responsible for the gains, direct comparisons are needed where both architectures are trained on the identical 3.8M core set. The current description leaves open the possibility that performance differences arise from training data quality differences.

    Authors: We agree that isolating the architectural contribution requires training both models on identical data. We will add new experiments in the revised manuscript that train a decoder-only VLM of comparable size directly on the same 3.8M core set used for GoClick. These results will be presented alongside the existing comparisons in the Experiments section to clarify the source of the observed gains. revision: yes

  2. Referee: [§3.2 (Progressive Data Refinement)] The pipeline's task type filtering and ratio adjustment are presented as key to extracting a high-quality core set. Yet, the manuscript should include ablations demonstrating the incremental benefits of each step and tests on benchmarks or data splits not involved in the filtering process to rule out selection bias toward easier or benchmark-specific samples.

    Authors: We acknowledge the value of more granular validation. We will expand §3.2 with ablations that separately quantify the effects of task type filtering and data ratio adjustment on final grounding performance. To address potential selection bias, we will additionally evaluate the refined core set and resulting model on held-out data splits and extra GUI grounding benchmarks not used during the refinement process, reporting these results to demonstrate generalizability. revision: yes

  3. Referee: [Abstract] While the abstract asserts experimental superiority, it omits the names of the GUI element grounding benchmarks, the specific larger models compared, and any quantitative metrics or error bars. These details are essential in the abstract or early in the experiments section to allow readers to evaluate the strength of the performance claims.

    Authors: We agree that greater specificity in the abstract will improve clarity and allow readers to better assess the claims. We will revise the abstract to name the primary GUI element grounding benchmarks, identify the larger models (exceeding 2.5B parameters) used for comparison, and include key quantitative metrics with error bars. These details will also be emphasized at the start of the Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and data curation validated on external benchmarks

full rationale

The paper advances an empirical claim that an encoder-decoder VLM at 230M parameters, trained on a 3.8M-sample core set extracted via task-type filtering and ratio adjustment, matches larger models on GUI grounding benchmarks. All load-bearing steps are experimental: architecture comparisons, ablation on the Progressive Data Refinement pipeline, and direct evaluation against downsized decoder-only baselines and >2.5B models. No equations, first-principles derivations, fitted-parameter predictions, or self-citation chains appear; results are externally falsifiable on held-out benchmarks rather than reducing to inputs by construction. This is a standard empirical ML study whose central claims remain independent of the reported training choices.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical performance of a trained VLM plus two key design choices whose justification is stated as experimental observation rather than derivation.

free parameters (2)
  • target parameter count
    Fixed at 230M to meet on-device constraints; value chosen by authors.
  • core dataset size
    3.8M samples extracted from 10.8M raw via filtering; exact ratio and selection criteria are design choices.
axioms (2)
  • domain assumption Encoder-decoder VLMs outperform decoder-only alternatives at small parameter scales for GUI grounding
    Invoked to justify architecture selection after experiments showed suboptimal downsizing results.
  • ad hoc to paper Task type filtering and data ratio adjustment reliably extract a high-quality core set
    Central to the Progressive Data Refinement pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5602 in / 1306 out tokens · 54853 ms · 2026-05-08T04:42:08.801069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736. Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., ...

  2. [2]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations. Driess, D., Xia, F., Sajjadi, M. S. M., Lynch,...

  3. [3]

    Laurenc ¸on, H., Tronchon, L., Cord, M., and Sanh, V . (2024). What matters when building vision-language models? In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 87874–87907. Curran Associates, Inc. Li, H., Chen, J., Su, J., Chen, Y ....

  4. [4]

    W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C. W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S. R., Crowson, K., Schmidt, L., Kacz- marczyk, R., and Jitsev, J. (2022). LAION-5b: An open large-scale dataset for training next generation image-text models. InThirty- sixth Conference on Neural ...

  5. [5]

    Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J

    Curran Associates, Inc. Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. (2024a). Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou,...

  6. [6]

    H., Le, Q

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E. H., Le, Q. V ., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors,Advances in Neural Information Processing Systems. Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C.,...

  7. [7]

    Xu, H., Xie, S., Tan, X., Huang, P.-Y ., Howes, R., Sharma, V ., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. (2024a). Demystifying CLIP data. InThe Twelfth International Conference on Learning Representations. Xu, J., Li, Z., Chen, W., Wang, Q., Gao, X., Cai, Q., and Ling, Z. (2024b). On-device language models: A comprehensive review. Xu,...