pith. sign in

arxiv: 2605.28629 · v1 · pith:FVN5RWYPnew · submitted 2026-05-27 · 💻 cs.CL

Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

Pith reviewed 2026-06-29 12:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords mobile agentsmultimodal large language modelsconfidence estimationhuman-agent interactionsupervised fine-tuningdirect preference optimizationtask success rate
0
0 comments X

The pith

Mobile-Aptus trains agents to output confidence scores and corrects their bias for balanced interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage framework that first uses supervised fine-tuning to make mobile agents output both actions and confidence scores, then applies semantic similarity retrieval plus direct preference optimization to fix biases in those scores. This setup aims to stop agents from attempting tasks they cannot finish while also preventing them from requesting human help too frequently. If the approach works, agents reach higher task completion rates on standard benchmarks and require fewer interventions in live tests. A reader would care because it offers a practical way to make multimodal agents more reliable without constant oversight.

Core claim

The central claim is that a universal confidence integration framework with interaction capability empowerment through supervised fine-tuning followed by confidence bias correction via semantic similarity retrieval and direct preference optimization produces agents that interact proactively and robustly, delivering state-of-the-art results on OS-Kairos, AITZ, Meta-GUI, and AndroidControl with an average task success rate gain exceeding 17 percent and a 26 percent gain in real-world dynamic tests using only 0.64 intervention steps per instruction.

What carries the argument

The confidence integration framework that combines supervised fine-tuning for joint action-and-confidence output with bias correction through retrieval and preference optimization.

If this is right

  • Agents reach higher success rates than prior methods across the four named benchmarks.
  • Average task success improves by more than 17 percent in offline evaluation.
  • Real-world success exceeds the baseline by 26 percent while keeping interventions low at 0.64 steps per instruction.
  • Both excessive autonomous attempts and excessive human requests are reduced at the same time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same confidence correction steps could be tested on non-mobile agent tasks such as web navigation or robotics.
  • If the scores prove reliable, they might be used to trigger other safety checks before execution begins.
  • Different retrieval methods could be swapped into the bias-correction stage to measure further gains.

Load-bearing premise

The combination of supervised fine-tuning to produce confidence scores and later bias correction through retrieval and optimization will create accurate estimates that hold up on new data.

What would settle it

Run the trained agent on a fresh set of mobile tasks and check whether its reported confidence values reliably predict actual success or failure rates.

Figures

Figures reproduced from arXiv: 2605.28629 by Aston Zhang, Gongshen Liu, Pengzhou Cheng, Tianjie Ju, Yuan Guo, Zheng Wu, Zhuosheng Zhang, Zongru Wu.

Figure 1
Figure 1. Figure 1: The decision boundary of a fully autonomous agent exceeds its [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Example of interactive mobile-using agents executing instructions, which resolves the over-execution issue but introduces over-soliciting. They may [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the universal confidence integration framework. First, leverage the data that has ground-truth action and confidence score, we apply SFT [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Results of the model scaling experiment. The results demonstrate the rationality of the probed mobile-using agent selection and the strong [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of attention heatmaps between Mobile-Aptus and OS [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17\% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile-Aptus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Mobile-Aptus, a two-stage framework for MLLM-based mobile-using agents. The first stage uses supervised fine-tuning to enable agents to output both actions and confidence scores. The second stage applies semantic similarity retrieval combined with direct preference optimization to correct biases in those confidence scores. The goal is to reduce both over-execution (attempting unresolvable tasks) and over-soliciting (excessive human requests). The manuscript reports state-of-the-art results on four benchmarks (OS-Kairos, AITZ, Meta-GUI, AndroidControl) with >17% average task success improvement over baselines, plus a 26% gain in real-world dynamic experiments at 0.64 intervention steps per instruction.

Significance. If the confidence estimates prove accurate and generalizable, the framework could meaningfully advance reliable mobile agents by enabling proactive, calibrated human intervention. The public code release at https://github.com/Wuzheng02/Mobile-Aptus is a positive contribution that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: The headline performance claims (>17% average task success improvement across four benchmarks and +26% in real-world experiments) are load-bearing for the central contribution, yet the abstract supplies no methodological details on how confidence scores were validated, no error analysis, no ablation isolating the semantic-retrieval + DPO bias-correction stage, and no quantitative evidence (e.g., calibration metrics or held-out generalization results) that the two-stage pipeline produces accurate, unbiased estimates beyond the SFT training distribution.
  2. [Abstract] The weakest assumption—that SFT followed by semantic-similarity retrieval + DPO yields confidence scores that generalize and reduce both over-execution and over-soliciting—is not supported by any reported validation. Without such evidence, the reported gains cannot be confidently attributed to the proposed confidence-driven interaction rather than other implementation factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the validation of our confidence calibration approach. We address each point below and will revise the abstract to better summarize the supporting evidence from the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claims (>17% average task success improvement across four benchmarks and +26% in real-world experiments) are load-bearing for the central contribution, yet the abstract supplies no methodological details on how confidence scores were validated, no error analysis, no ablation isolating the semantic-retrieval + DPO bias-correction stage, and no quantitative evidence (e.g., calibration metrics or held-out generalization results) that the two-stage pipeline produces accurate, unbiased estimates beyond the SFT training distribution.

    Authors: We agree the abstract is concise and could better reference the validation details. The full paper describes the SFT stage in Section 3.1 and the semantic similarity retrieval + DPO bias-correction stage in Section 3.2. Section 4.3 presents ablations that isolate the contribution of the bias-correction stage, Section 4.2 reports results on four held-out benchmarks demonstrating generalization, and Section 4.5 includes error analysis on over-execution and over-soliciting rates. We will revise the abstract to add a brief clause noting the two-stage validation and the ablation-supported gains from bias correction. revision: yes

  2. Referee: [Abstract] The weakest assumption—that SFT followed by semantic-similarity retrieval + DPO yields confidence scores that generalize and reduce both over-execution and over-soliciting—is not supported by any reported validation. Without such evidence, the reported gains cannot be confidently attributed to the proposed confidence-driven interaction rather than other implementation factors.

    Authors: The manuscript does report supporting validation. Ablations in Section 4.3 directly compare the full two-stage model against the SFT-only baseline and show additional reductions in over-execution and over-soliciting attributable to the retrieval + DPO stage. Results on four held-out benchmarks plus the real-world dynamic experiments (0.64 intervention steps) provide evidence of generalization beyond the SFT distribution. These controlled comparisons allow attribution to the confidence-driven mechanism rather than other factors. We do not report traditional calibration metrics such as ECE, but the downstream task metrics serve as the primary validation of utility. revision: no

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivations or self-referential equations

full rationale

The paper presents an empirical two-stage training pipeline (SFT for action+confidence, then semantic-retrieval + DPO for bias correction) and reports benchmark gains. No equations, uniqueness theorems, or derivation steps appear in the provided text. Performance claims rest on experimental results rather than any mathematical reduction that collapses to fitted inputs or self-citations by construction. The central assumption about generalization of confidence scores is an empirical claim open to falsification, not a definitional or self-referential loop. This is the normal non-circular case for applied ML papers without formal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5866 in / 828 out tokens · 32269 ms · 2026-06-29T12:44:26.073673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 29 canonical work pages · 16 internal anchors

  1. [1]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning, pp. 19730– 19742, PMLR, 2023

  2. [2]

    Image as a foreign language: Beit pretraining for vision and vision-language tasks,

    W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som,et al., “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19175–19186, 2023

  3. [3]

    Palm-e: An embodied multimodal language model,

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang,et al., “Palm-e: An embodied multimodal language model,” 2023

  4. [4]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

  5. [5]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34892– 34916, 2023

  6. [6]

    Unifying structure reasoning and language pre-training for complex reasoning tasks,

    S. Wang, Z. Wei, J. Xu, T. Li, and Z. Fan, “Unifying structure reasoning and language pre-training for complex reasoning tasks,”IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 1586–1595, 2024

  7. [7]

    You only look at screens: Multimodal chain- of-action agents,

    Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain- of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024(L.-W. Ku, A. Martins, and V . Srikumar, eds.), (Bangkok, Thailand), pp. 3132–3149, Association for Computational Linguistics, Aug. 2024

  8. [8]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,”arXiv preprint arXiv:2410.23218, 2024

  9. [9]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang,et al., “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025

  10. [10]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    J. Ye, X. Zhang, H. Xu, H. Liu, J. Wang, Z. Zhu, Z. Zheng, F. Gao, J. Cao, Z. Lu,et al., “Mobile-agent-v3: Fundamental agents for gui automation,”arXiv preprint arXiv:2508.15144, 2025

  11. [11]

    jrapture: A cap- ture/replay tool for observation-based testing,

    J. Steven, P. Chandra, B. Fleck, and A. Podgurski, “jrapture: A cap- ture/replay tool for observation-based testing,” inProceedings of the 2000 ACM SIGSOFT international symposium on Software testing and analysis, pp. 158–167, 2000. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  12. [12]

    Dart: a frame- work for regression testing

    A. Memon, I. Banerjee, N. Hashmi, and A. Nagarajan, “Dart: a frame- work for regression testing” nightly/daily builds” of gui applications,” in International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings., pp. 410–419, IEEE, 2003

  13. [13]

    Hierarchical gui test case generation using automated planning,

    A. M. Memon, M. E. Pollack, and M. L. Soffa, “Hierarchical gui test case generation using automated planning,”IEEE transactions on software engineering, vol. 27, no. 2, pp. 144–155, 2001

  14. [14]

    Rule-based exploratory testing of graphical user interfaces,

    T. D. Hellmann and F. Maurer, “Rule-based exploratory testing of graphical user interfaces,” in2011 Agile Conference, pp. 107–116, IEEE, 2011

  15. [16]

    Gui agents: A survey,

    D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia,et al., “Gui agents: A survey,”arXiv preprint arXiv:2412.13501, 2024

  16. [17]

    Gui agents with foundation models: A compre- hensive survey,

    S. Wang, W. Liu, J. Chen, W. Gan, X. Zeng, S. Yu, X. Hao, K. Shao, Y . Wang, and R. Tang, “Gui agents with foundation models: A compre- hensive survey,”arXiv preprint arXiv:2411.04890, 2024

  17. [18]

    Smoothing grounding and reasoning for mllm-powered gui agents with query- oriented pivot tasks,

    Z. Wu, P. Cheng, Z. Wu, T. Ju, Z. Zhang, and G. Liu, “Smoothing grounding and reasoning for mllm-powered gui agents with query- oriented pivot tasks,”arXiv preprint arXiv:2503.00401, 2025

  18. [19]

    Gui-g2: Gaussian reward modeling for gui grounding,

    F. Tang, Z. Gu, Z. Lu, X. Liu, S. Shen, C. Meng, W. Wang, W. Zhang, Y . Shen, W. Lu,et al., “Gui-g2: Gaussian reward modeling for gui grounding,”arXiv preprint arXiv:2507.15846, 2025

  19. [20]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,

    Y . Zhou, S. Dai, S. Wang, K. Zhou, Q. Jia, and J. Xu, “Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents,” arXiv preprint arXiv:2505.15810, 2025

  20. [21]

    Cogagent: A visual language model for gui agents,

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding,et al., “Cogagent: A visual language model for gui agents,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14281–14290, 2024

  21. [22]

    CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,

    X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cog- nitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024(L.-W. Ku, A. Martins, and V . Srikumar, eds.), (Bangkok, Thailand), pp. 9097– 9110, Association for Computational Linguistics, Aug. 2024

  22. [23]

    Android in the zoo: Chain-of-action-thought for GUI agents,

    J. Zhang, J. Wu, T. Yihua, M. Liao, N. Xu, X. Xiao, Z. Wei, and D. Tang, “Android in the zoo: Chain-of-action-thought for GUI agents,” inFindings of the Association for Computational Linguistics: EMNLP 2024(Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, eds.), (Miami, Florida, USA), pp. 12016–12031, Association for Computational Linguistics, Nov. 2024

  23. [24]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,

    J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration,”Advances in Neural Information Processing Systems, vol. 37, pp. 2686–2710, 2025

  24. [25]

    Appagent v2: Advanced agent for flexible mobile interactions,

    Y . Li, C. Zhang, W. Yang, B. Fu, P. Cheng, X. Chen, L. Chen, and Y . Wei, “Appagent v2: Advanced agent for flexible mobile interactions,” arXiv preprint arXiv:2408.11824, 2024

  25. [26]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks,

    Z. Wang, H. Xu, J. Wang, X. Zhang, M. Yan, J. Zhang, F. Huang, and H. Ji, “Mobile-agent-e: Self-evolving mobile assistant for complex tasks,”arXiv preprint arXiv:2501.11733, 2025

  26. [27]

    MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

    N. Li, X. Qu, J. Zhou, J. Wang, M. Wen, K. Du, X. Lou, Q. Peng, and W. Zhang, “Mobileuse: A gui agent with hierarchical reflection for autonomous mobile operation,”arXiv preprint arXiv:2507.16853, 2025

  27. [28]

    Large Language Model-Brained GUI Agents: A Survey

    C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, Q. Lin, S. Rajmohan,et al., “Large language model-brained gui agents: A survey,”arXiv preprint arXiv:2411.18279, 2024

  28. [29]

    Towards trustworthy gui agents: A survey,

    Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

  29. [30]

    CowPilot: A framework for autonomous and human- agent collaborative web navigation,

    F. Huq, Z. Z. Wang, F. F. Xu, T. Ou, S. Zhou, J. P. Bigham, and G. Neubig, “CowPilot: A framework for autonomous and human- agent collaborative web navigation,” inProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)(N. Dziri, S. X. ...

  30. [31]

    Os-kairos: Adaptive interaction for mllm-powered gui agents,

    P. Cheng, Z. Wu, Z. Wu, A. Zhang, Z. Zhang, and G. Liu, “Os-kairos: Adaptive interaction for mllm-powered gui agents,”arXiv preprint arXiv:2503.16465, 2025

  31. [32]

    VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

    Z. Wu, H. Huang, X. Lou, X. Qu, P. Cheng, Z. Wu, W. Liu, W. Zhang, J. Wang, Z. Wang,et al., “Verios: Query-driven proactive human-agent-gui interaction for trustworthy os agents,”arXiv preprint arXiv:2509.07553, 2025

  32. [33]

    Browseconf: Confidence-guided test-time scaling for web agents,

    L. Ou, K. Li, H. Yin, L. Zhang, Z. Zhang, X. Wu, R. Ye, Z. Qiao, P. Xie, J. Zhou,et al., “Browseconf: Confidence-guided test-time scaling for web agents,”arXiv preprint arXiv:2510.23458, 2025

  33. [34]

    Mice for cats: Model-internal confidence estimation for calibrating agents with tools,

    N. Subramani, J. Eisner, J. Svegliato, B. Van Durme, Y . Su, and S. Thomson, “Mice for cats: Model-internal confidence estimation for calibrating agents with tools,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 12362–...

  34. [35]

    EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

    Y . Lu, T. Ju, M. Zhao, X. Ma, Y . Guo, and Z. Zhang, “Eva: Red-teaming gui agents via evolving indirect prompt injection,”arXiv preprint arXiv:2505.14289, 2025

  35. [36]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Y . Xu, Z. Wang, J. Wang, D. Lu, T. Xie, A. Saha, D. Sahoo, T. Yu, and C. Xiong, “Aguvis: Unified pure vision agents for autonomous gui interaction,”arXiv preprint arXiv:2412.04454, 2024

  36. [37]

    Digirl: Training in-the-wild device-control agents with autonomous re- inforcement learning,

    Y . Zhou, H. Bai, M. Cemri, J. Pan, A. Suhr, S. Levine, and A. Kumar, “Digirl: Training in-the-wild device-control agents with autonomous re- inforcement learning,” inAutomated Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs, 2024

  37. [38]

    Distrl: An asynchronous distributed reinforcement learning framework for on-device control agent,

    T. Wang, Z. Wu, J. Liu, D. Yuen, H. Jianye, J. Wang, and K. Shao, “Distrl: An asynchronous distributed reinforcement learning framework for on-device control agent,” inNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024

  38. [39]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li, “Ui-r1: Enhancing action prediction of gui agents by reinforcement learning,”arXiv preprint arXiv:2503.21620, 2025

  39. [40]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia, “Gui-r1: A generalist r1-style vision-language action model for gui agents,”arXiv preprint arXiv:2504.10458, 2025

  40. [41]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Y . Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu, “Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners,”arXiv preprint arXiv:2504.14239, 2025

  41. [42]

    Ui-s1: Advancing gui automation via semi- online reinforcement learning,

    Z. Lu, J. Ye, F. Tang, Y . Shen, H. Xu, Z. Zheng, W. Lu, M. Yan, F. Huang, J. Xiao,et al., “Ui-s1: Advancing gui automation via semi- online reinforcement learning,”arXiv preprint arXiv:2509.11543, 2025

  42. [43]

    InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

    Q. Ai, P. Bu, Y . Cao, Y . Wang, J. Gu, J. Xing, Z. Zhu, W. Jiang, Z. Zheng, J. Song,et al., “Inquiremobile: Teaching vlm-based mobile agent to request human assistance via reinforcement fine-tuning,”arXiv preprint arXiv:2508.19679, 2025

  43. [44]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023

  44. [45]

    Meta- gui: Towards multi-modal conversational agents on mobile gui,

    L. Sun, X. Chen, L. Chen, T. Dai, Z. Zhu, and K. Yu, “Meta- gui: Towards multi-modal conversational agents on mobile gui,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6699–6712, 2022

  45. [46]

    On the effects of data scale on ui control agents, 2024

    W. Li, W. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyama- gundlu, and O. Riva, “On the effects of data scale on computer control agents,”arXiv preprint arXiv:2406.03679, 2024

  46. [47]

    Android in the wild: a large-scale dataset for android device control,

    C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. Lillicrap, “Android in the wild: a large-scale dataset for android device control,” pp. 59708– 59728, 2023

  47. [48]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  48. [49]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao,et al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

  49. [50]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

  50. [51]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

  51. [52]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986, 2022

  52. [53]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  53. [54]

    Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,

    J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  54. [55]

    Attention-driven gui grounding: Leveraging pretrained multimodal large language models without fine-tuning,

    H.-M. Xu, Q. Chen, L. Wang, and L. Liu, “Attention-driven gui grounding: Leveraging pretrained multimodal large language models without fine-tuning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 8851–8859, 2025. Zheng Wureceived his Bachelor’s degree in infor- mation security from Shanghai Jiao Tong University, Shanghai, C...