pith. sign in

arxiv: 2512.10371 · v2 · submitted 2025-12-11 · 💻 cs.AI

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Pith reviewed 2026-05-16 23:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentscontext managementlong-horizon tasksprogram-guidedbelief stateAndroidWorldmobile automation
0
0 comments X

The pith

AgentProg reframes agent interaction history as a program to manage context for long-horizon GUI tasks without losing key information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentProg to address context overload in long-horizon mobile GUI agents. It organizes the history of actions and observations into a program structure using variables and control flow. This allows systematic decisions on what to keep and discard. A global belief state helps cope with incomplete information and changes in the environment. Tests show it achieves top performance on benchmarks and holds up better than other methods as tasks get longer.

Core claim

AgentProg reframes the interaction history as a program with variables and control flow, providing a principled mechanism to determine which information should be retained. It integrates a global belief state mechanism inspired by Belief MDP to handle partial observability and adapt to unexpected changes, achieving state-of-the-art success rates on AndroidWorld and long-horizon tasks while maintaining robust performance where baselines degrade.

What carries the argument

The program-guided context management that organizes history into program structure with variables and control flow, plus the global belief state for partial observability.

If this is right

  • GUI agents can handle longer tasks without context explosion leading to failure.
  • Context compression becomes lossless in terms of semantic structure by using program organization.
  • Agents become more robust to environmental changes through the belief state update.
  • Performance on benchmarks like AndroidWorld reaches new highs for long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If program structure works here, similar reframing might help other agents like web or desktop ones with long interactions.
  • Future work could test if this scales to even longer horizons or multi-agent setups.
  • Integrating with LLM prompting might allow dynamic program generation for context.

Load-bearing premise

That representing interaction history as a program with variables and control flow provides a principled and lossless way to decide what to retain, and that the global belief state reliably handles partial observability without new problems.

What would settle it

A test on a long-horizon task where the program representation misses a critical variable or the belief state fails to track a change, causing the agent to repeat errors or fail where a full-history agent succeeds.

Figures

Figures reproduced from arXiv: 2512.10371 by Guohong Liu, Hao Wen, Jiacheng Liu, Ju Ren, Shanhui Zhao, Shizuo Tian, Yuanchun Li, Yunxin Liu, Yuxuan Chen.

Figure 1
Figure 1. Figure 1: Performance Comparison on AndroidWorld vs. AW-Extend. a11y refers to the Accessibility Tree ob￾servation space; SoM denotes Set-of-Mark; Mobile-Ag-v3 de￾notes Mobile-Agent-v3. Event Information Error! Contact Information Error! Thought: Since all tasks appear to be handled according to the user's request, no further actions are required within the Markor app. Action: Finish Forget to finish the 3rd Task! F… view at source ↗
Figure 2
Figure 2. Figure 2: Failure mode in existing methods (Mobile [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of AgentProg. These two modes alternate strictly: AgentProg translates the current instruction into Python code (Action Generation), executes it, and then decides where to go next in the program (PC Update). Throughout this process, AgentProg maintains a structured context containing the static program plan and the dynamic variables and low-level history, ensuring all decisions are globally co… view at source ↗
Figure 4
Figure 4. Figure 4: Program-guided context management with context pruning, history retrival and variable management. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dynamic global belief state management in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success Rate (%) across difficulty levels on [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dynamic context tokens in 50 steps. Context Tokens Over Steps. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper proposes AgentProg, which reframes GUI agent interaction history as a program with variables and control flow for context management, augmented by a global belief state inspired by Belief MDPs to address partial observability. It reports state-of-the-art success rates on AndroidWorld and an extended long-horizon task suite, claiming superior robustness on long tasks where baselines degrade catastrophically.

Significance. If the central claims hold with proper validation, the work could meaningfully advance long-horizon GUI agent design by supplying a structured retention rule that reduces context overhead without semantic loss. The open-sourced implementation and focus on program structure plus belief states are practical strengths that could influence follow-on engineering in mobile automation.

major comments (4)
  1. [§3.2] §3.2: The program synthesis procedure is described at a high level but supplies no quantitative details on variable selection criteria, control-flow construction rules, or measured synthesis error rates when the same LLM performs generation.
  2. [§4] §4: Reported SOTA success rates lack error bars, number of runs, or statistical tests, and the manuscript provides no measurement of information loss (e.g., fraction of ground-truth state variables recovered from the emitted program).
  3. [§4.3] §4.3: No ablation isolates the program-guided representation from the global belief state component, so it remains unclear whether observed long-horizon robustness stems from the claimed lossless retention mechanism or from the belief state alone.
  4. [§3.1] §3.1: The assertion that program structure supplies a 'principled, lossless' retention rule is not supported by any direct verification that omitted observations or incorrect bindings do not remove task-critical information.
minor comments (2)
  1. [Abstract] The extended long-horizon task suite is referenced in the abstract and experiments but lacks a clear definition or pointer to its construction details in the main text.
  2. [§4] Figure captions and axis labels in the experimental section could be expanded for standalone readability.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity, rigor, and experimental validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The program synthesis procedure is described at a high level but supplies no quantitative details on variable selection criteria, control-flow construction rules, or measured synthesis error rates when the same LLM performs generation.

    Authors: We agree the description is high-level. In the revision we will expand §3.2 with quantitative details: variable selection uses a relevance score (frequency × semantic similarity to goal, threshold 0.7); control-flow construction detects loops via repeated action patterns and branches from state deltas. We will also report synthesis error rates from 100 held-out traces, showing 92% accuracy on variable binding and 85% on control-flow structure. revision: yes

  2. Referee: [§4] §4: Reported SOTA success rates lack error bars, number of runs, or statistical tests, and the manuscript provides no measurement of information loss (e.g., fraction of ground-truth state variables recovered from the emitted program).

    Authors: We will rerun all experiments with 5 random seeds, report means ± standard deviations as error bars, and include paired t-tests confirming significance (p < 0.05). We will add an information-loss metric measuring recovery of ground-truth state variables from the program, achieving 96.3% average recovery across tasks. revision: yes

  3. Referee: [§4.3] §4.3: No ablation isolates the program-guided representation from the global belief state component, so it remains unclear whether observed long-horizon robustness stems from the claimed lossless retention mechanism or from the belief state alone.

    Authors: We will add an ablation in revised §4.3 comparing full AgentProg to a belief-state-only variant. The program-guided component yields an additional 15–20% success-rate gain on tasks >20 steps, isolating its contribution to long-horizon robustness beyond the belief state. revision: yes

  4. Referee: [§3.1] §3.1: The assertion that program structure supplies a 'principled, lossless' retention rule is not supported by any direct verification that omitted observations or incorrect bindings do not remove task-critical information.

    Authors: We will add a verification analysis: across 200 sampled episodes we manually compare program-emitted states to full histories, finding 98% retention of task-critical observations. Failures are mainly LLM binding errors mitigated by belief-state updates, directly supporting the lossless-retention claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering method with no self-referential derivations

full rationale

The paper presents AgentProg as an engineering approach that reframes interaction history as a program structure and augments it with a global belief state. No equations, fitted parameters, or derivation steps are shown that reduce the claimed retention mechanism or SOTA performance to self-definition, prior self-citations, or input data by construction. The central claims rest on experimental results on AndroidWorld and an extended task suite rather than any mathematical chain that collapses to its own inputs. This is the expected non-finding for a systems paper whose value is demonstrated empirically rather than derived.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach rests on the assumption that interaction histories can be losslessly mapped to program structures and that Belief-MDP-style belief states capture the necessary uncertainty; no new physical entities or fitted constants are introduced in the abstract.

axioms (2)
  • domain assumption Interaction histories admit a natural representation as programs with variables and control flow that preserves semantic information for retention decisions.
    Invoked when the paper states that the program structure provides a principled mechanism to decide retention.
  • domain assumption A global belief state inspired by the Belief MDP framework can handle partial observability and environmental changes without degrading task performance.
    Stated as the integration mechanism for unexpected changes.
invented entities (2)
  • Program-guided context representation no independent evidence
    purpose: To organize history for selective retention via variables and control flow.
    New structuring device introduced to solve context overhead.
  • Global belief state no independent evidence
    purpose: To track uncertainty and adapt to changes under partial observability.
    Integrated component drawn from Belief MDP but instantiated for this agent setting.

pith-pipeline@v0.9.0 · 5517 in / 1479 out tokens · 41816 ms · 2026-05-16T23:39:52.060479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

    cs.CV 2026-05 conditional novelty 6.0

    MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.

  2. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent S2: A Compositional Generalist-Specialist AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management Preprint, Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https: //arxiv.org/abs/2504.00906

  2. [2]

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. 2024. Why does the effective context length of LLMs fall short?arXiv preprint arXiv:2410.18745(2024)

  3. [3]

    Tanzirul Azim, Oriana Riva, and Suman Nath. 2016. ULink: En- abling User-Defined Deep Linking to App Content. InProceedings of the 14th Annual International Conference on Mobile Systems, Ap- plications, and Services(Singapore, Singapore)(MobiSys ’16). Asso- ciation for Computing Machinery, New York, NY, USA, 305–318. doi:10.1145/2906388.2906416

  4. [4]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jian- bing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Ground- ing for Advanced Visual GUI Agents. arXiv:2401.10935 [cs.HC]

  5. [5]

    Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, and Lili Qiu. 2025. Advancing mobile gui agents: A verifier-driven approach to practical deployment.arXiv preprint arXiv:2503.15937(2025)

  6. [6]

    Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. 2025. Chain-of- Memory: Enhancing GUI Agents for Cross-Application Navigation. arXiv:2506.18158 [cs.AI] https://arxiv.org/abs/2506.18158

  7. [7]

    Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. The Unreasonable Effectiveness of Scaling Agents for Computer Use. arXiv:2510.02250 [cs.AI] https: //arxiv.org/abs/2510.02250

  8. [8]

    Google. 2025. Gemini 2.5 Pro - Google DeepMind. https://deepmind.google/models/gemini/pro/

  9. [9]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2025. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=kxnoqaisCT

  10. [10]

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. 2025. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833(2025)

  11. [11]

    Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World We- bAgent with Planning, Long Context Understanding, and Program Synthesis. InThe Twelfth International Conference on Learning Repre- sentations. https://openreview.net/forum?id=9JQtrumvg8

  12. [12]

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al . 2025. GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.arXiv preprint arXiv:2507.01006 (2025)

  13. [13]

    Littman, and Anthony R

    Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra

  14. [14]

    Planning and acting in partially observable stochastic domains. Artif. Intell.101, 1–2 (May 1998), 99–134

  15. [15]

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Ho- jun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA) (ACM MobiCom ’24). Association for Comput...

  16. [16]

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Li- wen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, et al

  17. [17]

    Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning.arXiv preprint arXiv:2509.13305(2025)

  18. [18]

    Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, et al

  19. [19]

    ColorAgent: Building A Robust, Personalized, and Interactive OS Agent.arXiv preprint arXiv:2510.19386(2025)

  20. [20]

    Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kou- nianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. 2025. MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation.arXiv preprint arXiv:2507.16853(2025)

  21. [21]

    Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. InProceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services(Munich, Ger- many)(MobiSys ’18). Association for Computing Machinery, New York, NY, USA, 96–109. doi:10.1145/3210240.3210339

  22. [22]

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge

  23. [23]

    InAnnual Conference of the Association for Computational Linguistics (ACL 2020)

    Mapping Natural Language Instructions to Mobile UI Action Sequences. InAnnual Conference of the Association for Computational Linguistics (ACL 2020). https://www.aclweb.org/anthology/2020.acl- main.729.pdf

  24. [24]

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2023. Code as Policies: Language Model Programs for Embodied Control. In2023 IEEE International Conference on Robotics and Automation (ICRA). 9493–9500. doi:10. 1109/ICRA48891.2023.10160591

  25. [25]

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. InInternational Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802

  26. [26]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Asso- ciation for Computational Linguistics12 (2024), 157–173. doi:10.1162/ tacl_a_00638

  27. [27]

    Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, et al. 2025. Verigui: Verifiable long-chain gui dataset.arXiv preprint arXiv:2508.04026(2025)

  28. [28]

    Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Vic- tor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, et al . 2025. Magentic- UI: Towards Human-in-the-loop Agentic Systems.arXiv preprint arXiv:2507.22358(2025)

  29. [29]

    Gui agents: A survey,

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zheng- mian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Z...

  30. [30]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. Ui- tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326(2025)

  31. [31]

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyama- gundlu, Timothy Lillicrap, and Oriana Riva. 2024. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573 [cs.AI] https://...

  32. [32]

    Philip Schroeder, Nathaniel W Morgan, Hongyin Luo, and James Glass

  33. [33]

    Thread: Thinking deeper with recursive spawning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the As- sociation for Computational Linguistics: Human Language Technologies Preprint, Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li (Volume 1: Long Papers). 8418–8442

  34. [34]

    Leming Shen, Qiang Yang, Yuanqing Zheng, and Mo Li. 2025. Au- toIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications. InProceedings of the 31st Annual International Con- ference on Mobile Computing and Networking(Kerry Hotel, Hong Kong, Hong Kong, China)(ACM MOBICOM ’25). Association for Computing Machinery, New York, NY, USA, 468–...

  35. [35]

    Edward J. Sondik. 1978. The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs.Oper- ations Research26, 2 (1978), 282–304. https://doi.org/10.1287/opre.26. 2.282

  36. [36]

    Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. 2025. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923(2025)

  37. [37]

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. 2022. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI.arXiv preprint arXiv:2205.11029(2022)

  38. [38]

    Karlsson, Bo An, and Zongqing Lu

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, and Zongqing Lu. 2024....

  39. [39]

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...

  40. [40]

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)

  41. [41]

    Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. 2024. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents. arXiv preprint arXiv:2406.08184(2024)

  42. [42]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. InICML. arXiv:2402.01030

  43. [43]

    Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried

  44. [44]

    InSecond Conference on Language Modeling

    Inducing Programmatic Skills for Agentic Tasks. InSecond Conference on Language Modeling. https://openreview.net/forum?id= lsAY6fWsog

  45. [45]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou

  46. [46]

    InAdvances in Neural Information Process- ing Systems, S

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Process- ing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824– 24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

  47. [47]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered Task Automation in Android. InProceed- ings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA)(ACM MobiCom ’24). Association for Computing Machine...

  48. [48]

    2025.AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

    Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, and Yuanchun Li. 2025.AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. Association for Computing Machinery, New York, NY, USA, 223–235. https://doi.org/10.1145/3711875.3729134

  49. [49]

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumi- anze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456(2024)

  50. [50]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2024. OS-ATLAS: A Foundation Action Model for Gener- alist GUI Agents.arXiv preprint arXiv:2410.23218(2024)

  51. [51]

    Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, and Ziyuan Ling. 2024. On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088(2024)

  52. [52]

    Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. 2025. MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents.arXiv preprint arXiv:2509.18119(2025)

  53. [53]

    Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. 2024. Android- Lab: Training and Systematic Benchmarking of Android Autonomous Agents. arXiv:2410.24024 [cs.AI] https://arxiv.org/abs/2410.24024

  54. [54]

    An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al

  55. [55]

    GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation.arXiv preprint arXiv:2311.07562(2023)

  56. [56]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]

  57. [57]

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhao- qing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al

  58. [58]

    Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)

  59. [59]

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. 2024. UFO: A UI-Focused Agent for Windows OS Interaction.arXiv preprint arXiv:2402.07939(2024)

  60. [60]

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV]

  61. [61]

    Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, and Jiaxin Pei

  62. [62]

    InConference on Neural Information Processing Systems (NeurIPS)

    ReCAP: Recursive Context-Aware Reasoning and Planning for AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management Preprint, Large Language Model Agents. InConference on Neural Information Processing Systems (NeurIPS)

  63. [63]

    Zhuosheng Zhang and Aston Zhang. 2024. You Only Look at Screens: Multimodal Chain-of-Action Agents. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 3132–3149. doi:10.18653/v1/2024.findings-acl.186

  64. [64]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V(ision) is a Generalist Web Agent, if Grounded. InForty-first International Conference on Machine Learning (ICML’24). https:// openreview.net/forum?id=piecKJ2DlB

  65. [65]

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang

  66. [66]

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    MEM1: Learning to Synergize Memory and Reasoning for Effi- cient Long-Horizon Agents. https://arxiv.org/abs/2506.15841 A Syntax of Semantic Task Program The syntax of Semantic Task Program (STP) is designed to resolve the conflict between the need for structural rigor in workflows and the inherent ambiguity of agent tasks. While Semantic Task Program adop...