AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
Pith reviewed 2026-05-16 23:39 UTC · model grok-4.3
The pith
AgentProg reframes agent interaction history as a program to manage context for long-horizon GUI tasks without losing key information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentProg reframes the interaction history as a program with variables and control flow, providing a principled mechanism to determine which information should be retained. It integrates a global belief state mechanism inspired by Belief MDP to handle partial observability and adapt to unexpected changes, achieving state-of-the-art success rates on AndroidWorld and long-horizon tasks while maintaining robust performance where baselines degrade.
What carries the argument
The program-guided context management that organizes history into program structure with variables and control flow, plus the global belief state for partial observability.
If this is right
- GUI agents can handle longer tasks without context explosion leading to failure.
- Context compression becomes lossless in terms of semantic structure by using program organization.
- Agents become more robust to environmental changes through the belief state update.
- Performance on benchmarks like AndroidWorld reaches new highs for long sequences.
Where Pith is reading between the lines
- If program structure works here, similar reframing might help other agents like web or desktop ones with long interactions.
- Future work could test if this scales to even longer horizons or multi-agent setups.
- Integrating with LLM prompting might allow dynamic program generation for context.
Load-bearing premise
That representing interaction history as a program with variables and control flow provides a principled and lossless way to decide what to retain, and that the global belief state reliably handles partial observability without new problems.
What would settle it
A test on a long-horizon task where the program representation misses a critical variable or the belief state fails to track a change, causing the agent to repeat errors or fail where a full-history agent succeeds.
Figures
read the original abstract
The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentProg, which reframes GUI agent interaction history as a program with variables and control flow for context management, augmented by a global belief state inspired by Belief MDPs to address partial observability. It reports state-of-the-art success rates on AndroidWorld and an extended long-horizon task suite, claiming superior robustness on long tasks where baselines degrade catastrophically.
Significance. If the central claims hold with proper validation, the work could meaningfully advance long-horizon GUI agent design by supplying a structured retention rule that reduces context overhead without semantic loss. The open-sourced implementation and focus on program structure plus belief states are practical strengths that could influence follow-on engineering in mobile automation.
major comments (4)
- [§3.2] §3.2: The program synthesis procedure is described at a high level but supplies no quantitative details on variable selection criteria, control-flow construction rules, or measured synthesis error rates when the same LLM performs generation.
- [§4] §4: Reported SOTA success rates lack error bars, number of runs, or statistical tests, and the manuscript provides no measurement of information loss (e.g., fraction of ground-truth state variables recovered from the emitted program).
- [§4.3] §4.3: No ablation isolates the program-guided representation from the global belief state component, so it remains unclear whether observed long-horizon robustness stems from the claimed lossless retention mechanism or from the belief state alone.
- [§3.1] §3.1: The assertion that program structure supplies a 'principled, lossless' retention rule is not supported by any direct verification that omitted observations or incorrect bindings do not remove task-critical information.
minor comments (2)
- [Abstract] The extended long-horizon task suite is referenced in the abstract and experiments but lacks a clear definition or pointer to its construction details in the main text.
- [§4] Figure captions and axis labels in the experimental section could be expanded for standalone readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity, rigor, and experimental validation.
read point-by-point responses
-
Referee: [§3.2] §3.2: The program synthesis procedure is described at a high level but supplies no quantitative details on variable selection criteria, control-flow construction rules, or measured synthesis error rates when the same LLM performs generation.
Authors: We agree the description is high-level. In the revision we will expand §3.2 with quantitative details: variable selection uses a relevance score (frequency × semantic similarity to goal, threshold 0.7); control-flow construction detects loops via repeated action patterns and branches from state deltas. We will also report synthesis error rates from 100 held-out traces, showing 92% accuracy on variable binding and 85% on control-flow structure. revision: yes
-
Referee: [§4] §4: Reported SOTA success rates lack error bars, number of runs, or statistical tests, and the manuscript provides no measurement of information loss (e.g., fraction of ground-truth state variables recovered from the emitted program).
Authors: We will rerun all experiments with 5 random seeds, report means ± standard deviations as error bars, and include paired t-tests confirming significance (p < 0.05). We will add an information-loss metric measuring recovery of ground-truth state variables from the program, achieving 96.3% average recovery across tasks. revision: yes
-
Referee: [§4.3] §4.3: No ablation isolates the program-guided representation from the global belief state component, so it remains unclear whether observed long-horizon robustness stems from the claimed lossless retention mechanism or from the belief state alone.
Authors: We will add an ablation in revised §4.3 comparing full AgentProg to a belief-state-only variant. The program-guided component yields an additional 15–20% success-rate gain on tasks >20 steps, isolating its contribution to long-horizon robustness beyond the belief state. revision: yes
-
Referee: [§3.1] §3.1: The assertion that program structure supplies a 'principled, lossless' retention rule is not supported by any direct verification that omitted observations or incorrect bindings do not remove task-critical information.
Authors: We will add a verification analysis: across 200 sampled episodes we manually compare program-emitted states to full histories, finding 98% retention of task-critical observations. Failures are mainly LLM binding errors mitigated by belief-state updates, directly supporting the lossless-retention claim. revision: yes
Circularity Check
No circularity: empirical engineering method with no self-referential derivations
full rationale
The paper presents AgentProg as an engineering approach that reframes interaction history as a program structure and augments it with a global belief state. No equations, fitted parameters, or derivation steps are shown that reduce the claimed retention mechanism or SOTA performance to self-definition, prior self-citations, or input data by construction. The central claims rest on experimental results on AndroidWorld and an extended task suite rather than any mathematical chain that collapses to its own inputs. This is the expected non-finding for a systems paper whose value is demonstrated empirically rather than derived.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Interaction histories admit a natural representation as programs with variables and control flow that preserves semantic information for retention decisions.
- domain assumption A global belief state inspired by the Belief MDP framework can handle partial observability and environmental changes without degrading task performance.
invented entities (2)
-
Program-guided context representation
no independent evidence
-
Global belief state
no independent evidence
Forward citations
Cited by 2 Pith papers
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent S2: A Compositional Generalist-Specialist AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management Preprint, Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https: //arxiv.org/abs/2504.00906
work page internal anchor Pith review arXiv 2025
- [2]
-
[3]
Tanzirul Azim, Oriana Riva, and Suman Nath. 2016. ULink: En- abling User-Defined Deep Linking to App Content. InProceedings of the 14th Annual International Conference on Mobile Systems, Ap- plications, and Services(Singapore, Singapore)(MobiSys ’16). Asso- ciation for Computing Machinery, New York, NY, USA, 305–318. doi:10.1145/2906388.2906416
-
[4]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jian- bing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Ground- ing for Advanced Visual GUI Agents. arXiv:2401.10935 [cs.HC]
work page internal anchor Pith review arXiv 2024
- [5]
- [6]
- [7]
-
[8]
Google. 2025. Gemini 2.5 Pro - Google DeepMind. https://deepmind.google/models/gemini/pro/
work page 2025
-
[9]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2025. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=kxnoqaisCT
work page 2025
- [10]
-
[11]
Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World We- bAgent with Planning, Long Context Understanding, and Program Synthesis. InThe Twelfth International Conference on Learning Repre- sentations. https://openreview.net/forum?id=9JQtrumvg8
work page 2024
-
[12]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al . 2025. GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.arXiv preprint arXiv:2507.01006 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra
-
[14]
Planning and acting in partially observable stochastic domains. Artif. Intell.101, 1–2 (May 1998), 99–134
work page 1998
-
[15]
Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Ho- jun Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA) (ACM MobiCom ’24). Association for Comput...
-
[16]
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Li- wen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, et al
- [17]
-
[18]
Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, et al
- [19]
- [20]
-
[21]
Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. InProceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services(Munich, Ger- many)(MobiSys ’18). Association for Computing Machinery, New York, NY, USA, 96–109. doi:10.1145/3210240.3210339
-
[22]
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge
-
[23]
InAnnual Conference of the Association for Computational Linguistics (ACL 2020)
Mapping Natural Language Instructions to Mobile UI Action Sequences. InAnnual Conference of the Association for Computational Linguistics (ACL 2020). https://www.aclweb.org/anthology/2020.acl- main.729.pdf
work page 2020
-
[24]
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2023. Code as Policies: Language Model Programs for Embodied Control. In2023 IEEE International Conference on Robotics and Automation (ICRA). 9493–9500. doi:10. 1109/ICRA48891.2023.10160591
-
[25]
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. InInternational Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Asso- ciation for Computational Linguistics12 (2024), 157–173. doi:10.1162/ tacl_a_00638
work page 2024
- [27]
- [28]
-
[29]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zheng- mian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Z...
-
[30]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. Ui- tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyama- gundlu, Timothy Lillicrap, and Oriana Riva. 2024. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573 [cs.AI] https://...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Philip Schroeder, Nathaniel W Morgan, Hongyin Luo, and James Glass
-
[33]
Thread: Thinking deeper with recursive spawning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the As- sociation for Computational Linguistics: Human Language Technologies Preprint, Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li (Volume 1: Long Papers). 8418–8442
work page 2025
-
[34]
Leming Shen, Qiang Yang, Yuanqing Zheng, and Mo Li. 2025. Au- toIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications. InProceedings of the 31st Annual International Con- ference on Mobile Computing and Networking(Kerry Hotel, Hong Kong, Hong Kong, China)(ACM MOBICOM ’25). Association for Computing Machinery, New York, NY, USA, 468–...
-
[35]
Edward J. Sondik. 1978. The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs.Oper- ations Research26, 2 (1978), 282–304. https://doi.org/10.1287/opre.26. 2.282
- [36]
- [37]
-
[38]
Karlsson, Bo An, and Zongqing Lu
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaojie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, and Zongqing Lu. 2024....
-
[39]
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158(2024)
work page internal anchor Pith review arXiv 2024
- [41]
- [42]
-
[43]
Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried
-
[44]
InSecond Conference on Language Modeling
Inducing Programmatic Skills for Agentic Tasks. InSecond Conference on Language Modeling. https://openreview.net/forum?id= lsAY6fWsog
-
[45]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou
-
[46]
InAdvances in Neural Information Process- ing Systems, S
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Process- ing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824– 24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
work page 2022
-
[47]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered Task Automation in Android. InProceed- ings of the 30th Annual International Conference on Mobile Computing and Networking(Washington D.C., DC, USA)(ACM MobiCom ’24). Association for Computing Machine...
-
[48]
2025.AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation
Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, and Yuanchun Li. 2025.AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. Association for Computing Machinery, New York, NY, USA, 223–235. https://doi.org/10.1145/3711875.3729134
- [49]
-
[50]
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2024. OS-ATLAS: A Foundation Action Model for Gener- alist GUI Agents.arXiv preprint arXiv:2410.23218(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [51]
- [52]
- [53]
-
[54]
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al
- [55]
-
[56]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhao- qing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al
-
[58]
Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)
work page internal anchor Pith review arXiv 2025
- [59]
-
[60]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV]
work page internal anchor Pith review arXiv 2023
-
[61]
Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, and Jiaxin Pei
-
[62]
InConference on Neural Information Processing Systems (NeurIPS)
ReCAP: Recursive Context-Aware Reasoning and Planning for AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management Preprint, Large Language Model Agents. InConference on Neural Information Processing Systems (NeurIPS)
-
[63]
Zhuosheng Zhang and Aston Zhang. 2024. You Only Look at Screens: Multimodal Chain-of-Action Agents. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 3132–3149. doi:10.18653/v1/2024.findings-acl.186
-
[64]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V(ision) is a Generalist Web Agent, if Grounded. InForty-first International Conference on Machine Learning (ICML’24). https:// openreview.net/forum?id=piecKJ2DlB
work page 2024
-
[65]
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang
-
[66]
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
MEM1: Learning to Synergize Memory and Reasoning for Effi- cient Long-Horizon Agents. https://arxiv.org/abs/2506.15841 A Syntax of Semantic Task Program The syntax of Semantic Task Program (STP) is designed to resolve the conflict between the need for structural rigor in workflows and the inherent ambiguity of agent tasks. While Semantic Task Program adop...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.