Recognition: unknown
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
Pith reviewed 2026-05-10 13:55 UTC · model grok-4.3
The pith
UI-Copilot lets a GUI agent call a lightweight copilot for memory retrieval and calculations only when needed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UI-Copilot is a collaborative framework in which the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. Memory decoupling separates persistent observations from transient execution context. The policy agent is trained to selectively invoke the copilot as Retriever or Calculator using Tool-Integrated Policy Optimization, which optimizes tool selection separately through single-turn prediction and task execution through on-policy multi-turn rollouts. UI-Copilot-7B reaches state-of-the-art results on MemGUI-Bench and records a 17.1 percent absolute gain on AndroidWorld over the base Qwen model.
What carries the argument
Tool-Integrated Policy Optimization (TIPO), which trains tool selection with single-turn prediction while training task execution with on-policy multi-turn rollouts.
If this is right
- Outperforms other 7B-scale GUI agents on MemGUI-Bench.
- Delivers a 17.1 percent absolute improvement on AndroidWorld over the base Qwen model.
- Reduces memory degradation and numerical hallucinations during extended interaction sequences.
- Maintains strong generalization when moving from benchmark tasks to real mobile applications.
Where Pith is reading between the lines
- The same selective-invocation pattern could be applied to web-browsing or desktop-automation agents that also suffer from context overload.
- Keeping the main model focused on execution while off-loading retrieval and arithmetic may lower overall token usage across many agent runs.
- The memory-decoupling idea could be tested in non-GUI domains such as multi-step planning or code generation where persistent state must be preserved.
Load-bearing premise
The learned policy for deciding when to call the copilot adds no new errors or latency that cancel out the gains from the assistance it provides.
What would settle it
A side-by-side deployment run that records whether average task completion time or failure rate rises when the policy and copilot calls are active compared with the base agent alone.
Figures
read the original abstract
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UI-Copilot, a collaborative framework for long-horizon GUI automation in which a primary GUI agent handles task execution while a lightweight copilot supplies on-demand assistance for memory retrieval (Retriever) and numerical computation (Calculator). It proposes memory decoupling to separate persistent observations from transient execution context and Tool-Integrated Policy Optimization (TIPO), which optimizes tool selection via single-turn prediction and task execution via on-policy multi-turn rollouts. Experiments claim that the resulting UI-Copilot-7B model achieves state-of-the-art results on MemGUI-Bench, outperforming other 7B-scale agents such as GUI-Owl-7B and UI-TARS-1.5-7B, and delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model.
Significance. If the performance gains are substantiated, the work offers a practical route to improving robustness of GUI agents on extended tasks by offloading memory and calculation burdens to specialized tools without enlarging the core model. The TIPO separation of concerns could generalize to other tool-augmented agent settings where single-turn decisions must support multi-turn reliability.
major comments (2)
- [5] Section 5 (Experiments): The reported SOTA performance on MemGUI-Bench and the 17.1% absolute improvement on AndroidWorld are presented without any description of training data composition, evaluation protocols, number of runs, statistical significance tests, or ablation controls on the TIPO components. This absence prevents verification of the central empirical claims and their attribution to the proposed method.
- [4.2] Section 4.2 (TIPO): The framework optimizes tool selection through single-turn prediction while optimizing execution through separate on-policy multi-turn rollouts, yet provides no explicit alignment, credit-assignment, or auxiliary loss that would allow the policy to internalize the downstream cost of erroneous invocations or to learn appropriate temporal context for when to call the copilot. No analysis demonstrates that single-turn accuracy transfers to full trajectories without compounding errors.
minor comments (2)
- [Abstract] The abstract refers to a 'lightweight copilot' but does not indicate its parameter count or architecture relative to the 7B agent, which would help readers assess the claimed efficiency benefit.
- [4] Notation for the two copilot modes (Retriever vs. Calculator) and the memory-decoupling variables should be introduced with explicit definitions in the method section rather than only in the experimental narrative.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments raise valid points about empirical rigor and methodological transparency that we will address through targeted revisions.
read point-by-point responses
-
Referee: Section 5 (Experiments): The reported SOTA performance on MemGUI-Bench and the 17.1% absolute improvement on AndroidWorld are presented without any description of training data composition, evaluation protocols, number of runs, statistical significance tests, or ablation controls on the TIPO components. This absence prevents verification of the central empirical claims and their attribution to the proposed method.
Authors: We agree that the current manuscript lacks sufficient experimental details. In the revised version, Section 5 will be expanded to include: a full description of training data composition and sources; complete evaluation protocols with task breakdowns for both MemGUI-Bench and AndroidWorld; the number of runs with mean and standard deviation reporting; statistical significance tests; and dedicated ablation studies on the single-turn and multi-turn components of TIPO. These additions will directly support verification and attribution of the reported gains. revision: yes
-
Referee: Section 4.2 (TIPO): The framework optimizes tool selection through single-turn prediction while optimizing execution through separate on-policy multi-turn rollouts, yet provides no explicit alignment, credit-assignment, or auxiliary loss that would allow the policy to internalize the downstream cost of erroneous invocations or to learn appropriate temporal context for when to call the copilot. No analysis demonstrates that single-turn accuracy transfers to full trajectories without compounding errors.
Authors: TIPO separates the phases deliberately: single-turn optimization trains accurate, context-aware tool selection, while on-policy multi-turn rollouts expose the policy to full trajectory outcomes, allowing it to learn the downstream effects of tool calls through direct reward signals without needing separate credit-assignment machinery. We will add an analysis subsection in the revision that reports single-turn tool accuracy, its correlation with multi-turn success, and examples of error propagation along with how multi-turn optimization reduces compounding. This will clarify the transfer without introducing auxiliary losses. revision: partial
Circularity Check
No circularity: empirical method and benchmark results
full rationale
The paper proposes UI-Copilot with memory decoupling and TIPO (separate single-turn tool selection and multi-turn on-policy rollouts), then reports empirical gains on MemGUI-Bench and AndroidWorld. No equations, derivations, or claims reduce any result to fitted inputs or self-citations by construction. TIPO is a training procedure whose outputs are evaluated on held-out benchmarks; the separation of optimization stages is a stated design choice, not a self-definitional loop. No uniqueness theorems, ansatzes smuggled via self-citation, or renaming of known results appear. The work is self-contained against external benchmarks and therefore receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025. Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2024. Showui: One vision-language-action model for gui visual agent. Guangyi Liu, Pengxian...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,
Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075. Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. 2025a. Learnact: Few- shot mobile gui agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805. Guangyi Liu,...
-
[3]
Os-genesis: Automating gui agent trajectory construction via reverse task synthesis,
Os-genesis: Automating gui agent trajec- tory construction via reverse task synthesis.arXiv preprint arXiv:2412.19723. Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and 1 others. 2025a. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846. Fei ...
-
[4]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Os-atlas: A foundation action model for gener- alist gui agents.arXiv preprint arXiv:2410.23218. Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. 2025. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 17877–17886. Yi...
work page internal anchor Pith review arXiv 2025
-
[5]
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
Mas-bench: A unified benchmark for shortcut- augmented hybrid mobile gui agents.arXiv preprint arXiv:2509.06477. Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. 2025. Hiconagent: History context-aware policy optimization for gui agents. arXiv preprint arXiv:2512.01763. A Action Space Action...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.